-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing arguments to hook to perform basic auth #360
Comments
@luisferreira93 Thanks for using Crawl4ai. I have a few things to explain to make the job easier for you. Before I explain, I want to let you know that we will release our scraping module very soon. It is under review and will provide a lot of efficiency. I definitely suggest you use this for scraping. Now, back to your questions; I will add some explanations and show you some code examples for more clarity. Let me address your questions and suggest some improvements to make your code more efficient:
Here's an improved version of your code: class WebCrawlerSpider(CrawlSpider):
def __init__(self, connector_args, documents, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = connector_args.urls if connector_args.urls else []
self.allowed_domains = self.extract_domains(connector_args.urls)
self.documents = documents
self.authentication = connector_args.authentication
# Set up the crawler strategy with authentication
async def on_page_context_created(page, **kwargs):
if isinstance(self.authentication, BasicAuth):
credentials = base64.b64encode(
f"{self.authentication.username}:{self.authentication.password}".encode()
).decode()
await page.set_extra_http_headers({
'Authorization': f'Basic {credentials}'
})
self.crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
self.crawler_strategy.set_hook('on_page_context_created', on_page_context_created)
self.crawler = AsyncWebCrawler(
verbose=True,
crawler_strategy=self.crawler_strategy
)
async def spider_opened(self):
"""Initialize crawler when spider starts"""
await self.crawler.start()
async def spider_closed(self):
"""Clean up crawler when spider finishes"""
await self.crawler.close()
async def process_url(self, url) -> CrawlResult:
return await self.crawler.arun(
url=url,
cache_mode=CacheMode.BYPASS,
exclude_external_links=True,
exclude_social_media_links=True,
) Key improvements in this code:
To use this code, you don't need to manually execute the hook or worry about the For example usage with explicit lifecycle management: # Initialize the spider
spider = WebCrawlerSpider(connector_args, documents)
# Start the crawler
await spider.spider_opened()
try:
# Process URLs
for url in spider.start_urls:
result = await spider.process_url(url)
# Handle result...
finally:
# Clean up
await spider.spider_closed() This approach is much more efficient as it:
Let me know if you need any clarification or have questions about implementing these improvements! |
Hello! I am working on a solution where I use scrapy to crawl through several levels of a website and crawl4AI to extract the content. Currently, I need to support basic authentication and I am trying a solution with hooks (already found something similar in the issues section here).
I have this hook that should receive a username and password in the parameters.
and this is the code that calls the hook:
The main problem here is that I don't know how to obtain the
page
parameter. Can you help me here? Also, is this the correct way to support basic auth? Thank you in advanceThe text was updated successfully, but these errors were encountered: