Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing arguments to hook to perform basic auth #360

Open
luisferreira93 opened this issue Dec 19, 2024 · 1 comment
Open

Passing arguments to hook to perform basic auth #360

luisferreira93 opened this issue Dec 19, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@luisferreira93
Copy link

Hello! I am working on a solution where I use scrapy to crawl through several levels of a website and crawl4AI to extract the content. Currently, I need to support basic authentication and I am trying a solution with hooks (already found something similar in the issues section here).
I have this hook that should receive a username and password in the parameters.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser, BrowserContext

async def before_goto(page, **kwargs):
    # kwargs might contain original_url and session_id etc.
    # Store original_url somewhere if needed, or print
    await page.set_extra_http_headers({'Authorization': #Create Base64 with username:password})

and this is the code that calls the hook:

import base64
from typing import Any, AsyncIterator
from urllib.parse import urlparse

from crawl4ai import AsyncWebCrawler, CacheMode, CrawlResult
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from scrapy.crawler import Crawler
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from playwright.async_api import Page, Browser, BrowserContext

from common.connector import IndexableContent
from connectors.webcrawler.args import WebCrawlerConnectorArgs
from connectors.webcrawler.basic_auth import BasicAuth
from connectors.webcrawler.scrapy_webcrawler.scrapy_webcrawler.hooks import before_goto


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler"

    rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),)

    def __init__(
        self,
        connector_args: WebCrawlerConnectorArgs,
        documents: list[IndexableContent],
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        # If start_urls empty, quit (?)
        self.start_urls = (
            connector_args.urls
            if connector_args.urls
            else []
        )
        self.allowed_domains = self.extract_domains(connector_args.urls)
        self.documents: list[IndexableContent] = documents
        self.authentication = connector_args.authentication
        if isinstance(self.authentication, BasicAuth):
            self.http_user = self.authentication.username
            self.http_auth_domain = None
            self.http_pass = self.authentication.password

    @classmethod
    def from_crawler(
        cls,
        crawler: Crawler,
        connector_args: WebCrawlerConnectorArgs,
        documents: list[IndexableContent],
        *args: Any,
        **kwargs: Any,
    ) -> "WebCrawlerSpider":
        crawler.settings.set("CRAWLSPIDER_FOLLOW_LINKS", connector_args.crawl_depth > 0)
        spider = super().from_crawler(
            crawler, connector_args, documents, *args, **kwargs
        )
        spider.settings.set(
            "DEPTH_LIMIT",
            connector_args.crawl_depth,
            priority="spider",
        )
        spider.settings.set(
            "ITEM_PIPELINES",
            {
                "scrapy_webcrawler.scrapy_webcrawler.pipelines.ScrapyWebcrawlerPipeline": 300,
            },
        )
        spider.settings.set(
            "TWISTED_REACTOR",
            "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        )
        return spider

    def extract_domains(self, urls: list[str]) -> list[str]:
        """
        Extracts domains from a list of URLs.

        Args:
            urls (list[str]): A list of URLs.

        Returns:
            list[str]: A list of domains extracted from the URLs.
        """
        domains = []
        for url in urls:
            parsed_url = urlparse(url)
            if parsed_url.netloc:
                domain = parsed_url.netloc
                # Remove 'www.' only if it is at the start of the domain
                if domain.startswith("www."):
                    domain = domain[4:]
                domains.append(domain)
        return domains

    def parse_start_url(self, response):
        return self.parse_item(response)

    async def parse_item(self, response) -> AsyncIterator[IndexableContent]:
        # Extract to method
        # We need to put a domain here, of the basic auth website, with multiple start_urls this becomes confusing
        crawl_result = await self.process_url(response.url)
        document = IndexableContent(
            identifier="id1",
            content=str(crawl_result.markdown),
            title="title",
            url=response.url,
            metadata={},
        )
        self.documents.append(document)
        yield document

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url)

    async def process_url(self, url) -> CrawlResult:
        crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True,)
        crawler_strategy.set_hook('before_goto', before_goto)
        await crawler_strategy.execute_hook('before_goto', //Pass here the username and password)
        async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
            return await crawler.arun(
                url=url,
                cache_mode=CacheMode.BYPASS,
                exclude_external_links=True,
                exclude_social_media_links=True,
            )

The main problem here is that I don't know how to obtain the page parameter. Can you help me here? Also, is this the correct way to support basic auth? Thank you in advance

@unclecode
Copy link
Owner

@luisferreira93 Thanks for using Crawl4ai. I have a few things to explain to make the job easier for you. Before I explain, I want to let you know that we will release our scraping module very soon. It is under review and will provide a lot of efficiency. I definitely suggest you use this for scraping. Now, back to your questions; I will add some explanations and show you some code examples for more clarity.

Let me address your questions and suggest some improvements to make your code more efficient:

  1. Hook Selection: Instead of using before_goto, I recommend using the on_page_context_created hook for setting authentication headers. This hook is more appropriate as it's called right after a new page context is created, ensuring your headers are set up properly.

  2. Browser Instance Management: Currently, you're creating a new crawler instance for each URL. This is inefficient as it involves starting and stopping the browser repeatedly. Let's improve this by creating the crawler once and reusing it.

Here's an improved version of your code:

class WebCrawlerSpider(CrawlSpider):
    def __init__(self, connector_args, documents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = connector_args.urls if connector_args.urls else []
        self.allowed_domains = self.extract_domains(connector_args.urls)
        self.documents = documents
        self.authentication = connector_args.authentication
        
        # Set up the crawler strategy with authentication
        async def on_page_context_created(page, **kwargs):
            if isinstance(self.authentication, BasicAuth):
                credentials = base64.b64encode(
                    f"{self.authentication.username}:{self.authentication.password}".encode()
                ).decode()
                await page.set_extra_http_headers({
                    'Authorization': f'Basic {credentials}'
                })

        self.crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
        self.crawler_strategy.set_hook('on_page_context_created', on_page_context_created)
        self.crawler = AsyncWebCrawler(
            verbose=True, 
            crawler_strategy=self.crawler_strategy
        )
        
    async def spider_opened(self):
        """Initialize crawler when spider starts"""
        await self.crawler.start()
        
    async def spider_closed(self):
        """Clean up crawler when spider finishes"""
        await self.crawler.close()

    async def process_url(self, url) -> CrawlResult:
        return await self.crawler.arun(
            url=url,
            cache_mode=CacheMode.BYPASS,
            exclude_external_links=True,
            exclude_social_media_links=True,
        )

Key improvements in this code:

  1. Better Hook: Using on_page_context_created instead of before_goto ensures headers are set immediately after a page context is created.

  2. Efficient Browser Management: The crawler is created once in __init__ and managed through spider_opened and spider_closed. This prevents the overhead of creating/destroying browser instances for each URL.

  3. Clean Authentication: The authentication logic is encapsulated in the hook function, making it cleaner and more maintainable.

To use this code, you don't need to manually execute the hook or worry about the page parameter - the crawler strategy will handle that for you. The hook will be called automatically with the correct page instance whenever a new page context is created.

For example usage with explicit lifecycle management:

# Initialize the spider
spider = WebCrawlerSpider(connector_args, documents)

# Start the crawler
await spider.spider_opened()

try:
    # Process URLs
    for url in spider.start_urls:
        result = await spider.process_url(url)
        # Handle result...
finally:
    # Clean up
    await spider.spider_closed()

This approach is much more efficient as it:

  • Reuses the browser instance across multiple URLs
  • Properly manages resources
  • Handles authentication consistently
  • Integrates well with Scrapy's lifecycle

Let me know if you need any clarification or have questions about implementing these improvements!

@unclecode unclecode self-assigned this Dec 25, 2024
@unclecode unclecode added the question Further information is requested label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants