Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running in docker container #354

Open
Jevli opened this issue Dec 17, 2024 · 11 comments
Open

When running in docker container #354

Jevli opened this issue Dec 17, 2024 · 11 comments
Assignees
Labels
question Further information is requested

Comments

@Jevli
Copy link

Jevli commented Dec 17, 2024

Hi,

When I'm running crawl4ai in docker container I get two odd errors. First one is for logger:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT_NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 123, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT_NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 141, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

Also the other one:

THIS vvv IS solved view comments by me.

Problem witn asyncio.gather() when running multiple crawlers...

Error in crawler collect_PAGE_gig_data: BrowserType.connect_over_cdp: connect ECONNREFUSED 127.0.0.1:9222
Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222

Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222
@Jevli
Copy link
Author

Jevli commented Dec 17, 2024

Also when trying to debug with screeshot=True in crawler.arun()
I got following error:
image

@unclecode
Copy link
Owner

@Jevli Hi, you're right. It's definitely a weird bug. For the second one, can you share your code snippet with me? As for the duck herd, I will check tomorrow or the day after tomorrow to see why it behaves that way.

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(headless=True, verbose=True)

    # Set run configurations
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=True,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://kidocode.com/',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
            # print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

And this is output

[INIT].... → Crawl4AI 0.4.23
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[EXPORT].. ℹ Exporting PDF and taking screenshot took 0.18s
[FETCH]... ↓ https://kidocode.com/... | Status: True | Time: 3.22s
[SCRAPE].. ◆ Processed https://kidocode.com/... | Time: 2035ms
[COMPLETE] ● https://kidocode.com/... | Status: True | Total: 5.26s
Raw Markdown Length: 118984
Citations Markdown Length: 107444

@unclecode unclecode self-assigned this Dec 17, 2024
@unclecode unclecode added the question Further information is requested label Dec 17, 2024
@Jevli
Copy link
Author

Jevli commented Dec 17, 2024

Dammit I think the chrome webdeveloper tools is mine. From using asyncio.gather()... removed that and got it working on outside of docker.

Thought I still got problem in docker. I will paste tomorrow snippets (I don't have today more time to "clean" code :)

For screenshot problem here is code which gets it:

If I turn screenshot=False code works. And I'm sorry this code look awful, hopefully it's readable. Haven't got time to clean it up...

async def authenticate_and_collect(
        self, url: str, second_crawler_hooks={}, **kwargs
    ):
        # Setup crawler strategy
        browser_config = BrowserConfig(
            headless=True,
            use_persistent_context=True,
            user_data_dir="./states/browser_data",
            storage_state=self.storage_path,
        )

        crawler_config = CrawlerRunConfig(
            magic=True,
            screenshot=True,
            cache_mode=CacheMode.BYPASS,
            wait_until="domcontentloaded",
            page_timeout=10000,
            **kwargs,
        )

        async def after_goto(page: Page, context: BrowserContext):
            await page.wait_for_load_state("networkidle")

        crawler_strategy = AsyncPlaywrightCrawlerStrategy(browser_config=browser_config)
        hooks = {
            "on_browser_created": self.on_browser_created,
            "after_goto": after_goto,
            **second_crawler_hooks,
        }
        crawler_strategy.hooks = hooks

        # First crawler for authentication
        async with AsyncWebCrawler(
            config=browser_config, crawler_strategy=crawler_strategy
        ) as crawler:
            return await crawler.arun(
                url=url,
                config=crawler_config,
            )

Btw **kwargs has css_selector and extraction_strategy when I'm running it

@Jevli
Copy link
Author

Jevli commented Dec 18, 2024

Update for docker, I get following error with docker file (running as should outside of docker):

ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-134' coro=<ManagedBrowser._monitor_browser_process() done, defined at /root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py:111> exception=AttributeError("'NoneType' object has no attribute 'error'")>
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 123, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 141, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'
Traceback (most recent call last):
  File "/app/src/main.py", line 47, in <module>
    PAGE_gigs = asyncio.run(collect_PAGE_gig_data())
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 721, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/app/src/provider_crawler/PAGE_gigs.py", line 12, in collect_PAGE_gig_data
    gigs = await collect_PAGE_links()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/provider_crawler/PAGE_gigs.py", line 93, in collect_PAGE_links
    rows = await crawler.authenticate_and_collect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
    )
    ^
  File "/app/src/crawler/crawler.py", line 107, in authenticate_and_collect
    async with AsyncWebCrawler(
               ~~~~~~~~~~~~~~~^
        config=browser_config, crawler_strategy=crawler_strategy
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ) as crawler:
    ^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_webcrawler.py", line 131, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 501, in __aenter__
    await self.start()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 508, in start
    await self.browser_manager.start()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 268, in start
    self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 14779, in connect_over_cdp
    await self._impl_obj.connect_over_cdp(
    ...<4 lines>...
    )
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_browser_type.py", line 174, in connect_over_cdp
    response = await self._channel.send_return_as_dict("connectOverCDP", params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 67, in send_return_as_dict
    return await self._connection.wrap_api_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
    )
    ^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: BrowserType.connect_over_cdp: connect ECONNREFUSED ::1:9222
Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222

Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f6199ed7920>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/asyncio/base_subprocess.py", line 130, in __del__
  File "/usr/local/lib/python3.13/asyncio/base_subprocess.py", line 107, in close
  File "/usr/local/lib/python3.13/asyncio/unix_events.py", line 802, in close
  File "/usr/local/lib/python3.13/asyncio/unix_events.py", line 788, in write_eof
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 829, in call_soon
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 552, in _check_closed
RuntimeError: Event loop is closed

I'm just ignoring something docker spefic stuff versus native... (Screenshot problem exists still also but that is both outside and inside docker)

@unclecode
Copy link
Owner

@Jevli I am working on this and have been engaged in updating the documentation. Let me summarize: right now, in your case with Docker, the screenshot doesn't work, and you encounter this error: 'AttributeError: 'NoneType' object has no attribute 'error''? Right?

Just a quick guess: when you get the error, the Docker version likely doesn't point to the latest version of the library. I will double-check this very soon. In the meantime, can you try to do the same thing without Docker? See if everything works on your computer without any issues. I also noticed that you have Python 3.13, which is not my usual version; I try to maintain around 3.10. If you encounter any errors like that in non-Docker, and if you see the term Grok, that means the issue lies in Docker. We definitely need to focus on that. I appreciate it if you can do this.

@Jevli
Copy link
Author

Jevli commented Dec 24, 2024

I had to leave this project with crawl4ai little background but it's still underdevelopment. I can try test this better later on this week, but first quick answers for you.

When I tested this last time I didn't get screenshot inside container but neither outside (when runned from my macbook in terminal). But I have tested only with python 3.13, I can try test later on this week with 3.10. Althought the screenshot worked before 0.4.2 release with same python version (I used screenshot to verify login status etc.)

@Jevli
Copy link
Author

Jevli commented Dec 24, 2024

Was able to test with 3.10.16 from terminal and I still get similar error as I did inside of docker before:
image

I have following crawler_config:

crawler_config = CrawlerRunConfig(
            magic=True,
            screenshot=True,
            cache_mode=CacheMode.BYPASS,
            wait_until="domcontentloaded",
        )

And calling crawler with:

async with AsyncWebCrawler(
            config=browser_config, crawler_strategy=crawler_strategy
        ) as crawler:
            return await crawler.arun(
                url=url,
                config=crawler_config,
            )

Didn't have this time more time to debug more docker setup otherwise.

@unclecode
Copy link
Owner

@Jevli Right now, I can confirm that the code below works (I am not talking about Docker). Both the PDF and the screenshot are included. Regarding Docker, I can't say much right now because I am working on it. It will definitely become stable in a couple of weeks, especially now that I have found a much more efficient way to manage Docker. However, if you want to use the library directly, the code below confirms that it works.

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    """Example script showing both context manager and explicit lifecycle patterns"""
    
    # Configure the browser settings
    browser_config = BrowserConfig(headless=True, verbose=True)

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=True,
        pdf=True,
    )

    # Example 1: Using context manager (recommended for simple cases)
    print("\n=== Using Context Manager ===")
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
            config=crawl_config
        )
        if result.success:
            # Save screenshot
            if result.screenshot:
                from base64 import b64decode
                with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
                    f.write(b64decode(result.screenshot))
        
        # Save PDF
        if result.pdf:
            pdf_bytes = b64decode(result.pdf)
            with open(os.path.join(__location__, "page.pdf"), "wb") as f:
                f.write(pdf_bytes)


if __name__ == "__main__":
    asyncio.run(main())

@Jevli
Copy link
Author

Jevli commented Dec 25, 2024

@unclecode I started to think might this be because I have crawler config and crawler strategy on AsyncWebCrawler.

async with AsyncWebCrawler(
            config=browser_config, crawler_strategy=crawler_strategy
        ) as crawler:
            return await crawler.arun(
                url=url,
                config=crawler_config,
            )

And this is problem because in crawl4ai/async_webcrawler.py.__init__ row 100:

# Initialize crawler strategy
        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
            browser_config=browser_config,
            logger=self.logger,
            **kwargs  # Pass remaining kwargs for backwards compatibility
        )

I was not able yet to figure out how to handle this. I verified that logger is None because I give strategy before and don't set logger. Didn't figure yet how to pass hooks then.

I need strategy for hooks or is there another way to pass hooks?

Edit. If I change crawler calling to this. My code works again. Is this right way to set hooks?
(I have created little universal usage crawler solutions for different pages where I have to log in, all those have different needs for hooks (exept need for before_created which every one needs...)

# First crawler for authentication
        async with AsyncWebCrawler(
            config=browser_config
        ) as crawler:
            for hook in list(hooks.keys()):
                crawler.crawler_strategy.set_hook(hook, hooks[hook])

            return await crawler.arun(
                url=url,
                config=crawler_config,
            )

@unclecode
Copy link
Owner

@Jevli You are right; there are a few things to address. First, if you create a crawler strategy and do not set any logger, I noticed that my code didn't check for that, which may raise an error. So now, if you pass a crawler strategy without a logger, I use the default logger from the higher library. This way, we won't encounter any bugs indicating that the logger is undefined or null.

Second, you must be able to create and set a webhook in both ways that you presented. You can do it separately by first creating an instance of the crawler strategy class and then passing it as a parameter to the constructor of the AsyncWebCrawler. Alternatively, you can do the same thing after creation in the context by going to your hooks and passing them. You should be able to do both now. I'm going to update to version 0.4.24 today or tomorrow. Please give it a try.

@unclecode
Copy link
Owner

@Jevli I want to share an example of how to use hooks with the library. You can use a context manager, and you can also use it implicitly. In this code example, I use the crawler asynchronously in an implicit way. Both methods are possible. The goal is to show you the best way to access all the hooks. I will also place this example in the document folder.

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext

async def main():
    print("🔗 Hooks Example: Demonstrating different hook use cases")

    # Configure browser settings
    browser_config = BrowserConfig(
        headless=True
    )
    
    # Configure crawler settings
    crawler_run_config = CrawlerRunConfig(
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        wait_for="body",
        cache_mode=CacheMode.BYPASS
    )

    # Create crawler instance
    crawler = AsyncWebCrawler(config=browser_config)

    # Define and set hook functions
    async def on_browser_created(browser, context: BrowserContext, **kwargs):
        """Hook called after the browser is created"""
        print("[HOOK] on_browser_created - Browser is ready!")
        # Example: Set a cookie that will be used for all requests
        return browser

    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
        """Hook called after a new page and context are created"""
        print("[HOOK] on_page_context_created - New page created!")
        # Example: Set default viewport size
        await context.add_cookies([{
            'name': 'session_id',
            'value': 'example_session',
            'domain': '.example.com',
            'path': '/'
        }])
        await page.set_viewport_size({"width": 1920, "height": 1080})
        return page

    async def on_user_agent_updated(page: Page, context: BrowserContext, user_agent: str, **kwargs):
        """Hook called when the user agent is updated"""
        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
        return page

    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
        """Hook called after custom JavaScript execution"""
        print("[HOOK] on_execution_started - Custom JS executed!")
        return page

    async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
        """Hook called before navigating to each URL"""
        print(f"[HOOK] before_goto - About to visit: {url}")
        # Example: Add custom headers for the request
        await page.set_extra_http_headers({
            "Custom-Header": "my-value"
        })
        return page

    async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
        """Hook called after navigating to each URL"""
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        # Example: Wait for a specific element to be loaded
        try:
            await page.wait_for_selector('.content', timeout=1000)
            print("Content element found!")
        except:
            print("Content element not found, continuing anyway")
        return page

    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
        """Hook called before retrieving the HTML content"""
        print("[HOOK] before_retrieve_html - About to get HTML content")
        # Example: Scroll to bottom to trigger lazy loading
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
        return page

    async def before_return_html(page: Page, context: BrowserContext, html:str, **kwargs):
        """Hook called before returning the HTML content"""
        print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
        # Example: You could modify the HTML content here if needed
        return page

    # Set all the hooks
    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
    crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
    crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
    crawler.crawler_strategy.set_hook("before_goto", before_goto)
    crawler.crawler_strategy.set_hook("after_goto", after_goto)
    crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
    crawler.crawler_strategy.set_hook("before_return_html", before_return_html)

    await crawler.start()

    # Example usage: crawl a simple website
    url = 'https://example.com'
    result = await crawler.arun(url, config=crawler_run_config)
    print(f"\nCrawled URL: {result.url}")
    print(f"HTML length: {len(result.html)}")
    
    await crawler.close()

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants