-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running in docker container #354
Comments
@Jevli Hi, you're right. It's definitely a weird bug. For the second one, can you share your code snippet with me? As for the duck herd, I will check tomorrow or the day after tomorrow to see why it behaves that way. import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=True, verbose=True)
# Set run configurations
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://kidocode.com/',
config=crawl_config
)
if result.success:
print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
# print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main()) And this is output
|
Dammit I think the chrome webdeveloper tools is mine. From using asyncio.gather()... removed that and got it working on outside of docker. Thought I still got problem in docker. I will paste tomorrow snippets (I don't have today more time to "clean" code :) For screenshot problem here is code which gets it: If I turn
Btw **kwargs has css_selector and extraction_strategy when I'm running it |
Update for docker, I get following error with docker file (running as should outside of docker):
I'm just ignoring something docker spefic stuff versus native... (Screenshot problem exists still also but that is both outside and inside docker) |
@Jevli I am working on this and have been engaged in updating the documentation. Let me summarize: right now, in your case with Docker, the screenshot doesn't work, and you encounter this error: 'AttributeError: 'NoneType' object has no attribute 'error''? Right? Just a quick guess: when you get the error, the Docker version likely doesn't point to the latest version of the library. I will double-check this very soon. In the meantime, can you try to do the same thing without Docker? See if everything works on your computer without any issues. I also noticed that you have Python 3.13, which is not my usual version; I try to maintain around 3.10. If you encounter any errors like that in non-Docker, and if you see the term Grok, that means the issue lies in Docker. We definitely need to focus on that. I appreciate it if you can do this. |
I had to leave this project with crawl4ai little background but it's still underdevelopment. I can try test this better later on this week, but first quick answers for you. When I tested this last time I didn't get screenshot inside container but neither outside (when runned from my macbook in terminal). But I have tested only with python 3.13, I can try test later on this week with 3.10. Althought the screenshot worked before 0.4.2 release with same python version (I used screenshot to verify login status etc.) |
@Jevli Right now, I can confirm that the code below works (I am not talking about Docker). Both the PDF and the screenshot are included. Regarding Docker, I can't say much right now because I am working on it. It will definitely become stable in a couple of weeks, especially now that I have found a much more efficient way to manage Docker. However, if you want to use the library directly, the code below confirms that it works. import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
"""Example script showing both context manager and explicit lifecycle patterns"""
# Configure the browser settings
browser_config = BrowserConfig(headless=True, verbose=True)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
pdf=True,
)
# Example 1: Using context manager (recommended for simple cases)
print("\n=== Using Context Manager ===")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
config=crawl_config
)
if result.success:
# Save screenshot
if result.screenshot:
from base64 import b64decode
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf:
pdf_bytes = b64decode(result.pdf)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)
if __name__ == "__main__":
asyncio.run(main()) |
@unclecode I started to think might this be because I have crawler config and crawler strategy on AsyncWebCrawler.
And this is problem because in
I was not able yet to figure out how to handle this. I verified that logger is None because I give strategy before and don't set logger. Didn't figure yet how to pass hooks then. I need strategy for hooks or is there another way to pass hooks? Edit. If I change crawler calling to this. My code works again. Is this right way to set hooks?
|
@Jevli You are right; there are a few things to address. First, if you create a crawler strategy and do not set any logger, I noticed that my code didn't check for that, which may raise an error. So now, if you pass a crawler strategy without a logger, I use the default logger from the higher library. This way, we won't encounter any bugs indicating that the logger is undefined or null. Second, you must be able to create and set a webhook in both ways that you presented. You can do it separately by first creating an instance of the crawler strategy class and then passing it as a parameter to the constructor of the AsyncWebCrawler. Alternatively, you can do the same thing after creation in the context by going to your hooks and passing them. You should be able to do both now. I'm going to update to version 0.4.24 today or tomorrow. Please give it a try. |
@Jevli I want to share an example of how to use hooks with the library. You can use a context manager, and you can also use it implicitly. In this code example, I use the crawler asynchronously in an implicit way. Both methods are possible. The goal is to show you the best way to access all the hooks. I will also place this example in the document folder. from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext
async def main():
print("🔗 Hooks Example: Demonstrating different hook use cases")
# Configure browser settings
browser_config = BrowserConfig(
headless=True
)
# Configure crawler settings
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body",
cache_mode=CacheMode.BYPASS
)
# Create crawler instance
crawler = AsyncWebCrawler(config=browser_config)
# Define and set hook functions
async def on_browser_created(browser, context: BrowserContext, **kwargs):
"""Hook called after the browser is created"""
print("[HOOK] on_browser_created - Browser is ready!")
# Example: Set a cookie that will be used for all requests
return browser
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
"""Hook called after a new page and context are created"""
print("[HOOK] on_page_context_created - New page created!")
# Example: Set default viewport size
await context.add_cookies([{
'name': 'session_id',
'value': 'example_session',
'domain': '.example.com',
'path': '/'
}])
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def on_user_agent_updated(page: Page, context: BrowserContext, user_agent: str, **kwargs):
"""Hook called when the user agent is updated"""
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
"""Hook called after custom JavaScript execution"""
print("[HOOK] on_execution_started - Custom JS executed!")
return page
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
"""Hook called before navigating to each URL"""
print(f"[HOOK] before_goto - About to visit: {url}")
# Example: Add custom headers for the request
await page.set_extra_http_headers({
"Custom-Header": "my-value"
})
return page
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# Example: Wait for a specific element to be loaded
try:
await page.wait_for_selector('.content', timeout=1000)
print("Content element found!")
except:
print("Content element not found, continuing anyway")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
"""Hook called before retrieving the HTML content"""
print("[HOOK] before_retrieve_html - About to get HTML content")
# Example: Scroll to bottom to trigger lazy loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(page: Page, context: BrowserContext, html:str, **kwargs):
"""Hook called before returning the HTML content"""
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
# Example: You could modify the HTML content here if needed
return page
# Set all the hooks
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
await crawler.start()
# Example usage: crawl a simple website
url = 'https://example.com'
result = await crawler.arun(url, config=crawler_run_config)
print(f"\nCrawled URL: {result.url}")
print(f"HTML length: {len(result.html)}")
await crawler.close()
if __name__ == "__main__":
import asyncio
asyncio.run(main()) |
Hi,
When I'm running crawl4ai in docker container I get two odd errors. First one is for logger:
Also the other one:
THIS vvv IS solved view comments by me.
Problem witn asyncio.gather() when running multiple crawlers...
The text was updated successfully, but these errors were encountered: