Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug when it comes to extracting websites #421

Open
ordersofmagnitude opened this issue Jan 7, 2025 · 9 comments
Open

bug when it comes to extracting websites #421

ordersofmagnitude opened this issue Jan 7, 2025 · 9 comments
Assignees

Comments

@ordersofmagnitude
Copy link

ordersofmagnitude commented Jan 7, 2025

Hi all, I am trying to extract this website:

https://www.zacks.com/stock/news/2375522/snowflake-rises-post-q3-earnings-how-should-you-play-the-stock

Here is my code:

async def ai_process_news(self, item):
    #agent = self.get_browser()
    js_commands = ["document.querySelector('#accept_cookie').click();",
        "window.scrollTo(0, 900);" 
    ]
    async with AsyncWebCrawler(verbose=True, headless=True, sleep_on_close =False) as crawler:
        result = await crawler.arun(
            url=item["link"],        
            simulate_user=True,        
            override_navigator=True,   
            magic=True,
            js_code = js_commands)
        session_id = "dynamic-session"
        result = await crawler.arun(
            url = item["link"],
            session_id = session_id,
            wait_for = "css:.show_article",
            js_code = [
            "document.querySelector('span.show_article').click();"],
            css_selector=".commentary_body"
            )
        
    #await crawler.crawler_strategy.kill_session(session_id)
    print(result.markdown)
    return result.markdown

def process_news(self, item):
    article = asyncio.run(self.ai_process_news(item))
    return article

The result: a bunch of links, but no actual content. please advise
Services Overview
* Zacks Ultimate
* Zacks Investor Collection
* Zacks Premium

Investor Services

* [ETF Investor](https://www.zacks.com/stock/news/2375522/</www.zacks.com/etfinvestor/>)
* [Home Run Investor](https://www.zacks.com/stock/news/2375522/</www.zacks.com/homerun/?adid=TOP_ONLINE_NAV>)
* [Income Investor](https://www.zacks.com/stock/news/2375522/</www.zacks.com/incomeinvestor/>)
* [Stocks Under $10](https://www.zacks.com/stock/news/2375522/</www.zacks.com/stocksunder10/>)
* [Value Investor](https://www.zacks.com/stock/news/2375522/</www.zacks.com/valueinvestor/?adid=TOP_ONLINE_NAV>)
* [Top 10 Stocks](https://www.zacks.com/stock/news/2375522/</www.zacks.com/top-10-stocks.php>)

Other Services

* [Method for Trading](https://www.zacks.com/stock/news/2375522/</www.zacks.com/registration/zmt/welcome/eoffer/33e3/?adid=TOP_ONLINE_NAV>)
* [Research Wizard](https://www.zacks.com/stock/news/2375522/</www.zacks.com/registration/rw/welcome/eoffer/3dd7/?adid=TOP_ONLINE_NAV>)
* [Zacks Confidential](https://www.zacks.com/stock/news/2375522/</www.zacks.com/confidential/?adid=TOP_ONLINE_NAV>)

Trading Services

* [Black Box Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/blackboxtrader/>)
* [Counterstrike](https://www.zacks.com/stock/news/2375522/</www.zacks.com/counterstrike/>)
* [Headline Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/headlinetrader/>)
* [Insider Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/insidertrader/>)
* [Large-Cap Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/largecaptrader/>)
* [Options Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/optionstrader/>)
* [Short Sell List](https://www.zacks.com/stock/news/2375522/</www.zacks.com/shortlist/>)
* [Surprise Trader](https://www.zacks.com/stock/news/2375522/</www.zacks.com/surprisetrader/>)
* [TAZR](https://www.zacks.com/stock/news/2375522/</www.zacks.com/tazr/>)

Innovators

* [Alternative Energy](https://www.zacks.com/stock/news/2375522/</www.zacks.com/alternativeenergyinnovators/>)
* [Blockchain](https://www.zacks.com/stock/news/2375522/</www.zacks.com/blockchaininnovators/>)
* [Commodity](https://www.zacks.com/stock/news/2375522/</www.zacks.com/commodityinnovators/>)
* [Healthcare](https://www.zacks.com/stock/news/2375522/</www.zacks.com/healthcareinnovators/>)
* [Marijuana](https://www.zacks.com/stock/news/2375522/</www.zacks.com/marijuanainnovators/>)
* [Technology](https://www.zacks.com/stock/news/2375522/</www.zacks.com/technologyinnovators/>)

Zacks Investment Research Home
You are being directed to ZacksTrade, a division of LBMZ Securities and licensed broker-dealer. ZacksTrade and Zacks.com are separate companies. The web link between the two companies is not a solicitation or offer to invest in a particular security or type of security. ZacksTrade does not endorse or adopt any particular investment strategy, any analyst opinion/rating/report or any approach to evaluating individual securities.
If you wish to go to ZacksTrade, click OK. If you do not, click Cancel.
OK Cancel

Zacks Investment Research

Don't Miss the Latest Market News

Get daily market news and analysis from Zacks Investment Research.
Maybe Later Get Notifications
Back to top

A SPECIAL WELCOME GIFT FROM ZACKS.COM

Zacks' 7 Strongest Buys for January, 2025

See our "best of the best" short-term stocks. Hand-picked from 220 new Strong Buys, they could be the most profitable stocks you own over the next 90 days. Recent picks have climbed as much as +67.5% in 1 month. Our new recommendations may soar just as high.

A SPECIAL WELCOME GIFT FROM ZACKS.COM

Zacks' 7 Strongest Buys for January, 2025

See our "best of the best" short-term stocks. Hand-picked from 220 new Strong Buys, they could be the most profitable stocks you own over the next 90 days. Recent picks have climbed as much as +67.5% in 1 month. Our new recommendations may soar just as high. Today's market dip makes now an ideal time to get in.
*Enter your best email address ▸ Click Here, It's Really Free
No cost. No credit card. No obligation to buy anything ever. Past performance is no guarantee of future results.We respect your privacy. Zacks Ultimate Member? Click here.

@devatbosch
Copy link

devatbosch commented Jan 7, 2025

hi @ordersofmagnitude ,

Can you try with the below code and give the feedback...

        async def ai_process_news(self, item):
            js_commands = [
                "document.querySelector('.cookie-accept-btn').click();",
                "window.scrollTo(0, 900);"
            ]
            async with AsyncWebCrawler(verbose=True, headless=True, sleep_on_close=False) as crawler:
                # Initial request to load the page and accept cookies
                await crawler.arun(
                    url=item["link"],
                    simulate_user=True,
                    override_navigator=True,
                    magic=True,
                    js_code=js_commands
                )
        
                session_id = "dynamic-session"
        
                # Second request to interact with "Show Full Article" and extract content
                result = await crawler.arun(
                    url=item["link"],
                    session_id=session_id,
                    wait_for="css:.show_article",
                    js_code=["document.querySelector('span.show_article').click();"],
                    css_selector=".commentary_body"
                )
        
                # Debugging step to inspect HTML content
                print(result.html)
        
            return result.markdown
      
      def process_news(self, item):
          article = asyncio.run(self.ai_process_news(item))
          return article

Actually have added an extra JavaScript Command Selector, so want to know whether it works or not...

Please do share your feedback.

Thank You!

@ordersofmagnitude
Copy link
Author

Hi,

Here is the result. It is just a bunch of links and ads. I am hoping to extract the actual content of the article. Please advise!

We use cookies to understand how you use our site and to improve your experience. This includes personalizing content and advertising. To learn more, click here. By continuing to use our site, you accept our use of cookies, revised Privacy Policy and Terms of Service.

I acceptX
Zacks Investment Research Home
                <span class="z1-mobile-menu">Menu</span>
                <span class="sr-only">Toggle header navigation</span>
            </button>            
            
            <ul id="navbar-collapse" class="navbar-collapse">
                <li class=" selected" aria-current="page">
                    <a href="//www.zacks.com" aria-current="page">Home</a>
                    </li>
    
  • Stocks
  • Stocks
                    </div>
                </li>
    
                
                <!-- search container -->
                 <li class="search-container home_search">
                     <form id="search_form" name="frmsearch" method="post" action="/search.php" autocomplete="off">
                     <label id="header_search" for="search-q" class="sr-only">Quote or Search</label>
                         <input type="search" id="search-q" name="search-q" class="dropdown " value="" hasdefault="false" searchextras="all" placeholder="Quote or Search" autofocus="autofocus" aria-autocomplete="both" aria-controls="result_list_view" aria-haspopup="listbox" aria-expanded="false" autocomplete="off">
                         <button type="submit">
                             <i class="fa fa-search" aria-hidden="true"></i>
                             <span class="sr-only">Search</span>
                         </button>
                         <input type="hidden" class="search_mode" name="search_mode" value="">
                         <input type="hidden" class="ticker_type" name="ticker_type" value="">
                         <div id="result_list_view" class="result_container">
                    <div class="results_tickers"></div>
                </div>
                     </form>
                     <script>document.getElementById('search-q').setAttribute("autocomplete", "off");</script>
                 </li>
                
                
            </ul>
    

    New to Zacks? Get started here.

    • Join Now
    • Sign In
                                  <h2>Member Sign In</h2>
      
                                  <div class="form-field float-label">
      
                                      <input id="username" type="text" name="username" placeholder="Username or Email Address" required="" autocomplete="username">
                                      <label for="username">Username or Email Address</label>
                                  </div>
      
                                  <div class="form-field float-label">
                                      <input id="password" class="password" type="password" name="password" placeholder="Password" required="" autocomplete="password">
                                      <button type="button" toggle=".password" class="password-toggle eye-ico-dark"></button>
                                      <label for="password">Password</label>
                                  </div>
      
                                  <div class="form-field">
                                      <input type="checkbox" id="rememberMe" name="remember_me" checked="checked" role="checkbox"> <label for="rememberMe">Keep Me Signed In </label>
                                      <span class="remember_me help_me_link" onclick="alert('If you are the only one who uses your computer \nor if you are not concerned about keeping\nprivate information from the others who use \nyour computer (such as your spouse), you can \nselect the   “Remember Me” option on the “Login”.\n\nWith this option you can move freely in and out \nof the Zacks Investment Research site with all \nits parts, and each time you return, \nZacks Investment Research will automatically sign \nyou in.\n\nFor your security you should keep your \nZacks Investment Research Member Name and\nPassword in a safe place so that you can refer to\nthem when you need to.\n\nThe Zacks Investment Research Team')">What does "Remember Me" do?</span>
                                  </div>
      
                                                                  <p><a href="/my-account/forgot.php?continue_to=%2F%2Fwww.zacks.com%2Findex.php" class="forgotPasswordLink" role="link">Don't Know Your Password?</a></p>
      
                                  <p><input type="submit" value="Sign In" class="fancy_button green" role="button"></p>
      
                                  <button type="button" tabindex="0" aria-label="Close" role="button" id="close_login">
                                      <span class="sr-only">Close this window</span>
                                  </button>
                              </form>
                          </div>
                          <!-- End: Login form -->
                      </li>
                      <li><a href="https://www.zacks.com/help/">Help</a></li>
      </ul>
                  </div>
      
    Zacks Investment Research Home
    • Join Now
    • Sign In
                                  <h1>Member Sign In</h1>
                                  <div class="form-field float-label">
                                      <input id="username" type="text" name="username" placeholder="Username or Email Address" autocomplete="username">
                                      <label for="username">Username or Email Address</label>
                                  </div>
                                  <div class="form-field float-label">
                                      <input id="password" type="password" name="password" placeholder="Password" autocomplete="password" class="password">
                                      <button type="button" toggle=".password" class="password-toggle eye-ico-dark"></button>
                                      <label for="password" class="visuallyhidden">Password</label>
                                  </div>
                                  <div class="form-field">
                                      <input type="checkbox" id="rememberMe_m" checked="checked"> <label for="rememberMe_m">Keep Me Signed In </label>
                                  </div>
                                  <p><a href="/my-account/forgot.php?continue_to=%2F%2Fwww.zacks.com%2Findex.php" class="forgotPasswordLink">Don't Know Your Password?</a></p>
                                  <p><input type="submit" value="Sign In" class="fancy_button green"></p>
                                  <button type="button" tabindex="0" aria-label="Close" role="button">
                                      <span id="mob_close_login">Close this window</span>
                                  </button>
                              </form>
                          </div>
                          <!-- End: Login form -->
                      </li></ul>
                  </div>
                  <div class="mobile_menu_trigger">
                      <button class="mobile_toggle">Menu</button>
                  </div>
              </div>
              <div class="header-bottom" id="header-bottom">
                  <!-- search goes here -->
                  <div class="relocated_searchbox">
                      <form id="search_form" name="frmsearch" method="post" action="/search.php" autocomplete="off">
                          <label id="header_search" for="search-q" class="sr-only">Quote or Search</label>
                          <input type="search" id="search-q" name="search-q" class="dropdown " value="" hasdefault="false" searchextras="all" placeholder="Quote or Search" autofocus="autofocus" aria-autocomplete="both" aria-controls="result_list_view" aria-haspopup="listbox" aria-expanded="false">
                          <button type="submit">
                              <i class="fa fa-search" aria-hidden="true"></i>
                              <span class="sr-only">Search</span>
                          </button>
                          <input type="hidden" class="search_mode" name="search_mode" value="">
                          <input type="hidden" class="ticker_type" name="ticker_type" value="">
                          <div id="result_list_view" class="result_container">
                              <div class="results_tickers"></div>
                          </div>
                      </form>
                  </div>
                  <script>document.getElementById('search-q').setAttribute("autocomplete", "off");</script>
              </div>
          </div>
      </header>
      <div class="mobile-menu-close-container">
          <button class="mobile-menu-close">
              <span class="icon close" role="img">
                  <!--?xml version="1.0" ?--><svg viewBox="0 0 32 32" xmlns="http://www.w3.org/2000/svg"><defs><style>.cls-1{fill:none;stroke:#000;stroke-linecap:round;stroke-linejoin:round;stroke-width:2px;}</style></defs><title></title><g id="cross"><line class="cls-1" x1="7" x2="25" y1="7" y2="25"></line><line class="cls-1" x1="7" x2="25" y1="25" y2="7"></line></g></svg>
              </span>
          </button>
      </div>
      <div class="mobile_menu">
          <div class="mobile_menu_header">
              <a href="/" class="logo_link">
                  <img src="https://staticx.zacks.com/images/zacks/logos/zacks_logo_200x57.png" alt="Zacks">
              </a>
          </div>
          <div class="mobile-menu-content">
              <ul>
              <li><a href="//www.zacks.com" class="mobile_menu_link">Home</a></li>
      
    • Stocks
    • Funds
    • Crypto
    • Earnings
    • Screening
    • Finance
    • Portfolio
    • Education
    • Video
    • Podcasts
    • Services
                      <div class="mobile-menu-sub-section">
                         <div class="mobile_menu_header">
                          <button class="mobile_menu_sub_section_close">
                              <span class="icon arrow-left" role="img">
                                  <!--?xml version="1.0" ?--><svg height="512px" id="Layer_12" style="enable-background:new 0 0 512 512;" version="1.1" viewBox="0 0 512 512" width="512px" xml:space="preserve" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><polygon points="352,115.4 331.3,96 160,256 331.3,416 352,396.7 201.5,256 "></polygon></svg>
                              </span>
                              <span class="menu-title">Services</span>
                          </button>
                         </div>
                         <div class="services-menu-content"> <h3>
                                      <span class="menu_name">Services Overview</span>
                                  </h3>
                                  <ul>
      
    • Zacks Ultimate
    • Zacks Investor Collection
    • Zacks Premium

    Investor Services

    Other Services

    Trading Services

    Innovators

            </ul>
         </div>
    </div>
    
    Zacks Investment Research Home
        <!-- nav toggle button for mobile menu -->
        <div class="nav-toggle">
            <span></span>
            <span></span>
            <span></span>
        </div>
        <!-- nav toggle button for mobile menu --><!-- investing channel adcode -->
    

    You are being directed to ZacksTrade, a division of LBMZ Securities and licensed broker-dealer. ZacksTrade and Zacks.com are separate companies. The web link between the two companies is not a solicitation or offer to invest in a particular security or type of security. ZacksTrade does not endorse or adopt any particular investment strategy, any analyst opinion/rating/report or any approach to evaluating individual securities.

    If you wish to go to ZacksTrade, click OK. If you do not, click Cancel.

    OK Cancel

    Zacks Investment Research

    Don't Miss the Latest Market News

    Get daily market news and analysis from Zacks Investment Research.

        <div class="btn-group" role="group" aria-label="Web Notification Button">
          <button type="button" class="btn secondary" onclick="setNoThanks()">Maybe Later</button>
          <button type="button" class="btn active">Get Notifications</button>
        </div>
    </div>
    <span class="arrow-left" aria-hidden="true"></span>
    
    <script defer="" src="https://staticx.zacks.com/js/zacks/web_notifications.js?v=20241223"></script>
    Back to top
    <div class="main_body">
    
    <!-- In Your face Start -->
    <!-- [login status: false] -->
    <!-- [iyf_cookie: ] -->
    <div id="iyf_pfp_signup">
        <script>
        var messageFromPage = 'Close This Panel X';
        var messageCloseFromPage = "Free Membership - discover all the benefits of free membership to Zacks.com";
        var pfpCookie = 'iyf_pfp_module';
        </script>
        <div id="iyf_pfp_content">
            <div id="IYFmodule" class="sp500-positive">
    <div id="positive-content">
        <h1 class="headline sub">A SPECIAL WELCOME GIFT FROM ZACKS.COM</h1>
        <h1 class="headline lead">Zacks' 7 Strongest Buys for January, 2025</h1>
        <div class="article">
            <p>See our "best of the best" short-term stocks. Hand-picked from 220 new Strong Buys, they could be the most profitable stocks you own over the next 90 days. Recent picks have climbed as much as +67.5% in 1 month. Our new recommendations may soar just as high.</p>
        </div>
    </div>
    <div id="negative-content">
        <h1 class="headline sub">A SPECIAL WELCOME GIFT FROM ZACKS.COM</h1>
        <h1 class="headline lead">Zacks' 7 Strongest Buys for January, 2025</h1>
        <div class="article">
            <p>See our "best of the best" short-term stocks. Hand-picked from 220 new Strong Buys, they could be the most profitable stocks you own over the next 90 days. Recent picks have climbed as much as +67.5% in 1 month. Our new recommendations may soar just as high. Today's market dip makes now an ideal time to get in.</p>
        </div>
    </div>
    <form action="/registration/pfp/step2.php" method="post" name="iyfModuleForm" id="iyfModuleForm" onsubmit="return false;">
        
        <label class="label" for="email">*Enter your best email address</label>
        <input name="email" type="text" class="textbox" id="email" onfocus="changeInputBackground('email','focus')" onclick="changeInputBackground('email','click')" onblur="changeInputBackground('email','blur')" autocomplete="off" required="">
        <input name="ADID" type="hidden" value="ZCOM_IYFHOME_7BEST_SPECIAL">
        <input name="ALERT" type="hidden" value="IYF_HOME_554_A381">
        <button class="input_button submit-iyf-no-exp-api" id="button"><span aria-hidden="true">▸</span> Click Here, It's Really Free</button>
        <p class="footnote">No cost. No credit card. No obligation to buy anything ever. Past performance is no guarantee of future results.<br>We respect your <a class="link" href="javascript:void(0);" onclick="javascript:window.open('/privacy_ecom.html','Privacy','width=390,height=475,menubar=no,scrollbars=yes,toolbar=no,left=200,top=72')" title="Privacy Policy">privacy.</a> <a class="link" href="/ultimate/special_reports.php" title="Zacks Ultimate subscribers can access this report here"><em>Zacks Ultimate</em> Member? Click here</a>.</p>
        
    </form>
    
    <style>
    #featured_zacks_rank_stocks {padding:3px 0 0 10px;}                      
    #IYFmodule {padding:15px; border:solid 2px #0f611c; background:#fff; overflow:hidden; position:relative; text-align:center;}
    #IYFmodule, #IYFmodule .headline.sub, #IYFmodule p, #IYFmodule .article {font-size:16px; line-height:1.375; color:#333; margin:10px 0;} #IYFmodule a {color:#1d5eb5;}
    #IYFmodule .headline.lead {font-size:24px; line-height:1.15; color:#0f611c; margin:20px 0 0; padding:0;}
    #IYFmodule .headline.sub {padding:10px; margin:-15px -15px 10px; background:#0f611c; color:#fff; text-shadow:none;}
    #IYFmodule p.footnote {width:100%; font-size:11px; line-height:1.375; margin:5px 0 0;}
    
    /* FORM ELEMENTS */
    #IYFmodule #iyfModuleForm {position:relative;}
    #IYFmodule .textbox, #IYFmodule .input_button.submit-iyf-no-exp-api {width:100% !important; display:block; border-radius:2px; margin:5px 0; font-size:16px; line-height:1.15; text-align:center; box-sizing:border-box;}
    #IYFmodule .textbox {position:relative; padding:10px 3px; border:solid 1px #484848; z-index:1; background:none;}
    #IYFmodule .label {position:absolute; top:18px; left:0; right:0; font-size:16px; line-height:1; color:#666;}
    #IYFmodule .input_button.submit-iyf-no-exp-api {
        padding:10px 20px !important; border:solid 1px #484848; font-weight:bold; text-shadow:0 1px 2px #333; color:#fff; cursor:pointer; float:right; box-shadow:1px 1px 1px rgba(0, 0, 0, .25);
        background:#F06C00; background:linear-gradient(to bottom, #F06C00 0%, #C25700 100%); /* W3C, IE10+, FF16+, Chrome26+, Opera12+, Safari7+ */
        filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#F06C00', endColorstr='#C25700', GradientType=0); /* IE6-9 */
    }
    
    /* POSITIVE & NEGATIVE STYLES */
    #positive-content {display:block;} #negative-content {display:none;}
    #positive-content .highlight {background:#0f611c; color:#fff;}
    #negative-content .headline.sub, #negative-content .highlight {background:#a50000; color:#fff;}
    #negative-content .headline.lead {color:#000;}
    
    #IYFmodule.sp500-negative {border-color:#a50000;}
    #IYFmodule.sp500-negative #positive-content {display:none;} #IYFmodule.sp500-negative #negative-content {display:block;}
    
    /* @MEDIA QUERIES */
    @media screen and (min-width:1024px) {#IYFmodule {min-height:270px;} #IYFmodule.sp500-negative {min-height:290px;} #IYFmodule .textbox, #IYFmodule .input_button.submit-iyf-no-exp-api {display:inline-block; max-width:280px;} #IYFmodule .label {width:280px; left:16px; right:auto;} #negative-content {height:219px;}}
    @media screen and (max-width:1023px) {#featured_zacks_rank_stocks {width:auto !important; float:none !important; padding:3px;} #IYFmodule .article {display:none;} #IYFmodule .label {top:12px;}}
    
    </style> <script src="https://staticx-tuner.zacks.com/woas/adv/services/js/experian-validation-1.0.4-iyf.js?date=20241211"></script>

    @devatbosch
    Copy link

    hi @ordersofmagnitude ,

    thanks for the feedback, after a few set of testings, the below code is working as expected. Can you check and let me know

              import asyncio
              import re
              from bs4 import BeautifulSoup
              
              class ArticleScraper:
                  async def ai_process_news(self, item):
                      # JavaScript commands for handling cookies and scrolling
                      js_commands_initial = [
                          "document.querySelector('#accept_cookie')?.click();",  # Click "Accept Cookies"
                          "window.scrollTo(0, 900);",  # Scroll down to simulate user interaction
                      ]
              
                      # JavaScript commands for expanding the article
                      js_commands_show_article = [
                          "document.querySelector('span.show_article')?.click();",  # Click "Read More"
                      ]
              
                      async with AsyncWebCrawler(verbose=True, headless=True, sleep_on_close=False) as crawler:
                          await crawler.arun(
                              url=item["link"],
                              simulate_user=True,
                              override_navigator=True,
                              magic=True,
                              js_code=js_commands_initial
                          )
              
                          session_id = "dynamic-session"
                          result = await crawler.arun(
                              url=item["link"],
                              session_id=session_id,
                              wait_for="css:span.show_article",  # Wait for the "Read More" button
                              js_code=js_commands_show_article,  # Execute JavaScript to click the button
                              css_selector=".commentary_body"  # Extract article content
                          )
              
                      clean_content = self.clean_html_content(result.html)
                      return clean_content
              
                  def clean_html_content(self, html_content):
                      soup = BeautifulSoup(html_content, "html.parser")
              
                      # Target the main article content container
                      article_container = soup.select_one(".commentary_body")  # Adjust if needed
              
                      if not article_container:
                          return "Unable to find the main article content."
              
                      # Remove unnecessary elements (e.g., links, images, scripts)
                      for tag in article_container.find_all(["a", "img", "script"]):
                          tag.decompose()
              
                      clean_text = article_container.get_text(separator="\n", strip=True)
                      
                      # Enhance formatting for indentation and readability
                      formatted_text = self.format_article_content(clean_text)
                      return formatted_text
              
                  def format_article_content(self, text):
                      """
                      Format the article content for better readability.
                      
                      Args:
                          text (str): Raw cleaned text from the article.
                      
                      Returns:
                          str: Formatted and indented text.
                      """
                      # Normalize spaces
                      text = re.sub(r'\s+', ' ', text)
              
                      # Split into paragraphs based on sentence boundaries
                      paragraphs = re.split(r'(?<=\.)\s+', text)
              
                      # Reassemble with proper indentation
                      return "\n\n".join(paragraphs)
              
                  def process_news(self, item):
                      article = asyncio.run(self.ai_process_news(item))
                      return article
              
              
              # Usage
              if __name__ == "__main__":
                  scraper = ArticleScraper()
                  article_url = "https://www.zacks.com/stock/news/2375522/snowflake-rises-post-q3-earnings-how-should-you-play-the-stock"
                  item = {"link": article_url}
                  content = scraper.process_news(item)
                  print(content)
    

    @devatbosch
    Copy link

    here is a screenshot of output for above code

    @devatbosch
    Copy link

    @unclecode , tagging you if at all you can help @ordersofmagnitude with any further info. ...

    @devatbosch
    Copy link

    @ordersofmagnitude , any update?

    @ordersofmagnitude
    Copy link
    Author

    ordersofmagnitude commented Jan 7, 2025

    Hi devatbosch: thank you sooo much for your help!

    Update:
    I get "Unable to find the main article content."

    quick questions:

    • why is it "document.querySelector('span.show_article')?.click();" rather than "document.querySelector('span.show_article').click();"?
    • why is it wait_for="css:span.show_article" instead of "css:.show_article"?

    @unclecode
    Copy link
    Owner

    @ordersofmagnitude Hi, thanks for trying Crawl4ai. Let’s see what we can do for you. Just a question: Are you trying to extract only the article contents? This page contain a lot of things, images, ads, headers, footers, and site elements. My assumption is that you're extracting the article about "snowflake rising its post-quarter three earnings". Is that correct? Can you confirm?

    @devatbosch Thx tagging me and supporting here 👍

    @unclecode unclecode self-assigned this Jan 7, 2025
    @ordersofmagnitude
    Copy link
    Author

    ordersofmagnitude commented Jan 7, 2025

    Hi unclecode,

    I am trying to extract the article contents. My company has the appropriate infrastructure for data cleaning. My role is to extract the raw html containing this. When I used crawl4ai, I get the output described above. There's a bug, which I'm not very sure how to fix.

    Zacks Investment Research Home

    Image: Bigstock
    Snowflake Rises Post Q3 Earnings: How Should You Play the Stock?

    Nida Ali Saigal November 26, 2024

    MSFT NVDA NOW

    SNOW

    Better trading starts here.
    Follow
    Hide Full Article

    Snowflake (

    SNOW - Free Report) shares jumped 32.8% following its impressive third-quarter fiscal 2025 results on Nov. 20. SNOW reported non-GAAP earnings of 20 cents per share, beating the Zacks Consensus Estimate by 33.33% and reflecting strong operational performance and customer-base expansion.

    SNOW’s earnings beat the Zacks Consensus Estimate in three of the trailing four quarters while missing once, the average earnings surprise being 35.39%.

    Find the latest EPS estimates and surprises on Zacks Earnings Calendar.

    SNOW shares have plunged 13.8% year to date against the Zacks Computer & Technology sector’s growth of 27.7%.

    It has been suffering from stiff competition from companies like Databricks, as well as increasing pricing pressure and growing GPU-related costs as it aggressively invests in AI initiatives.

    YTD Performance

    Zacks Investment Research
    Image Source: Zacks Investment Research

    Nevertheless, Snowflake’s prospects are robust, driven by a strong portfolio and an expanding partner base.

    In the reported quarter, Snowflake partnered with Microsoft (
    MSFT - Free Report) and ServiceNow (

    NOW - Free Report) to improve data interoperability, making it easier for customers to move data in and out of Snowflake and develop applications faster.

    So, the question that arises is - how should investors now play SNOW stock after third-quarter fiscal 2025 results? Let’s dig deep to find out.

    SNOW’s Strong Portfolio: A Key Catalyst

    SNOW benefits from a strong portfolio with new capabilities, including Marketplace, Listing Auto-Fulfillment & Monetization, account replication & failover, Query Acceleration Service, geospatial analytics, Snowpark and Snowpipe Streaming.

    In the third quarter of fiscal 2025, Snowpark contributed a significant portion toward SNOW’s top-line growth and is on track to be roughly 3% of total revenues.

    Iceberg tables, Hybrid tables and Cortex Large Language Model (LLM) and machine learning-powered functions strengthen SNOW’s offerings.

    More than 3,200 accounts adopted SNOW’s AI and ML features in the reported quarter, and approximately 500 accounts adopted Iceberg.

    SNOW has an expanding partner base that includes Amazon, Microsoft, ServiceNow, NVIDIA (

    NVDA - Free Report) , Fiserv, EY, Deloitte, LTMindtree, Next Pathway and S&P Global, among others.

    Snowflake has adopted NVIDIA AI Enterprise software to integrate NeMo Retriever microservices into Snowflake Cortex AI, its fully managed LLM and vector search service. The integration will help enterprises to smoothly connect custom models to diverse business data and deliver highly accurate responses.

    SNOW’s recent announcement of its partnership with Anthropic brings the latter’s advanced AI models to Snowflake Cortex AI, enabling enterprises to build secure, cutting-edge AI applications with ease and flexibility.

    Snowflake’s growing client base includes industry leaders like Disney, Accor, Comcast, Chipotle, Kraft Heinz, NBC Universal, Sanofi and Toyota. Disney utilizes SNOW’s platform for in-park optimization.

    SNOW remained acquisitive in the reported quarter. Its planned acquisition of Datavolo strengthens Snowflake’s platform for both structured and unstructured data, enhancing flexibility and simplifying data engineering workloads for customers.

    SNOW Offers Positive Guidance for Q4 and Fiscal 2025

    For the fourth quarter of fiscal 2025, Snowflake expects product revenues in the range of $906-$911 million. The figure indicates year-over-year growth of approximately 23%.

    The Zacks Consensus Estimate for fiscal fourth-quarter 2025 revenues is currently pegged at $952.55 million, indicating 22.96% year-over-year growth. The consensus mark for earnings is pegged at 17 cents per share, up 41.67% over the past week. The figure suggests a decrease of 51.43% year over year.

    For fiscal 2025, SNOW expects product revenues to increase 29% year over year to $3.43 billion.

    The non-GAAP product gross margin is expected to be 76% and the non-GAAP operating margin is expected to be 5%.

    The non-GAAP adjusted free cash flow margin is expected to be 26% in fiscal 2025.

    The Zacks Consensus Estimate for SNOW’s fiscal 2025 revenues is pegged at $3.59 billion, indicating year-over-year growth of 28%. The consensus mark for earnings is pegged at 68 cents per share, up by 17.24% over the past week, and suggesting a decrease of 30.61% on a year-over-year basis.

    Snowflake Inc. Price and Consensus

    Snowflake Inc. Price and Consensus

    Snowflake Inc. price-consensus-chart | Snowflake Inc. Quote

    SNOW Shares Trading at a Premium

    Snowflake’s Value Score of F suggests a stretched valuation at the moment.

    Currently, SNOW is trading at a premium, with a forward 12-month Price/Sales of 13.6X compared with the sector’s 6.13X and the Internet Software Industry’s 3.02X.

    Price/Sales Ratio (F12M)

    Zacks Investment Research
    Image Source: Zacks Investment Research

    SNOW shares have been trading above the 50-day and 200-day moving average, indicating a bullish trend.

    SNOW Shares Trade Above 50-Day and 200-Day SMA

    Zacks Investment Research
    Image Source: Zacks Investment Research

    SNOW: Buy, Sell, or Hold the Stock?

    Snowflake is a risky bet in the near term, given its modest growth prospect and a stretched valuation. However, investors who already own the stock may expect the company's growth prospects to be rewarding, given its strong portfolio and partner base over the long term.

    SNOW currently has a Zacks Rank #3 (Hold), suggesting that it may be wise to wait for a more favorable entry point in the stock. You can see the complete list of today’s Zacks #1 Rank (Strong Buy) stocks here.

    5 Stocks Set to Double

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants