This project scrapes specific web pages and converts the content into HTML files, organizing them into a knowledge base. The project uses Selenium for web scraping and BeautifulSoup for HTML parsing.
- Python 3.x
pip
(Python package installer)- Google Chrome browser
- ChromeDriver
-
Clone the repository:
git clone https://github.com/Glebuar/KnowledgeBaseBuilder.git cd knowledge-base-builder
-
Create a virtual environment and activate it:
python -m venv .venv source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Download ChromeDriver:
- Download the correct version of ChromeDriver that matches your Chrome browser from here.
- Place the
chromedriver
executable in a known location.
-
Update
config.json
:- Make sure the
config.json
file is present in the root directory with the correct structure and update thechrome_driver_path
to the path where you placed thechromedriver
executable.
Example
config.json
:{ "chrome_driver_path": "path/to/chromedriver", "urls": [ { "url": "https://example.com/page1", "children": [] }, { "url": "https://example.com/page2", "children": [ { "url": "https://example.com/page2-1", "children": [] } ] } ] }
- Make sure the
python main.py