This commit is contained in:
Creepso 2025-10-31 18:07:51 +00:00
parent 41c85844b5
commit 5a7b09d07d

View file

@ -1,233 +1,233 @@
# Web Scraping Project # Web Scraping Project
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction. A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
## Features ## Features
- **Multiple Scraping Methods**: - **Multiple Scraping Methods**:
- Basic HTTP requests with BeautifulSoup - Basic HTTP requests with BeautifulSoup
- Selenium for JavaScript-heavy sites - Selenium for JavaScript-heavy sites
- Jina AI for intelligent text extraction - Jina AI for intelligent text extraction
- Firecrawl for deep web crawling - Firecrawl for deep web crawling
- AgentQL for complex workflows - AgentQL for complex workflows
- Multion for exploratory tasks - Multion for exploratory tasks
- **Built-in Utilities**: - **Built-in Utilities**:
- Rate limiting and retry logic - Rate limiting and retry logic
- Comprehensive logging - Comprehensive logging
- Data validation and sanitization - Data validation and sanitization
- Multiple storage formats (JSON, CSV, text) - Multiple storage formats (JSON, CSV, text)
- **Best Practices**: - **Best Practices**:
- PEP 8 compliant code - PEP 8 compliant code
- Modular and reusable components - Modular and reusable components
- Error handling and recovery - Error handling and recovery
- Ethical scraping practices - Ethical scraping practices
## Project Structure ## Project Structure
``` ```
. .
├── config.py # Configuration and settings ├── config.py # Configuration and settings
├── requirements.txt # Python dependencies ├── requirements.txt # Python dependencies
├── .env.example # Environment variables template ├── .env.example # Environment variables template
├── scrapers/ # Scraper implementations ├── scrapers/ # Scraper implementations
│ ├── base_scraper.py # Abstract base class │ ├── base_scraper.py # Abstract base class
│ ├── basic_scraper.py # requests + BeautifulSoup │ ├── basic_scraper.py # requests + BeautifulSoup
│ ├── selenium_scraper.py # Selenium WebDriver │ ├── selenium_scraper.py # Selenium WebDriver
│ ├── jina_scraper.py # Jina AI integration │ ├── jina_scraper.py # Jina AI integration
│ ├── firecrawl_scraper.py # Firecrawl integration │ ├── firecrawl_scraper.py # Firecrawl integration
│ ├── agentql_scraper.py # AgentQL workflows │ ├── agentql_scraper.py # AgentQL workflows
│ └── multion_scraper.py # Multion AI agent │ └── multion_scraper.py # Multion AI agent
├── utils/ # Utility modules ├── utils/ # Utility modules
│ ├── logger.py # Logging configuration │ ├── logger.py # Logging configuration
│ ├── rate_limiter.py # Rate limiting │ ├── rate_limiter.py # Rate limiting
│ └── retry.py # Retry with backoff │ └── retry.py # Retry with backoff
├── data_processors/ # Data processing ├── data_processors/ # Data processing
│ ├── validator.py # Data validation │ ├── validator.py # Data validation
│ └── storage.py # Data storage │ └── storage.py # Data storage
├── examples/ # Example scripts ├── examples/ # Example scripts
│ ├── basic_example.py │ ├── basic_example.py
│ ├── selenium_example.py │ ├── selenium_example.py
│ └── advanced_example.py │ └── advanced_example.py
└── tests/ # Test suite └── tests/ # Test suite
├── test_basic_scraper.py ├── test_basic_scraper.py
└── test_data_processors.py └── test_data_processors.py
``` ```
## Installation ## Installation
1. **Clone the repository**: 1. **Clone the repository**:
```bash ```bash
git clone <repository-url> git clone <repository-url>
cd <project-directory> cd <project-directory>
``` ```
2. **Create virtual environment**: 2. **Create virtual environment**:
```bash ```bash
python -m venv venv python -m venv venv
# Windows # Windows
venv\Scripts\activate venv\Scripts\activate
# Unix/MacOS # Unix/MacOS
source venv/bin/activate source venv/bin/activate
``` ```
3. **Install dependencies**: 3. **Install dependencies**:
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
4. **Configure environment variables**: 4. **Configure environment variables**:
```bash ```bash
cp .env.example .env cp .env.example .env
# Edit .env with your API keys # Edit .env with your API keys
``` ```
## Quick Start ## Quick Start
### Basic Scraping ### Basic Scraping
```python ```python
from scrapers.basic_scraper import BasicScraper from scrapers.basic_scraper import BasicScraper
with BasicScraper() as scraper: with BasicScraper() as scraper:
result = scraper.scrape("https://example.com") result = scraper.scrape("https://example.com")
if result["success"]: if result["success"]:
soup = result["soup"] soup = result["soup"]
# Extract data using BeautifulSoup # Extract data using BeautifulSoup
titles = scraper.extract_text(soup, "h1") titles = scraper.extract_text(soup, "h1")
print(titles) print(titles)
``` ```
### Dynamic Content (Selenium) ### Dynamic Content (Selenium)
```python ```python
from scrapers.selenium_scraper import SeleniumScraper from scrapers.selenium_scraper import SeleniumScraper
with SeleniumScraper(headless=True) as scraper: with SeleniumScraper(headless=True) as scraper:
result = scraper.scrape( result = scraper.scrape(
"https://example.com", "https://example.com",
wait_for=".dynamic-content" wait_for=".dynamic-content"
) )
if result["success"]: if result["success"]:
print(result["title"]) print(result["title"])
``` ```
### AI-Powered Extraction (Jina) ### AI-Powered Extraction (Jina)
```python ```python
from scrapers.jina_scraper import JinaScraper from scrapers.jina_scraper import JinaScraper
with JinaScraper() as scraper: with JinaScraper() as scraper:
result = scraper.scrape( result = scraper.scrape(
"https://example.com", "https://example.com",
return_format="markdown" return_format="markdown"
) )
if result["success"]: if result["success"]:
print(result["content"]) print(result["content"])
``` ```
## Usage Examples ## Usage Examples
See the `examples/` directory for detailed usage examples: See the `examples/` directory for detailed usage examples:
- `basic_example.py` - Static page scraping - `basic_example.py` - Static page scraping
- `selenium_example.py` - Dynamic content and interaction - `selenium_example.py` - Dynamic content and interaction
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.) - `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
Run examples: Run examples:
```bash ```bash
python examples/basic_example.py python examples/basic_example.py
``` ```
## Configuration ## Configuration
Edit `config.py` or set environment variables in `.env`: Edit `config.py` or set environment variables in `.env`:
```bash ```bash
# API Keys # API Keys
JINA_API_KEY=your_api_key JINA_API_KEY=your_api_key
FIRECRAWL_API_KEY=your_api_key FIRECRAWL_API_KEY=your_api_key
AGENTQL_API_KEY=your_api_key AGENTQL_API_KEY=your_api_key
MULTION_API_KEY=your_api_key MULTION_API_KEY=your_api_key
# Scraping Settings # Scraping Settings
RATE_LIMIT_DELAY=2 RATE_LIMIT_DELAY=2
MAX_RETRIES=3 MAX_RETRIES=3
TIMEOUT=30 TIMEOUT=30
``` ```
## Data Storage ## Data Storage
Save scraped data in multiple formats: Save scraped data in multiple formats:
```python ```python
from data_processors.storage import DataStorage from data_processors.storage import DataStorage
storage = DataStorage() storage = DataStorage()
# Save as JSON # Save as JSON
storage.save_json(data, "output.json") storage.save_json(data, "output.json")
# Save as CSV # Save as CSV
storage.save_csv(data, "output.csv") storage.save_csv(data, "output.csv")
# Save as text # Save as text
storage.save_text(content, "output.txt") storage.save_text(content, "output.txt")
``` ```
## Testing ## Testing
Run tests with pytest: Run tests with pytest:
```bash ```bash
pytest tests/ -v pytest tests/ -v
``` ```
Run specific test file: Run specific test file:
```bash ```bash
pytest tests/test_basic_scraper.py -v pytest tests/test_basic_scraper.py -v
``` ```
## Best Practices ## Best Practices
1. **Respect robots.txt**: Always check and follow website scraping policies 1. **Respect robots.txt**: Always check and follow website scraping policies
2. **Rate Limiting**: Use appropriate delays between requests 2. **Rate Limiting**: Use appropriate delays between requests
3. **User-Agent**: Set realistic User-Agent headers 3. **User-Agent**: Set realistic User-Agent headers
4. **Error Handling**: Implement robust error handling and retries 4. **Error Handling**: Implement robust error handling and retries
5. **Data Validation**: Validate and sanitize scraped data 5. **Data Validation**: Validate and sanitize scraped data
6. **Logging**: Maintain detailed logs for debugging 6. **Logging**: Maintain detailed logs for debugging
## Tool Selection Guide ## Tool Selection Guide
- **Basic Scraper**: Static HTML pages, simple data extraction - **Basic Scraper**: Static HTML pages, simple data extraction
- **Selenium**: JavaScript-rendered content, interactive elements - **Selenium**: JavaScript-rendered content, interactive elements
- **Jina**: AI-driven text extraction, structured data - **Jina**: AI-driven text extraction, structured data
- **Firecrawl**: Deep crawling, hierarchical content - **Firecrawl**: Deep crawling, hierarchical content
- **AgentQL**: Complex workflows (login, forms, multi-step processes) - **AgentQL**: Complex workflows (login, forms, multi-step processes)
- **Multion**: Exploratory tasks, unpredictable scenarios - **Multion**: Exploratory tasks, unpredictable scenarios
## Contributing ## Contributing
1. Follow PEP 8 style guidelines 1. Follow PEP 8 style guidelines
2. Add tests for new features 2. Add tests for new features
3. Update documentation 3. Update documentation
4. Use meaningful commit messages 4. Use meaningful commit messages
## License ## License
[Your License Here] [Your License Here]
## Disclaimer ## Disclaimer
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data. This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.