to_en
This commit is contained in:
parent
41c85844b5
commit
5a7b09d07d
1 changed files with 232 additions and 232 deletions
|
|
@ -1,233 +1,233 @@
|
||||||
# Web Scraping Project
|
# Web Scraping Project
|
||||||
|
|
||||||
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- **Multiple Scraping Methods**:
|
- **Multiple Scraping Methods**:
|
||||||
- Basic HTTP requests with BeautifulSoup
|
- Basic HTTP requests with BeautifulSoup
|
||||||
- Selenium for JavaScript-heavy sites
|
- Selenium for JavaScript-heavy sites
|
||||||
- Jina AI for intelligent text extraction
|
- Jina AI for intelligent text extraction
|
||||||
- Firecrawl for deep web crawling
|
- Firecrawl for deep web crawling
|
||||||
- AgentQL for complex workflows
|
- AgentQL for complex workflows
|
||||||
- Multion for exploratory tasks
|
- Multion for exploratory tasks
|
||||||
|
|
||||||
- **Built-in Utilities**:
|
- **Built-in Utilities**:
|
||||||
- Rate limiting and retry logic
|
- Rate limiting and retry logic
|
||||||
- Comprehensive logging
|
- Comprehensive logging
|
||||||
- Data validation and sanitization
|
- Data validation and sanitization
|
||||||
- Multiple storage formats (JSON, CSV, text)
|
- Multiple storage formats (JSON, CSV, text)
|
||||||
|
|
||||||
- **Best Practices**:
|
- **Best Practices**:
|
||||||
- PEP 8 compliant code
|
- PEP 8 compliant code
|
||||||
- Modular and reusable components
|
- Modular and reusable components
|
||||||
- Error handling and recovery
|
- Error handling and recovery
|
||||||
- Ethical scraping practices
|
- Ethical scraping practices
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
.
|
.
|
||||||
├── config.py # Configuration and settings
|
├── config.py # Configuration and settings
|
||||||
├── requirements.txt # Python dependencies
|
├── requirements.txt # Python dependencies
|
||||||
├── .env.example # Environment variables template
|
├── .env.example # Environment variables template
|
||||||
│
|
│
|
||||||
├── scrapers/ # Scraper implementations
|
├── scrapers/ # Scraper implementations
|
||||||
│ ├── base_scraper.py # Abstract base class
|
│ ├── base_scraper.py # Abstract base class
|
||||||
│ ├── basic_scraper.py # requests + BeautifulSoup
|
│ ├── basic_scraper.py # requests + BeautifulSoup
|
||||||
│ ├── selenium_scraper.py # Selenium WebDriver
|
│ ├── selenium_scraper.py # Selenium WebDriver
|
||||||
│ ├── jina_scraper.py # Jina AI integration
|
│ ├── jina_scraper.py # Jina AI integration
|
||||||
│ ├── firecrawl_scraper.py # Firecrawl integration
|
│ ├── firecrawl_scraper.py # Firecrawl integration
|
||||||
│ ├── agentql_scraper.py # AgentQL workflows
|
│ ├── agentql_scraper.py # AgentQL workflows
|
||||||
│ └── multion_scraper.py # Multion AI agent
|
│ └── multion_scraper.py # Multion AI agent
|
||||||
│
|
│
|
||||||
├── utils/ # Utility modules
|
├── utils/ # Utility modules
|
||||||
│ ├── logger.py # Logging configuration
|
│ ├── logger.py # Logging configuration
|
||||||
│ ├── rate_limiter.py # Rate limiting
|
│ ├── rate_limiter.py # Rate limiting
|
||||||
│ └── retry.py # Retry with backoff
|
│ └── retry.py # Retry with backoff
|
||||||
│
|
│
|
||||||
├── data_processors/ # Data processing
|
├── data_processors/ # Data processing
|
||||||
│ ├── validator.py # Data validation
|
│ ├── validator.py # Data validation
|
||||||
│ └── storage.py # Data storage
|
│ └── storage.py # Data storage
|
||||||
│
|
│
|
||||||
├── examples/ # Example scripts
|
├── examples/ # Example scripts
|
||||||
│ ├── basic_example.py
|
│ ├── basic_example.py
|
||||||
│ ├── selenium_example.py
|
│ ├── selenium_example.py
|
||||||
│ └── advanced_example.py
|
│ └── advanced_example.py
|
||||||
│
|
│
|
||||||
└── tests/ # Test suite
|
└── tests/ # Test suite
|
||||||
├── test_basic_scraper.py
|
├── test_basic_scraper.py
|
||||||
└── test_data_processors.py
|
└── test_data_processors.py
|
||||||
```
|
```
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
1. **Clone the repository**:
|
1. **Clone the repository**:
|
||||||
```bash
|
```bash
|
||||||
git clone <repository-url>
|
git clone <repository-url>
|
||||||
cd <project-directory>
|
cd <project-directory>
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Create virtual environment**:
|
2. **Create virtual environment**:
|
||||||
```bash
|
```bash
|
||||||
python -m venv venv
|
python -m venv venv
|
||||||
|
|
||||||
# Windows
|
# Windows
|
||||||
venv\Scripts\activate
|
venv\Scripts\activate
|
||||||
|
|
||||||
# Unix/MacOS
|
# Unix/MacOS
|
||||||
source venv/bin/activate
|
source venv/bin/activate
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Install dependencies**:
|
3. **Install dependencies**:
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
4. **Configure environment variables**:
|
4. **Configure environment variables**:
|
||||||
```bash
|
```bash
|
||||||
cp .env.example .env
|
cp .env.example .env
|
||||||
# Edit .env with your API keys
|
# Edit .env with your API keys
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### Basic Scraping
|
### Basic Scraping
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from scrapers.basic_scraper import BasicScraper
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
|
||||||
with BasicScraper() as scraper:
|
with BasicScraper() as scraper:
|
||||||
result = scraper.scrape("https://example.com")
|
result = scraper.scrape("https://example.com")
|
||||||
|
|
||||||
if result["success"]:
|
if result["success"]:
|
||||||
soup = result["soup"]
|
soup = result["soup"]
|
||||||
# Extract data using BeautifulSoup
|
# Extract data using BeautifulSoup
|
||||||
titles = scraper.extract_text(soup, "h1")
|
titles = scraper.extract_text(soup, "h1")
|
||||||
print(titles)
|
print(titles)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Dynamic Content (Selenium)
|
### Dynamic Content (Selenium)
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from scrapers.selenium_scraper import SeleniumScraper
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
|
||||||
with SeleniumScraper(headless=True) as scraper:
|
with SeleniumScraper(headless=True) as scraper:
|
||||||
result = scraper.scrape(
|
result = scraper.scrape(
|
||||||
"https://example.com",
|
"https://example.com",
|
||||||
wait_for=".dynamic-content"
|
wait_for=".dynamic-content"
|
||||||
)
|
)
|
||||||
|
|
||||||
if result["success"]:
|
if result["success"]:
|
||||||
print(result["title"])
|
print(result["title"])
|
||||||
```
|
```
|
||||||
|
|
||||||
### AI-Powered Extraction (Jina)
|
### AI-Powered Extraction (Jina)
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from scrapers.jina_scraper import JinaScraper
|
from scrapers.jina_scraper import JinaScraper
|
||||||
|
|
||||||
with JinaScraper() as scraper:
|
with JinaScraper() as scraper:
|
||||||
result = scraper.scrape(
|
result = scraper.scrape(
|
||||||
"https://example.com",
|
"https://example.com",
|
||||||
return_format="markdown"
|
return_format="markdown"
|
||||||
)
|
)
|
||||||
|
|
||||||
if result["success"]:
|
if result["success"]:
|
||||||
print(result["content"])
|
print(result["content"])
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage Examples
|
## Usage Examples
|
||||||
|
|
||||||
See the `examples/` directory for detailed usage examples:
|
See the `examples/` directory for detailed usage examples:
|
||||||
|
|
||||||
- `basic_example.py` - Static page scraping
|
- `basic_example.py` - Static page scraping
|
||||||
- `selenium_example.py` - Dynamic content and interaction
|
- `selenium_example.py` - Dynamic content and interaction
|
||||||
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
||||||
|
|
||||||
Run examples:
|
Run examples:
|
||||||
```bash
|
```bash
|
||||||
python examples/basic_example.py
|
python examples/basic_example.py
|
||||||
```
|
```
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
Edit `config.py` or set environment variables in `.env`:
|
Edit `config.py` or set environment variables in `.env`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# API Keys
|
# API Keys
|
||||||
JINA_API_KEY=your_api_key
|
JINA_API_KEY=your_api_key
|
||||||
FIRECRAWL_API_KEY=your_api_key
|
FIRECRAWL_API_KEY=your_api_key
|
||||||
AGENTQL_API_KEY=your_api_key
|
AGENTQL_API_KEY=your_api_key
|
||||||
MULTION_API_KEY=your_api_key
|
MULTION_API_KEY=your_api_key
|
||||||
|
|
||||||
# Scraping Settings
|
# Scraping Settings
|
||||||
RATE_LIMIT_DELAY=2
|
RATE_LIMIT_DELAY=2
|
||||||
MAX_RETRIES=3
|
MAX_RETRIES=3
|
||||||
TIMEOUT=30
|
TIMEOUT=30
|
||||||
```
|
```
|
||||||
|
|
||||||
## Data Storage
|
## Data Storage
|
||||||
|
|
||||||
Save scraped data in multiple formats:
|
Save scraped data in multiple formats:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from data_processors.storage import DataStorage
|
from data_processors.storage import DataStorage
|
||||||
|
|
||||||
storage = DataStorage()
|
storage = DataStorage()
|
||||||
|
|
||||||
# Save as JSON
|
# Save as JSON
|
||||||
storage.save_json(data, "output.json")
|
storage.save_json(data, "output.json")
|
||||||
|
|
||||||
# Save as CSV
|
# Save as CSV
|
||||||
storage.save_csv(data, "output.csv")
|
storage.save_csv(data, "output.csv")
|
||||||
|
|
||||||
# Save as text
|
# Save as text
|
||||||
storage.save_text(content, "output.txt")
|
storage.save_text(content, "output.txt")
|
||||||
```
|
```
|
||||||
|
|
||||||
## Testing
|
## Testing
|
||||||
|
|
||||||
Run tests with pytest:
|
Run tests with pytest:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pytest tests/ -v
|
pytest tests/ -v
|
||||||
```
|
```
|
||||||
|
|
||||||
Run specific test file:
|
Run specific test file:
|
||||||
```bash
|
```bash
|
||||||
pytest tests/test_basic_scraper.py -v
|
pytest tests/test_basic_scraper.py -v
|
||||||
```
|
```
|
||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
1. **Respect robots.txt**: Always check and follow website scraping policies
|
1. **Respect robots.txt**: Always check and follow website scraping policies
|
||||||
2. **Rate Limiting**: Use appropriate delays between requests
|
2. **Rate Limiting**: Use appropriate delays between requests
|
||||||
3. **User-Agent**: Set realistic User-Agent headers
|
3. **User-Agent**: Set realistic User-Agent headers
|
||||||
4. **Error Handling**: Implement robust error handling and retries
|
4. **Error Handling**: Implement robust error handling and retries
|
||||||
5. **Data Validation**: Validate and sanitize scraped data
|
5. **Data Validation**: Validate and sanitize scraped data
|
||||||
6. **Logging**: Maintain detailed logs for debugging
|
6. **Logging**: Maintain detailed logs for debugging
|
||||||
|
|
||||||
## Tool Selection Guide
|
## Tool Selection Guide
|
||||||
|
|
||||||
- **Basic Scraper**: Static HTML pages, simple data extraction
|
- **Basic Scraper**: Static HTML pages, simple data extraction
|
||||||
- **Selenium**: JavaScript-rendered content, interactive elements
|
- **Selenium**: JavaScript-rendered content, interactive elements
|
||||||
- **Jina**: AI-driven text extraction, structured data
|
- **Jina**: AI-driven text extraction, structured data
|
||||||
- **Firecrawl**: Deep crawling, hierarchical content
|
- **Firecrawl**: Deep crawling, hierarchical content
|
||||||
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
||||||
- **Multion**: Exploratory tasks, unpredictable scenarios
|
- **Multion**: Exploratory tasks, unpredictable scenarios
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
1. Follow PEP 8 style guidelines
|
1. Follow PEP 8 style guidelines
|
||||||
2. Add tests for new features
|
2. Add tests for new features
|
||||||
3. Update documentation
|
3. Update documentation
|
||||||
4. Use meaningful commit messages
|
4. Use meaningful commit messages
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
[Your License Here]
|
[Your License Here]
|
||||||
|
|
||||||
## Disclaimer
|
## Disclaimer
|
||||||
|
|
||||||
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.
|
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.
|
||||||
Loading…
Reference in a new issue