to_en
This commit is contained in:
parent
41c85844b5
commit
5a7b09d07d
1 changed files with 232 additions and 232 deletions
|
|
@ -1,233 +1,233 @@
|
|||
# Web Scraping Project
|
||||
|
||||
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multiple Scraping Methods**:
|
||||
- Basic HTTP requests with BeautifulSoup
|
||||
- Selenium for JavaScript-heavy sites
|
||||
- Jina AI for intelligent text extraction
|
||||
- Firecrawl for deep web crawling
|
||||
- AgentQL for complex workflows
|
||||
- Multion for exploratory tasks
|
||||
|
||||
- **Built-in Utilities**:
|
||||
- Rate limiting and retry logic
|
||||
- Comprehensive logging
|
||||
- Data validation and sanitization
|
||||
- Multiple storage formats (JSON, CSV, text)
|
||||
|
||||
- **Best Practices**:
|
||||
- PEP 8 compliant code
|
||||
- Modular and reusable components
|
||||
- Error handling and recovery
|
||||
- Ethical scraping practices
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── config.py # Configuration and settings
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env.example # Environment variables template
|
||||
│
|
||||
├── scrapers/ # Scraper implementations
|
||||
│ ├── base_scraper.py # Abstract base class
|
||||
│ ├── basic_scraper.py # requests + BeautifulSoup
|
||||
│ ├── selenium_scraper.py # Selenium WebDriver
|
||||
│ ├── jina_scraper.py # Jina AI integration
|
||||
│ ├── firecrawl_scraper.py # Firecrawl integration
|
||||
│ ├── agentql_scraper.py # AgentQL workflows
|
||||
│ └── multion_scraper.py # Multion AI agent
|
||||
│
|
||||
├── utils/ # Utility modules
|
||||
│ ├── logger.py # Logging configuration
|
||||
│ ├── rate_limiter.py # Rate limiting
|
||||
│ └── retry.py # Retry with backoff
|
||||
│
|
||||
├── data_processors/ # Data processing
|
||||
│ ├── validator.py # Data validation
|
||||
│ └── storage.py # Data storage
|
||||
│
|
||||
├── examples/ # Example scripts
|
||||
│ ├── basic_example.py
|
||||
│ ├── selenium_example.py
|
||||
│ └── advanced_example.py
|
||||
│
|
||||
└── tests/ # Test suite
|
||||
├── test_basic_scraper.py
|
||||
└── test_data_processors.py
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone the repository**:
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd <project-directory>
|
||||
```
|
||||
|
||||
2. **Create virtual environment**:
|
||||
```bash
|
||||
python -m venv venv
|
||||
|
||||
# Windows
|
||||
venv\Scripts\activate
|
||||
|
||||
# Unix/MacOS
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
3. **Install dependencies**:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
4. **Configure environment variables**:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env with your API keys
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Scraping
|
||||
|
||||
```python
|
||||
from scrapers.basic_scraper import BasicScraper
|
||||
|
||||
with BasicScraper() as scraper:
|
||||
result = scraper.scrape("https://example.com")
|
||||
|
||||
if result["success"]:
|
||||
soup = result["soup"]
|
||||
# Extract data using BeautifulSoup
|
||||
titles = scraper.extract_text(soup, "h1")
|
||||
print(titles)
|
||||
```
|
||||
|
||||
### Dynamic Content (Selenium)
|
||||
|
||||
```python
|
||||
from scrapers.selenium_scraper import SeleniumScraper
|
||||
|
||||
with SeleniumScraper(headless=True) as scraper:
|
||||
result = scraper.scrape(
|
||||
"https://example.com",
|
||||
wait_for=".dynamic-content"
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["title"])
|
||||
```
|
||||
|
||||
### AI-Powered Extraction (Jina)
|
||||
|
||||
```python
|
||||
from scrapers.jina_scraper import JinaScraper
|
||||
|
||||
with JinaScraper() as scraper:
|
||||
result = scraper.scrape(
|
||||
"https://example.com",
|
||||
return_format="markdown"
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["content"])
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
See the `examples/` directory for detailed usage examples:
|
||||
|
||||
- `basic_example.py` - Static page scraping
|
||||
- `selenium_example.py` - Dynamic content and interaction
|
||||
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
||||
|
||||
Run examples:
|
||||
```bash
|
||||
python examples/basic_example.py
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit `config.py` or set environment variables in `.env`:
|
||||
|
||||
```bash
|
||||
# API Keys
|
||||
JINA_API_KEY=your_api_key
|
||||
FIRECRAWL_API_KEY=your_api_key
|
||||
AGENTQL_API_KEY=your_api_key
|
||||
MULTION_API_KEY=your_api_key
|
||||
|
||||
# Scraping Settings
|
||||
RATE_LIMIT_DELAY=2
|
||||
MAX_RETRIES=3
|
||||
TIMEOUT=30
|
||||
```
|
||||
|
||||
## Data Storage
|
||||
|
||||
Save scraped data in multiple formats:
|
||||
|
||||
```python
|
||||
from data_processors.storage import DataStorage
|
||||
|
||||
storage = DataStorage()
|
||||
|
||||
# Save as JSON
|
||||
storage.save_json(data, "output.json")
|
||||
|
||||
# Save as CSV
|
||||
storage.save_csv(data, "output.csv")
|
||||
|
||||
# Save as text
|
||||
storage.save_text(content, "output.txt")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run tests with pytest:
|
||||
|
||||
```bash
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
Run specific test file:
|
||||
```bash
|
||||
pytest tests/test_basic_scraper.py -v
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Respect robots.txt**: Always check and follow website scraping policies
|
||||
2. **Rate Limiting**: Use appropriate delays between requests
|
||||
3. **User-Agent**: Set realistic User-Agent headers
|
||||
4. **Error Handling**: Implement robust error handling and retries
|
||||
5. **Data Validation**: Validate and sanitize scraped data
|
||||
6. **Logging**: Maintain detailed logs for debugging
|
||||
|
||||
## Tool Selection Guide
|
||||
|
||||
- **Basic Scraper**: Static HTML pages, simple data extraction
|
||||
- **Selenium**: JavaScript-rendered content, interactive elements
|
||||
- **Jina**: AI-driven text extraction, structured data
|
||||
- **Firecrawl**: Deep crawling, hierarchical content
|
||||
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
||||
- **Multion**: Exploratory tasks, unpredictable scenarios
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow PEP 8 style guidelines
|
||||
2. Add tests for new features
|
||||
3. Update documentation
|
||||
4. Use meaningful commit messages
|
||||
|
||||
## License
|
||||
|
||||
[Your License Here]
|
||||
|
||||
## Disclaimer
|
||||
|
||||
# Web Scraping Project
|
||||
|
||||
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multiple Scraping Methods**:
|
||||
- Basic HTTP requests with BeautifulSoup
|
||||
- Selenium for JavaScript-heavy sites
|
||||
- Jina AI for intelligent text extraction
|
||||
- Firecrawl for deep web crawling
|
||||
- AgentQL for complex workflows
|
||||
- Multion for exploratory tasks
|
||||
|
||||
- **Built-in Utilities**:
|
||||
- Rate limiting and retry logic
|
||||
- Comprehensive logging
|
||||
- Data validation and sanitization
|
||||
- Multiple storage formats (JSON, CSV, text)
|
||||
|
||||
- **Best Practices**:
|
||||
- PEP 8 compliant code
|
||||
- Modular and reusable components
|
||||
- Error handling and recovery
|
||||
- Ethical scraping practices
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── config.py # Configuration and settings
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env.example # Environment variables template
|
||||
│
|
||||
├── scrapers/ # Scraper implementations
|
||||
│ ├── base_scraper.py # Abstract base class
|
||||
│ ├── basic_scraper.py # requests + BeautifulSoup
|
||||
│ ├── selenium_scraper.py # Selenium WebDriver
|
||||
│ ├── jina_scraper.py # Jina AI integration
|
||||
│ ├── firecrawl_scraper.py # Firecrawl integration
|
||||
│ ├── agentql_scraper.py # AgentQL workflows
|
||||
│ └── multion_scraper.py # Multion AI agent
|
||||
│
|
||||
├── utils/ # Utility modules
|
||||
│ ├── logger.py # Logging configuration
|
||||
│ ├── rate_limiter.py # Rate limiting
|
||||
│ └── retry.py # Retry with backoff
|
||||
│
|
||||
├── data_processors/ # Data processing
|
||||
│ ├── validator.py # Data validation
|
||||
│ └── storage.py # Data storage
|
||||
│
|
||||
├── examples/ # Example scripts
|
||||
│ ├── basic_example.py
|
||||
│ ├── selenium_example.py
|
||||
│ └── advanced_example.py
|
||||
│
|
||||
└── tests/ # Test suite
|
||||
├── test_basic_scraper.py
|
||||
└── test_data_processors.py
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone the repository**:
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd <project-directory>
|
||||
```
|
||||
|
||||
2. **Create virtual environment**:
|
||||
```bash
|
||||
python -m venv venv
|
||||
|
||||
# Windows
|
||||
venv\Scripts\activate
|
||||
|
||||
# Unix/MacOS
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
3. **Install dependencies**:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
4. **Configure environment variables**:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env with your API keys
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Scraping
|
||||
|
||||
```python
|
||||
from scrapers.basic_scraper import BasicScraper
|
||||
|
||||
with BasicScraper() as scraper:
|
||||
result = scraper.scrape("https://example.com")
|
||||
|
||||
if result["success"]:
|
||||
soup = result["soup"]
|
||||
# Extract data using BeautifulSoup
|
||||
titles = scraper.extract_text(soup, "h1")
|
||||
print(titles)
|
||||
```
|
||||
|
||||
### Dynamic Content (Selenium)
|
||||
|
||||
```python
|
||||
from scrapers.selenium_scraper import SeleniumScraper
|
||||
|
||||
with SeleniumScraper(headless=True) as scraper:
|
||||
result = scraper.scrape(
|
||||
"https://example.com",
|
||||
wait_for=".dynamic-content"
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["title"])
|
||||
```
|
||||
|
||||
### AI-Powered Extraction (Jina)
|
||||
|
||||
```python
|
||||
from scrapers.jina_scraper import JinaScraper
|
||||
|
||||
with JinaScraper() as scraper:
|
||||
result = scraper.scrape(
|
||||
"https://example.com",
|
||||
return_format="markdown"
|
||||
)
|
||||
|
||||
if result["success"]:
|
||||
print(result["content"])
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
See the `examples/` directory for detailed usage examples:
|
||||
|
||||
- `basic_example.py` - Static page scraping
|
||||
- `selenium_example.py` - Dynamic content and interaction
|
||||
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
||||
|
||||
Run examples:
|
||||
```bash
|
||||
python examples/basic_example.py
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit `config.py` or set environment variables in `.env`:
|
||||
|
||||
```bash
|
||||
# API Keys
|
||||
JINA_API_KEY=your_api_key
|
||||
FIRECRAWL_API_KEY=your_api_key
|
||||
AGENTQL_API_KEY=your_api_key
|
||||
MULTION_API_KEY=your_api_key
|
||||
|
||||
# Scraping Settings
|
||||
RATE_LIMIT_DELAY=2
|
||||
MAX_RETRIES=3
|
||||
TIMEOUT=30
|
||||
```
|
||||
|
||||
## Data Storage
|
||||
|
||||
Save scraped data in multiple formats:
|
||||
|
||||
```python
|
||||
from data_processors.storage import DataStorage
|
||||
|
||||
storage = DataStorage()
|
||||
|
||||
# Save as JSON
|
||||
storage.save_json(data, "output.json")
|
||||
|
||||
# Save as CSV
|
||||
storage.save_csv(data, "output.csv")
|
||||
|
||||
# Save as text
|
||||
storage.save_text(content, "output.txt")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run tests with pytest:
|
||||
|
||||
```bash
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
Run specific test file:
|
||||
```bash
|
||||
pytest tests/test_basic_scraper.py -v
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Respect robots.txt**: Always check and follow website scraping policies
|
||||
2. **Rate Limiting**: Use appropriate delays between requests
|
||||
3. **User-Agent**: Set realistic User-Agent headers
|
||||
4. **Error Handling**: Implement robust error handling and retries
|
||||
5. **Data Validation**: Validate and sanitize scraped data
|
||||
6. **Logging**: Maintain detailed logs for debugging
|
||||
|
||||
## Tool Selection Guide
|
||||
|
||||
- **Basic Scraper**: Static HTML pages, simple data extraction
|
||||
- **Selenium**: JavaScript-rendered content, interactive elements
|
||||
- **Jina**: AI-driven text extraction, structured data
|
||||
- **Firecrawl**: Deep crawling, hierarchical content
|
||||
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
||||
- **Multion**: Exploratory tasks, unpredictable scenarios
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow PEP 8 style guidelines
|
||||
2. Add tests for new features
|
||||
3. Update documentation
|
||||
4. Use meaningful commit messages
|
||||
|
||||
## License
|
||||
|
||||
[Your License Here]
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.
|
||||
Loading…
Reference in a new issue