Supprimer README.md
This commit is contained in:
parent
644ea16f94
commit
f4285381bd
1 changed files with 0 additions and 233 deletions
233
README.md
233
README.md
|
|
@ -1,233 +0,0 @@
|
||||||
# Web Scraping Project
|
|
||||||
|
|
||||||
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
- **Multiple Scraping Methods**:
|
|
||||||
- Basic HTTP requests with BeautifulSoup
|
|
||||||
- Selenium for JavaScript-heavy sites
|
|
||||||
- Jina AI for intelligent text extraction
|
|
||||||
- Firecrawl for deep web crawling
|
|
||||||
- AgentQL for complex workflows
|
|
||||||
- Multion for exploratory tasks
|
|
||||||
|
|
||||||
- **Built-in Utilities**:
|
|
||||||
- Rate limiting and retry logic
|
|
||||||
- Comprehensive logging
|
|
||||||
- Data validation and sanitization
|
|
||||||
- Multiple storage formats (JSON, CSV, text)
|
|
||||||
|
|
||||||
- **Best Practices**:
|
|
||||||
- PEP 8 compliant code
|
|
||||||
- Modular and reusable components
|
|
||||||
- Error handling and recovery
|
|
||||||
- Ethical scraping practices
|
|
||||||
|
|
||||||
## Project Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
.
|
|
||||||
├── config.py # Configuration and settings
|
|
||||||
├── requirements.txt # Python dependencies
|
|
||||||
├── .env.example # Environment variables template
|
|
||||||
│
|
|
||||||
├── scrapers/ # Scraper implementations
|
|
||||||
│ ├── base_scraper.py # Abstract base class
|
|
||||||
│ ├── basic_scraper.py # requests + BeautifulSoup
|
|
||||||
│ ├── selenium_scraper.py # Selenium WebDriver
|
|
||||||
│ ├── jina_scraper.py # Jina AI integration
|
|
||||||
│ ├── firecrawl_scraper.py # Firecrawl integration
|
|
||||||
│ ├── agentql_scraper.py # AgentQL workflows
|
|
||||||
│ └── multion_scraper.py # Multion AI agent
|
|
||||||
│
|
|
||||||
├── utils/ # Utility modules
|
|
||||||
│ ├── logger.py # Logging configuration
|
|
||||||
│ ├── rate_limiter.py # Rate limiting
|
|
||||||
│ └── retry.py # Retry with backoff
|
|
||||||
│
|
|
||||||
├── data_processors/ # Data processing
|
|
||||||
│ ├── validator.py # Data validation
|
|
||||||
│ └── storage.py # Data storage
|
|
||||||
│
|
|
||||||
├── examples/ # Example scripts
|
|
||||||
│ ├── basic_example.py
|
|
||||||
│ ├── selenium_example.py
|
|
||||||
│ └── advanced_example.py
|
|
||||||
│
|
|
||||||
└── tests/ # Test suite
|
|
||||||
├── test_basic_scraper.py
|
|
||||||
└── test_data_processors.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
1. **Clone the repository**:
|
|
||||||
```bash
|
|
||||||
git clone <repository-url>
|
|
||||||
cd <project-directory>
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Create virtual environment**:
|
|
||||||
```bash
|
|
||||||
python -m venv venv
|
|
||||||
|
|
||||||
# Windows
|
|
||||||
venv\Scripts\activate
|
|
||||||
|
|
||||||
# Unix/MacOS
|
|
||||||
source venv/bin/activate
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Install dependencies**:
|
|
||||||
```bash
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Configure environment variables**:
|
|
||||||
```bash
|
|
||||||
cp .env.example .env
|
|
||||||
# Edit .env with your API keys
|
|
||||||
```
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
### Basic Scraping
|
|
||||||
|
|
||||||
```python
|
|
||||||
from scrapers.basic_scraper import BasicScraper
|
|
||||||
|
|
||||||
with BasicScraper() as scraper:
|
|
||||||
result = scraper.scrape("https://example.com")
|
|
||||||
|
|
||||||
if result["success"]:
|
|
||||||
soup = result["soup"]
|
|
||||||
# Extract data using BeautifulSoup
|
|
||||||
titles = scraper.extract_text(soup, "h1")
|
|
||||||
print(titles)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Dynamic Content (Selenium)
|
|
||||||
|
|
||||||
```python
|
|
||||||
from scrapers.selenium_scraper import SeleniumScraper
|
|
||||||
|
|
||||||
with SeleniumScraper(headless=True) as scraper:
|
|
||||||
result = scraper.scrape(
|
|
||||||
"https://example.com",
|
|
||||||
wait_for=".dynamic-content"
|
|
||||||
)
|
|
||||||
|
|
||||||
if result["success"]:
|
|
||||||
print(result["title"])
|
|
||||||
```
|
|
||||||
|
|
||||||
### AI-Powered Extraction (Jina)
|
|
||||||
|
|
||||||
```python
|
|
||||||
from scrapers.jina_scraper import JinaScraper
|
|
||||||
|
|
||||||
with JinaScraper() as scraper:
|
|
||||||
result = scraper.scrape(
|
|
||||||
"https://example.com",
|
|
||||||
return_format="markdown"
|
|
||||||
)
|
|
||||||
|
|
||||||
if result["success"]:
|
|
||||||
print(result["content"])
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage Examples
|
|
||||||
|
|
||||||
See the `examples/` directory for detailed usage examples:
|
|
||||||
|
|
||||||
- `basic_example.py` - Static page scraping
|
|
||||||
- `selenium_example.py` - Dynamic content and interaction
|
|
||||||
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
|
||||||
|
|
||||||
Run examples:
|
|
||||||
```bash
|
|
||||||
python examples/basic_example.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
Edit `config.py` or set environment variables in `.env`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# API Keys
|
|
||||||
JINA_API_KEY=your_api_key
|
|
||||||
FIRECRAWL_API_KEY=your_api_key
|
|
||||||
AGENTQL_API_KEY=your_api_key
|
|
||||||
MULTION_API_KEY=your_api_key
|
|
||||||
|
|
||||||
# Scraping Settings
|
|
||||||
RATE_LIMIT_DELAY=2
|
|
||||||
MAX_RETRIES=3
|
|
||||||
TIMEOUT=30
|
|
||||||
```
|
|
||||||
|
|
||||||
## Data Storage
|
|
||||||
|
|
||||||
Save scraped data in multiple formats:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from data_processors.storage import DataStorage
|
|
||||||
|
|
||||||
storage = DataStorage()
|
|
||||||
|
|
||||||
# Save as JSON
|
|
||||||
storage.save_json(data, "output.json")
|
|
||||||
|
|
||||||
# Save as CSV
|
|
||||||
storage.save_csv(data, "output.csv")
|
|
||||||
|
|
||||||
# Save as text
|
|
||||||
storage.save_text(content, "output.txt")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Testing
|
|
||||||
|
|
||||||
Run tests with pytest:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pytest tests/ -v
|
|
||||||
```
|
|
||||||
|
|
||||||
Run specific test file:
|
|
||||||
```bash
|
|
||||||
pytest tests/test_basic_scraper.py -v
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
1. **Respect robots.txt**: Always check and follow website scraping policies
|
|
||||||
2. **Rate Limiting**: Use appropriate delays between requests
|
|
||||||
3. **User-Agent**: Set realistic User-Agent headers
|
|
||||||
4. **Error Handling**: Implement robust error handling and retries
|
|
||||||
5. **Data Validation**: Validate and sanitize scraped data
|
|
||||||
6. **Logging**: Maintain detailed logs for debugging
|
|
||||||
|
|
||||||
## Tool Selection Guide
|
|
||||||
|
|
||||||
- **Basic Scraper**: Static HTML pages, simple data extraction
|
|
||||||
- **Selenium**: JavaScript-rendered content, interactive elements
|
|
||||||
- **Jina**: AI-driven text extraction, structured data
|
|
||||||
- **Firecrawl**: Deep crawling, hierarchical content
|
|
||||||
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
|
||||||
- **Multion**: Exploratory tasks, unpredictable scenarios
|
|
||||||
|
|
||||||
## Contributing
|
|
||||||
|
|
||||||
1. Follow PEP 8 style guidelines
|
|
||||||
2. Add tests for new features
|
|
||||||
3. Update documentation
|
|
||||||
4. Use meaningful commit messages
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
[Your License Here]
|
|
||||||
|
|
||||||
## Disclaimer
|
|
||||||
|
|
||||||
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.
|
|
||||||
Loading…
Reference in a new issue