Supprimer README.md

This commit is contained in:
Creepso 2025-10-31 18:06:57 +00:00
parent 644ea16f94
commit f4285381bd

233
README.md
View file

@ -1,233 +0,0 @@
# Web Scraping Project
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
## Features
- **Multiple Scraping Methods**:
- Basic HTTP requests with BeautifulSoup
- Selenium for JavaScript-heavy sites
- Jina AI for intelligent text extraction
- Firecrawl for deep web crawling
- AgentQL for complex workflows
- Multion for exploratory tasks
- **Built-in Utilities**:
- Rate limiting and retry logic
- Comprehensive logging
- Data validation and sanitization
- Multiple storage formats (JSON, CSV, text)
- **Best Practices**:
- PEP 8 compliant code
- Modular and reusable components
- Error handling and recovery
- Ethical scraping practices
## Project Structure
```
.
├── config.py # Configuration and settings
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── scrapers/ # Scraper implementations
│ ├── base_scraper.py # Abstract base class
│ ├── basic_scraper.py # requests + BeautifulSoup
│ ├── selenium_scraper.py # Selenium WebDriver
│ ├── jina_scraper.py # Jina AI integration
│ ├── firecrawl_scraper.py # Firecrawl integration
│ ├── agentql_scraper.py # AgentQL workflows
│ └── multion_scraper.py # Multion AI agent
├── utils/ # Utility modules
│ ├── logger.py # Logging configuration
│ ├── rate_limiter.py # Rate limiting
│ └── retry.py # Retry with backoff
├── data_processors/ # Data processing
│ ├── validator.py # Data validation
│ └── storage.py # Data storage
├── examples/ # Example scripts
│ ├── basic_example.py
│ ├── selenium_example.py
│ └── advanced_example.py
└── tests/ # Test suite
├── test_basic_scraper.py
└── test_data_processors.py
```
## Installation
1. **Clone the repository**:
```bash
git clone <repository-url>
cd <project-directory>
```
2. **Create virtual environment**:
```bash
python -m venv venv
# Windows
venv\Scripts\activate
# Unix/MacOS
source venv/bin/activate
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Configure environment variables**:
```bash
cp .env.example .env
# Edit .env with your API keys
```
## Quick Start
### Basic Scraping
```python
from scrapers.basic_scraper import BasicScraper
with BasicScraper() as scraper:
result = scraper.scrape("https://example.com")
if result["success"]:
soup = result["soup"]
# Extract data using BeautifulSoup
titles = scraper.extract_text(soup, "h1")
print(titles)
```
### Dynamic Content (Selenium)
```python
from scrapers.selenium_scraper import SeleniumScraper
with SeleniumScraper(headless=True) as scraper:
result = scraper.scrape(
"https://example.com",
wait_for=".dynamic-content"
)
if result["success"]:
print(result["title"])
```
### AI-Powered Extraction (Jina)
```python
from scrapers.jina_scraper import JinaScraper
with JinaScraper() as scraper:
result = scraper.scrape(
"https://example.com",
return_format="markdown"
)
if result["success"]:
print(result["content"])
```
## Usage Examples
See the `examples/` directory for detailed usage examples:
- `basic_example.py` - Static page scraping
- `selenium_example.py` - Dynamic content and interaction
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
Run examples:
```bash
python examples/basic_example.py
```
## Configuration
Edit `config.py` or set environment variables in `.env`:
```bash
# API Keys
JINA_API_KEY=your_api_key
FIRECRAWL_API_KEY=your_api_key
AGENTQL_API_KEY=your_api_key
MULTION_API_KEY=your_api_key
# Scraping Settings
RATE_LIMIT_DELAY=2
MAX_RETRIES=3
TIMEOUT=30
```
## Data Storage
Save scraped data in multiple formats:
```python
from data_processors.storage import DataStorage
storage = DataStorage()
# Save as JSON
storage.save_json(data, "output.json")
# Save as CSV
storage.save_csv(data, "output.csv")
# Save as text
storage.save_text(content, "output.txt")
```
## Testing
Run tests with pytest:
```bash
pytest tests/ -v
```
Run specific test file:
```bash
pytest tests/test_basic_scraper.py -v
```
## Best Practices
1. **Respect robots.txt**: Always check and follow website scraping policies
2. **Rate Limiting**: Use appropriate delays between requests
3. **User-Agent**: Set realistic User-Agent headers
4. **Error Handling**: Implement robust error handling and retries
5. **Data Validation**: Validate and sanitize scraped data
6. **Logging**: Maintain detailed logs for debugging
## Tool Selection Guide
- **Basic Scraper**: Static HTML pages, simple data extraction
- **Selenium**: JavaScript-rendered content, interactive elements
- **Jina**: AI-driven text extraction, structured data
- **Firecrawl**: Deep crawling, hierarchical content
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
- **Multion**: Exploratory tasks, unpredictable scenarios
## Contributing
1. Follow PEP 8 style guidelines
2. Add tests for new features
3. Update documentation
4. Use meaningful commit messages
## License
[Your License Here]
## Disclaimer
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.