Supprimer README.md

2025-10-31 18:06:57 +00:00 · 2025-10-31 18:06:57 +00:00 · f4285381bd
commit f4285381bd
parent 644ea16f94
1 changed files with 0 additions and 233 deletions
--- a/README.md
+++ b/README.md
@ -1,233 +0,0 @@
-# Web Scraping Project
-
-A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
-
-## Features
-
- **Multiple Scraping Methods**:
-  - Basic HTTP requests with BeautifulSoup
-  - Selenium for JavaScript-heavy sites
-  - Jina AI for intelligent text extraction
-  - Firecrawl for deep web crawling
-  - AgentQL for complex workflows
-  - Multion for exploratory tasks
-
- **Built-in Utilities**:
-  - Rate limiting and retry logic
-  - Comprehensive logging
-  - Data validation and sanitization
-  - Multiple storage formats (JSON, CSV, text)
-
- **Best Practices**:
-  - PEP 8 compliant code
-  - Modular and reusable components
-  - Error handling and recovery
-  - Ethical scraping practices
-
-## Project Structure
-
-```
-.
-├── config.py                 # Configuration and settings
-├── requirements.txt          # Python dependencies
-├── .env.example             # Environment variables template
-│
-├── scrapers/                # Scraper implementations
-│   ├── base_scraper.py      # Abstract base class
-│   ├── basic_scraper.py     # requests + BeautifulSoup
-│   ├── selenium_scraper.py  # Selenium WebDriver
-│   ├── jina_scraper.py      # Jina AI integration
-│   ├── firecrawl_scraper.py # Firecrawl integration
-│   ├── agentql_scraper.py   # AgentQL workflows
-│   └── multion_scraper.py   # Multion AI agent
-│
-├── utils/                   # Utility modules
-│   ├── logger.py           # Logging configuration
-│   ├── rate_limiter.py     # Rate limiting
-│   └── retry.py            # Retry with backoff
-│
-├── data_processors/         # Data processing
-│   ├── validator.py        # Data validation
-│   └── storage.py          # Data storage
-│
-├── examples/               # Example scripts
-│   ├── basic_example.py
-│   ├── selenium_example.py
-│   └── advanced_example.py
-│
-└── tests/                  # Test suite
-    ├── test_basic_scraper.py
-    └── test_data_processors.py
-```
-
-## Installation
-
-1. **Clone the repository**:
-```bash
-git clone <repository-url>
-cd <project-directory>
-```
-
-2. **Create virtual environment**:
-```bash
-python -m venv venv
-
-# Windows
-venv\Scripts\activate
-
-# Unix/MacOS
-source venv/bin/activate
-```
-
-3. **Install dependencies**:
-```bash
-pip install -r requirements.txt
-```
-
-4. **Configure environment variables**:
-```bash
-cp .env.example .env
-# Edit .env with your API keys
-```
-
-## Quick Start
-
-### Basic Scraping
-
-```python
-from scrapers.basic_scraper import BasicScraper
-
-with BasicScraper() as scraper:
-    result = scraper.scrape("https://example.com")
-    
-    if result["success"]:
-        soup = result["soup"]
-        # Extract data using BeautifulSoup
-        titles = scraper.extract_text(soup, "h1")
-        print(titles)
-```
-
-### Dynamic Content (Selenium)
-
-```python
-from scrapers.selenium_scraper import SeleniumScraper
-
-with SeleniumScraper(headless=True) as scraper:
-    result = scraper.scrape(
-        "https://example.com",
-        wait_for=".dynamic-content"
-    )
-    
-    if result["success"]:
-        print(result["title"])
-```
-
-### AI-Powered Extraction (Jina)
-
-```python
-from scrapers.jina_scraper import JinaScraper
-
-with JinaScraper() as scraper:
-    result = scraper.scrape(
-        "https://example.com",
-        return_format="markdown"
-    )
-    
-    if result["success"]:
-        print(result["content"])
-```
-
-## Usage Examples
-
-See the `examples/` directory for detailed usage examples:
-
- `basic_example.py` - Static page scraping
- `selenium_example.py` - Dynamic content and interaction
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
-
-Run examples:
-```bash
-python examples/basic_example.py
-```
-
-## Configuration
-
-Edit `config.py` or set environment variables in `.env`:
-
-```bash
-# API Keys
-JINA_API_KEY=your_api_key
-FIRECRAWL_API_KEY=your_api_key
-AGENTQL_API_KEY=your_api_key
-MULTION_API_KEY=your_api_key
-
-# Scraping Settings
-RATE_LIMIT_DELAY=2
-MAX_RETRIES=3
-TIMEOUT=30
-```
-
-## Data Storage
-
-Save scraped data in multiple formats:
-
-```python
-from data_processors.storage import DataStorage
-
-storage = DataStorage()
-
-# Save as JSON
-storage.save_json(data, "output.json")
-
-# Save as CSV
-storage.save_csv(data, "output.csv")
-
-# Save as text
-storage.save_text(content, "output.txt")
-```
-
-## Testing
-
-Run tests with pytest:
-
-```bash
-pytest tests/ -v
-```
-
-Run specific test file:
-```bash
-pytest tests/test_basic_scraper.py -v
-```
-
-## Best Practices
-
-1. **Respect robots.txt**: Always check and follow website scraping policies
-2. **Rate Limiting**: Use appropriate delays between requests
-3. **User-Agent**: Set realistic User-Agent headers
-4. **Error Handling**: Implement robust error handling and retries
-5. **Data Validation**: Validate and sanitize scraped data
-6. **Logging**: Maintain detailed logs for debugging
-
-## Tool Selection Guide
-
- **Basic Scraper**: Static HTML pages, simple data extraction
- **Selenium**: JavaScript-rendered content, interactive elements
- **Jina**: AI-driven text extraction, structured data
- **Firecrawl**: Deep crawling, hierarchical content
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
- **Multion**: Exploratory tasks, unpredictable scenarios
-
-## Contributing
-
-1. Follow PEP 8 style guidelines
-2. Add tests for new features
-3. Update documentation
-4. Use meaningful commit messages
-
-## License
-
-[Your License Here]
-
-## Disclaimer
-
-This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.