Supprimer README.md

2025-10-31 18:06:57 +00:00 · 2025-10-31 18:06:57 +00:00 · f4285381bd
commit f4285381bd
parent 644ea16f94
1 changed files with 0 additions and 233 deletions
--- a/README.md
+++ b/README.md
@ -1,233 +0,0 @@
 # Web Scraping Project
 A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
 ## Features
 - **Multiple Scraping Methods**:
  - Basic HTTP requests with BeautifulSoup
  - Selenium for JavaScript-heavy sites
  - Jina AI for intelligent text extraction
  - Firecrawl for deep web crawling
  - AgentQL for complex workflows
  - Multion for exploratory tasks
 - **Built-in Utilities**:
  - Rate limiting and retry logic
  - Comprehensive logging
  - Data validation and sanitization
  - Multiple storage formats (JSON, CSV, text)
 - **Best Practices**:
  - PEP 8 compliant code
  - Modular and reusable components
  - Error handling and recovery
  - Ethical scraping practices
 ## Project Structure
 ```
 .
 ├── config.py                 # Configuration and settings
 ├── requirements.txt          # Python dependencies
 ├── .env.example             # Environment variables template
 │
 ├── scrapers/                # Scraper implementations
 │   ├── base_scraper.py      # Abstract base class
 │   ├── basic_scraper.py     # requests + BeautifulSoup
 │   ├── selenium_scraper.py  # Selenium WebDriver
 │   ├── jina_scraper.py      # Jina AI integration
 │   ├── firecrawl_scraper.py # Firecrawl integration
 │   ├── agentql_scraper.py   # AgentQL workflows
 │   └── multion_scraper.py   # Multion AI agent
 │
 ├── utils/                   # Utility modules
 │   ├── logger.py           # Logging configuration
 │   ├── rate_limiter.py     # Rate limiting
 │   └── retry.py            # Retry with backoff
 │
 ├── data_processors/         # Data processing
 │   ├── validator.py        # Data validation
 │   └── storage.py          # Data storage
 │
 ├── examples/               # Example scripts
 │   ├── basic_example.py
 │   ├── selenium_example.py
 │   └── advanced_example.py
 │
 └── tests/                  # Test suite
    ├── test_basic_scraper.py
    └── test_data_processors.py
 ```
 ## Installation
 1. **Clone the repository**:
 ```bash
 git clone <repository-url>
 cd <project-directory>
 ```
 2. **Create virtual environment**:
 ```bash
 python -m venv venv
 # Windows
 venv\Scripts\activate
 # Unix/MacOS
 source venv/bin/activate
 ```
 3. **Install dependencies**:
 ```bash
 pip install -r requirements.txt
 ```
 4. **Configure environment variables**:
 ```bash
 cp .env.example .env
 # Edit .env with your API keys
 ```
 ## Quick Start
 ### Basic Scraping
 ```python
 from scrapers.basic_scraper import BasicScraper
 with BasicScraper() as scraper:
    result = scraper.scrape("https://example.com")
    if result["success"]:
        soup = result["soup"]
        # Extract data using BeautifulSoup
        titles = scraper.extract_text(soup, "h1")
        print(titles)
 ```
 ### Dynamic Content (Selenium)
 ```python
 from scrapers.selenium_scraper import SeleniumScraper
 with SeleniumScraper(headless=True) as scraper:
    result = scraper.scrape(
        "https://example.com",
        wait_for=".dynamic-content"
    )
    if result["success"]:
        print(result["title"])
 ```
 ### AI-Powered Extraction (Jina)
 ```python
 from scrapers.jina_scraper import JinaScraper
 with JinaScraper() as scraper:
    result = scraper.scrape(
        "https://example.com",
        return_format="markdown"
    )
    if result["success"]:
        print(result["content"])
 ```
 ## Usage Examples
 See the `examples/` directory for detailed usage examples:
 - `basic_example.py` - Static page scraping
 - `selenium_example.py` - Dynamic content and interaction
 - `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
 Run examples:
 ```bash
 python examples/basic_example.py
 ```
 ## Configuration
 Edit `config.py` or set environment variables in `.env`:
 ```bash
 # API Keys
 JINA_API_KEY=your_api_key
 FIRECRAWL_API_KEY=your_api_key
 AGENTQL_API_KEY=your_api_key
 MULTION_API_KEY=your_api_key
 # Scraping Settings
 RATE_LIMIT_DELAY=2
 MAX_RETRIES=3
 TIMEOUT=30
 ```
 ## Data Storage
 Save scraped data in multiple formats:
 ```python
 from data_processors.storage import DataStorage
 storage = DataStorage()
 # Save as JSON
 storage.save_json(data, "output.json")
 # Save as CSV
 storage.save_csv(data, "output.csv")
 # Save as text
 storage.save_text(content, "output.txt")
 ```
 ## Testing
 Run tests with pytest:
 ```bash
 pytest tests/ -v
 ```
 Run specific test file:
 ```bash
 pytest tests/test_basic_scraper.py -v
 ```
 ## Best Practices
 1. **Respect robots.txt**: Always check and follow website scraping policies
 2. **Rate Limiting**: Use appropriate delays between requests
 3. **User-Agent**: Set realistic User-Agent headers
 4. **Error Handling**: Implement robust error handling and retries
 5. **Data Validation**: Validate and sanitize scraped data
 6. **Logging**: Maintain detailed logs for debugging
 ## Tool Selection Guide
 - **Basic Scraper**: Static HTML pages, simple data extraction
 - **Selenium**: JavaScript-rendered content, interactive elements
 - **Jina**: AI-driven text extraction, structured data
 - **Firecrawl**: Deep crawling, hierarchical content
 - **AgentQL**: Complex workflows (login, forms, multi-step processes)
 - **Multion**: Exploratory tasks, unpredictable scenarios
 ## Contributing
 1. Follow PEP 8 style guidelines
 2. Add tests for new features
 3. Update documentation
 4. Use meaningful commit messages
 ## License
 [Your License Here]
 ## Disclaimer
 This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.