to_en

2025-10-31 18:07:51 +00:00 · 2025-10-31 18:07:51 +00:00 · 5a7b09d07d
commit 5a7b09d07d
parent 41c85844b5
1 changed files with 232 additions and 232 deletions
--- a/README_en.md
+++ b/README_en.md
@ -1,233 +1,233 @@
-# Web Scraping Project
-
-A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
-
-## Features
-
- **Multiple Scraping Methods**:
-  - Basic HTTP requests with BeautifulSoup
-  - Selenium for JavaScript-heavy sites
-  - Jina AI for intelligent text extraction
-  - Firecrawl for deep web crawling
-  - AgentQL for complex workflows
-  - Multion for exploratory tasks
-
- **Built-in Utilities**:
-  - Rate limiting and retry logic
-  - Comprehensive logging
-  - Data validation and sanitization
-  - Multiple storage formats (JSON, CSV, text)
-
- **Best Practices**:
-  - PEP 8 compliant code
-  - Modular and reusable components
-  - Error handling and recovery
-  - Ethical scraping practices
-
-## Project Structure
-
-```
-.
-├── config.py                 # Configuration and settings
-├── requirements.txt          # Python dependencies
-├── .env.example             # Environment variables template
-│
-├── scrapers/                # Scraper implementations
-│   ├── base_scraper.py      # Abstract base class
-│   ├── basic_scraper.py     # requests + BeautifulSoup
-│   ├── selenium_scraper.py  # Selenium WebDriver
-│   ├── jina_scraper.py      # Jina AI integration
-│   ├── firecrawl_scraper.py # Firecrawl integration
-│   ├── agentql_scraper.py   # AgentQL workflows
-│   └── multion_scraper.py   # Multion AI agent
-│
-├── utils/                   # Utility modules
-│   ├── logger.py           # Logging configuration
-│   ├── rate_limiter.py     # Rate limiting
-│   └── retry.py            # Retry with backoff
-│
-├── data_processors/         # Data processing
-│   ├── validator.py        # Data validation
-│   └── storage.py          # Data storage
-│
-├── examples/               # Example scripts
-│   ├── basic_example.py
-│   ├── selenium_example.py
-│   └── advanced_example.py
-│
-└── tests/                  # Test suite
-    ├── test_basic_scraper.py
-    └── test_data_processors.py
-```
-
-## Installation
-
-1. **Clone the repository**:
-```bash
-git clone <repository-url>
-cd <project-directory>
-```
-
-2. **Create virtual environment**:
-```bash
-python -m venv venv
-
-# Windows
-venv\Scripts\activate
-
-# Unix/MacOS
-source venv/bin/activate
-```
-
-3. **Install dependencies**:
-```bash
-pip install -r requirements.txt
-```
-
-4. **Configure environment variables**:
-```bash
-cp .env.example .env
-# Edit .env with your API keys
-```
-
-## Quick Start
-
-### Basic Scraping
-
-```python
-from scrapers.basic_scraper import BasicScraper
-
-with BasicScraper() as scraper:
-    result = scraper.scrape("https://example.com")
-    
-    if result["success"]:
-        soup = result["soup"]
-        # Extract data using BeautifulSoup
-        titles = scraper.extract_text(soup, "h1")
-        print(titles)
-```
-
-### Dynamic Content (Selenium)
-
-```python
-from scrapers.selenium_scraper import SeleniumScraper
-
-with SeleniumScraper(headless=True) as scraper:
-    result = scraper.scrape(
-        "https://example.com",
-        wait_for=".dynamic-content"
-    )
-    
-    if result["success"]:
-        print(result["title"])
-```
-
-### AI-Powered Extraction (Jina)
-
-```python
-from scrapers.jina_scraper import JinaScraper
-
-with JinaScraper() as scraper:
-    result = scraper.scrape(
-        "https://example.com",
-        return_format="markdown"
-    )
-    
-    if result["success"]:
-        print(result["content"])
-```
-
-## Usage Examples
-
-See the `examples/` directory for detailed usage examples:
-
- `basic_example.py` - Static page scraping
- `selenium_example.py` - Dynamic content and interaction
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
-
-Run examples:
-```bash
-python examples/basic_example.py
-```
-
-## Configuration
-
-Edit `config.py` or set environment variables in `.env`:
-
-```bash
-# API Keys
-JINA_API_KEY=your_api_key
-FIRECRAWL_API_KEY=your_api_key
-AGENTQL_API_KEY=your_api_key
-MULTION_API_KEY=your_api_key
-
-# Scraping Settings
-RATE_LIMIT_DELAY=2
-MAX_RETRIES=3
-TIMEOUT=30
-```
-
-## Data Storage
-
-Save scraped data in multiple formats:
-
-```python
-from data_processors.storage import DataStorage
-
-storage = DataStorage()
-
-# Save as JSON
-storage.save_json(data, "output.json")
-
-# Save as CSV
-storage.save_csv(data, "output.csv")
-
-# Save as text
-storage.save_text(content, "output.txt")
-```
-
-## Testing
-
-Run tests with pytest:
-
-```bash
-pytest tests/ -v
-```
-
-Run specific test file:
-```bash
-pytest tests/test_basic_scraper.py -v
-```
-
-## Best Practices
-
-1. **Respect robots.txt**: Always check and follow website scraping policies
-2. **Rate Limiting**: Use appropriate delays between requests
-3. **User-Agent**: Set realistic User-Agent headers
-4. **Error Handling**: Implement robust error handling and retries
-5. **Data Validation**: Validate and sanitize scraped data
-6. **Logging**: Maintain detailed logs for debugging
-
-## Tool Selection Guide
-
- **Basic Scraper**: Static HTML pages, simple data extraction
- **Selenium**: JavaScript-rendered content, interactive elements
- **Jina**: AI-driven text extraction, structured data
- **Firecrawl**: Deep crawling, hierarchical content
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
- **Multion**: Exploratory tasks, unpredictable scenarios
-
-## Contributing
-
-1. Follow PEP 8 style guidelines
-2. Add tests for new features
-3. Update documentation
-4. Use meaningful commit messages
-
-## License
-
-[Your License Here]
-
-## Disclaimer
-
+# Web Scraping Project
+
+A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
+
+## Features
+
+- **Multiple Scraping Methods**:
+  - Basic HTTP requests with BeautifulSoup
+  - Selenium for JavaScript-heavy sites
+  - Jina AI for intelligent text extraction
+  - Firecrawl for deep web crawling
+  - AgentQL for complex workflows
+  - Multion for exploratory tasks
+
+- **Built-in Utilities**:
+  - Rate limiting and retry logic
+  - Comprehensive logging
+  - Data validation and sanitization
+  - Multiple storage formats (JSON, CSV, text)
+
+- **Best Practices**:
+  - PEP 8 compliant code
+  - Modular and reusable components
+  - Error handling and recovery
+  - Ethical scraping practices
+
+## Project Structure
+
+```
+.
+├── config.py                 # Configuration and settings
+├── requirements.txt          # Python dependencies
+├── .env.example             # Environment variables template
+│
+├── scrapers/                # Scraper implementations
+│   ├── base_scraper.py      # Abstract base class
+│   ├── basic_scraper.py     # requests + BeautifulSoup
+│   ├── selenium_scraper.py  # Selenium WebDriver
+│   ├── jina_scraper.py      # Jina AI integration
+│   ├── firecrawl_scraper.py # Firecrawl integration
+│   ├── agentql_scraper.py   # AgentQL workflows
+│   └── multion_scraper.py   # Multion AI agent
+│
+├── utils/                   # Utility modules
+│   ├── logger.py           # Logging configuration
+│   ├── rate_limiter.py     # Rate limiting
+│   └── retry.py            # Retry with backoff
+│
+├── data_processors/         # Data processing
+│   ├── validator.py        # Data validation
+│   └── storage.py          # Data storage
+│
+├── examples/               # Example scripts
+│   ├── basic_example.py
+│   ├── selenium_example.py
+│   └── advanced_example.py
+│
+└── tests/                  # Test suite
+    ├── test_basic_scraper.py
+    └── test_data_processors.py
+```
+
+## Installation
+
+1. **Clone the repository**:
+```bash
+git clone <repository-url>
+cd <project-directory>
+```
+
+2. **Create virtual environment**:
+```bash
+python -m venv venv
+
+# Windows
+venv\Scripts\activate
+
+# Unix/MacOS
+source venv/bin/activate
+```
+
+3. **Install dependencies**:
+```bash
+pip install -r requirements.txt
+```
+
+4. **Configure environment variables**:
+```bash
+cp .env.example .env
+# Edit .env with your API keys
+```
+
+## Quick Start
+
+### Basic Scraping
+
+```python
+from scrapers.basic_scraper import BasicScraper
+
+with BasicScraper() as scraper:
+    result = scraper.scrape("https://example.com")
+    
+    if result["success"]:
+        soup = result["soup"]
+        # Extract data using BeautifulSoup
+        titles = scraper.extract_text(soup, "h1")
+        print(titles)
+```
+
+### Dynamic Content (Selenium)
+
+```python
+from scrapers.selenium_scraper import SeleniumScraper
+
+with SeleniumScraper(headless=True) as scraper:
+    result = scraper.scrape(
+        "https://example.com",
+        wait_for=".dynamic-content"
+    )
+    
+    if result["success"]:
+        print(result["title"])
+```
+
+### AI-Powered Extraction (Jina)
+
+```python
+from scrapers.jina_scraper import JinaScraper
+
+with JinaScraper() as scraper:
+    result = scraper.scrape(
+        "https://example.com",
+        return_format="markdown"
+    )
+    
+    if result["success"]:
+        print(result["content"])
+```
+
+## Usage Examples
+
+See the `examples/` directory for detailed usage examples:
+
+- `basic_example.py` - Static page scraping
+- `selenium_example.py` - Dynamic content and interaction
+- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
+
+Run examples:
+```bash
+python examples/basic_example.py
+```
+
+## Configuration
+
+Edit `config.py` or set environment variables in `.env`:
+
+```bash
+# API Keys
+JINA_API_KEY=your_api_key
+FIRECRAWL_API_KEY=your_api_key
+AGENTQL_API_KEY=your_api_key
+MULTION_API_KEY=your_api_key
+
+# Scraping Settings
+RATE_LIMIT_DELAY=2
+MAX_RETRIES=3
+TIMEOUT=30
+```
+
+## Data Storage
+
+Save scraped data in multiple formats:
+
+```python
+from data_processors.storage import DataStorage
+
+storage = DataStorage()
+
+# Save as JSON
+storage.save_json(data, "output.json")
+
+# Save as CSV
+storage.save_csv(data, "output.csv")
+
+# Save as text
+storage.save_text(content, "output.txt")
+```
+
+## Testing
+
+Run tests with pytest:
+
+```bash
+pytest tests/ -v
+```
+
+Run specific test file:
+```bash
+pytest tests/test_basic_scraper.py -v
+```
+
+## Best Practices
+
+1. **Respect robots.txt**: Always check and follow website scraping policies
+2. **Rate Limiting**: Use appropriate delays between requests
+3. **User-Agent**: Set realistic User-Agent headers
+4. **Error Handling**: Implement robust error handling and retries
+5. **Data Validation**: Validate and sanitize scraped data
+6. **Logging**: Maintain detailed logs for debugging
+
+## Tool Selection Guide
+
+- **Basic Scraper**: Static HTML pages, simple data extraction
+- **Selenium**: JavaScript-rendered content, interactive elements
+- **Jina**: AI-driven text extraction, structured data
+- **Firecrawl**: Deep crawling, hierarchical content
+- **AgentQL**: Complex workflows (login, forms, multi-step processes)
+- **Multion**: Exploratory tasks, unpredictable scenarios
+
+## Contributing
+
+1. Follow PEP 8 style guidelines
+2. Add tests for new features
+3. Update documentation
+4. Use meaningful commit messages
+
+## License
+
+[Your License Here]
+
+## Disclaimer
+
 This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.