diff --git a/README.md b/README.md deleted file mode 100644 index 58ae56d..0000000 --- a/README.md +++ /dev/null @@ -1,233 +0,0 @@ -# Web Scraping Project - -A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction. - -## Features - -- **Multiple Scraping Methods**: - - Basic HTTP requests with BeautifulSoup - - Selenium for JavaScript-heavy sites - - Jina AI for intelligent text extraction - - Firecrawl for deep web crawling - - AgentQL for complex workflows - - Multion for exploratory tasks - -- **Built-in Utilities**: - - Rate limiting and retry logic - - Comprehensive logging - - Data validation and sanitization - - Multiple storage formats (JSON, CSV, text) - -- **Best Practices**: - - PEP 8 compliant code - - Modular and reusable components - - Error handling and recovery - - Ethical scraping practices - -## Project Structure - -``` -. -├── config.py # Configuration and settings -├── requirements.txt # Python dependencies -├── .env.example # Environment variables template -│ -├── scrapers/ # Scraper implementations -│ ├── base_scraper.py # Abstract base class -│ ├── basic_scraper.py # requests + BeautifulSoup -│ ├── selenium_scraper.py # Selenium WebDriver -│ ├── jina_scraper.py # Jina AI integration -│ ├── firecrawl_scraper.py # Firecrawl integration -│ ├── agentql_scraper.py # AgentQL workflows -│ └── multion_scraper.py # Multion AI agent -│ -├── utils/ # Utility modules -│ ├── logger.py # Logging configuration -│ ├── rate_limiter.py # Rate limiting -│ └── retry.py # Retry with backoff -│ -├── data_processors/ # Data processing -│ ├── validator.py # Data validation -│ └── storage.py # Data storage -│ -├── examples/ # Example scripts -│ ├── basic_example.py -│ ├── selenium_example.py -│ └── advanced_example.py -│ -└── tests/ # Test suite - ├── test_basic_scraper.py - └── test_data_processors.py -``` - -## Installation - -1. **Clone the repository**: -```bash -git clone -cd -``` - -2. **Create virtual environment**: -```bash -python -m venv venv - -# Windows -venv\Scripts\activate - -# Unix/MacOS -source venv/bin/activate -``` - -3. **Install dependencies**: -```bash -pip install -r requirements.txt -``` - -4. **Configure environment variables**: -```bash -cp .env.example .env -# Edit .env with your API keys -``` - -## Quick Start - -### Basic Scraping - -```python -from scrapers.basic_scraper import BasicScraper - -with BasicScraper() as scraper: - result = scraper.scrape("https://example.com") - - if result["success"]: - soup = result["soup"] - # Extract data using BeautifulSoup - titles = scraper.extract_text(soup, "h1") - print(titles) -``` - -### Dynamic Content (Selenium) - -```python -from scrapers.selenium_scraper import SeleniumScraper - -with SeleniumScraper(headless=True) as scraper: - result = scraper.scrape( - "https://example.com", - wait_for=".dynamic-content" - ) - - if result["success"]: - print(result["title"]) -``` - -### AI-Powered Extraction (Jina) - -```python -from scrapers.jina_scraper import JinaScraper - -with JinaScraper() as scraper: - result = scraper.scrape( - "https://example.com", - return_format="markdown" - ) - - if result["success"]: - print(result["content"]) -``` - -## Usage Examples - -See the `examples/` directory for detailed usage examples: - -- `basic_example.py` - Static page scraping -- `selenium_example.py` - Dynamic content and interaction -- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.) - -Run examples: -```bash -python examples/basic_example.py -``` - -## Configuration - -Edit `config.py` or set environment variables in `.env`: - -```bash -# API Keys -JINA_API_KEY=your_api_key -FIRECRAWL_API_KEY=your_api_key -AGENTQL_API_KEY=your_api_key -MULTION_API_KEY=your_api_key - -# Scraping Settings -RATE_LIMIT_DELAY=2 -MAX_RETRIES=3 -TIMEOUT=30 -``` - -## Data Storage - -Save scraped data in multiple formats: - -```python -from data_processors.storage import DataStorage - -storage = DataStorage() - -# Save as JSON -storage.save_json(data, "output.json") - -# Save as CSV -storage.save_csv(data, "output.csv") - -# Save as text -storage.save_text(content, "output.txt") -``` - -## Testing - -Run tests with pytest: - -```bash -pytest tests/ -v -``` - -Run specific test file: -```bash -pytest tests/test_basic_scraper.py -v -``` - -## Best Practices - -1. **Respect robots.txt**: Always check and follow website scraping policies -2. **Rate Limiting**: Use appropriate delays between requests -3. **User-Agent**: Set realistic User-Agent headers -4. **Error Handling**: Implement robust error handling and retries -5. **Data Validation**: Validate and sanitize scraped data -6. **Logging**: Maintain detailed logs for debugging - -## Tool Selection Guide - -- **Basic Scraper**: Static HTML pages, simple data extraction -- **Selenium**: JavaScript-rendered content, interactive elements -- **Jina**: AI-driven text extraction, structured data -- **Firecrawl**: Deep crawling, hierarchical content -- **AgentQL**: Complex workflows (login, forms, multi-step processes) -- **Multion**: Exploratory tasks, unpredictable scenarios - -## Contributing - -1. Follow PEP 8 style guidelines -2. Add tests for new features -3. Update documentation -4. Use meaningful commit messages - -## License - -[Your License Here] - -## Disclaimer - -This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data. \ No newline at end of file