diff --git a/README.md b/README.md new file mode 100644 index 0000000..701ff90 --- /dev/null +++ b/README.md @@ -0,0 +1,233 @@ +# Web Scraping Project + +A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction. + +## Features + +- **Multiple Scraping Methods**: + - Basic HTTP requests with BeautifulSoup + - Selenium for JavaScript-heavy sites + - Jina AI for intelligent text extraction + - Firecrawl for deep web crawling + - AgentQL for complex workflows + - Multion for exploratory tasks + +- **Built-in Utilities**: + - Rate limiting and retry logic + - Comprehensive logging + - Data validation and sanitization + - Multiple storage formats (JSON, CSV, text) + +- **Best Practices**: + - PEP 8 compliant code + - Modular and reusable components + - Error handling and recovery + - Ethical scraping practices + +## Project Structure + +``` +. +├── config.py # Configuration and settings +├── requirements.txt # Python dependencies +├── .env.example # Environment variables template +│ +├── scrapers/ # Scraper implementations +│ ├── base_scraper.py # Abstract base class +│ ├── basic_scraper.py # requests + BeautifulSoup +│ ├── selenium_scraper.py # Selenium WebDriver +│ ├── jina_scraper.py # Jina AI integration +│ ├── firecrawl_scraper.py # Firecrawl integration +│ ├── agentql_scraper.py # AgentQL workflows +│ └── multion_scraper.py # Multion AI agent +│ +├── utils/ # Utility modules +│ ├── logger.py # Logging configuration +│ ├── rate_limiter.py # Rate limiting +│ └── retry.py # Retry with backoff +│ +├── data_processors/ # Data processing +│ ├── validator.py # Data validation +│ └── storage.py # Data storage +│ +├── examples/ # Example scripts +│ ├── basic_example.py +│ ├── selenium_example.py +│ └── advanced_example.py +│ +└── tests/ # Test suite + ├── test_basic_scraper.py + └── test_data_processors.py +``` + +## Installation + +1. **Clone the repository**: +```bash +git clone +cd +``` + +2. **Create virtual environment**: +```bash +python -m venv venv + +# Windows +venv\Scripts\activate + +# Unix/MacOS +source venv/bin/activate +``` + +3. **Install dependencies**: +```bash +pip install -r requirements.txt +``` + +4. **Configure environment variables**: +```bash +cp .env.example .env +# Edit .env with your API keys +``` + +## Quick Start + +### Basic Scraping + +```python +from scrapers.basic_scraper import BasicScraper + +with BasicScraper() as scraper: + result = scraper.scrape("https://example.com") + + if result["success"]: + soup = result["soup"] + # Extract data using BeautifulSoup + titles = scraper.extract_text(soup, "h1") + print(titles) +``` + +### Dynamic Content (Selenium) + +```python +from scrapers.selenium_scraper import SeleniumScraper + +with SeleniumScraper(headless=True) as scraper: + result = scraper.scrape( + "https://example.com", + wait_for=".dynamic-content" + ) + + if result["success"]: + print(result["title"]) +``` + +### AI-Powered Extraction (Jina) + +```python +from scrapers.jina_scraper import JinaScraper + +with JinaScraper() as scraper: + result = scraper.scrape( + "https://example.com", + return_format="markdown" + ) + + if result["success"]: + print(result["content"]) +``` + +## Usage Examples + +See the `examples/` directory for detailed usage examples: + +- `basic_example.py` - Static page scraping +- `selenium_example.py` - Dynamic content and interaction +- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.) + +Run examples: +```bash +python examples/basic_example.py +``` + +## Configuration + +Edit `config.py` or set environment variables in `.env`: + +```bash +# API Keys +JINA_API_KEY=your_api_key +FIRECRAWL_API_KEY=your_api_key +AGENTQL_API_KEY=your_api_key +MULTION_API_KEY=your_api_key + +# Scraping Settings +RATE_LIMIT_DELAY=2 +MAX_RETRIES=3 +TIMEOUT=30 +``` + +## Data Storage + +Save scraped data in multiple formats: + +```python +from data_processors.storage import DataStorage + +storage = DataStorage() + +# Save as JSON +storage.save_json(data, "output.json") + +# Save as CSV +storage.save_csv(data, "output.csv") + +# Save as text +storage.save_text(content, "output.txt") +``` + +## Testing + +Run tests with pytest: + +```bash +pytest tests/ -v +``` + +Run specific test file: +```bash +pytest tests/test_basic_scraper.py -v +``` + +## Best Practices + +1. **Respect robots.txt**: Always check and follow website scraping policies +2. **Rate Limiting**: Use appropriate delays between requests +3. **User-Agent**: Set realistic User-Agent headers +4. **Error Handling**: Implement robust error handling and retries +5. **Data Validation**: Validate and sanitize scraped data +6. **Logging**: Maintain detailed logs for debugging + +## Tool Selection Guide + +- **Basic Scraper**: Static HTML pages, simple data extraction +- **Selenium**: JavaScript-rendered content, interactive elements +- **Jina**: AI-driven text extraction, structured data +- **Firecrawl**: Deep crawling, hierarchical content +- **AgentQL**: Complex workflows (login, forms, multi-step processes) +- **Multion**: Exploratory tasks, unpredictable scenarios + +## Contributing + +1. Follow PEP 8 style guidelines +2. Add tests for new features +3. Update documentation +4. Use meaningful commit messages + +## License + +[Your License Here] + +## Disclaimer + +This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data. \ No newline at end of file