No description
| data_processors | ||
| examples | ||
| scrapers | ||
| tests | ||
| utils | ||
| .gitignore | ||
| config.py | ||
| main.py | ||
| PROXY_GUIDE.md | ||
| QUICKSTART.md | ||
| README.md | ||
| README_en.md | ||
| requirements.txt | ||
| sekai_one_scraper.py | ||
| start_proxy.bat | ||
| start_proxy.sh | ||
| test_proxy.py | ||
| video_proxy_server.py | ||
Web Scraping Project
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
Features
-
Multiple Scraping Methods:
- Basic HTTP requests with BeautifulSoup
- Selenium for JavaScript-heavy sites
- Jina AI for intelligent text extraction
- Firecrawl for deep web crawling
- AgentQL for complex workflows
- Multion for exploratory tasks
-
Built-in Utilities:
- Rate limiting and retry logic
- Comprehensive logging
- Data validation and sanitization
- Multiple storage formats (JSON, CSV, text)
-
Best Practices:
- PEP 8 compliant code
- Modular and reusable components
- Error handling and recovery
- Ethical scraping practices
Project Structure
.
├── config.py # Configuration and settings
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
│
├── scrapers/ # Scraper implementations
│ ├── base_scraper.py # Abstract base class
│ ├── basic_scraper.py # requests + BeautifulSoup
│ ├── selenium_scraper.py # Selenium WebDriver
│ ├── jina_scraper.py # Jina AI integration
│ ├── firecrawl_scraper.py # Firecrawl integration
│ ├── agentql_scraper.py # AgentQL workflows
│ └── multion_scraper.py # Multion AI agent
│
├── utils/ # Utility modules
│ ├── logger.py # Logging configuration
│ ├── rate_limiter.py # Rate limiting
│ └── retry.py # Retry with backoff
│
├── data_processors/ # Data processing
│ ├── validator.py # Data validation
│ └── storage.py # Data storage
│
├── examples/ # Example scripts
│ ├── basic_example.py
│ ├── selenium_example.py
│ └── advanced_example.py
│
└── tests/ # Test suite
├── test_basic_scraper.py
└── test_data_processors.py
Installation
- Clone the repository:
git clone <repository-url>
cd <project-directory>
- Create virtual environment:
python -m venv venv
# Windows
venv\Scripts\activate
# Unix/MacOS
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
cp .env.example .env
# Edit .env with your API keys
Quick Start
Basic Scraping
from scrapers.basic_scraper import BasicScraper
with BasicScraper() as scraper:
result = scraper.scrape("https://example.com")
if result["success"]:
soup = result["soup"]
# Extract data using BeautifulSoup
titles = scraper.extract_text(soup, "h1")
print(titles)
Dynamic Content (Selenium)
from scrapers.selenium_scraper import SeleniumScraper
with SeleniumScraper(headless=True) as scraper:
result = scraper.scrape(
"https://example.com",
wait_for=".dynamic-content"
)
if result["success"]:
print(result["title"])
AI-Powered Extraction (Jina)
from scrapers.jina_scraper import JinaScraper
with JinaScraper() as scraper:
result = scraper.scrape(
"https://example.com",
return_format="markdown"
)
if result["success"]:
print(result["content"])
Usage Examples
See the examples/ directory for detailed usage examples:
basic_example.py- Static page scrapingselenium_example.py- Dynamic content and interactionadvanced_example.py- Advanced tools (Jina, Firecrawl, etc.)
Run examples:
python examples/basic_example.py
Configuration
Edit config.py or set environment variables in .env:
# API Keys
JINA_API_KEY=your_api_key
FIRECRAWL_API_KEY=your_api_key
AGENTQL_API_KEY=your_api_key
MULTION_API_KEY=your_api_key
# Scraping Settings
RATE_LIMIT_DELAY=2
MAX_RETRIES=3
TIMEOUT=30
Data Storage
Save scraped data in multiple formats:
from data_processors.storage import DataStorage
storage = DataStorage()
# Save as JSON
storage.save_json(data, "output.json")
# Save as CSV
storage.save_csv(data, "output.csv")
# Save as text
storage.save_text(content, "output.txt")
Testing
Run tests with pytest:
pytest tests/ -v
Run specific test file:
pytest tests/test_basic_scraper.py -v
Best Practices
- Respect robots.txt: Always check and follow website scraping policies
- Rate Limiting: Use appropriate delays between requests
- User-Agent: Set realistic User-Agent headers
- Error Handling: Implement robust error handling and retries
- Data Validation: Validate and sanitize scraped data
- Logging: Maintain detailed logs for debugging
Tool Selection Guide
- Basic Scraper: Static HTML pages, simple data extraction
- Selenium: JavaScript-rendered content, interactive elements
- Jina: AI-driven text extraction, structured data
- Firecrawl: Deep crawling, hierarchical content
- AgentQL: Complex workflows (login, forms, multi-step processes)
- Multion: Exploratory tasks, unpredictable scenarios
Contributing
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation
- Use meaningful commit messages
License
[Your License Here]
Disclaimer
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.