No description
Find a file
2025-10-31 18:08:10 +00:00
data_processors Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
examples Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
scrapers Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
tests Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
utils Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
.gitignore Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
config.py Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
main.py Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
PROXY_GUIDE.md Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
QUICKSTART.md Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
README.md to_README.md 2025-10-31 18:08:10 +00:00
README_en.md to_en 2025-10-31 18:07:51 +00:00
requirements.txt Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
sekai_one_scraper.py Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
start_proxy.bat Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
start_proxy.sh Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
test_proxy.py Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00
video_proxy_server.py Sekai_scraper - OP Version 2025-10-31 19:03:17 +01:00

Web Scraping Project

A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.

Features

  • Multiple Scraping Methods:

    • Basic HTTP requests with BeautifulSoup
    • Selenium for JavaScript-heavy sites
    • Jina AI for intelligent text extraction
    • Firecrawl for deep web crawling
    • AgentQL for complex workflows
    • Multion for exploratory tasks
  • Built-in Utilities:

    • Rate limiting and retry logic
    • Comprehensive logging
    • Data validation and sanitization
    • Multiple storage formats (JSON, CSV, text)
  • Best Practices:

    • PEP 8 compliant code
    • Modular and reusable components
    • Error handling and recovery
    • Ethical scraping practices

Project Structure

.
├── config.py                 # Configuration and settings
├── requirements.txt          # Python dependencies
├── .env.example             # Environment variables template
│
├── scrapers/                # Scraper implementations
│   ├── base_scraper.py      # Abstract base class
│   ├── basic_scraper.py     # requests + BeautifulSoup
│   ├── selenium_scraper.py  # Selenium WebDriver
│   ├── jina_scraper.py      # Jina AI integration
│   ├── firecrawl_scraper.py # Firecrawl integration
│   ├── agentql_scraper.py   # AgentQL workflows
│   └── multion_scraper.py   # Multion AI agent
│
├── utils/                   # Utility modules
│   ├── logger.py           # Logging configuration
│   ├── rate_limiter.py     # Rate limiting
│   └── retry.py            # Retry with backoff
│
├── data_processors/         # Data processing
│   ├── validator.py        # Data validation
│   └── storage.py          # Data storage
│
├── examples/               # Example scripts
│   ├── basic_example.py
│   ├── selenium_example.py
│   └── advanced_example.py
│
└── tests/                  # Test suite
    ├── test_basic_scraper.py
    └── test_data_processors.py

Installation

  1. Clone the repository:
git clone <repository-url>
cd <project-directory>
  1. Create virtual environment:
python -m venv venv

# Windows
venv\Scripts\activate

# Unix/MacOS
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables:
cp .env.example .env
# Edit .env with your API keys

Quick Start

Basic Scraping

from scrapers.basic_scraper import BasicScraper

with BasicScraper() as scraper:
    result = scraper.scrape("https://example.com")
    
    if result["success"]:
        soup = result["soup"]
        # Extract data using BeautifulSoup
        titles = scraper.extract_text(soup, "h1")
        print(titles)

Dynamic Content (Selenium)

from scrapers.selenium_scraper import SeleniumScraper

with SeleniumScraper(headless=True) as scraper:
    result = scraper.scrape(
        "https://example.com",
        wait_for=".dynamic-content"
    )
    
    if result["success"]:
        print(result["title"])

AI-Powered Extraction (Jina)

from scrapers.jina_scraper import JinaScraper

with JinaScraper() as scraper:
    result = scraper.scrape(
        "https://example.com",
        return_format="markdown"
    )
    
    if result["success"]:
        print(result["content"])

Usage Examples

See the examples/ directory for detailed usage examples:

  • basic_example.py - Static page scraping
  • selenium_example.py - Dynamic content and interaction
  • advanced_example.py - Advanced tools (Jina, Firecrawl, etc.)

Run examples:

python examples/basic_example.py

Configuration

Edit config.py or set environment variables in .env:

# API Keys
JINA_API_KEY=your_api_key
FIRECRAWL_API_KEY=your_api_key
AGENTQL_API_KEY=your_api_key
MULTION_API_KEY=your_api_key

# Scraping Settings
RATE_LIMIT_DELAY=2
MAX_RETRIES=3
TIMEOUT=30

Data Storage

Save scraped data in multiple formats:

from data_processors.storage import DataStorage

storage = DataStorage()

# Save as JSON
storage.save_json(data, "output.json")

# Save as CSV
storage.save_csv(data, "output.csv")

# Save as text
storage.save_text(content, "output.txt")

Testing

Run tests with pytest:

pytest tests/ -v

Run specific test file:

pytest tests/test_basic_scraper.py -v

Best Practices

  1. Respect robots.txt: Always check and follow website scraping policies
  2. Rate Limiting: Use appropriate delays between requests
  3. User-Agent: Set realistic User-Agent headers
  4. Error Handling: Implement robust error handling and retries
  5. Data Validation: Validate and sanitize scraped data
  6. Logging: Maintain detailed logs for debugging

Tool Selection Guide

  • Basic Scraper: Static HTML pages, simple data extraction
  • Selenium: JavaScript-rendered content, interactive elements
  • Jina: AI-driven text extraction, structured data
  • Firecrawl: Deep crawling, hierarchical content
  • AgentQL: Complex workflows (login, forms, multi-step processes)
  • Multion: Exploratory tasks, unpredictable scenarios

Contributing

  1. Follow PEP 8 style guidelines
  2. Add tests for new features
  3. Update documentation
  4. Use meaningful commit messages

License

[Your License Here]

Disclaimer

This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.