No description

Find a file

Creepso 327d360df5 to_README.md		2025-10-31 18:08:10 +00:00
data_processors	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
examples	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
scrapers	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
tests	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
utils	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
.gitignore	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
config.py	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
main.py	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
PROXY_GUIDE.md	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
QUICKSTART.md	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
README.md	to_README.md	2025-10-31 18:08:10 +00:00
README_en.md	to_en	2025-10-31 18:07:51 +00:00
requirements.txt	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
sekai_one_scraper.py	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
start_proxy.bat	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
start_proxy.sh	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
test_proxy.py	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00
video_proxy_server.py	Sekai_scraper - OP Version	2025-10-31 19:03:17 +01:00

README_en.md

Web Scraping Project

A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.

Features

Multiple Scraping Methods:
- Basic HTTP requests with BeautifulSoup
- Selenium for JavaScript-heavy sites
- Jina AI for intelligent text extraction
- Firecrawl for deep web crawling
- AgentQL for complex workflows
- Multion for exploratory tasks
Built-in Utilities:
- Rate limiting and retry logic
- Comprehensive logging
- Data validation and sanitization
- Multiple storage formats (JSON, CSV, text)
Best Practices:
- PEP 8 compliant code
- Modular and reusable components
- Error handling and recovery
- Ethical scraping practices

Project Structure

.
├── config.py                 # Configuration and settings
├── requirements.txt          # Python dependencies
├── .env.example             # Environment variables template
│
├── scrapers/                # Scraper implementations
│   ├── base_scraper.py      # Abstract base class
│   ├── basic_scraper.py     # requests + BeautifulSoup
│   ├── selenium_scraper.py  # Selenium WebDriver
│   ├── jina_scraper.py      # Jina AI integration
│   ├── firecrawl_scraper.py # Firecrawl integration
│   ├── agentql_scraper.py   # AgentQL workflows
│   └── multion_scraper.py   # Multion AI agent
│
├── utils/                   # Utility modules
│   ├── logger.py           # Logging configuration
│   ├── rate_limiter.py     # Rate limiting
│   └── retry.py            # Retry with backoff
│
├── data_processors/         # Data processing
│   ├── validator.py        # Data validation
│   └── storage.py          # Data storage
│
├── examples/               # Example scripts
│   ├── basic_example.py
│   ├── selenium_example.py
│   └── advanced_example.py
│
└── tests/                  # Test suite
    ├── test_basic_scraper.py
    └── test_data_processors.py

Installation

Clone the repository:

git clone <repository-url>
cd <project-directory>

Create virtual environment:

python -m venv venv

# Windows
venv\Scripts\activate

# Unix/MacOS
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Configure environment variables:

cp .env.example .env
# Edit .env with your API keys

Quick Start

Basic Scraping

from scrapers.basic_scraper import BasicScraper

with BasicScraper() as scraper:
    result = scraper.scrape("https://example.com")
    
    if result["success"]:
        soup = result["soup"]
        # Extract data using BeautifulSoup
        titles = scraper.extract_text(soup, "h1")
        print(titles)

Dynamic Content (Selenium)

from scrapers.selenium_scraper import SeleniumScraper

with SeleniumScraper(headless=True) as scraper:
    result = scraper.scrape(
        "https://example.com",
        wait_for=".dynamic-content"
    )
    
    if result["success"]:
        print(result["title"])

AI-Powered Extraction (Jina)

from scrapers.jina_scraper import JinaScraper

with JinaScraper() as scraper:
    result = scraper.scrape(
        "https://example.com",
        return_format="markdown"
    )
    
    if result["success"]:
        print(result["content"])

Usage Examples

See the examples/ directory for detailed usage examples:

basic_example.py - Static page scraping
selenium_example.py - Dynamic content and interaction
advanced_example.py - Advanced tools (Jina, Firecrawl, etc.)

Run examples:

python examples/basic_example.py

Configuration

Edit config.py or set environment variables in .env:

# API Keys
JINA_API_KEY=your_api_key
FIRECRAWL_API_KEY=your_api_key
AGENTQL_API_KEY=your_api_key
MULTION_API_KEY=your_api_key

# Scraping Settings
RATE_LIMIT_DELAY=2
MAX_RETRIES=3
TIMEOUT=30

Data Storage

Save scraped data in multiple formats:

from data_processors.storage import DataStorage

storage = DataStorage()

# Save as JSON
storage.save_json(data, "output.json")

# Save as CSV
storage.save_csv(data, "output.csv")

# Save as text
storage.save_text(content, "output.txt")

Testing

Run tests with pytest:

pytest tests/ -v

Run specific test file:

pytest tests/test_basic_scraper.py -v

Best Practices

Respect robots.txt: Always check and follow website scraping policies
Rate Limiting: Use appropriate delays between requests
User-Agent: Set realistic User-Agent headers
Error Handling: Implement robust error handling and retries
Data Validation: Validate and sanitize scraped data
Logging: Maintain detailed logs for debugging

Tool Selection Guide

Basic Scraper: Static HTML pages, simple data extraction
Selenium: JavaScript-rendered content, interactive elements
Jina: AI-driven text extraction, structured data
Firecrawl: Deep crawling, hierarchical content
AgentQL: Complex workflows (login, forms, multi-step processes)
Multion: Exploratory tasks, unpredictable scenarios

Contributing

Follow PEP 8 style guidelines
Add tests for new features
Update documentation
Use meaningful commit messages

License

[Your License Here]

Disclaimer

This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.