creepso 644ea16f94 Sekai_scraper - OP Version

2025-10-31 19:03:17 +01:00

7 KiB

Raw Permalink Blame History

Quick Start Guide

Get started with web scraping in minutes!

1. Installation

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# Unix/MacOS:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
copy .env.example .env  # Windows
# or
cp .env.example .env    # Unix/MacOS

2. Basic Usage

Command Line Interface

Scrape any website using the CLI:

# Basic scraping
python main.py https://example.com

# Use Selenium for JavaScript sites
python main.py https://example.com -m selenium

# Use Jina AI for text extraction
python main.py https://example.com -m jina -o output.txt

# Enable verbose logging
python main.py https://example.com -v

Python Scripts

Simple Static Page Scraping

from scrapers.basic_scraper import BasicScraper

# Scrape a static website
with BasicScraper() as scraper:
    result = scraper.scrape("https://quotes.toscrape.com/")
    
    if result["success"]:
        soup = result["soup"]
        
        # Extract quotes
        for quote in soup.select(".quote"):
            text = quote.select_one(".text").get_text()
            author = quote.select_one(".author").get_text()
            print(f"{text} - {author}")

JavaScript-Heavy Websites

from scrapers.selenium_scraper import SeleniumScraper

# Scrape dynamic content
with SeleniumScraper(headless=True) as scraper:
    result = scraper.scrape(
        "https://quotes.toscrape.com/js/",
        wait_for=".quote"  # Wait for this element to load
    )
    
    if result["success"]:
        print(f"Page title: {result['title']}")
        # Process the data...

AI-Powered Text Extraction

from scrapers.jina_scraper import JinaScraper

# Extract text intelligently with AI
with JinaScraper() as scraper:
    result = scraper.scrape(
        "https://news.ycombinator.com/",
        return_format="markdown"
    )
    
    if result["success"]:
        print(result["content"])

3. Save Your Data

from data_processors.storage import DataStorage

storage = DataStorage()

# Save as JSON
data = {"title": "Example", "content": "Hello World"}
storage.save_json(data, "output.json")

# Save as CSV
data_list = [
    {"name": "John", "age": 30},
    {"name": "Jane", "age": 25}
]
storage.save_csv(data_list, "people.csv")

# Save as text
storage.save_text("Some text content", "output.txt")

4. Run Examples

Try the included examples:

# Basic scraping example
python examples/basic_example.py

# Selenium example
python examples/selenium_example.py

# Advanced tools example (requires API keys)
python examples/advanced_example.py

5. Common Patterns

Extract Links from a Page

from scrapers.basic_scraper import BasicScraper

with BasicScraper() as scraper:
    result = scraper.scrape("https://example.com")
    
    if result["success"]:
        links = scraper.extract_links(
            result["soup"],
            base_url="https://example.com"
        )
        
        for link in links:
            print(link)

Click Buttons and Fill Forms

from scrapers.selenium_scraper import SeleniumScraper

with SeleniumScraper(headless=False) as scraper:
    scraper.scrape("https://example.com/login")
    
    # Fill form fields
    scraper.fill_form("#username", "myuser")
    scraper.fill_form("#password", "mypass")
    
    # Click submit button
    scraper.click_element("#submit-btn")
    
    # Take screenshot
    scraper.take_screenshot("logged_in.png")

Validate and Clean Data

from data_processors.validator import DataValidator

# Validate email
is_valid = DataValidator.validate_email("test@example.com")

# Clean text
cleaned = DataValidator.clean_text("  Multiple   spaces  ")

# Validate required fields
data = {"name": "John", "email": "john@example.com"}
validation = DataValidator.validate_required_fields(
    data, 
    required_fields=["name", "email", "phone"]
)

if not validation["valid"]:
    print(f"Missing: {validation['missing_fields']}")

6. Testing

Run the test suite:

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_basic_scraper.py -v

# Run with coverage
pytest tests/ --cov=scrapers --cov=utils --cov=data_processors

7. Advanced Features

Deep Crawling with Firecrawl

from scrapers.firecrawl_scraper import FirecrawlScraper

with FirecrawlScraper() as scraper:
    result = scraper.crawl(
        "https://example.com",
        max_depth=3,
        max_pages=50,
        include_patterns=["*/blog/*"],
        exclude_patterns=["*/admin/*"]
    )
    
    if result["success"]:
        print(f"Crawled {result['total_pages']} pages")
        for page in result["pages"]:
            print(f"- {page['url']}")

Complex Workflows with AgentQL

from scrapers.agentql_scraper import AgentQLScraper

with AgentQLScraper() as scraper:
    # Automated login
    result = scraper.login_workflow(
        url="https://example.com/login",
        username="user@example.com",
        password="password123",
        username_field="input[name='email']",
        password_field="input[name='password']",
        submit_button="button[type='submit']"
    )

Exploratory Tasks with Multion

from scrapers.multion_scraper import MultionScraper

with MultionScraper() as scraper:
    # Find best deal automatically
    result = scraper.find_best_deal(
        search_query="noise cancelling headphones",
        filters={
            "max_price": 200,
            "rating": "4.5+",
            "brand": "Sony"
        }
    )
    
    if result["success"]:
        print(result["final_result"])

8. Tips & Best Practices

Always use context managers (with statement) to ensure proper cleanup
Respect rate limits - the default is 2 seconds between requests
Check robots.txt before scraping a website
Use appropriate User-Agent headers
Handle errors gracefully - the scrapers include built-in retry logic
Validate and clean data before storing it
Log everything for debugging purposes

9. Troubleshooting

Issue: Selenium driver not found

# The project uses webdriver-manager to auto-download drivers
# If you have issues, manually install ChromeDriver:
# 1. Download from https://chromedriver.chromium.org/
# 2. Add to your system PATH

Issue: Import errors

# Make sure you've activated the virtual environment
# and installed all dependencies
pip install -r requirements.txt

Issue: API keys not working

# Make sure you've copied .env.example to .env
# and added your actual API keys
cp .env.example .env
# Edit .env with your keys

10. Next Steps

Explore the examples/ directory for more use cases
Read the full README.md for detailed documentation
Check out the tests/ directory to see testing patterns
Customize config.py for your specific needs
Build your own scrapers extending BaseScraper

Happy Scraping! 🚀

7 KiB Raw Permalink Blame History

Quick Start Guide

1. Installation

2. Basic Usage

Command Line Interface

Python Scripts

Simple Static Page Scraping

JavaScript-Heavy Websites

AI-Powered Text Extraction

3. Save Your Data

4. Run Examples

5. Common Patterns

Extract Links from a Page

Click Buttons and Fill Forms

Validate and Clean Data

6. Testing

7. Advanced Features

Deep Crawling with Firecrawl

Complex Workflows with AgentQL

Exploratory Tasks with Multion

8. Tips & Best Practices

9. Troubleshooting

Issue: Selenium driver not found

Issue: Import errors

Issue: API keys not working

10. Next Steps

7 KiB

Raw Permalink Blame History