7 KiB
7 KiB
Quick Start Guide
Get started with web scraping in minutes!
1. Installation
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# Unix/MacOS:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
copy .env.example .env # Windows
# or
cp .env.example .env # Unix/MacOS
2. Basic Usage
Command Line Interface
Scrape any website using the CLI:
# Basic scraping
python main.py https://example.com
# Use Selenium for JavaScript sites
python main.py https://example.com -m selenium
# Use Jina AI for text extraction
python main.py https://example.com -m jina -o output.txt
# Enable verbose logging
python main.py https://example.com -v
Python Scripts
Simple Static Page Scraping
from scrapers.basic_scraper import BasicScraper
# Scrape a static website
with BasicScraper() as scraper:
result = scraper.scrape("https://quotes.toscrape.com/")
if result["success"]:
soup = result["soup"]
# Extract quotes
for quote in soup.select(".quote"):
text = quote.select_one(".text").get_text()
author = quote.select_one(".author").get_text()
print(f"{text} - {author}")
JavaScript-Heavy Websites
from scrapers.selenium_scraper import SeleniumScraper
# Scrape dynamic content
with SeleniumScraper(headless=True) as scraper:
result = scraper.scrape(
"https://quotes.toscrape.com/js/",
wait_for=".quote" # Wait for this element to load
)
if result["success"]:
print(f"Page title: {result['title']}")
# Process the data...
AI-Powered Text Extraction
from scrapers.jina_scraper import JinaScraper
# Extract text intelligently with AI
with JinaScraper() as scraper:
result = scraper.scrape(
"https://news.ycombinator.com/",
return_format="markdown"
)
if result["success"]:
print(result["content"])
3. Save Your Data
from data_processors.storage import DataStorage
storage = DataStorage()
# Save as JSON
data = {"title": "Example", "content": "Hello World"}
storage.save_json(data, "output.json")
# Save as CSV
data_list = [
{"name": "John", "age": 30},
{"name": "Jane", "age": 25}
]
storage.save_csv(data_list, "people.csv")
# Save as text
storage.save_text("Some text content", "output.txt")
4. Run Examples
Try the included examples:
# Basic scraping example
python examples/basic_example.py
# Selenium example
python examples/selenium_example.py
# Advanced tools example (requires API keys)
python examples/advanced_example.py
5. Common Patterns
Extract Links from a Page
from scrapers.basic_scraper import BasicScraper
with BasicScraper() as scraper:
result = scraper.scrape("https://example.com")
if result["success"]:
links = scraper.extract_links(
result["soup"],
base_url="https://example.com"
)
for link in links:
print(link)
Click Buttons and Fill Forms
from scrapers.selenium_scraper import SeleniumScraper
with SeleniumScraper(headless=False) as scraper:
scraper.scrape("https://example.com/login")
# Fill form fields
scraper.fill_form("#username", "myuser")
scraper.fill_form("#password", "mypass")
# Click submit button
scraper.click_element("#submit-btn")
# Take screenshot
scraper.take_screenshot("logged_in.png")
Validate and Clean Data
from data_processors.validator import DataValidator
# Validate email
is_valid = DataValidator.validate_email("test@example.com")
# Clean text
cleaned = DataValidator.clean_text(" Multiple spaces ")
# Validate required fields
data = {"name": "John", "email": "john@example.com"}
validation = DataValidator.validate_required_fields(
data,
required_fields=["name", "email", "phone"]
)
if not validation["valid"]:
print(f"Missing: {validation['missing_fields']}")
6. Testing
Run the test suite:
# Run all tests
pytest tests/ -v
# Run specific test
pytest tests/test_basic_scraper.py -v
# Run with coverage
pytest tests/ --cov=scrapers --cov=utils --cov=data_processors
7. Advanced Features
Deep Crawling with Firecrawl
from scrapers.firecrawl_scraper import FirecrawlScraper
with FirecrawlScraper() as scraper:
result = scraper.crawl(
"https://example.com",
max_depth=3,
max_pages=50,
include_patterns=["*/blog/*"],
exclude_patterns=["*/admin/*"]
)
if result["success"]:
print(f"Crawled {result['total_pages']} pages")
for page in result["pages"]:
print(f"- {page['url']}")
Complex Workflows with AgentQL
from scrapers.agentql_scraper import AgentQLScraper
with AgentQLScraper() as scraper:
# Automated login
result = scraper.login_workflow(
url="https://example.com/login",
username="user@example.com",
password="password123",
username_field="input[name='email']",
password_field="input[name='password']",
submit_button="button[type='submit']"
)
Exploratory Tasks with Multion
from scrapers.multion_scraper import MultionScraper
with MultionScraper() as scraper:
# Find best deal automatically
result = scraper.find_best_deal(
search_query="noise cancelling headphones",
filters={
"max_price": 200,
"rating": "4.5+",
"brand": "Sony"
}
)
if result["success"]:
print(result["final_result"])
8. Tips & Best Practices
- Always use context managers (
withstatement) to ensure proper cleanup - Respect rate limits - the default is 2 seconds between requests
- Check robots.txt before scraping a website
- Use appropriate User-Agent headers
- Handle errors gracefully - the scrapers include built-in retry logic
- Validate and clean data before storing it
- Log everything for debugging purposes
9. Troubleshooting
Issue: Selenium driver not found
# The project uses webdriver-manager to auto-download drivers
# If you have issues, manually install ChromeDriver:
# 1. Download from https://chromedriver.chromium.org/
# 2. Add to your system PATH
Issue: Import errors
# Make sure you've activated the virtual environment
# and installed all dependencies
pip install -r requirements.txt
Issue: API keys not working
# Make sure you've copied .env.example to .env
# and added your actual API keys
cp .env.example .env
# Edit .env with your keys
10. Next Steps
- Explore the
examples/directory for more use cases - Read the full
README.mdfor detailed documentation - Check out the
tests/directory to see testing patterns - Customize
config.pyfor your specific needs - Build your own scrapers extending
BaseScraper
Happy Scraping! 🚀