# Quick Start Guide Get started with web scraping in minutes! ## 1. Installation ```bash # Create virtual environment python -m venv venv # Activate virtual environment # Windows: venv\Scripts\activate # Unix/MacOS: source venv/bin/activate # Install dependencies pip install -r requirements.txt # Set up environment variables copy .env.example .env # Windows # or cp .env.example .env # Unix/MacOS ``` ## 2. Basic Usage ### Command Line Interface Scrape any website using the CLI: ```bash # Basic scraping python main.py https://example.com # Use Selenium for JavaScript sites python main.py https://example.com -m selenium # Use Jina AI for text extraction python main.py https://example.com -m jina -o output.txt # Enable verbose logging python main.py https://example.com -v ``` ### Python Scripts #### Simple Static Page Scraping ```python from scrapers.basic_scraper import BasicScraper # Scrape a static website with BasicScraper() as scraper: result = scraper.scrape("https://quotes.toscrape.com/") if result["success"]: soup = result["soup"] # Extract quotes for quote in soup.select(".quote"): text = quote.select_one(".text").get_text() author = quote.select_one(".author").get_text() print(f"{text} - {author}") ``` #### JavaScript-Heavy Websites ```python from scrapers.selenium_scraper import SeleniumScraper # Scrape dynamic content with SeleniumScraper(headless=True) as scraper: result = scraper.scrape( "https://quotes.toscrape.com/js/", wait_for=".quote" # Wait for this element to load ) if result["success"]: print(f"Page title: {result['title']}") # Process the data... ``` #### AI-Powered Text Extraction ```python from scrapers.jina_scraper import JinaScraper # Extract text intelligently with AI with JinaScraper() as scraper: result = scraper.scrape( "https://news.ycombinator.com/", return_format="markdown" ) if result["success"]: print(result["content"]) ``` ## 3. Save Your Data ```python from data_processors.storage import DataStorage storage = DataStorage() # Save as JSON data = {"title": "Example", "content": "Hello World"} storage.save_json(data, "output.json") # Save as CSV data_list = [ {"name": "John", "age": 30}, {"name": "Jane", "age": 25} ] storage.save_csv(data_list, "people.csv") # Save as text storage.save_text("Some text content", "output.txt") ``` ## 4. Run Examples Try the included examples: ```bash # Basic scraping example python examples/basic_example.py # Selenium example python examples/selenium_example.py # Advanced tools example (requires API keys) python examples/advanced_example.py ``` ## 5. Common Patterns ### Extract Links from a Page ```python from scrapers.basic_scraper import BasicScraper with BasicScraper() as scraper: result = scraper.scrape("https://example.com") if result["success"]: links = scraper.extract_links( result["soup"], base_url="https://example.com" ) for link in links: print(link) ``` ### Click Buttons and Fill Forms ```python from scrapers.selenium_scraper import SeleniumScraper with SeleniumScraper(headless=False) as scraper: scraper.scrape("https://example.com/login") # Fill form fields scraper.fill_form("#username", "myuser") scraper.fill_form("#password", "mypass") # Click submit button scraper.click_element("#submit-btn") # Take screenshot scraper.take_screenshot("logged_in.png") ``` ### Validate and Clean Data ```python from data_processors.validator import DataValidator # Validate email is_valid = DataValidator.validate_email("test@example.com") # Clean text cleaned = DataValidator.clean_text(" Multiple spaces ") # Validate required fields data = {"name": "John", "email": "john@example.com"} validation = DataValidator.validate_required_fields( data, required_fields=["name", "email", "phone"] ) if not validation["valid"]: print(f"Missing: {validation['missing_fields']}") ``` ## 6. Testing Run the test suite: ```bash # Run all tests pytest tests/ -v # Run specific test pytest tests/test_basic_scraper.py -v # Run with coverage pytest tests/ --cov=scrapers --cov=utils --cov=data_processors ``` ## 7. Advanced Features ### Deep Crawling with Firecrawl ```python from scrapers.firecrawl_scraper import FirecrawlScraper with FirecrawlScraper() as scraper: result = scraper.crawl( "https://example.com", max_depth=3, max_pages=50, include_patterns=["*/blog/*"], exclude_patterns=["*/admin/*"] ) if result["success"]: print(f"Crawled {result['total_pages']} pages") for page in result["pages"]: print(f"- {page['url']}") ``` ### Complex Workflows with AgentQL ```python from scrapers.agentql_scraper import AgentQLScraper with AgentQLScraper() as scraper: # Automated login result = scraper.login_workflow( url="https://example.com/login", username="user@example.com", password="password123", username_field="input[name='email']", password_field="input[name='password']", submit_button="button[type='submit']" ) ``` ### Exploratory Tasks with Multion ```python from scrapers.multion_scraper import MultionScraper with MultionScraper() as scraper: # Find best deal automatically result = scraper.find_best_deal( search_query="noise cancelling headphones", filters={ "max_price": 200, "rating": "4.5+", "brand": "Sony" } ) if result["success"]: print(result["final_result"]) ``` ## 8. Tips & Best Practices 1. **Always use context managers** (`with` statement) to ensure proper cleanup 2. **Respect rate limits** - the default is 2 seconds between requests 3. **Check robots.txt** before scraping a website 4. **Use appropriate User-Agent** headers 5. **Handle errors gracefully** - the scrapers include built-in retry logic 6. **Validate and clean data** before storing it 7. **Log everything** for debugging purposes ## 9. Troubleshooting ### Issue: Selenium driver not found ```bash # The project uses webdriver-manager to auto-download drivers # If you have issues, manually install ChromeDriver: # 1. Download from https://chromedriver.chromium.org/ # 2. Add to your system PATH ``` ### Issue: Import errors ```bash # Make sure you've activated the virtual environment # and installed all dependencies pip install -r requirements.txt ``` ### Issue: API keys not working ```bash # Make sure you've copied .env.example to .env # and added your actual API keys cp .env.example .env # Edit .env with your keys ``` ## 10. Next Steps - Explore the `examples/` directory for more use cases - Read the full `README.md` for detailed documentation - Check out the `tests/` directory to see testing patterns - Customize `config.py` for your specific needs - Build your own scrapers extending `BaseScraper` Happy Scraping! 🚀