# Advanced Concepts This guide covers advanced topics for building robust and reliable scrapers. ## Stream Protection and Proxying Modern streaming services use various protection mechanisms. ### Common Protections 1. **Referer Checking** - URLs only work from specific domains 2. **CORS Restrictions** - Prevent browser access from unauthorized origins 3. **Geographic Blocking** - IP-based access restrictions 4. **Time-Limited Tokens** - URLs expire after short periods 5. **User-Agent Filtering** - Only allow specific browsers/clients ### Handling Protected Streams **Convert HLS playlists to data URLs:** Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams. ```typescript import { convertPlaylistsToDataUrls } from '@/utils/playlist'; // Original playlist URL from streaming service const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8'; // Headers required to access the playlist const headers = { 'Referer': 'https://player.example.com/', 'Origin': 'https://player.example.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }; // Convert playlist and all variants to data URLs const dataUrl = await convertPlaylistsToDataUrls( ctx.proxiedFetcher, playlistUrl, headers ); return { stream: [{ id: 'primary', type: 'hls', playlist: dataUrl, // Self-contained data URL flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs captions: [] }] }; ``` **Why data URLs work for CORS bypass:** - **Inital proxied request WITH HEADERS**: A request is sent from the proxy *with* headers allowing for the playlist to load - **Fewer external requests**: Content is embedded directly in the URL as base64 - **Same-origin**: Browsers treat data URLs as same-origin content - **Complete isolation**: No network requests means no CORS preflight checks - **Self-contained**: All playlist data and segments are embedded in the response **How the conversion works:** 1. Fetches the master playlist using provided headers 2. For each quality variant, fetches the variant playlist 3. Converts all playlists to base64-encoded data URLs 4. Returns a master data URL containing all embedded variants **When to use data URLs vs M3U8 proxy:** - **Use data URLs** when you can fetch all playlist data upfront - **Use M3U8 proxy** when playlists are too large or change frequently - **Data URLs are preferred** for most HLS streams due to simplicity and reliability - **Each segment is origin or header locked**: converting to a data URL does not apply the headers to the segments **Use M3U8 proxy for HLS streams:** Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments. ```typescript import { createM3U8ProxyUrl } from '@/utils/proxy'; // Extract the original stream URL const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8'; // Headers required by the streaming service const streamHeaders = { 'Referer': 'https://player.example.com/', 'Origin': 'https://player.example.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }; // Create proxied URL that handles headers and CORS const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders); return { stream: [{ id: 'primary', type: 'hls', playlist: proxiedUrl, // Use proxied URL flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets captions: [] }] }; ``` ### Stream Validation Bypass Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in `valid.ts`: ```typescript // In src/utils/valid.ts, add your scraper ID to skip validation const SKIP_VALIDATION_CHECK_IDS = [ // ... existing IDs 'your-scraper-id', // Add your scraper ID here ]; ``` **Why this is needed:** - By default, all streams are validated by attempting to fetch metadata - The validation uses `proxiedFetcher` to check if streams are playable - If the stream blocks the fetcher, validation will fail - But the proxied URL should still work in the actual player context - Adding to skip list bypasses validation and returns the proxied URL directly without checking it **When to skip validation:** - Validation consistently fails but streams work in browsers - The stream may be origin or IP locked - The stream blocks the extension or proxy **Use setupProxy for MP4 streams:** When adding headers in the stream response, usually may need to use the **extension** or native to send the correct headers in the request. ```typescript import { setupProxy } from '@/utils/proxy'; let stream = { id: 'primary', type: 'file', flags: [], qualities: { '1080p': { url: 'https://protected-cdn.example.com/video.mp4' } }, headers: { 'Referer': 'https://player.example.com/', 'User-Agent': 'Mozilla/5.0...' }, captions: [] }; // setupProxy will handle proxying if needed stream = setupProxy(stream); return { stream: [stream] }; ``` ## Performance Optimization ### Efficient Data Extraction **Use targeted selectors:** ```typescript // ✅ Good - specific selector const embedUrl = $('iframe[src*="turbovid"]').attr('src'); // ❌ Bad - searches entire document const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src'); ``` **Cache expensive operations:** ```typescript // Cache parsed data to avoid re-parsing let cachedConfig; if (!cachedConfig) { cachedConfig = JSON.parse(configString); } ``` ### Minimize HTTP Requests **Combine operations when possible:** ```typescript // ✅ Good - single request with full processing const embedPage = await ctx.proxiedFetcher(embedUrl); const streams = extractAllStreams(embedPage); // ❌ Bad - multiple requests for same page const page1 = await ctx.proxiedFetcher(embedUrl); const config = extractConfig(page1); const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request const streams = extractStreams(page2); ``` ## Security Considerations ### Input Validation **Validate external data:** ```typescript // Validate URLs before using them const isValidUrl = (url: string) => { try { new URL(url); return url.startsWith('http://') || url.startsWith('https://'); } catch { return false; } }; if (!isValidUrl(streamUrl)) { throw new Error('Invalid stream URL received'); } ``` **Sanitize regex inputs:** ```typescript // Be careful with dynamic regex const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); const titleRegex = new RegExp(safeTitle, 'i'); ``` ### Safe JSON Parsing ```typescript // Handle malformed JSON gracefully let config; try { config = JSON.parse(configString); } catch (error) { throw new Error('Invalid configuration format'); } // Validate expected structure if (!config || typeof config !== 'object' || !config.streams) { throw new Error('Invalid configuration structure'); } ``` ## Testing and Debugging ### Debug Logging ```typescript // Add temporary debug logs (remove before submitting) console.log('Request URL:', requestUrl); console.log('Response headers:', response.headers); console.log('Extracted data:', extractedData); ``` ### Test Edge Cases - Content with special characters in titles - Very new releases (might not be available) - Old content (might have different URL patterns) - Different regions (geographic restrictions) - Different quality levels ### Common Debugging Steps 1. **Verify URLs are correct** 2. **Check HTTP status codes** 3. **Inspect response headers** 4. **Validate extracted data structure** 5. **Test with different content types** ## Best Practices Summary 1. **Always use `ctx.proxiedFetcher`** for external requests 2. **Throw `NotFoundError`** for content-not-found scenarios 3. **Update progress** at meaningful milestones 4. **Use appropriate flags** for stream capabilities 5. **Handle protected streams** with proxy functions 6. **Validate external data** before using it 7. **Test thoroughly** with diverse content 8. **Document your implementation** in pull requests ## Next Steps With these advanced concepts: 1. Review [Sources vs Embeds](/in-depth/sources-and-embeds) for architectural patterns 2. Study existing scrapers in `src/providers/` for real examples 3. Test your implementation thoroughly 4. Submit pull requests with detailed testing documentation