6.3 KiB
Advanced Concepts
This guide covers advanced topics for building robust and reliable scrapers.
Stream Protection and Proxying
Modern streaming services use various protection mechanisms.
Common Protections
- Referer Checking - URLs only work from specific domains
- CORS Restrictions - Prevent browser access from unauthorized origins
- Geographic Blocking - IP-based access restrictions
- Time-Limited Tokens - URLs expire after short periods
- User-Agent Filtering - Only allow specific browsers/clients
Handling Protected Streams
Use M3U8 proxy for HLS streams:
Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.
import { createM3U8ProxyUrl } from '@/utils/proxy';
// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';
// Headers required by the streaming service
const streamHeaders = {
'Referer': 'https://player.example.com/',
'Origin': 'https://player.example.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};
// Create proxied URL that handles headers and CORS
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);
return {
stream: [{
id: 'primary',
type: 'hls',
playlist: proxiedUrl, // Use proxied URL
flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
captions: []
}]
};
Stream Validation Bypass
When using M3U8 proxies that are origin-locked (like P-Stream proxies), you may need to bypass automatic stream validation in valid.ts:
// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
// ... existing IDs
'your-scraper-id', // Add your scraper ID here
];
Why this is needed:
- By default, all streams are validated by attempting to fetch metadata
- The validation uses
proxiedFetcherto check if streams are playable - If your proxy blocks the fetcher (origin-locked), validation will fail
- But the proxied URL should still work in the actual player context
- Adding to skip list bypasses validation and returns the proxied URL directly without checking it
When to skip validation:
- Your scraper uses origin-locked proxies
- The proxy service blocks programmatic access
- Validation consistently fails but streams work in browsers
- You're certain the proxy setup is correct
Use setupProxy for MP4 streams: When adding headers in the stream response, usually may need to use the extension or native to send the correct headers in the request.
import { setupProxy } from '@/utils/proxy';
let stream = {
id: 'primary',
type: 'file',
flags: [],
qualities: {
'1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
},
headers: {
'Referer': 'https://player.example.com/',
'User-Agent': 'Mozilla/5.0...'
},
captions: []
};
// setupProxy will handle proxying if needed
stream = setupProxy(stream);
return { stream: [stream] };
Performance Optimization
Efficient Data Extraction
Use targeted selectors:
// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');
// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');
Cache expensive operations:
// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
cachedConfig = JSON.parse(configString);
}
Minimize HTTP Requests
Combine operations when possible:
// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);
// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);
Security Considerations
Input Validation
Validate external data:
// Validate URLs before using them
const isValidUrl = (url: string) => {
try {
new URL(url);
return url.startsWith('http://') || url.startsWith('https://');
} catch {
return false;
}
};
if (!isValidUrl(streamUrl)) {
throw new Error('Invalid stream URL received');
}
Sanitize regex inputs:
// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');
Safe JSON Parsing
// Handle malformed JSON gracefully
let config;
try {
config = JSON.parse(configString);
} catch (error) {
throw new Error('Invalid configuration format');
}
// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
throw new Error('Invalid configuration structure');
}
Testing and Debugging
Debug Logging
// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);
Test Edge Cases
- Content with special characters in titles
- Very new releases (might not be available)
- Old content (might have different URL patterns)
- Different regions (geographic restrictions)
- Different quality levels
Common Debugging Steps
- Verify URLs are correct
- Check HTTP status codes
- Inspect response headers
- Validate extracted data structure
- Test with different content types
Best Practices Summary
- Always use
ctx.proxiedFetcherfor external requests - Throw
NotFoundErrorfor content-not-found scenarios - Update progress at meaningful milestones
- Use appropriate flags for stream capabilities
- Handle protected streams with proxy functions
- Validate external data before using it
- Test thoroughly with diverse content
- Document your implementation in pull requests
Next Steps
With these advanced concepts:
- Review Sources vs Embeds for architectural patterns
- Study existing scrapers in
src/providers/for real examples - Test your implementation thoroughly
- Submit pull requests with detailed testing documentation