pstreams-providers/.docs/content/3.in-depth/6.advanced-concepts.md
2025-07-21 15:42:17 -06:00

281 lines
8.4 KiB
Markdown

# Advanced Concepts
This guide covers advanced topics for building robust and reliable scrapers.
## Stream Protection and Proxying
Modern streaming services use various protection mechanisms.
### Common Protections
1. **Referer Checking** - URLs only work from specific domains
2. **CORS Restrictions** - Prevent browser access from unauthorized origins
3. **Geographic Blocking** - IP-based access restrictions
4. **Time-Limited Tokens** - URLs expire after short periods
5. **User-Agent Filtering** - Only allow specific browsers/clients
### Handling Protected Streams
**Convert HLS playlists to data URLs:**
Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams.
```typescript
import { convertPlaylistsToDataUrls } from '@/utils/playlist';
// Original playlist URL from streaming service
const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8';
// Headers required to access the playlist
const headers = {
'Referer': 'https://player.example.com/',
'Origin': 'https://player.example.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};
// Convert playlist and all variants to data URLs
const dataUrl = await convertPlaylistsToDataUrls(
ctx.proxiedFetcher,
playlistUrl,
headers
);
return {
stream: [{
id: 'primary',
type: 'hls',
playlist: dataUrl, // Self-contained data URL
flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs
captions: []
}]
};
```
**Why data URLs work for CORS bypass:**
- **Inital proxied request WITH HEADERS**: A request is sent from the proxy *with* headers allowing for the playlist to load
- **Fewer external requests**: Content is embedded directly in the URL as base64
- **Same-origin**: Browsers treat data URLs as same-origin content
- **Complete isolation**: No network requests means no CORS preflight checks
- **Self-contained**: All playlist data and segments are embedded in the response
**How the conversion works:**
1. Fetches the master playlist using provided headers
2. For each quality variant, fetches the variant playlist
3. Converts all playlists to base64-encoded data URLs
4. Returns a master data URL containing all embedded variants
**When to use data URLs vs M3U8 proxy:**
- **Use data URLs** when you can fetch all playlist data upfront
- **Use M3U8 proxy** when playlists are too large or change frequently
- **Data URLs are preferred** for most HLS streams due to simplicity and reliability
- **Each segment is origin or header locked**: converting to a data URL does not apply the headers to the segments
**Use M3U8 proxy for HLS streams:**
Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.
```typescript
import { createM3U8ProxyUrl } from '@/utils/proxy';
// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';
// Headers required by the streaming service
const streamHeaders = {
'Referer': 'https://player.example.com/',
'Origin': 'https://player.example.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};
// Create proxied URL that handles headers and CORS
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);
return {
stream: [{
id: 'primary',
type: 'hls',
playlist: proxiedUrl, // Use proxied URL
flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
captions: []
}]
};
```
### Stream Validation Bypass
Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in `valid.ts`:
```typescript
// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
// ... existing IDs
'your-scraper-id', // Add your scraper ID here
];
```
**Why this is needed:**
- By default, all streams are validated by attempting to fetch metadata
- The validation uses `proxiedFetcher` to check if streams are playable
- If the stream blocks the fetcher, validation will fail
- But the proxied URL should still work in the actual player context
- Adding to skip list bypasses validation and returns the proxied URL directly without checking it
**When to skip validation:**
- Validation consistently fails but streams work in browsers
- The stream may be origin or IP locked
- The stream blocks the extension or proxy
**Use setupProxy for MP4 streams:**
When adding headers in the stream response, usually may need to use the **extension** or native to send the correct headers in the request.
```typescript
import { setupProxy } from '@/utils/proxy';
let stream = {
id: 'primary',
type: 'file',
flags: [],
qualities: {
'1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
},
headers: {
'Referer': 'https://player.example.com/',
'User-Agent': 'Mozilla/5.0...'
},
captions: []
};
// setupProxy will handle proxying if needed
stream = setupProxy(stream);
return { stream: [stream] };
```
## Performance Optimization
### Efficient Data Extraction
**Use targeted selectors:**
```typescript
// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');
// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');
```
**Cache expensive operations:**
```typescript
// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
cachedConfig = JSON.parse(configString);
}
```
### Minimize HTTP Requests
**Combine operations when possible:**
```typescript
// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);
// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);
```
## Security Considerations
### Input Validation
**Validate external data:**
```typescript
// Validate URLs before using them
const isValidUrl = (url: string) => {
try {
new URL(url);
return url.startsWith('http://') || url.startsWith('https://');
} catch {
return false;
}
};
if (!isValidUrl(streamUrl)) {
throw new Error('Invalid stream URL received');
}
```
**Sanitize regex inputs:**
```typescript
// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');
```
### Safe JSON Parsing
```typescript
// Handle malformed JSON gracefully
let config;
try {
config = JSON.parse(configString);
} catch (error) {
throw new Error('Invalid configuration format');
}
// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
throw new Error('Invalid configuration structure');
}
```
## Testing and Debugging
### Debug Logging
```typescript
// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);
```
### Test Edge Cases
- Content with special characters in titles
- Very new releases (might not be available)
- Old content (might have different URL patterns)
- Different regions (geographic restrictions)
- Different quality levels
### Common Debugging Steps
1. **Verify URLs are correct**
2. **Check HTTP status codes**
3. **Inspect response headers**
4. **Validate extracted data structure**
5. **Test with different content types**
## Best Practices Summary
1. **Always use `ctx.proxiedFetcher`** for external requests
2. **Throw `NotFoundError`** for content-not-found scenarios
3. **Update progress** at meaningful milestones
4. **Use appropriate flags** for stream capabilities
5. **Handle protected streams** with proxy functions
6. **Validate external data** before using it
7. **Test thoroughly** with diverse content
8. **Document your implementation** in pull requests
## Next Steps
With these advanced concepts:
1. Review [Sources vs Embeds](/in-depth/sources-and-embeds) for architectural patterns
2. Study existing scrapers in `src/providers/` for real examples
3. Test your implementation thoroughly
4. Submit pull requests with detailed testing documentation