mirror of
https://github.com/p-stream/providers.git
synced 2026-05-17 05:42:11 +00:00
281 lines
8.4 KiB
Markdown
281 lines
8.4 KiB
Markdown
# Advanced Concepts
|
|
|
|
This guide covers advanced topics for building robust and reliable scrapers.
|
|
|
|
## Stream Protection and Proxying
|
|
|
|
Modern streaming services use various protection mechanisms.
|
|
|
|
### Common Protections
|
|
|
|
1. **Referer Checking** - URLs only work from specific domains
|
|
2. **CORS Restrictions** - Prevent browser access from unauthorized origins
|
|
3. **Geographic Blocking** - IP-based access restrictions
|
|
4. **Time-Limited Tokens** - URLs expire after short periods
|
|
5. **User-Agent Filtering** - Only allow specific browsers/clients
|
|
|
|
### Handling Protected Streams
|
|
|
|
**Convert HLS playlists to data URLs:**
|
|
|
|
Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams.
|
|
|
|
```typescript
|
|
import { convertPlaylistsToDataUrls } from '@/utils/playlist';
|
|
|
|
// Original playlist URL from streaming service
|
|
const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8';
|
|
|
|
// Headers required to access the playlist
|
|
const headers = {
|
|
'Referer': 'https://player.example.com/',
|
|
'Origin': 'https://player.example.com',
|
|
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
|
};
|
|
|
|
// Convert playlist and all variants to data URLs
|
|
const dataUrl = await convertPlaylistsToDataUrls(
|
|
ctx.proxiedFetcher,
|
|
playlistUrl,
|
|
headers
|
|
);
|
|
|
|
return {
|
|
stream: [{
|
|
id: 'primary',
|
|
type: 'hls',
|
|
playlist: dataUrl, // Self-contained data URL
|
|
flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs
|
|
captions: []
|
|
}]
|
|
};
|
|
```
|
|
|
|
**Why data URLs work for CORS bypass:**
|
|
- **Inital proxied request WITH HEADERS**: A request is sent from the proxy *with* headers allowing for the playlist to load
|
|
- **Fewer external requests**: Content is embedded directly in the URL as base64
|
|
- **Same-origin**: Browsers treat data URLs as same-origin content
|
|
- **Complete isolation**: No network requests means no CORS preflight checks
|
|
- **Self-contained**: All playlist data and segments are embedded in the response
|
|
|
|
**How the conversion works:**
|
|
1. Fetches the master playlist using provided headers
|
|
2. For each quality variant, fetches the variant playlist
|
|
3. Converts all playlists to base64-encoded data URLs
|
|
4. Returns a master data URL containing all embedded variants
|
|
|
|
**When to use data URLs vs M3U8 proxy:**
|
|
- **Use data URLs** when you can fetch all playlist data upfront
|
|
- **Use M3U8 proxy** when playlists are too large or change frequently
|
|
- **Data URLs are preferred** for most HLS streams due to simplicity and reliability
|
|
- **Each segment is origin or header locked**: converting to a data URL does not apply the headers to the segments
|
|
|
|
**Use M3U8 proxy for HLS streams:**
|
|
|
|
Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.
|
|
|
|
```typescript
|
|
import { createM3U8ProxyUrl } from '@/utils/proxy';
|
|
|
|
// Extract the original stream URL
|
|
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';
|
|
|
|
// Headers required by the streaming service
|
|
const streamHeaders = {
|
|
'Referer': 'https://player.example.com/',
|
|
'Origin': 'https://player.example.com',
|
|
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
|
};
|
|
|
|
// Create proxied URL that handles headers and CORS
|
|
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);
|
|
|
|
return {
|
|
stream: [{
|
|
id: 'primary',
|
|
type: 'hls',
|
|
playlist: proxiedUrl, // Use proxied URL
|
|
flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
|
|
captions: []
|
|
}]
|
|
};
|
|
```
|
|
|
|
### Stream Validation Bypass
|
|
|
|
Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in `valid.ts`:
|
|
|
|
```typescript
|
|
// In src/utils/valid.ts, add your scraper ID to skip validation
|
|
const SKIP_VALIDATION_CHECK_IDS = [
|
|
// ... existing IDs
|
|
'your-scraper-id', // Add your scraper ID here
|
|
];
|
|
```
|
|
|
|
**Why this is needed:**
|
|
- By default, all streams are validated by attempting to fetch metadata
|
|
- The validation uses `proxiedFetcher` to check if streams are playable
|
|
- If the stream blocks the fetcher, validation will fail
|
|
- But the proxied URL should still work in the actual player context
|
|
- Adding to skip list bypasses validation and returns the proxied URL directly without checking it
|
|
|
|
**When to skip validation:**
|
|
- Validation consistently fails but streams work in browsers
|
|
- The stream may be origin or IP locked
|
|
- The stream blocks the extension or proxy
|
|
|
|
**Use setupProxy for MP4 streams:**
|
|
When adding headers in the stream response, usually may need to use the **extension** or native to send the correct headers in the request.
|
|
|
|
```typescript
|
|
import { setupProxy } from '@/utils/proxy';
|
|
|
|
let stream = {
|
|
id: 'primary',
|
|
type: 'file',
|
|
flags: [],
|
|
qualities: {
|
|
'1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
|
|
},
|
|
headers: {
|
|
'Referer': 'https://player.example.com/',
|
|
'User-Agent': 'Mozilla/5.0...'
|
|
},
|
|
captions: []
|
|
};
|
|
|
|
// setupProxy will handle proxying if needed
|
|
stream = setupProxy(stream);
|
|
|
|
return { stream: [stream] };
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Efficient Data Extraction
|
|
|
|
**Use targeted selectors:**
|
|
```typescript
|
|
// ✅ Good - specific selector
|
|
const embedUrl = $('iframe[src*="turbovid"]').attr('src');
|
|
|
|
// ❌ Bad - searches entire document
|
|
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');
|
|
```
|
|
|
|
**Cache expensive operations:**
|
|
```typescript
|
|
// Cache parsed data to avoid re-parsing
|
|
let cachedConfig;
|
|
if (!cachedConfig) {
|
|
cachedConfig = JSON.parse(configString);
|
|
}
|
|
```
|
|
|
|
### Minimize HTTP Requests
|
|
|
|
**Combine operations when possible:**
|
|
```typescript
|
|
// ✅ Good - single request with full processing
|
|
const embedPage = await ctx.proxiedFetcher(embedUrl);
|
|
const streams = extractAllStreams(embedPage);
|
|
|
|
// ❌ Bad - multiple requests for same page
|
|
const page1 = await ctx.proxiedFetcher(embedUrl);
|
|
const config = extractConfig(page1);
|
|
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
|
|
const streams = extractStreams(page2);
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Input Validation
|
|
|
|
**Validate external data:**
|
|
```typescript
|
|
// Validate URLs before using them
|
|
const isValidUrl = (url: string) => {
|
|
try {
|
|
new URL(url);
|
|
return url.startsWith('http://') || url.startsWith('https://');
|
|
} catch {
|
|
return false;
|
|
}
|
|
};
|
|
|
|
if (!isValidUrl(streamUrl)) {
|
|
throw new Error('Invalid stream URL received');
|
|
}
|
|
```
|
|
|
|
**Sanitize regex inputs:**
|
|
```typescript
|
|
// Be careful with dynamic regex
|
|
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
|
const titleRegex = new RegExp(safeTitle, 'i');
|
|
```
|
|
|
|
### Safe JSON Parsing
|
|
|
|
```typescript
|
|
// Handle malformed JSON gracefully
|
|
let config;
|
|
try {
|
|
config = JSON.parse(configString);
|
|
} catch (error) {
|
|
throw new Error('Invalid configuration format');
|
|
}
|
|
|
|
// Validate expected structure
|
|
if (!config || typeof config !== 'object' || !config.streams) {
|
|
throw new Error('Invalid configuration structure');
|
|
}
|
|
```
|
|
|
|
## Testing and Debugging
|
|
|
|
### Debug Logging
|
|
|
|
```typescript
|
|
// Add temporary debug logs (remove before submitting)
|
|
console.log('Request URL:', requestUrl);
|
|
console.log('Response headers:', response.headers);
|
|
console.log('Extracted data:', extractedData);
|
|
```
|
|
|
|
### Test Edge Cases
|
|
|
|
- Content with special characters in titles
|
|
- Very new releases (might not be available)
|
|
- Old content (might have different URL patterns)
|
|
- Different regions (geographic restrictions)
|
|
- Different quality levels
|
|
|
|
### Common Debugging Steps
|
|
|
|
1. **Verify URLs are correct**
|
|
2. **Check HTTP status codes**
|
|
3. **Inspect response headers**
|
|
4. **Validate extracted data structure**
|
|
5. **Test with different content types**
|
|
|
|
## Best Practices Summary
|
|
|
|
1. **Always use `ctx.proxiedFetcher`** for external requests
|
|
2. **Throw `NotFoundError`** for content-not-found scenarios
|
|
3. **Update progress** at meaningful milestones
|
|
4. **Use appropriate flags** for stream capabilities
|
|
5. **Handle protected streams** with proxy functions
|
|
6. **Validate external data** before using it
|
|
7. **Test thoroughly** with diverse content
|
|
8. **Document your implementation** in pull requests
|
|
|
|
## Next Steps
|
|
|
|
With these advanced concepts:
|
|
|
|
1. Review [Sources vs Embeds](/in-depth/sources-and-embeds) for architectural patterns
|
|
2. Study existing scrapers in `src/providers/` for real examples
|
|
3. Test your implementation thoroughly
|
|
4. Submit pull requests with detailed testing documentation
|