pstreams-providers/.docs/content/3.in-depth/5.advanced-concepts.md
2025-07-11 14:08:43 -06:00

6.3 KiB

Advanced Concepts

This guide covers advanced topics for building robust and reliable scrapers.

Stream Protection and Proxying

Modern streaming services use various protection mechanisms.

Common Protections

  1. Referer Checking - URLs only work from specific domains
  2. CORS Restrictions - Prevent browser access from unauthorized origins
  3. Geographic Blocking - IP-based access restrictions
  4. Time-Limited Tokens - URLs expire after short periods
  5. User-Agent Filtering - Only allow specific browsers/clients

Handling Protected Streams

Use M3U8 proxy for HLS streams:

Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.

import { createM3U8ProxyUrl } from '@/utils/proxy';

// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required by the streaming service
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Create proxied URL that handles headers and CORS
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: proxiedUrl, // Use proxied URL
    flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
    captions: []
  }]
};

Stream Validation Bypass

When using M3U8 proxies that are origin-locked (like P-Stream proxies), you may need to bypass automatic stream validation in valid.ts:

// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
  // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

Why this is needed:

  • By default, all streams are validated by attempting to fetch metadata
  • The validation uses proxiedFetcher to check if streams are playable
  • If your proxy blocks the fetcher (origin-locked), validation will fail
  • But the proxied URL should still work in the actual player context
  • Adding to skip list bypasses validation and returns the proxied URL directly without checking it

When to skip validation:

  • Your scraper uses origin-locked proxies
  • The proxy service blocks programmatic access
  • Validation consistently fails but streams work in browsers
  • You're certain the proxy setup is correct

Use setupProxy for MP4 streams: When adding headers in the stream response, usually may need to use the extension or native to send the correct headers in the request.

import { setupProxy } from '@/utils/proxy';

let stream = {
  id: 'primary',
  type: 'file',
  flags: [],
  qualities: {
    '1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
  },
  headers: {
    'Referer': 'https://player.example.com/',
    'User-Agent': 'Mozilla/5.0...'
  },
  captions: []
};

// setupProxy will handle proxying if needed
stream = setupProxy(stream);

return { stream: [stream] };

Performance Optimization

Efficient Data Extraction

Use targeted selectors:

// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');

// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');

Cache expensive operations:

// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
  cachedConfig = JSON.parse(configString);
}

Minimize HTTP Requests

Combine operations when possible:

// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);

// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);

Security Considerations

Input Validation

Validate external data:

// Validate URLs before using them
const isValidUrl = (url: string) => {
  try {
    new URL(url);
    return url.startsWith('http://') || url.startsWith('https://');
  } catch {
    return false;
  }
};

if (!isValidUrl(streamUrl)) {
  throw new Error('Invalid stream URL received');
}

Sanitize regex inputs:

// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');

Safe JSON Parsing

// Handle malformed JSON gracefully
let config;
try {
  config = JSON.parse(configString);
} catch (error) {
  throw new Error('Invalid configuration format');
}

// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
  throw new Error('Invalid configuration structure');
}

Testing and Debugging

Debug Logging

// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);

Test Edge Cases

  • Content with special characters in titles
  • Very new releases (might not be available)
  • Old content (might have different URL patterns)
  • Different regions (geographic restrictions)
  • Different quality levels

Common Debugging Steps

  1. Verify URLs are correct
  2. Check HTTP status codes
  3. Inspect response headers
  4. Validate extracted data structure
  5. Test with different content types

Best Practices Summary

  1. Always use ctx.proxiedFetcher for external requests
  2. Throw NotFoundError for content-not-found scenarios
  3. Update progress at meaningful milestones
  4. Use appropriate flags for stream capabilities
  5. Handle protected streams with proxy functions
  6. Validate external data before using it
  7. Test thoroughly with diverse content
  8. Document your implementation in pull requests

Next Steps

With these advanced concepts:

  1. Review Sources vs Embeds for architectural patterns
  2. Study existing scrapers in src/providers/ for real examples
  3. Test your implementation thoroughly
  4. Submit pull requests with detailed testing documentation