Creepso/pstreams-providers

Fork 0

mirror of https://github.com/p-stream/providers.git synced 2026-01-11 20:10:33 +00:00

Pas 3be5bdac6f update docs

2025-07-11 14:08:43 -06:00

6.3 KiB

Raw Blame History

Advanced Concepts

This guide covers advanced topics for building robust and reliable scrapers.

Stream Protection and Proxying

Modern streaming services use various protection mechanisms.

Common Protections

Referer Checking - URLs only work from specific domains
CORS Restrictions - Prevent browser access from unauthorized origins
Geographic Blocking - IP-based access restrictions
Time-Limited Tokens - URLs expire after short periods
User-Agent Filtering - Only allow specific browsers/clients

Handling Protected Streams

Use M3U8 proxy for HLS streams:

Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.

import { createM3U8ProxyUrl } from '@/utils/proxy';

// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required by the streaming service
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Create proxied URL that handles headers and CORS
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: proxiedUrl, // Use proxied URL
    flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
    captions: []
  }]
};

Stream Validation Bypass

When using M3U8 proxies that are origin-locked (like P-Stream proxies), you may need to bypass automatic stream validation in valid.ts:

// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
  // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

Why this is needed:

By default, all streams are validated by attempting to fetch metadata
The validation uses proxiedFetcher to check if streams are playable
If your proxy blocks the fetcher (origin-locked), validation will fail
But the proxied URL should still work in the actual player context
Adding to skip list bypasses validation and returns the proxied URL directly without checking it

When to skip validation:

Your scraper uses origin-locked proxies
The proxy service blocks programmatic access
Validation consistently fails but streams work in browsers
You're certain the proxy setup is correct

Use setupProxy for MP4 streams: When adding headers in the stream response, usually may need to use the extension or native to send the correct headers in the request.

import { setupProxy } from '@/utils/proxy';

let stream = {
  id: 'primary',
  type: 'file',
  flags: [],
  qualities: {
    '1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
  },
  headers: {
    'Referer': 'https://player.example.com/',
    'User-Agent': 'Mozilla/5.0...'
  },
  captions: []
};

// setupProxy will handle proxying if needed
stream = setupProxy(stream);

return { stream: [stream] };

Performance Optimization

Efficient Data Extraction

Use targeted selectors:

// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');

// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');

Cache expensive operations:

// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
  cachedConfig = JSON.parse(configString);
}

Minimize HTTP Requests

Combine operations when possible:

// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);

// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);

Security Considerations

Input Validation

Validate external data:

// Validate URLs before using them
const isValidUrl = (url: string) => {
  try {
    new URL(url);
    return url.startsWith('http://') || url.startsWith('https://');
  } catch {
    return false;
  }
};

if (!isValidUrl(streamUrl)) {
  throw new Error('Invalid stream URL received');
}

Sanitize regex inputs:

// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');

Safe JSON Parsing

// Handle malformed JSON gracefully
let config;
try {
  config = JSON.parse(configString);
} catch (error) {
  throw new Error('Invalid configuration format');
}

// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
  throw new Error('Invalid configuration structure');
}

Testing and Debugging

Debug Logging

// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);

Test Edge Cases

Content with special characters in titles
Very new releases (might not be available)
Old content (might have different URL patterns)
Different regions (geographic restrictions)
Different quality levels

Common Debugging Steps

Verify URLs are correct
Check HTTP status codes
Inspect response headers
Validate extracted data structure
Test with different content types

Best Practices Summary

Always use ctx.proxiedFetcher for external requests
Throw NotFoundError for content-not-found scenarios
Update progress at meaningful milestones
Use appropriate flags for stream capabilities
Handle protected streams with proxy functions
Validate external data before using it
Test thoroughly with diverse content
Document your implementation in pull requests

Next Steps

With these advanced concepts:

Review Sources vs Embeds for architectural patterns
Study existing scrapers in src/providers/ for real examples
Test your implementation thoroughly
Submit pull requests with detailed testing documentation

6.3 KiB Raw Blame History