pstreams-providers/.docs/content/3.in-depth/6.advanced-concepts.md

# Advanced Concepts

This guide covers advanced topics for building robust and reliable scrapers.

## Stream Protection and Proxying

Modern streaming services use various protection mechanisms.

### Common Protections

1. **Referer Checking** - URLs only work from specific domains
2. **CORS Restrictions** - Prevent browser access from unauthorized origins
3. **Geographic Blocking** - IP-based access restrictions
4. **Time-Limited Tokens** - URLs expire after short periods
5. **User-Agent Filtering** - Only allow specific browsers/clients

### Handling Protected Streams

**Convert HLS playlists to data URLs:**

Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams.

```typescript
import { convertPlaylistsToDataUrls } from '@/utils/playlist';

// Original playlist URL from streaming service
const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required to access the playlist
const headers = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Convert playlist and all variants to data URLs
const dataUrl = await convertPlaylistsToDataUrls(
  ctx.proxiedFetcher,
  playlistUrl,
  headers
);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: dataUrl, // Self-contained data URL
    flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs
    captions: []
  }]
};
```

**Why data URLs work for CORS bypass:**
- **Inital proxied request WITH HEADERS**: A request is sent from the proxy *with* headers allowing for the playlist to load
- **Fewer external requests**: Content is embedded directly in the URL as base64
- **Same-origin**: Browsers treat data URLs as same-origin content
- **Complete isolation**: No network requests means no CORS preflight checks
- **Self-contained**: All playlist data and segments are embedded in the response

**How the conversion works:**
1. Fetches the master playlist using provided headers
2. For each quality variant, fetches the variant playlist
3. Converts all playlists to base64-encoded data URLs
4. Returns a master data URL containing all embedded variants

**When to use data URLs vs M3U8 proxy:**
- **Use data URLs** when you can fetch all playlist data upfront
- **Use M3U8 proxy** when playlists are too large or change frequently
- **Data URLs are preferred** for most HLS streams due to simplicity and reliability
- **Each segment is origin or header locked**: converting to a data URL does not apply the headers to the segments

**Use M3U8 proxy for HLS streams:**

Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.

```typescript
import { createM3U8ProxyUrl } from '@/utils/proxy';

// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required by the streaming service
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Create proxied URL that handles headers and CORS
const proxiedUrl = createM3U8ProxyUrl(originalPlaylist, streamHeaders);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: proxiedUrl, // Use proxied URL
    flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
    captions: []
  }]
};
```

### Stream Validation Bypass

Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in `valid.ts`:

```typescript
// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
  // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];
```

**Why this is needed:**
- By default, all streams are validated by attempting to fetch metadata
- The validation uses `proxiedFetcher` to check if streams are playable
- If the stream blocks the fetcher, validation will fail
- But the proxied URL should still work in the actual player context
- Adding to skip list bypasses validation and returns the proxied URL directly without checking it

**When to skip validation:**
- Validation consistently fails but streams work in browsers
- The stream may be origin or IP locked
- The stream blocks the extension or proxy

**Use setupProxy for MP4 streams:**
When adding headers in the stream response, usually may need to use the **extension** or native to send the correct headers in the request.

```typescript
import { setupProxy } from '@/utils/proxy';

let stream = {
  id: 'primary',
  type: 'file',
  flags: [],
  qualities: {
    '1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
  },
  headers: {
    'Referer': 'https://player.example.com/',
    'User-Agent': 'Mozilla/5.0...'
  },
  captions: []
};

// setupProxy will handle proxying if needed
stream = setupProxy(stream);

return { stream: [stream] };
```

## Performance Optimization

### Efficient Data Extraction

**Use targeted selectors:**
```typescript
// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');

// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');
```

**Cache expensive operations:**
```typescript
// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
  cachedConfig = JSON.parse(configString);
}
```

### Minimize HTTP Requests

**Combine operations when possible:**
```typescript
// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);

// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);
```

## Security Considerations

### Input Validation

**Validate external data:**
```typescript
// Validate URLs before using them
const isValidUrl = (url: string) => {
  try {
    new URL(url);
    return url.startsWith('http://') || url.startsWith('https://');
  } catch {
    return false;
  }
};

if (!isValidUrl(streamUrl)) {
  throw new Error('Invalid stream URL received');
}
```

**Sanitize regex inputs:**
```typescript
// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');
```

### Safe JSON Parsing

```typescript
// Handle malformed JSON gracefully
let config;
try {
  config = JSON.parse(configString);
} catch (error) {
  throw new Error('Invalid configuration format');
}

// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
  throw new Error('Invalid configuration structure');
}
```

## Testing and Debugging

### Debug Logging

```typescript
// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);
```

### Test Edge Cases

- Content with special characters in titles
- Very new releases (might not be available)
- Old content (might have different URL patterns)
- Different regions (geographic restrictions)
- Different quality levels

### Common Debugging Steps

1. **Verify URLs are correct**
2. **Check HTTP status codes**
3. **Inspect response headers**
4. **Validate extracted data structure**
5. **Test with different content types**

## Best Practices Summary

1. **Always use `ctx.proxiedFetcher`** for external requests
2. **Throw `NotFoundError`** for content-not-found scenarios
3. **Update progress** at meaningful milestones
4. **Use appropriate flags** for stream capabilities
5. **Handle protected streams** with proxy functions
6. **Validate external data** before using it
7. **Test thoroughly** with diverse content
8. **Document your implementation** in pull requests

## Next Steps

With these advanced concepts:

1. Review [Sources vs Embeds](/in-depth/sources-and-embeds) for architectural patterns
2. Study existing scrapers in `src/providers/` for real examples
3. Test your implementation thoroughly
4. Submit pull requests with detailed testing documentation