pstreams-providers/.docs/content/3.in-depth/4.sources-and-embeds.md
2025-10-21 17:49:35 -06:00

272 lines
8.3 KiB
Markdown

# Sources vs Embeds
Understanding the difference between sources and embeds is crucial for building scrapers effectively. They work together in a two-stage pipeline to extract playable video streams.
## The Two-Stage Pipeline
```
User Request → Source Scraper → What did source find?
┌─────────────┐
↓ ↓
Direct Stream Embed URLs
↓ ↓
Play Video Embed Scraper
Extract Stream
Play Video
```
**Flow Breakdown:**
1. **User requests** content (movie/TV show)
2. **Source scraper** searches the target website
3. **Source returns** either:
- **Direct streams** → Ready to play immediately
- **Embed URLs** → Need further processing
4. **Embed scraper** (if needed) extracts streams from player URLs
5. **Final result** → Playable video stream
## Sources: The Content Finders
**Sources** are the first stage - they find content on websites and return either:
1. **Direct video streams** (ready to play)
2. **Embed URLs** that need further processing
### Example: Autoembed Source
```typescript
// From src/providers/sources/autoembed.ts
async function comboScraper(ctx: ShowScrapeContext | MovieScrapeContext): Promise<SourcererOutput> {
// 1. Call an API to find video sources
const data = await ctx.proxiedFetcher(`/api/getVideoSource`, {
baseUrl: 'https://tom.autoembed.cc',
query: { type: mediaType, id }
});
// 2. Return embed URLs for further processing
return {
embeds: [{
embedId: 'autoembed-english', // Points to an embed scraper
url: data.videoSource // URL that embed will process
}]
};
}
```
**What this source does:**
- Queries an API with TMDB ID
- Gets back a video source URL
- Returns it as an embed for the `autoembed-english` embed scraper to handle
### Example: Catflix Source
```typescript
// From src/providers/sources/catflix.ts
async function comboScraper(ctx: ShowScrapeContext | MovieScrapeContext): Promise<SourcererOutput> {
// 1. Build URL to the movie/show page
const watchPageUrl = `${baseUrl}/movie/${mediaTitle}-${movieId}`;
// 2. Scrape the page for embedded player URLs
const watchPage = await ctx.proxiedFetcher(watchPageUrl);
const $ = load(watchPage);
// 3. Extract and decode the embed URL
const mainOriginMatch = scriptData.data.match(/main_origin = "(.*?)";/);
const decodedUrl = atob(mainOriginMatch[1]);
// 4. Return embed URL for turbovid embed to process
return {
embeds: [{
embedId: 'turbovid', // Points to turbovid embed scraper
url: decodedUrl // Turbovid player URL
}]
};
}
```
**What this source does:**
- Scrapes a streaming website
- Finds encoded embed player URLs in the page source
- Decodes the URL and returns it for the `turbovid` embed scraper
## Embeds: The Stream Extractors
**Embeds** are the second stage - they take URLs from sources and extract the actual playable video streams. Each embed type knows how to handle a specific player or service.
### Example: Autoembed Embed (Simple)
```typescript
// From src/providers/embeds/autoembed.ts
async scrape(ctx) {
// The URL from the source is already a direct HLS playlist
return {
stream: [{
id: 'primary',
type: 'hls',
playlist: ctx.url, // Use the URL directly as HLS playlist
flags: [flags.CORS_ALLOWED],
captions: []
}]
};
}
```
**What this embed does:**
- Takes the URL from autoembed source
- Treats it as a direct HLS playlist (no further processing needed)
- Returns it as a playable stream
### Example: Turbovid Embed (Complex)
```typescript
// From src/providers/embeds/turbovid.ts
async scrape(ctx) {
// 1. Fetch the turbovid player page
const embedPage = await ctx.proxiedFetcher(ctx.url);
// 2. Extract encryption keys from the page
const apkey = embedPage.match(/const\s+apkey\s*=\s*"(.*?)";/)?.[1];
const xxid = embedPage.match(/const\s+xxid\s*=\s*"(.*?)";/)?.[1];
// 3. Get decryption key from API
const encodedJuiceKey = JSON.parse(
await ctx.proxiedFetcher('/api/cucked/juice_key', { baseUrl })
).juice;
// 4. Get encrypted playlist data
const data = JSON.parse(
await ctx.proxiedFetcher('/api/cucked/the_juice_v2/', {
baseUrl, query: { [apkey]: xxid }
})
).data;
// 5. Decrypt the playlist URL
const playlist = decrypt(data, atob(encodedJuiceKey));
// 6. Return proxied stream (handles CORS/headers)
return {
stream: [{
type: 'hls',
id: 'primary',
playlist: createM3U8ProxyUrl(playlist, ctx.features, streamHeaders),
headers: streamHeaders,
flags: [], captions: []
}]
};
}
```
**What this embed does:**
- Takes turbovid player URL from catflix source
- Performs complex extraction: fetches page → gets keys → decrypts data
- Returns the final HLS playlist with proper proxy handling
## Key Differences
| Sources | Embeds |
|---------|--------|
| **Find content** on websites | **Extract streams** from players |
| Return embed URLs OR direct streams | Always return direct streams |
| Handle website navigation/search | Handle player-specific extraction |
| Can return multiple server options | Process one specific player type |
| Example: "Find Avengers on Catflix" | Example: "Extract stream from Turbovid player" |
## Why This Separation?
### 1. **Reusability**
Multiple sources can use the same embed:
```typescript
// Both catflix and other sources can return turbovid embeds
{ embedId: 'turbovid', url: 'https://turbovid.com/player123' }
```
### 2. **Multiple Server Options**
Sources can provide backup servers:
```typescript
return {
embeds: [
{ embedId: 'turbovid', url: 'https://turbovid.com/player123' },
{ embedId: 'vidcloud', url: 'https://vidcloud.co/embed456' },
{ embedId: 'dood', url: 'https://dood.watch/789' }
]
};
```
### 3. **Language/Quality Variants**
Sources can offer different options:
```typescript
return {
embeds: [
{ embedId: 'autoembed-english', url: streamUrl },
{ embedId: 'autoembed-spanish', url: streamUrlEs },
{ embedId: 'autoembed-hindi', url: streamUrlHi }
]
};
```
### 4. **Specialization**
- **Sources** specialize in website structures and search
- **Embeds** specialize in player technologies and decryption
## How They Work Together
### Flow Example: Finding "Spirited Away"
1. **Source (catflix)**:
- Searches catflix.su for "Spirited Away"
- Finds movie page with embedded player
- Extracts turbovid URL: `https://turbovid.com/embed/abc123`
- Returns: `{ embedId: 'turbovid', url: 'https://turbovid.com/embed/abc123' }`
2. **Embed (turbovid)**:
- Receives the turbovid URL
- Scrapes the player page for encryption keys
- Decrypts the actual HLS playlist URL
- Returns: `{ stream: [{ playlist: 'https://cdn.example.com/movie.m3u8' }] }`
3. **Result**: User can now play the video stream
### Error Handling Chain
If the embed fails to extract a stream:
```typescript
// Source provides multiple backup options
return {
embeds: [
{ embedId: 'turbovid', url: url1 }, // Try first
{ embedId: 'mixdrop', url: url2 }, // Fallback 1
{ embedId: 'dood', url: url3 } // Fallback 2
]
};
```
The system tries each embed in rank order until one succeeds.
## Best Practices
### For Sources:
- Provide multiple embed options when possible
- Use descriptive embed IDs that match existing embeds
- Handle both movies and TV shows (combo scraper pattern)
- Return direct streams when embed processing isn't needed
### For Embeds:
- Focus on one player type per embed
- Handle errors gracefully with clear error messages
- Use proxy functions for protected streams
- Include proper headers and flags
### Registration:
```typescript
// In src/providers/all.ts
export function gatherAllSources(): Array<Sourcerer> {
return [catflixScraper, autoembedScraper, /* ... */];
}
export function gatherAllEmbeds(): Array<Embed> {
return [turbovidScraper, autoembedEnglishScraper, /* ... */];
}
```
Both sources and embeds must be registered in `all.ts` to be available for use.