pstreams-providers/.docs/content/3.in-depth/3.building-scrapers.md

# Building Scrapers

This guide covers the technical details of implementing scrapers, from basic structure to advanced patterns.

## The Combo Scraper Pattern

The most common and recommended pattern is the "combo scraper" that handles both movies and TV shows with a single function. This reduces code duplication and ensures consistent behavior.

### Basic Structure

```typescript
import { SourcererEmbed, SourcererOutput, makeSourcerer } from '@/providers/base';
import { MovieScrapeContext, ShowScrapeContext } from '@/utils/context';
import { NotFoundError } from '@/utils/errors';

// Main scraping function that handles both movies and TV shows
async function comboScraper(ctx: ShowScrapeContext | MovieScrapeContext): Promise<SourcererOutput> {
  // 1. Build the appropriate URL based on media type
  const embedUrl = `https://embed.su/embed/${
    ctx.media.type === 'movie'
      ? `movie/${ctx.media.tmdbId}`
      : `tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`
  }`;

  // 2. Fetch the embed page using proxied fetcher
  const embedPage = await ctx.proxiedFetcher<string>(embedUrl, {
    headers: {
      Referer: 'https://embed.su/',
      'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    },
  });

  // 3. Extract and decode configuration
  const vConfigMatch = embedPage.match(/window\.vConfig\s*=\s*JSON\.parse\(atob\(`([^`]+)/i);
  const encodedConfig = vConfigMatch?.[1];
  if (!encodedConfig) throw new NotFoundError('No encoded config found');

  // 4. Process the data (decode, decrypt, etc.)
  const decodedConfig = JSON.parse(await stringAtob(encodedConfig));
  if (!decodedConfig?.hash) throw new NotFoundError('No stream hash found');

  // 5. Update progress to show we're making progress
  ctx.progress(50);

  // 6. Build the final result
  const embeds: SourcererEmbed[] = secondDecode.map((server) => ({
    embedId: 'viper',  // ID of the embed scraper to handle this URL
    url: `https://embed.su/api/e/${server.hash}`,
  }));

  ctx.progress(90);

  return { embeds };
}

// Export the scraper configuration
export const embedsuScraper = makeSourcerer({
  id: 'embedsu',              // Unique identifier
  name: 'embed.su',           // Display name
  rank: 165,                  // Priority rank (must be unique)
  disabled: false,            // Whether the scraper is disabled
  flags: [],                  // Feature flags (see Advanced Concepts)
  scrapeMovie: comboScraper,  // Function for movies
  scrapeShow: comboScraper,   // Function for TV shows
});
```

### Alternative: Separate Functions

For complex cases where movie and TV show logic differs significantly. However, its best to use combo scraper!

```typescript
async function scrapeMovie(ctx: MovieScrapeContext): Promise<SourcererOutput> {
  // Movie-specific logic
  const movieUrl = `${baseUrl}/movie/${ctx.media.tmdbId}`;
  // ... movie processing
}

async function scrapeShow(ctx: ShowScrapeContext): Promise<SourcererOutput> {
  // TV show-specific logic
  const showUrl = `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`;
  // ... show processing
}

export const myScraper = makeSourcerer({
  id: 'my-scraper',
  name: 'My Scraper',
  rank: 150,
  disabled: false,
  flags: [],
  scrapeMovie: scrapeMovie,    // Separate functions
  scrapeShow: scrapeShow,
});
```

## Return Types

A `SourcererOutput` can return two types of data. Understanding when to use each is crucial:

### 1. Embeds Array (Most Common)

Use when your scraper finds embed players that need further processing:

```typescript
return {
  embeds: [
    {
      embedId: 'turbovid',  // Must match an existing embed scraper ID
      url: 'https://turbovid.com/embed/abc123'
    },
    {
      embedId: 'mixdrop',   // Backup option
      url: 'https://mixdrop.co/embed/def456'
    }
  ]
};
```

**When to use:**
- Your scraper finds embed player URLs
- You want to leverage existing embed scrapers
- The site uses common players (turbovid, mixdrop, etc.)
- You want to provide multiple server options

### 2. Stream Array (Direct Streams)

Use when your scraper finds direct video streams that are ready to play:

```typescript
import { flags } from '@/entrypoint/utils/targets';

// For HLS streams
return {
  embeds: [], // Can be empty when returning streams
  stream: [
    {
      id: 'primary',
      type: 'hls',
      playlist: streamUrl,
      flags: [flags.CORS_ALLOWED],
      captions: [], // Subtitle tracks (optional)
    }
  ]
};

// For MP4 files with a single quality
return {
    embeds: [],
    stream: [
      {
        id: 'primary',
        captions,
        qualities: {
          unknown: {
            type: 'mp4',
            url: streamUrl,
          },
        },
        type: 'file',
        flags: [flags.CORS_ALLOWED],
      },
    ],
  };

  // For MP4 files with multiple qualities:
  // It's recommended to return it using a function similar to this:

  const streams = Object.entries(data.streams).reduce((acc: Record<string, string>, [quality, url]) => {
    let qualityKey: number;
    if (quality === 'ORG') {
      // Only add unknown quality if it's an mp4 (handle URLs with query parameters)
      const urlPath = url.split('?')[0]; // Remove query parameters
      if (urlPath.toLowerCase().endsWith('.mp4')) {
        acc.unknown = url;
      }
      return acc;
    }
    if (quality === '4K') {
      qualityKey = 2160;
    } else {
      qualityKey = parseInt(quality.replace('P', ''), 10);
    }
    if (Number.isNaN(qualityKey) || acc[qualityKey]) return acc;
    acc[qualityKey] = url;
    return acc;
  }, {});

  // Filter qualities based on provider type
  const filteredStreams = Object.entries(streams).reduce((acc: Record<string, string>, [quality, url]) => {
    // Skip unknown for cached provider
    if (provider.useCacheUrl && quality === 'unknown') {
      return acc;
    }

    acc[quality] = url;
    return acc;
  }, {});

  // Returning each quality like so
  return {
    stream: [
      {
        id: 'primary',
        captions: [],
        qualities: {
          ...(filteredStreams[2160] && {
            '4k': {
              type: 'mp4',
              url: filteredStreams[2160],
            },
          }),
          ...(filteredStreams[1080] && {
            1080: {
              type: 'mp4',
              url: filteredStreams[1080],
            },
          }),
          ...(filteredStreams[720] && {
            720: {
              type: 'mp4',
              url: filteredStreams[720],
            },
          }),
          ...(filteredStreams[480] && {
            480: {
              type: 'mp4',
              url: filteredStreams[480],
            },
          }),
          ...(filteredStreams[360] && {
            360: {
              type: 'mp4',
              url: filteredStreams[360],
            },
          }),
          ...(filteredStreams.unknown && {
            unknown: {
              type: 'mp4',
              url: filteredStreams.unknown,
            },
          }),
        },
        type: 'file',
        flags: [flags.CORS_ALLOWED],
      },
    ],
  };
```

**When to use:**
- Your scraper can extract direct video URLs
- The site provides its own player technology
- You need fine control over stream handling
- The streams don't require complex embed processing

## Context and Utilities

The scraper context (`ctx`) provides everything you need for implementation:

### Media Information
```typescript
// Basic media info (always available)
ctx.media.title          // "Spirited Away"
ctx.media.type           // "movie" | "show"
ctx.media.tmdbId         // 129
ctx.media.releaseYear    // 2001
ctx.media.imdbId         // "tt0245429" (when available)

// For TV shows only (check ctx.media.type === 'show')
ctx.media.season.number  // 1
ctx.media.season.tmdbId  // Season TMDB ID
ctx.media.episode.number // 5
ctx.media.episode.tmdbId // Episode TMDB ID
```

### HTTP Client
```typescript
// Always use proxiedFetcher for external requests to avoid CORS
const response = await ctx.proxiedFetcher<string>('https://example.com/api', {
  method: 'POST',
  headers: {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://example.com'
  },
  body: JSON.stringify({ key: 'value' })
});

// For API calls with base URL
const data = await ctx.proxiedFetcher('/search', {
  baseUrl: 'https://api.example.com',
  query: { q: ctx.media.title, year: ctx.media.releaseYear }
});
```

### Progress Updates
```typescript
// Update the loading indicator (0-100)
ctx.progress(25);  // Found media page
// ... processing ...
ctx.progress(50);  // Extracted embed links
// ... more processing ...
ctx.progress(90);  // Almost done
```

## Common Patterns

### 1. URL Building
```typescript
// Handle different media types
const buildUrl = (ctx: ShowScrapeContext | MovieScrapeContext) => {
  const apiUrl = ctx.media.type === 'movie'
    ? `${baseUrl}/movie/${ctx.media.tmdbId}`
    : `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`;

  return apiUrl;
};
```

### 2. Data Extraction
```typescript
import { load } from 'cheerio';

// Scraping with Cheerio
const $ = load(embedPage);
const embedUrls = $('iframe[src*="turbovid"]')
  .map((_, el) => $(el).attr('src'))
  .get()
  .filter(Boolean);

// Regex extraction
const configMatch = embedPage.match(/window\.playerConfig\s*=\s*({.*?});/s);
if (configMatch) {
  const config = JSON.parse(configMatch[1]);
  // Process config...
}
```

### 3. Error Handling
```typescript
import { NotFoundError } from '@/utils/errors';

// Throw NotFoundError for content not found
if (!embedUrls.length) {
  throw new NotFoundError('No embed players found');
}

// Throw generic Error for other issues
if (!apiResponse.success) {
  throw new Error(`API request failed: ${apiResponse.message}`);
}
```

### 4. Protected Streams
There are several ways to bypass protections on streams.

Using the M3U8 proxy:
```typescript
import { createM3U8ProxyUrl } from '@/utils/proxy';

// For streams that require special headers
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0...'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: createM3U8ProxyUrl(originalPlaylist, ctx.features, streamHeaders),
    headers: streamHeaders, // Include headers in the createM3U8ProxyUrl function and here for native and extension targets
    flags: [flags.CORS_ALLOWED], // createM3U8ProxyUrl (or the extension) bypasses cors so we say it's allowed to play in a browser
    captions: []
  }]
};
```

Using the browser extension:
```typescript
// For streams that require special headers
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0...'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: originalPlaylist,
    headers: streamHeaders,
    flags: [], // Use the extension becuase it can pass headers, include no flag for extension or native
    captions: []
  }]
};
```

## Building Embed Scrapers

Embed scrapers follow a simpler pattern since they only handle one URL type:

```typescript
import { makeEmbed } from '@/providers/base';

export const myEmbedScraper = makeEmbed({
  id: 'my-embed',
  name: 'My Embed Player',
  rank: 120,
  async scrape(ctx) {
    // ctx.url contains the embed URL from a source

    // 1. Fetch the embed page
    const embedPage = await ctx.proxiedFetcher(ctx.url);

    // 2. Extract the stream URL (example with regex)
    const streamMatch = embedPage.match(/src:\s*["']([^"']+\.m3u8[^"']*)/);
    if (!streamMatch) {
      throw new NotFoundError('No stream found in embed');
    }

    // 3. Return the stream
    return {
      stream: [{
        id: 'primary',
        type: 'hls',
        playlist: streamMatch[1],
        flags: [flags.CORS_ALLOWED],
        captions: []
      }]
    };
  },
});
```

## Testing Your Scrapers

### 1. Basic Testing
```sh
# Test your scraper with CLI
pnpm cli --source-id my-scraper --tmdb-id 11527

# Test different content types
pnpm cli --source-id my-scraper --tmdb-id 94605 --season 1 --episode 1  # TV show
```

### 2. Real CLI Output Examples

**Testing a source that returns embeds:**
```sh
pnpm cli --source-id catflix --tmdb-id 11527
```
```json
{
  embeds: [
    {
      embedId: 'turbovid',
      url: 'https://turbovid.eu/embed/DjncbDBEmbLW'
    }
  ]
}
```

**Testing an embed that returns streams:**
```sh
pnpm cli --source-id turbovid --url "https://turbovid.eu/embed/DjncbDBEmbLW"
```
```json
{
  stream: [
    {
      type: 'hls',
      id: 'primary',
      playlist: 'https://proxy.fifthwit.net/m3u8-proxy?url=https%3A%2F%2Fqueenselti.pro%2Fwrofm%2Fuwu.m3u8&headers=%7B%22referer%22%3A%22https%3A%2F%2Fturbovid.eu%2F%22%2C%22origin%22%3A%2F%2Fturbovid.eu%22%7D',
      flags: [flags.CORS_ALLOWED],
      captions: []
    }
  ]
}
```

**Notice**: The playlist URL shows how `createM3U8ProxyUrl()` creates proxied URLs to handle protected streams.

### 3. Comprehensive Testing
Test with various content:
- Popular movies (The Shining: 11527, Spirited Away: 129, Avatar: 19995)
- Recent releases (check current popular movies)
- TV shows with multiple seasons
- Anime series (different episode numbering)
- Different languages/regions

### 4. Debug Mode
```sh
# Add debug logging to your scraper
console.log('Fetching URL:', embedUrl);
console.log('Response status:', response.status);
console.log('Extracted data:', extractedData);
```

## Next Steps

Once you've built your scraper:

1. Test thoroughly with multiple content types
2. Check [Advanced Concepts](/in-depth/advanced-concepts) for flags and error handling
3. Register in `all.ts` with a unique rank
4. Submit a pull request with testing documentation

::alert{type="warning"}
Always test your scrapers with both movies and TV shows, and include multiple examples in your pull request description.
::