# Building Scrapers This guide covers the technical details of implementing scrapers, from basic structure to advanced patterns. ## The Combo Scraper Pattern The most common and recommended pattern is the "combo scraper" that handles both movies and TV shows with a single function. This reduces code duplication and ensures consistent behavior. ### Basic Structure ```typescript import { SourcererEmbed, SourcererOutput, makeSourcerer } from '@/providers/base'; import { MovieScrapeContext, ShowScrapeContext } from '@/utils/context'; import { NotFoundError } from '@/utils/errors'; // Main scraping function that handles both movies and TV shows async function comboScraper(ctx: ShowScrapeContext | MovieScrapeContext): Promise { // 1. Build the appropriate URL based on media type const embedUrl = `https://embed.su/embed/${ ctx.media.type === 'movie' ? `movie/${ctx.media.tmdbId}` : `tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}` }`; // 2. Fetch the embed page using proxied fetcher const embedPage = await ctx.proxiedFetcher(embedUrl, { headers: { Referer: 'https://embed.su/', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36', }, }); // 3. Extract and decode configuration const vConfigMatch = embedPage.match(/window\.vConfig\s*=\s*JSON\.parse\(atob\(`([^`]+)/i); const encodedConfig = vConfigMatch?.[1]; if (!encodedConfig) throw new NotFoundError('No encoded config found'); // 4. Process the data (decode, decrypt, etc.) const decodedConfig = JSON.parse(await stringAtob(encodedConfig)); if (!decodedConfig?.hash) throw new NotFoundError('No stream hash found'); // 5. Update progress to show we're making progress ctx.progress(50); // 6. Build the final result const embeds: SourcererEmbed[] = secondDecode.map((server) => ({ embedId: 'viper', // ID of the embed scraper to handle this URL url: `https://embed.su/api/e/${server.hash}`, })); ctx.progress(90); return { embeds }; } // Export the scraper configuration export const embedsuScraper = makeSourcerer({ id: 'embedsu', // Unique identifier name: 'embed.su', // Display name rank: 165, // Priority rank (must be unique) disabled: false, // Whether the scraper is disabled flags: [], // Feature flags (see Advanced Concepts) scrapeMovie: comboScraper, // Function for movies scrapeShow: comboScraper, // Function for TV shows }); ``` ### Alternative: Separate Functions For complex cases where movie and TV show logic differs significantly. However, its best to use combo scraper! ```typescript async function scrapeMovie(ctx: MovieScrapeContext): Promise { // Movie-specific logic const movieUrl = `${baseUrl}/movie/${ctx.media.tmdbId}`; // ... movie processing } async function scrapeShow(ctx: ShowScrapeContext): Promise { // TV show-specific logic const showUrl = `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`; // ... show processing } export const myScraper = makeSourcerer({ id: 'my-scraper', name: 'My Scraper', rank: 150, disabled: false, flags: [], scrapeMovie: scrapeMovie, // Separate functions scrapeShow: scrapeShow, }); ``` ## Return Types A `SourcererOutput` can return two types of data. Understanding when to use each is crucial: ### 1. Embeds Array (Most Common) Use when your scraper finds embed players that need further processing: ```typescript return { embeds: [ { embedId: 'turbovid', // Must match an existing embed scraper ID url: 'https://turbovid.com/embed/abc123' }, { embedId: 'mixdrop', // Backup option url: 'https://mixdrop.co/embed/def456' } ] }; ``` **When to use:** - Your scraper finds embed player URLs - You want to leverage existing embed scrapers - The site uses common players (turbovid, mixdrop, etc.) - You want to provide multiple server options ### 2. Stream Array (Direct Streams) Use when your scraper finds direct video streams that are ready to play: ```typescript import { flags } from '@/entrypoint/utils/targets'; // For HLS streams return { embeds: [], // Can be empty when returning streams stream: [ { id: 'primary', type: 'hls', playlist: streamUrl, flags: [flags.CORS_ALLOWED], captions: [], // Subtitle tracks (optional) } ] }; // For MP4 files with a single quality return { embeds: [], stream: [ { id: 'primary', captions, qualities: { unknown: { type: 'mp4', url: streamUrl, }, }, type: 'file', flags: [flags.CORS_ALLOWED], }, ], }; // For MP4 files with multiple qualities: // It's recommended to return it using a function similar to this: const streams = Object.entries(data.streams).reduce((acc: Record, [quality, url]) => { let qualityKey: number; if (quality === 'ORG') { // Only add unknown quality if it's an mp4 (handle URLs with query parameters) const urlPath = url.split('?')[0]; // Remove query parameters if (urlPath.toLowerCase().endsWith('.mp4')) { acc.unknown = url; } return acc; } if (quality === '4K') { qualityKey = 2160; } else { qualityKey = parseInt(quality.replace('P', ''), 10); } if (Number.isNaN(qualityKey) || acc[qualityKey]) return acc; acc[qualityKey] = url; return acc; }, {}); // Filter qualities based on provider type const filteredStreams = Object.entries(streams).reduce((acc: Record, [quality, url]) => { // Skip unknown for cached provider if (provider.useCacheUrl && quality === 'unknown') { return acc; } acc[quality] = url; return acc; }, {}); // Returning each quality like so return { stream: [ { id: 'primary', captions: [], qualities: { ...(filteredStreams[2160] && { '4k': { type: 'mp4', url: filteredStreams[2160], }, }), ...(filteredStreams[1080] && { 1080: { type: 'mp4', url: filteredStreams[1080], }, }), ...(filteredStreams[720] && { 720: { type: 'mp4', url: filteredStreams[720], }, }), ...(filteredStreams[480] && { 480: { type: 'mp4', url: filteredStreams[480], }, }), ...(filteredStreams[360] && { 360: { type: 'mp4', url: filteredStreams[360], }, }), ...(filteredStreams.unknown && { unknown: { type: 'mp4', url: filteredStreams.unknown, }, }), }, type: 'file', flags: [flags.CORS_ALLOWED], }, ], }; ``` **When to use:** - Your scraper can extract direct video URLs - The site provides its own player technology - You need fine control over stream handling - The streams don't require complex embed processing ## Context and Utilities The scraper context (`ctx`) provides everything you need for implementation: ### Media Information ```typescript // Basic media info (always available) ctx.media.title // "Spirited Away" ctx.media.type // "movie" | "show" ctx.media.tmdbId // 129 ctx.media.releaseYear // 2001 ctx.media.imdbId // "tt0245429" (when available) // For TV shows only (check ctx.media.type === 'show') ctx.media.season.number // 1 ctx.media.season.tmdbId // Season TMDB ID ctx.media.episode.number // 5 ctx.media.episode.tmdbId // Episode TMDB ID ``` ### HTTP Client ```typescript // Always use proxiedFetcher for external requests to avoid CORS const response = await ctx.proxiedFetcher('https://example.com/api', { method: 'POST', headers: { 'User-Agent': 'Mozilla/5.0...', 'Referer': 'https://example.com' }, body: JSON.stringify({ key: 'value' }) }); // For API calls with base URL const data = await ctx.proxiedFetcher('/search', { baseUrl: 'https://api.example.com', query: { q: ctx.media.title, year: ctx.media.releaseYear } }); ``` ### Progress Updates ```typescript // Update the loading indicator (0-100) ctx.progress(25); // Found media page // ... processing ... ctx.progress(50); // Extracted embed links // ... more processing ... ctx.progress(90); // Almost done ``` ## Common Patterns ### 1. URL Building ```typescript // Handle different media types const buildUrl = (ctx: ShowScrapeContext | MovieScrapeContext) => { const apiUrl = ctx.media.type === 'movie' ? `${baseUrl}/movie/${ctx.media.tmdbId}` : `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`; return apiUrl; }; ``` ### 2. Data Extraction ```typescript import { load } from 'cheerio'; // Scraping with Cheerio const $ = load(embedPage); const embedUrls = $('iframe[src*="turbovid"]') .map((_, el) => $(el).attr('src')) .get() .filter(Boolean); // Regex extraction const configMatch = embedPage.match(/window\.playerConfig\s*=\s*({.*?});/s); if (configMatch) { const config = JSON.parse(configMatch[1]); // Process config... } ``` ### 3. Error Handling ```typescript import { NotFoundError } from '@/utils/errors'; // Throw NotFoundError for content not found if (!embedUrls.length) { throw new NotFoundError('No embed players found'); } // Throw generic Error for other issues if (!apiResponse.success) { throw new Error(`API request failed: ${apiResponse.message}`); } ``` ### 4. Protected Streams There are several ways to bypass protections on streams. Using the M3U8 proxy: ```typescript import { createM3U8ProxyUrl } from '@/utils/proxy'; // For streams that require special headers const streamHeaders = { 'Referer': 'https://player.example.com/', 'Origin': 'https://player.example.com', 'User-Agent': 'Mozilla/5.0...' }; return { stream: [{ id: 'primary', type: 'hls', playlist: createM3U8ProxyUrl(originalPlaylist, ctx.features, streamHeaders), headers: streamHeaders, // Include headers in the createM3U8ProxyUrl function and here for native and extension targets flags: [flags.CORS_ALLOWED], // createM3U8ProxyUrl (or the extension) bypasses cors so we say it's allowed to play in a browser captions: [] }] }; ``` Using the browser extension: ```typescript // For streams that require special headers const streamHeaders = { 'Referer': 'https://player.example.com/', 'Origin': 'https://player.example.com', 'User-Agent': 'Mozilla/5.0...' }; return { stream: [{ id: 'primary', type: 'hls', playlist: originalPlaylist, headers: streamHeaders, flags: [], // Use the extension becuase it can pass headers, include no flag for extension or native captions: [] }] }; ``` ## Building Embed Scrapers Embed scrapers follow a simpler pattern since they only handle one URL type: ```typescript import { makeEmbed } from '@/providers/base'; export const myEmbedScraper = makeEmbed({ id: 'my-embed', name: 'My Embed Player', rank: 120, async scrape(ctx) { // ctx.url contains the embed URL from a source // 1. Fetch the embed page const embedPage = await ctx.proxiedFetcher(ctx.url); // 2. Extract the stream URL (example with regex) const streamMatch = embedPage.match(/src:\s*["']([^"']+\.m3u8[^"']*)/); if (!streamMatch) { throw new NotFoundError('No stream found in embed'); } // 3. Return the stream return { stream: [{ id: 'primary', type: 'hls', playlist: streamMatch[1], flags: [flags.CORS_ALLOWED], captions: [] }] }; }, }); ``` ## Testing Your Scrapers ### 1. Basic Testing ```sh # Test your scraper with CLI pnpm cli --source-id my-scraper --tmdb-id 11527 # Test different content types pnpm cli --source-id my-scraper --tmdb-id 94605 --season 1 --episode 1 # TV show ``` ### 2. Real CLI Output Examples **Testing a source that returns embeds:** ```sh pnpm cli --source-id catflix --tmdb-id 11527 ``` ```json { embeds: [ { embedId: 'turbovid', url: 'https://turbovid.eu/embed/DjncbDBEmbLW' } ] } ``` **Testing an embed that returns streams:** ```sh pnpm cli --source-id turbovid --url "https://turbovid.eu/embed/DjncbDBEmbLW" ``` ```json { stream: [ { type: 'hls', id: 'primary', playlist: 'https://proxy.fifthwit.net/m3u8-proxy?url=https%3A%2F%2Fqueenselti.pro%2Fwrofm%2Fuwu.m3u8&headers=%7B%22referer%22%3A%22https%3A%2F%2Fturbovid.eu%2F%22%2C%22origin%22%3A%2F%2Fturbovid.eu%22%7D', flags: [flags.CORS_ALLOWED], captions: [] } ] } ``` **Notice**: The playlist URL shows how `createM3U8ProxyUrl()` creates proxied URLs to handle protected streams. ### 3. Comprehensive Testing Test with various content: - Popular movies (The Shining: 11527, Spirited Away: 129, Avatar: 19995) - Recent releases (check current popular movies) - TV shows with multiple seasons - Anime series (different episode numbering) - Different languages/regions ### 4. Debug Mode ```sh # Add debug logging to your scraper console.log('Fetching URL:', embedUrl); console.log('Response status:', response.status); console.log('Extracted data:', extractedData); ``` ## Next Steps Once you've built your scraper: 1. Test thoroughly with multiple content types 2. Check [Advanced Concepts](/in-depth/advanced-concepts) for flags and error handling 3. Register in `all.ts` with a unique rank 4. Submit a pull request with testing documentation ::alert{type="warning"} Always test your scrapers with both movies and TV shows, and include multiple examples in your pull request description. ::