Sekai_scraper - OP Version
This commit is contained in:
parent
1fff726d40
commit
644ea16f94
35 changed files with 4867 additions and 1 deletions
71
.gitignore
vendored
Normal file
71
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,71 @@
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
*.so
|
||||||
|
.Python
|
||||||
|
env/
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
build/
|
||||||
|
develop-eggs/
|
||||||
|
dist/
|
||||||
|
downloads/
|
||||||
|
eggs/
|
||||||
|
.eggs/
|
||||||
|
lib/
|
||||||
|
lib64/
|
||||||
|
parts/
|
||||||
|
sdist/
|
||||||
|
var/
|
||||||
|
wheels/
|
||||||
|
*.egg-info/
|
||||||
|
.installed.cfg
|
||||||
|
*.egg
|
||||||
|
|
||||||
|
# Virtual environments
|
||||||
|
venv/
|
||||||
|
env/
|
||||||
|
ENV/
|
||||||
|
.venv
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
*~
|
||||||
|
|
||||||
|
# Environment variables
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
|
||||||
|
# Data and logs
|
||||||
|
data/
|
||||||
|
logs/
|
||||||
|
cache/
|
||||||
|
*.log
|
||||||
|
|
||||||
|
# Selenium
|
||||||
|
*.png
|
||||||
|
*.jpg
|
||||||
|
screenshots/
|
||||||
|
|
||||||
|
# OS
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
.pytest_cache/
|
||||||
|
.coverage
|
||||||
|
htmlcov/
|
||||||
|
.tox/
|
||||||
|
|
||||||
|
# Jupyter Notebook
|
||||||
|
.ipynb_checkpoints
|
||||||
|
|
||||||
|
# Database
|
||||||
|
*.db
|
||||||
|
*.sqlite
|
||||||
|
*.sqlite3
|
||||||
|
|
||||||
534
PROXY_GUIDE.md
Normal file
534
PROXY_GUIDE.md
Normal file
|
|
@ -0,0 +1,534 @@
|
||||||
|
### 🎬 Guide du Proxy Vidéo Sekai.one
|
||||||
|
|
||||||
|
Solution complète pour contourner la protection Referer et accéder aux vidéos de sekai.one
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Problème Résolu
|
||||||
|
|
||||||
|
Le serveur vidéo `mugiwara.xyz` bloque l'accès direct avec un **403 Forbidden** car il vérifie que le `Referer` provient de `https://sekai.one/`.
|
||||||
|
|
||||||
|
**Notre solution** : Un serveur proxy qui ajoute automatiquement le bon `Referer` et permet d'accéder aux vidéos depuis n'importe où.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚡ Démarrage Rapide
|
||||||
|
|
||||||
|
### 1. Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Installer les dépendances (inclut Flask)
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Démarrer le serveur proxy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python video_proxy_server.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Le serveur démarre sur `http://localhost:8080`
|
||||||
|
|
||||||
|
### 3. Utiliser le proxy
|
||||||
|
|
||||||
|
**Format de l'URL :**
|
||||||
|
```
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Exemple dans le navigateur :**
|
||||||
|
- Copiez l'URL ci-dessus
|
||||||
|
- Collez dans votre navigateur
|
||||||
|
- La vidéo se lit directement ! 🎉
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📖 Utilisation Détaillée
|
||||||
|
|
||||||
|
### A. Dans un navigateur web
|
||||||
|
|
||||||
|
```
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
→ La vidéo se lit directement dans le navigateur
|
||||||
|
|
||||||
|
### B. Avec VLC Media Player
|
||||||
|
|
||||||
|
1. Ouvrir VLC
|
||||||
|
2. Média → Ouvrir un flux réseau
|
||||||
|
3. Coller l'URL proxy :
|
||||||
|
```
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
4. Lire ! 🎬
|
||||||
|
|
||||||
|
### C. Dans une page HTML
|
||||||
|
|
||||||
|
```html
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>One Piece Episode 527</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h1>One Piece - Episode 527</h1>
|
||||||
|
|
||||||
|
<video controls width="1280" height="720">
|
||||||
|
<source
|
||||||
|
src="http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
type="video/mp4">
|
||||||
|
Votre navigateur ne supporte pas la vidéo HTML5.
|
||||||
|
</video>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
```
|
||||||
|
|
||||||
|
### D. Télécharger avec wget
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wget "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -O episode_527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
### E. Télécharger avec curl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -o episode_527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
### F. En Python
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
proxy_url = "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
|
||||||
|
# Streaming
|
||||||
|
response = requests.get(proxy_url, stream=True)
|
||||||
|
with open("episode_527.mp4", "wb") as f:
|
||||||
|
for chunk in response.iter_content(chunk_size=8192):
|
||||||
|
f.write(chunk)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🌐 Déploiement sur un VPS (vid.creepso.com)
|
||||||
|
|
||||||
|
### Configuration Nginx (reverse proxy)
|
||||||
|
|
||||||
|
1. **Installer nginx sur votre VPS**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install nginx
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Créer un fichier de configuration**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nano /etc/nginx/sites-available/video-proxy
|
||||||
|
```
|
||||||
|
|
||||||
|
Contenu :
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name vid.creepso.com;
|
||||||
|
|
||||||
|
location / {
|
||||||
|
proxy_pass http://127.0.0.1:8080;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto $scheme;
|
||||||
|
|
||||||
|
# Important pour le streaming vidéo
|
||||||
|
proxy_buffering off;
|
||||||
|
proxy_cache off;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Connection "";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Activer le site**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ln -s /etc/nginx/sites-available/video-proxy /etc/nginx/sites-enabled/
|
||||||
|
sudo nginx -t
|
||||||
|
sudo systemctl restart nginx
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Démarrer le serveur Python avec gunicorn**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Installer gunicorn
|
||||||
|
pip install gunicorn
|
||||||
|
|
||||||
|
# Démarrer le serveur
|
||||||
|
gunicorn -w 4 -b 127.0.0.1:8080 video_proxy_server:app
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Créer un service systemd pour auto-start**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nano /etc/systemd/system/video-proxy.service
|
||||||
|
```
|
||||||
|
|
||||||
|
Contenu :
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Sekai Video Proxy Server
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
User=votre-user
|
||||||
|
WorkingDirectory=/chemin/vers/projet
|
||||||
|
Environment="PATH=/chemin/vers/venv/bin"
|
||||||
|
ExecStart=/chemin/vers/venv/bin/gunicorn -w 4 -b 127.0.0.1:8080 video_proxy_server:app
|
||||||
|
|
||||||
|
Restart=always
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
Activer :
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable video-proxy
|
||||||
|
sudo systemctl start video-proxy
|
||||||
|
sudo systemctl status video-proxy
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Ajouter SSL avec Certbot (HTTPS)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install certbot python3-certbot-nginx
|
||||||
|
sudo certbot --nginx -d vid.creepso.com
|
||||||
|
```
|
||||||
|
|
||||||
|
### Utilisation après déploiement
|
||||||
|
|
||||||
|
Une fois déployé sur votre VPS, vous pouvez accéder aux vidéos via :
|
||||||
|
|
||||||
|
```
|
||||||
|
https://vid.creepso.com/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
Cette URL est accessible **depuis n'importe où dans le monde** ! 🌍
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛠️ API du Serveur Proxy
|
||||||
|
|
||||||
|
### Endpoints disponibles
|
||||||
|
|
||||||
|
#### 1. `/proxy?url=[VIDEO_URL]`
|
||||||
|
|
||||||
|
**Fonction :** Proxy vidéo avec streaming
|
||||||
|
|
||||||
|
**Exemple :**
|
||||||
|
```
|
||||||
|
GET http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fonctionnalités :**
|
||||||
|
- ✅ Streaming progressif
|
||||||
|
- ✅ Support du seeking (Range requests)
|
||||||
|
- ✅ CORS activé
|
||||||
|
- ✅ Aucune limite de taille
|
||||||
|
|
||||||
|
#### 2. `/info?url=[VIDEO_URL]`
|
||||||
|
|
||||||
|
**Fonction :** Obtenir les métadonnées de la vidéo
|
||||||
|
|
||||||
|
**Exemple :**
|
||||||
|
```bash
|
||||||
|
curl "http://localhost:8080/info?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Réponse :**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"url": "https://17.mugiwara.xyz/op/saga-7/hd/527.mp4",
|
||||||
|
"status_code": 200,
|
||||||
|
"accessible": true,
|
||||||
|
"content_type": "video/mp4",
|
||||||
|
"content_length": "272760832",
|
||||||
|
"content_length_mb": 260.14,
|
||||||
|
"server": "nginx/1.25.3",
|
||||||
|
"accept_ranges": "bytes",
|
||||||
|
"proxy_url": "http://localhost:8080/proxy?url=..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. `/download?url=[VIDEO_URL]`
|
||||||
|
|
||||||
|
**Fonction :** Téléchargement forcé (avec Content-Disposition)
|
||||||
|
|
||||||
|
**Exemple :**
|
||||||
|
```
|
||||||
|
GET http://localhost:8080/download?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
Le navigateur va automatiquement télécharger le fichier.
|
||||||
|
|
||||||
|
#### 4. `/health`
|
||||||
|
|
||||||
|
**Fonction :** Vérifier que le serveur fonctionne
|
||||||
|
|
||||||
|
**Exemple :**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:8080/health
|
||||||
|
```
|
||||||
|
|
||||||
|
**Réponse :**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"status": "ok",
|
||||||
|
"service": "sekai-video-proxy",
|
||||||
|
"version": "1.0.0"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Options du Serveur
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Port personnalisé
|
||||||
|
python video_proxy_server.py --port 5000
|
||||||
|
|
||||||
|
# Accès réseau (pas seulement localhost)
|
||||||
|
python video_proxy_server.py --host 0.0.0.0
|
||||||
|
|
||||||
|
# Mode debug
|
||||||
|
python video_proxy_server.py --debug
|
||||||
|
|
||||||
|
# Combinaison
|
||||||
|
python video_proxy_server.py --host 0.0.0.0 --port 5000
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎭 Comment ça marche ?
|
||||||
|
|
||||||
|
### Le problème
|
||||||
|
|
||||||
|
Quand vous accédez directement à `https://17.mugiwara.xyz/op/saga-7/hd/527.mp4` :
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /op/saga-7/hd/527.mp4 HTTP/1.1
|
||||||
|
Host: 17.mugiwara.xyz
|
||||||
|
User-Agent: Mozilla/5.0...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Réponse : 403 Forbidden** ❌
|
||||||
|
|
||||||
|
Le serveur vérifie que la requête vient de sekai.one.
|
||||||
|
|
||||||
|
### La solution
|
||||||
|
|
||||||
|
Le proxy ajoute le header `Referer` correct :
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /op/saga-7/hd/527.mp4 HTTP/1.1
|
||||||
|
Host: 17.mugiwara.xyz
|
||||||
|
User-Agent: Mozilla/5.0...
|
||||||
|
Referer: https://sekai.one/ ← La clé !
|
||||||
|
```
|
||||||
|
|
||||||
|
**Réponse : 200 OK** ✅
|
||||||
|
|
||||||
|
Le serveur pense que la requête vient de sekai.one et autorise l'accès.
|
||||||
|
|
||||||
|
### Flux de données
|
||||||
|
|
||||||
|
```
|
||||||
|
Client (Navigateur/VLC/wget)
|
||||||
|
↓
|
||||||
|
GET http://vid.creepso.com/proxy?url=...
|
||||||
|
↓
|
||||||
|
Serveur Proxy (votre VPS)
|
||||||
|
↓
|
||||||
|
GET https://17.mugiwara.xyz/... avec Referer: sekai.one
|
||||||
|
↓
|
||||||
|
Serveur Vidéo (mugiwara.xyz)
|
||||||
|
↓
|
||||||
|
200 OK + Flux vidéo
|
||||||
|
↓
|
||||||
|
Serveur Proxy → Client
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Intégration avec Stremio
|
||||||
|
|
||||||
|
Vous pouvez créer un add-on Stremio qui utilise votre proxy :
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// stremio-addon.js
|
||||||
|
const { addonBuilder } = require('stremio-addon-sdk');
|
||||||
|
|
||||||
|
const builder = new addonBuilder({
|
||||||
|
id: 'com.sekai.one',
|
||||||
|
version: '1.0.0',
|
||||||
|
name: 'Sekai.one Anime',
|
||||||
|
description: 'Watch anime from sekai.one',
|
||||||
|
resources: ['stream'],
|
||||||
|
types: ['series'],
|
||||||
|
idPrefixes: ['sekai:']
|
||||||
|
});
|
||||||
|
|
||||||
|
builder.defineStreamHandler(async ({ type, id }) => {
|
||||||
|
// Exemple pour One Piece Episode 527
|
||||||
|
if (id === 'sekai:onepiece:527') {
|
||||||
|
return {
|
||||||
|
streams: [{
|
||||||
|
title: 'HD',
|
||||||
|
url: 'https://vid.creepso.com/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4'
|
||||||
|
}]
|
||||||
|
};
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
module.exports = builder.getInterface();
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔐 Sécurité et Performance
|
||||||
|
|
||||||
|
### Limitations recommandées
|
||||||
|
|
||||||
|
Pour protéger votre VPS, ajoutez des limitations :
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Dans video_proxy_server.py, ajoutez :
|
||||||
|
from flask_limiter import Limiter
|
||||||
|
from flask_limiter.util import get_remote_address
|
||||||
|
|
||||||
|
limiter = Limiter(
|
||||||
|
app,
|
||||||
|
key_func=get_remote_address,
|
||||||
|
default_limits=["100 per hour"]
|
||||||
|
)
|
||||||
|
|
||||||
|
@app.route('/proxy')
|
||||||
|
@limiter.limit("10 per minute") # Max 10 requêtes/minute
|
||||||
|
def proxy_video():
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cache (optionnel)
|
||||||
|
|
||||||
|
Pour réduire la charge :
|
||||||
|
|
||||||
|
```python
|
||||||
|
from flask_caching import Cache
|
||||||
|
|
||||||
|
cache = Cache(app, config={'CACHE_TYPE': 'simple'})
|
||||||
|
|
||||||
|
@app.route('/info')
|
||||||
|
@cache.cached(timeout=300) # Cache 5 minutes
|
||||||
|
def video_info():
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Monitoring et Logs
|
||||||
|
|
||||||
|
Les logs sont automatiquement sauvegardés dans `logs/`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Voir les logs en temps réel
|
||||||
|
tail -f logs/*_scraping.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Pour un monitoring avancé sur VPS :
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Installer pm2 pour Node.js ou utiliser systemd logs
|
||||||
|
sudo journalctl -u video-proxy -f
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Exemples d'URLs
|
||||||
|
|
||||||
|
### One Piece
|
||||||
|
|
||||||
|
```
|
||||||
|
# Episode 527 (Saga 7)
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
|
||||||
|
# Episode 528 (Saga 7)
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/528.mp4
|
||||||
|
|
||||||
|
# Pattern général : /op/saga-X/hd/EPISODE.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Avertissement Légal
|
||||||
|
|
||||||
|
Ce proxy est créé dans le cadre d'un **bug bounty** autorisé.
|
||||||
|
|
||||||
|
- ✅ Usage autorisé pour tests de sécurité
|
||||||
|
- ✅ Usage personnel uniquement
|
||||||
|
- ❌ Ne pas utiliser pour distribution publique
|
||||||
|
- ❌ Respecter les droits d'auteur
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🆘 Dépannage
|
||||||
|
|
||||||
|
### Problème : "Connection refused"
|
||||||
|
|
||||||
|
**Solution :** Le serveur n'est pas démarré
|
||||||
|
```bash
|
||||||
|
python video_proxy_server.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Problème : "404 Not Found"
|
||||||
|
|
||||||
|
**Solution :** L'URL de la vidéo est incorrecte. Vérifiez avec :
|
||||||
|
```bash
|
||||||
|
curl "http://localhost:8080/info?url=VOTRE_URL"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Problème : "403 Forbidden" même avec le proxy
|
||||||
|
|
||||||
|
**Solution :** Le serveur source a peut-être changé sa protection. Vérifiez les headers dans `video_proxy_server.py`.
|
||||||
|
|
||||||
|
### Problème : Vidéo lag/buffering
|
||||||
|
|
||||||
|
**Solution :**
|
||||||
|
1. Augmenter le chunk size dans le code
|
||||||
|
2. Vérifier la bande passante du VPS
|
||||||
|
3. Utiliser un CDN devant le proxy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎉 Succès !
|
||||||
|
|
||||||
|
Si tout fonctionne, vous devriez pouvoir :
|
||||||
|
|
||||||
|
1. ✅ Lire les vidéos directement dans le navigateur
|
||||||
|
2. ✅ Les télécharger avec wget/curl
|
||||||
|
3. ✅ Les intégrer dans un lecteur HTML5
|
||||||
|
4. ✅ Les lire avec VLC
|
||||||
|
5. ✅ Y accéder depuis n'importe où (si déployé sur VPS)
|
||||||
|
|
||||||
|
**URL finale accessible publiquement :**
|
||||||
|
```
|
||||||
|
https://vid.creepso.com/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
Profitez-en ! 🚀
|
||||||
|
|
||||||
319
QUICKSTART.md
Normal file
319
QUICKSTART.md
Normal file
|
|
@ -0,0 +1,319 @@
|
||||||
|
# Quick Start Guide
|
||||||
|
|
||||||
|
Get started with web scraping in minutes!
|
||||||
|
|
||||||
|
## 1. Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create virtual environment
|
||||||
|
python -m venv venv
|
||||||
|
|
||||||
|
# Activate virtual environment
|
||||||
|
# Windows:
|
||||||
|
venv\Scripts\activate
|
||||||
|
# Unix/MacOS:
|
||||||
|
source venv/bin/activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Set up environment variables
|
||||||
|
copy .env.example .env # Windows
|
||||||
|
# or
|
||||||
|
cp .env.example .env # Unix/MacOS
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Basic Usage
|
||||||
|
|
||||||
|
### Command Line Interface
|
||||||
|
|
||||||
|
Scrape any website using the CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic scraping
|
||||||
|
python main.py https://example.com
|
||||||
|
|
||||||
|
# Use Selenium for JavaScript sites
|
||||||
|
python main.py https://example.com -m selenium
|
||||||
|
|
||||||
|
# Use Jina AI for text extraction
|
||||||
|
python main.py https://example.com -m jina -o output.txt
|
||||||
|
|
||||||
|
# Enable verbose logging
|
||||||
|
python main.py https://example.com -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python Scripts
|
||||||
|
|
||||||
|
#### Simple Static Page Scraping
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
|
||||||
|
# Scrape a static website
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("https://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
soup = result["soup"]
|
||||||
|
|
||||||
|
# Extract quotes
|
||||||
|
for quote in soup.select(".quote"):
|
||||||
|
text = quote.select_one(".text").get_text()
|
||||||
|
author = quote.select_one(".author").get_text()
|
||||||
|
print(f"{text} - {author}")
|
||||||
|
```
|
||||||
|
|
||||||
|
#### JavaScript-Heavy Websites
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
|
||||||
|
# Scrape dynamic content
|
||||||
|
with SeleniumScraper(headless=True) as scraper:
|
||||||
|
result = scraper.scrape(
|
||||||
|
"https://quotes.toscrape.com/js/",
|
||||||
|
wait_for=".quote" # Wait for this element to load
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"Page title: {result['title']}")
|
||||||
|
# Process the data...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### AI-Powered Text Extraction
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.jina_scraper import JinaScraper
|
||||||
|
|
||||||
|
# Extract text intelligently with AI
|
||||||
|
with JinaScraper() as scraper:
|
||||||
|
result = scraper.scrape(
|
||||||
|
"https://news.ycombinator.com/",
|
||||||
|
return_format="markdown"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(result["content"])
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Save Your Data
|
||||||
|
|
||||||
|
```python
|
||||||
|
from data_processors.storage import DataStorage
|
||||||
|
|
||||||
|
storage = DataStorage()
|
||||||
|
|
||||||
|
# Save as JSON
|
||||||
|
data = {"title": "Example", "content": "Hello World"}
|
||||||
|
storage.save_json(data, "output.json")
|
||||||
|
|
||||||
|
# Save as CSV
|
||||||
|
data_list = [
|
||||||
|
{"name": "John", "age": 30},
|
||||||
|
{"name": "Jane", "age": 25}
|
||||||
|
]
|
||||||
|
storage.save_csv(data_list, "people.csv")
|
||||||
|
|
||||||
|
# Save as text
|
||||||
|
storage.save_text("Some text content", "output.txt")
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Run Examples
|
||||||
|
|
||||||
|
Try the included examples:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic scraping example
|
||||||
|
python examples/basic_example.py
|
||||||
|
|
||||||
|
# Selenium example
|
||||||
|
python examples/selenium_example.py
|
||||||
|
|
||||||
|
# Advanced tools example (requires API keys)
|
||||||
|
python examples/advanced_example.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. Common Patterns
|
||||||
|
|
||||||
|
### Extract Links from a Page
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("https://example.com")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
links = scraper.extract_links(
|
||||||
|
result["soup"],
|
||||||
|
base_url="https://example.com"
|
||||||
|
)
|
||||||
|
|
||||||
|
for link in links:
|
||||||
|
print(link)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Click Buttons and Fill Forms
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
|
||||||
|
with SeleniumScraper(headless=False) as scraper:
|
||||||
|
scraper.scrape("https://example.com/login")
|
||||||
|
|
||||||
|
# Fill form fields
|
||||||
|
scraper.fill_form("#username", "myuser")
|
||||||
|
scraper.fill_form("#password", "mypass")
|
||||||
|
|
||||||
|
# Click submit button
|
||||||
|
scraper.click_element("#submit-btn")
|
||||||
|
|
||||||
|
# Take screenshot
|
||||||
|
scraper.take_screenshot("logged_in.png")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate and Clean Data
|
||||||
|
|
||||||
|
```python
|
||||||
|
from data_processors.validator import DataValidator
|
||||||
|
|
||||||
|
# Validate email
|
||||||
|
is_valid = DataValidator.validate_email("test@example.com")
|
||||||
|
|
||||||
|
# Clean text
|
||||||
|
cleaned = DataValidator.clean_text(" Multiple spaces ")
|
||||||
|
|
||||||
|
# Validate required fields
|
||||||
|
data = {"name": "John", "email": "john@example.com"}
|
||||||
|
validation = DataValidator.validate_required_fields(
|
||||||
|
data,
|
||||||
|
required_fields=["name", "email", "phone"]
|
||||||
|
)
|
||||||
|
|
||||||
|
if not validation["valid"]:
|
||||||
|
print(f"Missing: {validation['missing_fields']}")
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. Testing
|
||||||
|
|
||||||
|
Run the test suite:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all tests
|
||||||
|
pytest tests/ -v
|
||||||
|
|
||||||
|
# Run specific test
|
||||||
|
pytest tests/test_basic_scraper.py -v
|
||||||
|
|
||||||
|
# Run with coverage
|
||||||
|
pytest tests/ --cov=scrapers --cov=utils --cov=data_processors
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7. Advanced Features
|
||||||
|
|
||||||
|
### Deep Crawling with Firecrawl
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.firecrawl_scraper import FirecrawlScraper
|
||||||
|
|
||||||
|
with FirecrawlScraper() as scraper:
|
||||||
|
result = scraper.crawl(
|
||||||
|
"https://example.com",
|
||||||
|
max_depth=3,
|
||||||
|
max_pages=50,
|
||||||
|
include_patterns=["*/blog/*"],
|
||||||
|
exclude_patterns=["*/admin/*"]
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"Crawled {result['total_pages']} pages")
|
||||||
|
for page in result["pages"]:
|
||||||
|
print(f"- {page['url']}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Complex Workflows with AgentQL
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.agentql_scraper import AgentQLScraper
|
||||||
|
|
||||||
|
with AgentQLScraper() as scraper:
|
||||||
|
# Automated login
|
||||||
|
result = scraper.login_workflow(
|
||||||
|
url="https://example.com/login",
|
||||||
|
username="user@example.com",
|
||||||
|
password="password123",
|
||||||
|
username_field="input[name='email']",
|
||||||
|
password_field="input[name='password']",
|
||||||
|
submit_button="button[type='submit']"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Exploratory Tasks with Multion
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.multion_scraper import MultionScraper
|
||||||
|
|
||||||
|
with MultionScraper() as scraper:
|
||||||
|
# Find best deal automatically
|
||||||
|
result = scraper.find_best_deal(
|
||||||
|
search_query="noise cancelling headphones",
|
||||||
|
filters={
|
||||||
|
"max_price": 200,
|
||||||
|
"rating": "4.5+",
|
||||||
|
"brand": "Sony"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(result["final_result"])
|
||||||
|
```
|
||||||
|
|
||||||
|
## 8. Tips & Best Practices
|
||||||
|
|
||||||
|
1. **Always use context managers** (`with` statement) to ensure proper cleanup
|
||||||
|
2. **Respect rate limits** - the default is 2 seconds between requests
|
||||||
|
3. **Check robots.txt** before scraping a website
|
||||||
|
4. **Use appropriate User-Agent** headers
|
||||||
|
5. **Handle errors gracefully** - the scrapers include built-in retry logic
|
||||||
|
6. **Validate and clean data** before storing it
|
||||||
|
7. **Log everything** for debugging purposes
|
||||||
|
|
||||||
|
## 9. Troubleshooting
|
||||||
|
|
||||||
|
### Issue: Selenium driver not found
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# The project uses webdriver-manager to auto-download drivers
|
||||||
|
# If you have issues, manually install ChromeDriver:
|
||||||
|
# 1. Download from https://chromedriver.chromium.org/
|
||||||
|
# 2. Add to your system PATH
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Import errors
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Make sure you've activated the virtual environment
|
||||||
|
# and installed all dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: API keys not working
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Make sure you've copied .env.example to .env
|
||||||
|
# and added your actual API keys
|
||||||
|
cp .env.example .env
|
||||||
|
# Edit .env with your keys
|
||||||
|
```
|
||||||
|
|
||||||
|
## 10. Next Steps
|
||||||
|
|
||||||
|
- Explore the `examples/` directory for more use cases
|
||||||
|
- Read the full `README.md` for detailed documentation
|
||||||
|
- Check out the `tests/` directory to see testing patterns
|
||||||
|
- Customize `config.py` for your specific needs
|
||||||
|
- Build your own scrapers extending `BaseScraper`
|
||||||
|
|
||||||
|
Happy Scraping! 🚀
|
||||||
|
|
||||||
234
README.md
234
README.md
|
|
@ -1 +1,233 @@
|
||||||
# Where it all begins.
|
# Web Scraping Project
|
||||||
|
|
||||||
|
A comprehensive Python web scraping framework supporting multiple scraping approaches, from basic static page scraping to advanced AI-driven data extraction.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Multiple Scraping Methods**:
|
||||||
|
- Basic HTTP requests with BeautifulSoup
|
||||||
|
- Selenium for JavaScript-heavy sites
|
||||||
|
- Jina AI for intelligent text extraction
|
||||||
|
- Firecrawl for deep web crawling
|
||||||
|
- AgentQL for complex workflows
|
||||||
|
- Multion for exploratory tasks
|
||||||
|
|
||||||
|
- **Built-in Utilities**:
|
||||||
|
- Rate limiting and retry logic
|
||||||
|
- Comprehensive logging
|
||||||
|
- Data validation and sanitization
|
||||||
|
- Multiple storage formats (JSON, CSV, text)
|
||||||
|
|
||||||
|
- **Best Practices**:
|
||||||
|
- PEP 8 compliant code
|
||||||
|
- Modular and reusable components
|
||||||
|
- Error handling and recovery
|
||||||
|
- Ethical scraping practices
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
.
|
||||||
|
├── config.py # Configuration and settings
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── .env.example # Environment variables template
|
||||||
|
│
|
||||||
|
├── scrapers/ # Scraper implementations
|
||||||
|
│ ├── base_scraper.py # Abstract base class
|
||||||
|
│ ├── basic_scraper.py # requests + BeautifulSoup
|
||||||
|
│ ├── selenium_scraper.py # Selenium WebDriver
|
||||||
|
│ ├── jina_scraper.py # Jina AI integration
|
||||||
|
│ ├── firecrawl_scraper.py # Firecrawl integration
|
||||||
|
│ ├── agentql_scraper.py # AgentQL workflows
|
||||||
|
│ └── multion_scraper.py # Multion AI agent
|
||||||
|
│
|
||||||
|
├── utils/ # Utility modules
|
||||||
|
│ ├── logger.py # Logging configuration
|
||||||
|
│ ├── rate_limiter.py # Rate limiting
|
||||||
|
│ └── retry.py # Retry with backoff
|
||||||
|
│
|
||||||
|
├── data_processors/ # Data processing
|
||||||
|
│ ├── validator.py # Data validation
|
||||||
|
│ └── storage.py # Data storage
|
||||||
|
│
|
||||||
|
├── examples/ # Example scripts
|
||||||
|
│ ├── basic_example.py
|
||||||
|
│ ├── selenium_example.py
|
||||||
|
│ └── advanced_example.py
|
||||||
|
│
|
||||||
|
└── tests/ # Test suite
|
||||||
|
├── test_basic_scraper.py
|
||||||
|
└── test_data_processors.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
1. **Clone the repository**:
|
||||||
|
```bash
|
||||||
|
git clone <repository-url>
|
||||||
|
cd <project-directory>
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Create virtual environment**:
|
||||||
|
```bash
|
||||||
|
python -m venv venv
|
||||||
|
|
||||||
|
# Windows
|
||||||
|
venv\Scripts\activate
|
||||||
|
|
||||||
|
# Unix/MacOS
|
||||||
|
source venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Install dependencies**:
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Configure environment variables**:
|
||||||
|
```bash
|
||||||
|
cp .env.example .env
|
||||||
|
# Edit .env with your API keys
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Basic Scraping
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("https://example.com")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
soup = result["soup"]
|
||||||
|
# Extract data using BeautifulSoup
|
||||||
|
titles = scraper.extract_text(soup, "h1")
|
||||||
|
print(titles)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dynamic Content (Selenium)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
|
||||||
|
with SeleniumScraper(headless=True) as scraper:
|
||||||
|
result = scraper.scrape(
|
||||||
|
"https://example.com",
|
||||||
|
wait_for=".dynamic-content"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(result["title"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### AI-Powered Extraction (Jina)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scrapers.jina_scraper import JinaScraper
|
||||||
|
|
||||||
|
with JinaScraper() as scraper:
|
||||||
|
result = scraper.scrape(
|
||||||
|
"https://example.com",
|
||||||
|
return_format="markdown"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(result["content"])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
See the `examples/` directory for detailed usage examples:
|
||||||
|
|
||||||
|
- `basic_example.py` - Static page scraping
|
||||||
|
- `selenium_example.py` - Dynamic content and interaction
|
||||||
|
- `advanced_example.py` - Advanced tools (Jina, Firecrawl, etc.)
|
||||||
|
|
||||||
|
Run examples:
|
||||||
|
```bash
|
||||||
|
python examples/basic_example.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `config.py` or set environment variables in `.env`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# API Keys
|
||||||
|
JINA_API_KEY=your_api_key
|
||||||
|
FIRECRAWL_API_KEY=your_api_key
|
||||||
|
AGENTQL_API_KEY=your_api_key
|
||||||
|
MULTION_API_KEY=your_api_key
|
||||||
|
|
||||||
|
# Scraping Settings
|
||||||
|
RATE_LIMIT_DELAY=2
|
||||||
|
MAX_RETRIES=3
|
||||||
|
TIMEOUT=30
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Storage
|
||||||
|
|
||||||
|
Save scraped data in multiple formats:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from data_processors.storage import DataStorage
|
||||||
|
|
||||||
|
storage = DataStorage()
|
||||||
|
|
||||||
|
# Save as JSON
|
||||||
|
storage.save_json(data, "output.json")
|
||||||
|
|
||||||
|
# Save as CSV
|
||||||
|
storage.save_csv(data, "output.csv")
|
||||||
|
|
||||||
|
# Save as text
|
||||||
|
storage.save_text(content, "output.txt")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Run tests with pytest:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pytest tests/ -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Run specific test file:
|
||||||
|
```bash
|
||||||
|
pytest tests/test_basic_scraper.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Respect robots.txt**: Always check and follow website scraping policies
|
||||||
|
2. **Rate Limiting**: Use appropriate delays between requests
|
||||||
|
3. **User-Agent**: Set realistic User-Agent headers
|
||||||
|
4. **Error Handling**: Implement robust error handling and retries
|
||||||
|
5. **Data Validation**: Validate and sanitize scraped data
|
||||||
|
6. **Logging**: Maintain detailed logs for debugging
|
||||||
|
|
||||||
|
## Tool Selection Guide
|
||||||
|
|
||||||
|
- **Basic Scraper**: Static HTML pages, simple data extraction
|
||||||
|
- **Selenium**: JavaScript-rendered content, interactive elements
|
||||||
|
- **Jina**: AI-driven text extraction, structured data
|
||||||
|
- **Firecrawl**: Deep crawling, hierarchical content
|
||||||
|
- **AgentQL**: Complex workflows (login, forms, multi-step processes)
|
||||||
|
- **Multion**: Exploratory tasks, unpredictable scenarios
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
1. Follow PEP 8 style guidelines
|
||||||
|
2. Add tests for new features
|
||||||
|
3. Update documentation
|
||||||
|
4. Use meaningful commit messages
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
[Your License Here]
|
||||||
|
|
||||||
|
## Disclaimer
|
||||||
|
|
||||||
|
This tool is for educational purposes. Always respect website terms of service and scraping policies. Be ethical and responsible when scraping data.
|
||||||
484
README_FINAL.md
Normal file
484
README_FINAL.md
Normal file
|
|
@ -0,0 +1,484 @@
|
||||||
|
# 🎬 Sekai.one Video Proxy - Solution Complète
|
||||||
|
|
||||||
|
**Accédez aux vidéos de sekai.one depuis n'importe où, sans restriction !**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Le Problème
|
||||||
|
|
||||||
|
Le serveur vidéo `mugiwara.xyz` utilise une protection **Referer** :
|
||||||
|
- ✅ Accessible depuis `https://sekai.one/`
|
||||||
|
- ❌ **403 Forbidden** en accès direct
|
||||||
|
|
||||||
|
**Notre Solution :** Un serveur proxy qui contourne cette protection !
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚡ Démarrage Ultra-Rapide
|
||||||
|
|
||||||
|
### 1. Installation (1 minute)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Cloner et installer
|
||||||
|
git clone <repo>
|
||||||
|
cd sekai-scraper
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Démarrer le Proxy (30 secondes)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python video_proxy_server.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Tester (10 secondes)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Dans un autre terminal
|
||||||
|
python test_proxy.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Utiliser ! 🎉
|
||||||
|
|
||||||
|
**URL Proxy :**
|
||||||
|
```
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
- Collez dans votre navigateur → La vidéo se lit !
|
||||||
|
- Utilisez dans VLC → Ça marche !
|
||||||
|
- Intégrez dans une page web → C'est bon !
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📚 Documentation Complète
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| **[PROXY_GUIDE.md](PROXY_GUIDE.md)** | 📖 Guide complet du proxy (déploiement VPS, API, etc.) |
|
||||||
|
| **[GUIDE_FR.md](GUIDE_FR.md)** | 🇫🇷 Guide général en français |
|
||||||
|
| **[README_SEKAI.md](README_SEKAI.md)** | 🔧 Documentation technique du scraper |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Utilisation
|
||||||
|
|
||||||
|
### A. Dans le Navigateur
|
||||||
|
|
||||||
|
```
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
### B. Avec VLC
|
||||||
|
|
||||||
|
1. Ouvrir VLC
|
||||||
|
2. Média → Ouvrir un flux réseau
|
||||||
|
3. Coller l'URL proxy
|
||||||
|
4. Lire ! 🎬
|
||||||
|
|
||||||
|
### C. Page HTML
|
||||||
|
|
||||||
|
```html
|
||||||
|
<video controls>
|
||||||
|
<source src="http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4">
|
||||||
|
</video>
|
||||||
|
```
|
||||||
|
|
||||||
|
### D. Télécharger
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Avec wget
|
||||||
|
wget "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -O ep527.mp4
|
||||||
|
|
||||||
|
# Avec curl
|
||||||
|
curl "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -o ep527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🌐 Déploiement sur VPS (vid.creepso.com)
|
||||||
|
|
||||||
|
### Installation Rapide
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Sur votre VPS
|
||||||
|
git clone <repo>
|
||||||
|
cd sekai-scraper
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Installer nginx
|
||||||
|
sudo apt install nginx
|
||||||
|
|
||||||
|
# Démarrer avec gunicorn
|
||||||
|
gunicorn -w 4 -b 127.0.0.1:8080 video_proxy_server:app --daemon
|
||||||
|
|
||||||
|
# Configurer nginx (voir PROXY_GUIDE.md)
|
||||||
|
# Ajouter SSL avec certbot
|
||||||
|
|
||||||
|
# Résultat final :
|
||||||
|
https://vid.creepso.com/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cette URL sera accessible depuis PARTOUT dans le monde !** 🌍
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📂 Architecture du Projet
|
||||||
|
|
||||||
|
```
|
||||||
|
📦 sekai-scraper/
|
||||||
|
│
|
||||||
|
├── 🎯 SCRIPTS PRINCIPAUX
|
||||||
|
│ ├── video_proxy_server.py ⭐ Serveur proxy (UTILISEZ CELUI-CI)
|
||||||
|
│ ├── test_proxy.py Tests automatiques
|
||||||
|
│ ├── sekai_one_scraper.py Extrait les URLs vidéo
|
||||||
|
│ └── get_one_piece.py Script complet (scraping + download)
|
||||||
|
│
|
||||||
|
├── 📖 DOCUMENTATION
|
||||||
|
│ ├── PROXY_GUIDE.md Guide complet du proxy ⭐
|
||||||
|
│ ├── GUIDE_FR.md Guide français général
|
||||||
|
│ ├── README_SEKAI.md Doc technique
|
||||||
|
│ └── QUICKSTART.md Quick start (anglais)
|
||||||
|
│
|
||||||
|
├── 🛠️ FRAMEWORK SCRAPING
|
||||||
|
│ ├── scrapers/ Framework générique
|
||||||
|
│ ├── utils/ Utilitaires (logs, retry, etc.)
|
||||||
|
│ └── data_processors/ Validation et stockage
|
||||||
|
│
|
||||||
|
└── 📊 DONNÉES
|
||||||
|
├── data/ Résultats et captures
|
||||||
|
├── videos/ Vidéos téléchargées
|
||||||
|
└── logs/ Logs détaillés
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎓 Comment ça Marche ?
|
||||||
|
|
||||||
|
### Le Flux
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Client (vous)
|
||||||
|
↓
|
||||||
|
http://localhost:8080/proxy?url=VIDEO_URL
|
||||||
|
↓
|
||||||
|
2. Serveur Proxy
|
||||||
|
↓
|
||||||
|
Ajoute → Referer: https://sekai.one/
|
||||||
|
↓
|
||||||
|
3. Serveur Vidéo (mugiwara.xyz)
|
||||||
|
↓
|
||||||
|
✅ 200 OK (pense que ça vient de sekai.one)
|
||||||
|
↓
|
||||||
|
4. Stream vidéo → Client
|
||||||
|
```
|
||||||
|
|
||||||
|
### Les Headers Magiques
|
||||||
|
|
||||||
|
```http
|
||||||
|
# SANS le proxy → 403 Forbidden ❌
|
||||||
|
GET /op/saga-7/hd/527.mp4
|
||||||
|
Host: 17.mugiwara.xyz
|
||||||
|
|
||||||
|
# AVEC le proxy → 200 OK ✅
|
||||||
|
GET /op/saga-7/hd/527.mp4
|
||||||
|
Host: 17.mugiwara.xyz
|
||||||
|
Referer: https://sekai.one/ ← La clé !
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛠️ API du Proxy
|
||||||
|
|
||||||
|
### Endpoints
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Proxy vidéo (streaming)
|
||||||
|
GET /proxy?url=[VIDEO_URL]
|
||||||
|
|
||||||
|
# 2. Infos vidéo (métadonnées)
|
||||||
|
GET /info?url=[VIDEO_URL]
|
||||||
|
|
||||||
|
# 3. Téléchargement forcé
|
||||||
|
GET /download?url=[VIDEO_URL]
|
||||||
|
|
||||||
|
# 4. Health check
|
||||||
|
GET /health
|
||||||
|
```
|
||||||
|
|
||||||
|
### Exemples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Obtenir les infos
|
||||||
|
curl "http://localhost:8080/info?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
|
||||||
|
# Réponse:
|
||||||
|
{
|
||||||
|
"accessible": true,
|
||||||
|
"content_length_mb": 260.14,
|
||||||
|
"content_type": "video/mp4",
|
||||||
|
"status_code": 200
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✨ Fonctionnalités
|
||||||
|
|
||||||
|
### Serveur Proxy
|
||||||
|
|
||||||
|
- ✅ **Streaming progressif** (pas de téléchargement complet)
|
||||||
|
- ✅ **Range requests** (seeking dans la vidéo)
|
||||||
|
- ✅ **CORS activé** (utilisable depuis n'importe quel site)
|
||||||
|
- ✅ **Multi-thread** (plusieurs clients simultanés)
|
||||||
|
- ✅ **Logs détaillés**
|
||||||
|
- ✅ **API REST complète**
|
||||||
|
|
||||||
|
### Scraper
|
||||||
|
|
||||||
|
- ✅ Extraction automatique des URLs vidéo
|
||||||
|
- ✅ Support Selenium (JavaScript)
|
||||||
|
- ✅ Analyse des patterns
|
||||||
|
- ✅ Captures d'écran pour debug
|
||||||
|
- ✅ Sauvegarde des résultats (JSON)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Tester tout automatiquement
|
||||||
|
python test_proxy.py
|
||||||
|
|
||||||
|
# Tests effectués :
|
||||||
|
✓ Health Check - Serveur actif
|
||||||
|
✓ Video Info - Métadonnées accessibles
|
||||||
|
✓ Streaming - Téléchargement fonctionne
|
||||||
|
✓ Range Request - Seeking supporté
|
||||||
|
✓ Direct Access - Protection active (403)
|
||||||
|
|
||||||
|
# Génère aussi test_video_player.html
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Cas d'Usage
|
||||||
|
|
||||||
|
### 1. Intégration Stremio
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Add-on Stremio
|
||||||
|
{
|
||||||
|
streams: [{
|
||||||
|
url: 'https://vid.creepso.com/proxy?url=VIDEO_URL',
|
||||||
|
title: 'HD'
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Site Web Personnel
|
||||||
|
|
||||||
|
```html
|
||||||
|
<video controls>
|
||||||
|
<source src="https://vid.creepso.com/proxy?url=VIDEO_URL">
|
||||||
|
</video>
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Application Mobile
|
||||||
|
|
||||||
|
```kotlin
|
||||||
|
// Android avec ExoPlayer
|
||||||
|
val videoUrl = "https://vid.creepso.com/proxy?url=VIDEO_URL"
|
||||||
|
player.setMediaItem(MediaItem.fromUri(videoUrl))
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Script de Téléchargement
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
url = "http://localhost:8080/proxy?url=VIDEO_URL"
|
||||||
|
with requests.get(url, stream=True) as r:
|
||||||
|
with open("video.mp4", "wb") as f:
|
||||||
|
for chunk in r.iter_content(8192):
|
||||||
|
f.write(chunk)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔒 Sécurité
|
||||||
|
|
||||||
|
### Sur VPS
|
||||||
|
|
||||||
|
1. **Rate Limiting** (recommandé)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Ajouter flask-limiter
|
||||||
|
@app.route('/proxy')
|
||||||
|
@limiter.limit("10 per minute")
|
||||||
|
def proxy_video():
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Whitelist d'URLs**
|
||||||
|
|
||||||
|
```python
|
||||||
|
ALLOWED_DOMAINS = ['mugiwara.xyz']
|
||||||
|
|
||||||
|
def is_allowed_url(url):
|
||||||
|
return any(domain in url for domain in ALLOWED_DOMAINS)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **HTTPS uniquement**
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
# nginx config
|
||||||
|
return 301 https://$server_name$request_uri;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Performance
|
||||||
|
|
||||||
|
### Benchmarks (localhost)
|
||||||
|
|
||||||
|
```
|
||||||
|
Taille vidéo : 260 MB
|
||||||
|
Streaming : ~50 MB/s
|
||||||
|
Latence : <100ms
|
||||||
|
Range requests : ✅ Supporté
|
||||||
|
Clients simul. : 10+ (avec gunicorn -w 4)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sur VPS
|
||||||
|
|
||||||
|
```
|
||||||
|
Bande passante : Dépend du VPS
|
||||||
|
Latence : 50-200ms (selon localisation)
|
||||||
|
CDN compatible : Oui (Cloudflare, etc.)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Limitations
|
||||||
|
|
||||||
|
1. **Bande passante** : Limitée par votre VPS
|
||||||
|
2. **Concurrent users** : Configurer gunicorn workers
|
||||||
|
3. **Cache** : Pas de cache vidéo (stream direct)
|
||||||
|
4. **DDoS** : Ajouter Cloudflare si nécessaire
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🐛 Dépannage
|
||||||
|
|
||||||
|
### "Connection refused"
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Le serveur n'est pas démarré
|
||||||
|
python video_proxy_server.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### "403 Forbidden" avec le proxy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Vérifier les headers dans video_proxy_server.py
|
||||||
|
# Le site a peut-être changé sa protection
|
||||||
|
```
|
||||||
|
|
||||||
|
### Vidéo lag/buffering
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Vérifier la bande passante
|
||||||
|
# 2. Augmenter les workers gunicorn
|
||||||
|
gunicorn -w 8 ...
|
||||||
|
# 3. Utiliser un CDN
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📈 Roadmap
|
||||||
|
|
||||||
|
- [ ] Cache vidéo (Redis)
|
||||||
|
- [ ] Dashboard de monitoring
|
||||||
|
- [ ] Support playlist M3U8
|
||||||
|
- [ ] Transcoding à la volée
|
||||||
|
- [ ] Interface web pour tester
|
||||||
|
- [ ] API key authentication
|
||||||
|
- [ ] Docker container
|
||||||
|
- [ ] Kubernetes deployment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🤝 Contribution
|
||||||
|
|
||||||
|
Ce projet est dans le cadre d'un **bug bounty autorisé**.
|
||||||
|
|
||||||
|
- ✅ Usage pour tests de sécurité
|
||||||
|
- ✅ Usage personnel
|
||||||
|
- ❌ Distribution publique interdite
|
||||||
|
- ❌ Respecter les droits d'auteur
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📞 Support
|
||||||
|
|
||||||
|
- **Logs** : `logs/*_scraping.log`
|
||||||
|
- **Captures** : `data/*.png`
|
||||||
|
- **HTML debug** : `data/sekai_page_source.html`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎉 Résultat Final
|
||||||
|
|
||||||
|
Après déploiement sur VPS :
|
||||||
|
|
||||||
|
```
|
||||||
|
🌐 URL Publique (accessible partout) :
|
||||||
|
https://vid.creepso.com/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
|
||||||
|
✅ Fonctionne dans :
|
||||||
|
- Navigateurs web (Chrome, Firefox, Safari, etc.)
|
||||||
|
- Lecteurs vidéo (VLC, MPV, etc.)
|
||||||
|
- Applications mobiles
|
||||||
|
- Stremio add-ons
|
||||||
|
- Scripts de téléchargement
|
||||||
|
- Balises <video> HTML5
|
||||||
|
|
||||||
|
🚀 Performance :
|
||||||
|
- Streaming progressif
|
||||||
|
- Seeking fonctionnel
|
||||||
|
- Pas de limite de taille
|
||||||
|
- Multi-clients
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🏁 Quick Start Complet
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Installation
|
||||||
|
git clone <repo> && cd sekai-scraper
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# 2. Démarrer le proxy
|
||||||
|
python video_proxy_server.py
|
||||||
|
|
||||||
|
# 3. Tester
|
||||||
|
python test_proxy.py
|
||||||
|
|
||||||
|
# 4. Utiliser
|
||||||
|
# Ouvrir dans le navigateur :
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
|
||||||
|
# 5. Déployer sur VPS (optionnel)
|
||||||
|
# Voir PROXY_GUIDE.md section "Déploiement"
|
||||||
|
|
||||||
|
# 🎉 C'est tout !
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Made with ❤️ for bug bounty and educational purposes**
|
||||||
|
|
||||||
|
*Licence : À usage personnel uniquement - Respectez les droits d'auteur*
|
||||||
|
|
||||||
57
config.py
Normal file
57
config.py
Normal file
|
|
@ -0,0 +1,57 @@
|
||||||
|
"""
|
||||||
|
Configuration module for web scraping project.
|
||||||
|
Loads environment variables and defines project-wide settings.
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
# Load environment variables
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
# Project Paths
|
||||||
|
BASE_DIR = Path(__file__).resolve().parent
|
||||||
|
DATA_DIR = BASE_DIR / "data"
|
||||||
|
LOGS_DIR = BASE_DIR / "logs"
|
||||||
|
CACHE_DIR = BASE_DIR / "cache"
|
||||||
|
|
||||||
|
# Create directories if they don't exist
|
||||||
|
DATA_DIR.mkdir(exist_ok=True)
|
||||||
|
LOGS_DIR.mkdir(exist_ok=True)
|
||||||
|
CACHE_DIR.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
# API Keys
|
||||||
|
JINA_API_KEY = os.getenv("JINA_API_KEY", "")
|
||||||
|
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
|
||||||
|
AGENTQL_API_KEY = os.getenv("AGENTQL_API_KEY", "")
|
||||||
|
MULTION_API_KEY = os.getenv("MULTION_API_KEY", "")
|
||||||
|
TWOCAPTCHA_API_KEY = os.getenv("TWOCAPTCHA_API_KEY", "")
|
||||||
|
|
||||||
|
# Scraping Configuration
|
||||||
|
RATE_LIMIT_DELAY = float(os.getenv("RATE_LIMIT_DELAY", 2))
|
||||||
|
MAX_RETRIES = int(os.getenv("MAX_RETRIES", 3))
|
||||||
|
TIMEOUT = int(os.getenv("TIMEOUT", 30))
|
||||||
|
USER_AGENT = os.getenv(
|
||||||
|
"USER_AGENT",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||||
|
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Request Headers
|
||||||
|
DEFAULT_HEADERS = {
|
||||||
|
"User-Agent": USER_AGENT,
|
||||||
|
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||||
|
"Accept-Language": "en-US,en;q=0.5",
|
||||||
|
"Accept-Encoding": "gzip, deflate, br",
|
||||||
|
"DNT": "1",
|
||||||
|
"Connection": "keep-alive",
|
||||||
|
"Upgrade-Insecure-Requests": "1"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Selenium Configuration
|
||||||
|
SELENIUM_HEADLESS = True
|
||||||
|
SELENIUM_IMPLICIT_WAIT = 10
|
||||||
|
|
||||||
|
# Cache Configuration
|
||||||
|
CACHE_EXPIRATION = 3600 # 1 hour in seconds
|
||||||
|
|
||||||
8
data_processors/__init__.py
Normal file
8
data_processors/__init__.py
Normal file
|
|
@ -0,0 +1,8 @@
|
||||||
|
"""
|
||||||
|
Data processing and storage modules.
|
||||||
|
"""
|
||||||
|
from .validator import DataValidator
|
||||||
|
from .storage import DataStorage
|
||||||
|
|
||||||
|
__all__ = ["DataValidator", "DataStorage"]
|
||||||
|
|
||||||
184
data_processors/storage.py
Normal file
184
data_processors/storage.py
Normal file
|
|
@ -0,0 +1,184 @@
|
||||||
|
"""
|
||||||
|
Data storage utilities for saving scraped content.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import csv
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
from datetime import datetime
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
from config import DATA_DIR
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class DataStorage:
|
||||||
|
"""
|
||||||
|
Storage handler for scraped data supporting multiple formats.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, output_dir: Optional[Path] = None):
|
||||||
|
"""
|
||||||
|
Initialize data storage.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
output_dir: Directory for storing data (default: DATA_DIR from config)
|
||||||
|
"""
|
||||||
|
self.output_dir = output_dir or DATA_DIR
|
||||||
|
self.output_dir.mkdir(exist_ok=True)
|
||||||
|
self.logger = logger
|
||||||
|
|
||||||
|
def save_json(
|
||||||
|
self,
|
||||||
|
data: Any,
|
||||||
|
filename: str,
|
||||||
|
indent: int = 2,
|
||||||
|
append: bool = False
|
||||||
|
) -> Path:
|
||||||
|
"""
|
||||||
|
Save data as JSON file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: Data to save
|
||||||
|
filename: Output filename
|
||||||
|
indent: JSON indentation
|
||||||
|
append: Append to existing file if True
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to saved file
|
||||||
|
"""
|
||||||
|
filepath = self.output_dir / filename
|
||||||
|
|
||||||
|
try:
|
||||||
|
if append and filepath.exists():
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
existing_data = json.load(f)
|
||||||
|
|
||||||
|
if isinstance(existing_data, list) and isinstance(data, list):
|
||||||
|
data = existing_data + data
|
||||||
|
else:
|
||||||
|
self.logger.warning("Cannot append: data types don't match")
|
||||||
|
|
||||||
|
with open(filepath, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(data, f, indent=indent, ensure_ascii=False)
|
||||||
|
|
||||||
|
self.logger.info(f"Saved JSON data to {filepath}")
|
||||||
|
return filepath
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Failed to save JSON: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def save_csv(
|
||||||
|
self,
|
||||||
|
data: List[Dict[str, Any]],
|
||||||
|
filename: str,
|
||||||
|
fieldnames: Optional[List[str]] = None,
|
||||||
|
append: bool = False
|
||||||
|
) -> Path:
|
||||||
|
"""
|
||||||
|
Save data as CSV file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: List of dictionaries to save
|
||||||
|
filename: Output filename
|
||||||
|
fieldnames: CSV column names (auto-detected if None)
|
||||||
|
append: Append to existing file if True
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to saved file
|
||||||
|
"""
|
||||||
|
filepath = self.output_dir / filename
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
self.logger.warning("No data to save")
|
||||||
|
return filepath
|
||||||
|
|
||||||
|
try:
|
||||||
|
if fieldnames is None:
|
||||||
|
fieldnames = list(data[0].keys())
|
||||||
|
|
||||||
|
mode = 'a' if append and filepath.exists() else 'w'
|
||||||
|
write_header = not (append and filepath.exists())
|
||||||
|
|
||||||
|
with open(filepath, mode, newline='', encoding='utf-8') as f:
|
||||||
|
writer = csv.DictWriter(f, fieldnames=fieldnames)
|
||||||
|
|
||||||
|
if write_header:
|
||||||
|
writer.writeheader()
|
||||||
|
|
||||||
|
writer.writerows(data)
|
||||||
|
|
||||||
|
self.logger.info(f"Saved CSV data to {filepath}")
|
||||||
|
return filepath
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Failed to save CSV: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def save_text(self, content: str, filename: str, append: bool = False) -> Path:
|
||||||
|
"""
|
||||||
|
Save content as text file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
content: Text content to save
|
||||||
|
filename: Output filename
|
||||||
|
append: Append to existing file if True
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to saved file
|
||||||
|
"""
|
||||||
|
filepath = self.output_dir / filename
|
||||||
|
|
||||||
|
try:
|
||||||
|
mode = 'a' if append else 'w'
|
||||||
|
|
||||||
|
with open(filepath, mode, encoding='utf-8') as f:
|
||||||
|
f.write(content)
|
||||||
|
if append:
|
||||||
|
f.write('\n')
|
||||||
|
|
||||||
|
self.logger.info(f"Saved text data to {filepath}")
|
||||||
|
return filepath
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Failed to save text: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def create_timestamped_filename(self, base_name: str, extension: str) -> str:
|
||||||
|
"""
|
||||||
|
Create a filename with timestamp.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
base_name: Base filename
|
||||||
|
extension: File extension (without dot)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Timestamped filename
|
||||||
|
"""
|
||||||
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
|
return f"{base_name}_{timestamp}.{extension}"
|
||||||
|
|
||||||
|
def load_json(self, filename: str) -> Any:
|
||||||
|
"""
|
||||||
|
Load data from JSON file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filename: Input filename
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Loaded data
|
||||||
|
"""
|
||||||
|
filepath = self.output_dir / filename
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
self.logger.info(f"Loaded JSON data from {filepath}")
|
||||||
|
return data
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Failed to load JSON: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
142
data_processors/validator.py
Normal file
142
data_processors/validator.py
Normal file
|
|
@ -0,0 +1,142 @@
|
||||||
|
"""
|
||||||
|
Data validation utilities for scraped content.
|
||||||
|
"""
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class DataValidator:
|
||||||
|
"""
|
||||||
|
Validator for scraped data with various validation rules.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def validate_email(email: str) -> bool:
|
||||||
|
"""Validate email format."""
|
||||||
|
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
|
||||||
|
return bool(re.match(pattern, email))
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def validate_url(url: str) -> bool:
|
||||||
|
"""Validate URL format."""
|
||||||
|
pattern = r'^https?://[^\s/$.?#].[^\s]*$'
|
||||||
|
return bool(re.match(pattern, url))
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def validate_phone(phone: str) -> bool:
|
||||||
|
"""Validate phone number format."""
|
||||||
|
# Basic validation - adjust pattern as needed
|
||||||
|
pattern = r'^\+?1?\d{9,15}$'
|
||||||
|
cleaned = re.sub(r'[\s\-\(\)]', '', phone)
|
||||||
|
return bool(re.match(pattern, cleaned))
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def validate_required_fields(data: Dict[str, Any], required_fields: List[str]) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Validate that required fields are present and non-empty.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: Data dictionary to validate
|
||||||
|
required_fields: List of required field names
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with validation results
|
||||||
|
"""
|
||||||
|
missing_fields = []
|
||||||
|
empty_fields = []
|
||||||
|
|
||||||
|
for field in required_fields:
|
||||||
|
if field not in data:
|
||||||
|
missing_fields.append(field)
|
||||||
|
elif not data[field] or (isinstance(data[field], str) and not data[field].strip()):
|
||||||
|
empty_fields.append(field)
|
||||||
|
|
||||||
|
is_valid = len(missing_fields) == 0 and len(empty_fields) == 0
|
||||||
|
|
||||||
|
return {
|
||||||
|
"valid": is_valid,
|
||||||
|
"missing_fields": missing_fields,
|
||||||
|
"empty_fields": empty_fields
|
||||||
|
}
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def validate_data_types(data: Dict[str, Any], type_schema: Dict[str, type]) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Validate data types against a schema.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: Data dictionary to validate
|
||||||
|
type_schema: Dictionary mapping field names to expected types
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with validation results
|
||||||
|
"""
|
||||||
|
type_errors = []
|
||||||
|
|
||||||
|
for field, expected_type in type_schema.items():
|
||||||
|
if field in data and not isinstance(data[field], expected_type):
|
||||||
|
type_errors.append({
|
||||||
|
"field": field,
|
||||||
|
"expected": expected_type.__name__,
|
||||||
|
"actual": type(data[field]).__name__
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"valid": len(type_errors) == 0,
|
||||||
|
"type_errors": type_errors
|
||||||
|
}
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def clean_text(text: str) -> str:
|
||||||
|
"""
|
||||||
|
Clean and normalize text content.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw text to clean
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Cleaned text
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str):
|
||||||
|
return str(text)
|
||||||
|
|
||||||
|
# Remove extra whitespace
|
||||||
|
text = ' '.join(text.split())
|
||||||
|
|
||||||
|
# Remove special characters (optional, adjust as needed)
|
||||||
|
# text = re.sub(r'[^\w\s\-.,!?]', '', text)
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def sanitize_data(data: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Sanitize all string fields in a data dictionary.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: Data dictionary to sanitize
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Sanitized data dictionary
|
||||||
|
"""
|
||||||
|
sanitized = {}
|
||||||
|
|
||||||
|
for key, value in data.items():
|
||||||
|
if isinstance(value, str):
|
||||||
|
sanitized[key] = DataValidator.clean_text(value)
|
||||||
|
elif isinstance(value, dict):
|
||||||
|
sanitized[key] = DataValidator.sanitize_data(value)
|
||||||
|
elif isinstance(value, list):
|
||||||
|
sanitized[key] = [
|
||||||
|
DataValidator.clean_text(item) if isinstance(item, str) else item
|
||||||
|
for item in value
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
sanitized[key] = value
|
||||||
|
|
||||||
|
return sanitized
|
||||||
|
|
||||||
4
examples/__init__.py
Normal file
4
examples/__init__.py
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
"""
|
||||||
|
Example scripts demonstrating different scraping techniques.
|
||||||
|
"""
|
||||||
|
|
||||||
106
examples/advanced_example.py
Normal file
106
examples/advanced_example.py
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
"""
|
||||||
|
Example: Advanced scraping with Jina, Firecrawl, AgentQL, and Multion.
|
||||||
|
"""
|
||||||
|
from scrapers.jina_scraper import JinaScraper
|
||||||
|
from scrapers.firecrawl_scraper import FirecrawlScraper
|
||||||
|
from scrapers.agentql_scraper import AgentQLScraper
|
||||||
|
from scrapers.multion_scraper import MultionScraper
|
||||||
|
|
||||||
|
|
||||||
|
def jina_example():
|
||||||
|
"""
|
||||||
|
Example: Use Jina for AI-driven text extraction
|
||||||
|
"""
|
||||||
|
print("=== Jina AI Example ===\n")
|
||||||
|
|
||||||
|
with JinaScraper() as scraper:
|
||||||
|
result = scraper.scrape(
|
||||||
|
"https://example.com",
|
||||||
|
return_format="markdown"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print("Extracted content (first 500 chars):")
|
||||||
|
print(result["content"][:500])
|
||||||
|
else:
|
||||||
|
print(f"Error: {result.get('error')}")
|
||||||
|
|
||||||
|
|
||||||
|
def firecrawl_example():
|
||||||
|
"""
|
||||||
|
Example: Use Firecrawl for deep crawling
|
||||||
|
"""
|
||||||
|
print("\n=== Firecrawl Example ===\n")
|
||||||
|
|
||||||
|
with FirecrawlScraper() as scraper:
|
||||||
|
# Scrape a single page
|
||||||
|
result = scraper.scrape("https://example.com")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"Scraped content length: {len(result.get('content', ''))}")
|
||||||
|
|
||||||
|
# Crawl multiple pages
|
||||||
|
crawl_result = scraper.crawl(
|
||||||
|
"https://example.com",
|
||||||
|
max_depth=2,
|
||||||
|
max_pages=5
|
||||||
|
)
|
||||||
|
|
||||||
|
if crawl_result["success"]:
|
||||||
|
print(f"Crawled {crawl_result['total_pages']} pages")
|
||||||
|
|
||||||
|
|
||||||
|
def agentql_example():
|
||||||
|
"""
|
||||||
|
Example: Use AgentQL for complex workflows
|
||||||
|
"""
|
||||||
|
print("\n=== AgentQL Example ===\n")
|
||||||
|
|
||||||
|
with AgentQLScraper() as scraper:
|
||||||
|
# Example login workflow
|
||||||
|
workflow = [
|
||||||
|
{"action": "navigate", "params": {"url": "https://example.com/login"}},
|
||||||
|
{"action": "fill_form", "params": {"field": "#username", "value": "user@example.com"}},
|
||||||
|
{"action": "fill_form", "params": {"field": "#password", "value": "password123"}},
|
||||||
|
{"action": "click", "params": {"element": "#submit"}},
|
||||||
|
{"action": "extract", "params": {"selector": ".dashboard-content"}}
|
||||||
|
]
|
||||||
|
|
||||||
|
result = scraper.scrape("https://example.com/login", workflow)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"Workflow executed: {len(result['workflow_results'])} steps")
|
||||||
|
|
||||||
|
|
||||||
|
def multion_example():
|
||||||
|
"""
|
||||||
|
Example: Use Multion for exploratory tasks
|
||||||
|
"""
|
||||||
|
print("\n=== Multion Example ===\n")
|
||||||
|
|
||||||
|
with MultionScraper() as scraper:
|
||||||
|
# Example: Find best deal
|
||||||
|
result = scraper.find_best_deal(
|
||||||
|
search_query="wireless headphones",
|
||||||
|
filters={"max_price": 100, "rating": "4+"}
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"Task result: {result.get('final_result')}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Note: These examples require API keys to be set in .env file
|
||||||
|
|
||||||
|
print("Advanced Scraping Examples")
|
||||||
|
print("=" * 50)
|
||||||
|
|
||||||
|
# Uncomment the examples you want to run:
|
||||||
|
|
||||||
|
# jina_example()
|
||||||
|
# firecrawl_example()
|
||||||
|
# agentql_example()
|
||||||
|
# multion_example()
|
||||||
|
|
||||||
|
print("\nNote: Set API keys in .env file to run these examples")
|
||||||
|
|
||||||
66
examples/basic_example.py
Normal file
66
examples/basic_example.py
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
"""
|
||||||
|
Example: Basic web scraping with requests and BeautifulSoup.
|
||||||
|
"""
|
||||||
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
import json
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_quotes():
|
||||||
|
"""
|
||||||
|
Example: Scrape quotes from quotes.toscrape.com
|
||||||
|
"""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
soup = result["soup"]
|
||||||
|
|
||||||
|
# Extract all quotes
|
||||||
|
quotes = []
|
||||||
|
for quote_elem in soup.select(".quote"):
|
||||||
|
text = quote_elem.select_one(".text").get_text(strip=True)
|
||||||
|
author = quote_elem.select_one(".author").get_text(strip=True)
|
||||||
|
tags = [tag.get_text(strip=True) for tag in quote_elem.select(".tag")]
|
||||||
|
|
||||||
|
quotes.append({
|
||||||
|
"text": text,
|
||||||
|
"author": author,
|
||||||
|
"tags": tags
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f"Scraped {len(quotes)} quotes")
|
||||||
|
print(json.dumps(quotes[:3], indent=2)) # Print first 3 quotes
|
||||||
|
|
||||||
|
return quotes
|
||||||
|
else:
|
||||||
|
print(f"Scraping failed: {result.get('error')}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_with_links():
|
||||||
|
"""
|
||||||
|
Example: Extract all links from a page
|
||||||
|
"""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
links = scraper.extract_links(
|
||||||
|
result["soup"],
|
||||||
|
base_url="http://quotes.toscrape.com/"
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Found {len(links)} links")
|
||||||
|
for link in links[:10]: # Print first 10 links
|
||||||
|
print(f" - {link}")
|
||||||
|
|
||||||
|
return links
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("=== Basic Scraping Example ===\n")
|
||||||
|
scrape_quotes()
|
||||||
|
|
||||||
|
print("\n=== Link Extraction Example ===\n")
|
||||||
|
scrape_with_links()
|
||||||
|
|
||||||
62
examples/selenium_example.py
Normal file
62
examples/selenium_example.py
Normal file
|
|
@ -0,0 +1,62 @@
|
||||||
|
"""
|
||||||
|
Example: Scraping dynamic content with Selenium.
|
||||||
|
"""
|
||||||
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
import time
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_dynamic_content():
|
||||||
|
"""
|
||||||
|
Example: Scrape JavaScript-rendered content
|
||||||
|
"""
|
||||||
|
with SeleniumScraper(headless=True) as scraper:
|
||||||
|
# Example with a site that loads content dynamically
|
||||||
|
result = scraper.scrape(
|
||||||
|
"http://quotes.toscrape.com/js/",
|
||||||
|
wait_for=".quote"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
soup = result["soup"]
|
||||||
|
quotes = soup.select(".quote")
|
||||||
|
|
||||||
|
print(f"Scraped {len(quotes)} quotes from JavaScript-rendered page")
|
||||||
|
|
||||||
|
# Extract quote details
|
||||||
|
for quote in quotes[:3]:
|
||||||
|
text = quote.select_one(".text").get_text(strip=True)
|
||||||
|
author = quote.select_one(".author").get_text(strip=True)
|
||||||
|
print(f"\n{text}\n - {author}")
|
||||||
|
else:
|
||||||
|
print(f"Scraping failed: {result.get('error')}")
|
||||||
|
|
||||||
|
|
||||||
|
def interact_with_page():
|
||||||
|
"""
|
||||||
|
Example: Interact with page elements (clicking, scrolling, etc.)
|
||||||
|
"""
|
||||||
|
with SeleniumScraper(headless=False) as scraper:
|
||||||
|
scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
# Scroll down
|
||||||
|
scraper.execute_script("window.scrollTo(0, document.body.scrollHeight);")
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
# Click "Next" button if exists
|
||||||
|
try:
|
||||||
|
scraper.click_element(".next > a")
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
print(f"Navigated to: {scraper.driver.current_url}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Could not click next: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("=== Selenium Dynamic Content Example ===\n")
|
||||||
|
scrape_dynamic_content()
|
||||||
|
|
||||||
|
print("\n=== Selenium Interaction Example ===\n")
|
||||||
|
# Uncomment to see browser interaction (non-headless)
|
||||||
|
# interact_with_page()
|
||||||
|
|
||||||
130
main.py
Normal file
130
main.py
Normal file
|
|
@ -0,0 +1,130 @@
|
||||||
|
"""
|
||||||
|
Main entry point for the web scraping project.
|
||||||
|
Example usage and demonstration of different scraping methods.
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
from scrapers import (
|
||||||
|
BasicScraper,
|
||||||
|
SeleniumScraper,
|
||||||
|
JinaScraper,
|
||||||
|
FirecrawlScraper,
|
||||||
|
AgentQLScraper,
|
||||||
|
MultionScraper
|
||||||
|
)
|
||||||
|
from data_processors.storage import DataStorage
|
||||||
|
from data_processors.validator import DataValidator
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_basic(url: str, output: str = None):
|
||||||
|
"""Scrape using basic HTTP requests."""
|
||||||
|
logger.info(f"Starting basic scrape: {url}")
|
||||||
|
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape(url)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
logger.info(f"Successfully scraped {url}")
|
||||||
|
|
||||||
|
if output:
|
||||||
|
storage = DataStorage()
|
||||||
|
storage.save_json(result, output)
|
||||||
|
logger.info(f"Saved results to {output}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
logger.error(f"Scraping failed: {result.get('error')}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_dynamic(url: str, output: str = None):
|
||||||
|
"""Scrape using Selenium for dynamic content."""
|
||||||
|
logger.info(f"Starting Selenium scrape: {url}")
|
||||||
|
|
||||||
|
with SeleniumScraper(headless=True) as scraper:
|
||||||
|
result = scraper.scrape(url)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
logger.info(f"Successfully scraped {url}")
|
||||||
|
|
||||||
|
if output:
|
||||||
|
storage = DataStorage()
|
||||||
|
storage.save_json(result, output)
|
||||||
|
logger.info(f"Saved results to {output}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
logger.error(f"Scraping failed: {result.get('error')}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_jina(url: str, output: str = None):
|
||||||
|
"""Scrape using Jina AI."""
|
||||||
|
logger.info(f"Starting Jina scrape: {url}")
|
||||||
|
|
||||||
|
with JinaScraper() as scraper:
|
||||||
|
result = scraper.scrape(url, return_format="markdown")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
logger.info(f"Successfully scraped {url}")
|
||||||
|
|
||||||
|
if output:
|
||||||
|
storage = DataStorage()
|
||||||
|
storage.save_text(result["content"], output)
|
||||||
|
logger.info(f"Saved results to {output}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
logger.error(f"Scraping failed: {result.get('error')}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main entry point with CLI argument parsing."""
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Web Scraping Framework",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"url",
|
||||||
|
help="Target URL to scrape"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"-m", "--method",
|
||||||
|
choices=["basic", "selenium", "jina", "firecrawl", "agentql", "multion"],
|
||||||
|
default="basic",
|
||||||
|
help="Scraping method to use (default: basic)"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"-o", "--output",
|
||||||
|
help="Output file path (optional)"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"-v", "--verbose",
|
||||||
|
action="store_true",
|
||||||
|
help="Enable verbose logging"
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Execute appropriate scraper
|
||||||
|
if args.method == "basic":
|
||||||
|
scrape_basic(args.url, args.output)
|
||||||
|
elif args.method == "selenium":
|
||||||
|
scrape_dynamic(args.url, args.output)
|
||||||
|
elif args.method == "jina":
|
||||||
|
scrape_jina(args.url, args.output)
|
||||||
|
else:
|
||||||
|
logger.warning(f"Method '{args.method}' not yet implemented in CLI")
|
||||||
|
print(f"Please use: basic, selenium, or jina")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
45
requirements.txt
Normal file
45
requirements.txt
Normal file
|
|
@ -0,0 +1,45 @@
|
||||||
|
# Core HTTP and Parsing
|
||||||
|
requests==2.31.0
|
||||||
|
beautifulsoup4==4.12.3
|
||||||
|
lxml==5.1.0
|
||||||
|
|
||||||
|
# Browser Automation
|
||||||
|
selenium==4.16.0
|
||||||
|
webdriver-manager==4.0.1
|
||||||
|
|
||||||
|
# Advanced Scraping Tools
|
||||||
|
jina==3.24.0
|
||||||
|
firecrawl-py==0.0.16
|
||||||
|
agentql==0.1.3
|
||||||
|
multion==1.0.1
|
||||||
|
|
||||||
|
# Data Processing
|
||||||
|
pandas==2.2.0
|
||||||
|
numpy==1.26.3
|
||||||
|
|
||||||
|
# Async and Performance
|
||||||
|
aiohttp==3.9.1
|
||||||
|
asyncio==3.4.3
|
||||||
|
requests-cache==1.1.1
|
||||||
|
|
||||||
|
# Utilities
|
||||||
|
python-dotenv==1.0.0
|
||||||
|
fake-useragent==1.4.0
|
||||||
|
tenacity==8.2.3
|
||||||
|
|
||||||
|
# Optional: Database Support
|
||||||
|
sqlalchemy==2.0.25
|
||||||
|
|
||||||
|
# Optional: CAPTCHA Solving
|
||||||
|
2captcha-python==1.2.1
|
||||||
|
|
||||||
|
# Web Server (pour le proxy vidéo)
|
||||||
|
flask==3.0.0
|
||||||
|
flask-cors==4.0.0
|
||||||
|
gunicorn==21.2.0
|
||||||
|
|
||||||
|
# Development Tools
|
||||||
|
pytest==7.4.4
|
||||||
|
black==24.1.1
|
||||||
|
flake8==7.0.0
|
||||||
|
|
||||||
19
scrapers/__init__.py
Normal file
19
scrapers/__init__.py
Normal file
|
|
@ -0,0 +1,19 @@
|
||||||
|
"""
|
||||||
|
Scraper modules for different scraping approaches.
|
||||||
|
"""
|
||||||
|
from .basic_scraper import BasicScraper
|
||||||
|
from .selenium_scraper import SeleniumScraper
|
||||||
|
from .jina_scraper import JinaScraper
|
||||||
|
from .firecrawl_scraper import FirecrawlScraper
|
||||||
|
from .agentql_scraper import AgentQLScraper
|
||||||
|
from .multion_scraper import MultionScraper
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"BasicScraper",
|
||||||
|
"SeleniumScraper",
|
||||||
|
"JinaScraper",
|
||||||
|
"FirecrawlScraper",
|
||||||
|
"AgentQLScraper",
|
||||||
|
"MultionScraper"
|
||||||
|
]
|
||||||
|
|
||||||
134
scrapers/agentql_scraper.py
Normal file
134
scrapers/agentql_scraper.py
Normal file
|
|
@ -0,0 +1,134 @@
|
||||||
|
"""
|
||||||
|
AgentQL scraper for complex, known processes (logins, forms, etc.).
|
||||||
|
"""
|
||||||
|
from typing import Dict, Any, Optional, List
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import AGENTQL_API_KEY
|
||||||
|
|
||||||
|
|
||||||
|
class AgentQLScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper using AgentQL for complex, known workflows.
|
||||||
|
Best for automated processes like logging in, form submissions, etc.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, api_key: Optional[str] = None, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize AgentQL scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key: AgentQL API key (default from config)
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.api_key = api_key or AGENTQL_API_KEY
|
||||||
|
|
||||||
|
if not self.api_key:
|
||||||
|
self.logger.warning("AgentQL API key not provided. Set AGENTQL_API_KEY in .env")
|
||||||
|
|
||||||
|
try:
|
||||||
|
import agentql
|
||||||
|
self.client = agentql
|
||||||
|
self.logger.info("AgentQL client initialized")
|
||||||
|
except ImportError:
|
||||||
|
self.logger.error("AgentQL library not installed. Install with: pip install agentql")
|
||||||
|
self.client = None
|
||||||
|
|
||||||
|
@retry_with_backoff(max_retries=2)
|
||||||
|
def scrape(self, url: str, workflow: List[Dict[str, Any]], **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Execute a defined workflow on a target URL.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL
|
||||||
|
workflow: List of workflow steps to execute
|
||||||
|
**kwargs: Additional parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing workflow results
|
||||||
|
"""
|
||||||
|
if not self.client:
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": "AgentQL client not initialized",
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
self.logger.info(f"Executing AgentQL workflow on {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
# Placeholder implementation - actual AgentQL API may vary
|
||||||
|
# This demonstrates the intended workflow structure
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
for step in workflow:
|
||||||
|
action = step.get("action")
|
||||||
|
params = step.get("params", {})
|
||||||
|
|
||||||
|
self.logger.info(f"Executing step: {action}")
|
||||||
|
|
||||||
|
# Example workflow actions
|
||||||
|
if action == "navigate":
|
||||||
|
result = {"action": action, "url": params.get("url")}
|
||||||
|
elif action == "fill_form":
|
||||||
|
result = {"action": action, "field": params.get("field")}
|
||||||
|
elif action == "click":
|
||||||
|
result = {"action": action, "element": params.get("element")}
|
||||||
|
elif action == "extract":
|
||||||
|
result = {"action": action, "selector": params.get("selector")}
|
||||||
|
else:
|
||||||
|
result = {"action": action, "status": "unknown"}
|
||||||
|
|
||||||
|
results.append(result)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"workflow_results": results,
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"AgentQL workflow failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"partial_results": results,
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def login_workflow(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
username: str,
|
||||||
|
password: str,
|
||||||
|
username_field: str = "input[name='username']",
|
||||||
|
password_field: str = "input[name='password']",
|
||||||
|
submit_button: str = "button[type='submit']"
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Execute a login workflow.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Login page URL
|
||||||
|
username: Username credential
|
||||||
|
password: Password credential
|
||||||
|
username_field: CSS selector for username field
|
||||||
|
password_field: CSS selector for password field
|
||||||
|
submit_button: CSS selector for submit button
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Login workflow results
|
||||||
|
"""
|
||||||
|
workflow = [
|
||||||
|
{"action": "navigate", "params": {"url": url}},
|
||||||
|
{"action": "fill_form", "params": {"field": username_field, "value": username}},
|
||||||
|
{"action": "fill_form", "params": {"field": password_field, "value": password}},
|
||||||
|
{"action": "click", "params": {"element": submit_button}},
|
||||||
|
{"action": "wait", "params": {"seconds": 2}}
|
||||||
|
]
|
||||||
|
|
||||||
|
return self.scrape(url, workflow)
|
||||||
|
|
||||||
77
scrapers/base_scraper.py
Normal file
77
scrapers/base_scraper.py
Normal file
|
|
@ -0,0 +1,77 @@
|
||||||
|
"""
|
||||||
|
Base scraper class with common functionality.
|
||||||
|
"""
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
from utils.rate_limiter import RateLimiter
|
||||||
|
from config import RATE_LIMIT_DELAY
|
||||||
|
|
||||||
|
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class for all scrapers.
|
||||||
|
Provides common functionality and enforces interface consistency.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, rate_limit: Optional[float] = None):
|
||||||
|
"""
|
||||||
|
Initialize base scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rate_limit: Delay between requests in seconds (default from config)
|
||||||
|
"""
|
||||||
|
self.logger = setup_logger(self.__class__.__name__)
|
||||||
|
self.rate_limiter = RateLimiter(
|
||||||
|
min_delay=rate_limit or RATE_LIMIT_DELAY,
|
||||||
|
max_delay=(rate_limit or RATE_LIMIT_DELAY) * 2
|
||||||
|
)
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def scrape(self, url: str, **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Main scraping method to be implemented by subclasses.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL to scrape
|
||||||
|
**kwargs: Additional scraping parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing scraped data
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def validate_data(self, data: Dict[str, Any], required_fields: list) -> bool:
|
||||||
|
"""
|
||||||
|
Validate that scraped data contains required fields.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: Data to validate
|
||||||
|
required_fields: List of required field names
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if valid, False otherwise
|
||||||
|
"""
|
||||||
|
missing_fields = [field for field in required_fields if field not in data]
|
||||||
|
|
||||||
|
if missing_fields:
|
||||||
|
self.logger.warning(f"Missing required fields: {missing_fields}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""
|
||||||
|
Cleanup method for releasing resources.
|
||||||
|
Override in subclasses if needed.
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
"""Context manager entry."""
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
"""Context manager exit."""
|
||||||
|
self.cleanup()
|
||||||
|
|
||||||
115
scrapers/basic_scraper.py
Normal file
115
scrapers/basic_scraper.py
Normal file
|
|
@ -0,0 +1,115 @@
|
||||||
|
"""
|
||||||
|
Basic scraper using requests and BeautifulSoup for static websites.
|
||||||
|
"""
|
||||||
|
import requests
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
from requests.exceptions import RequestException, Timeout
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import DEFAULT_HEADERS, TIMEOUT
|
||||||
|
|
||||||
|
|
||||||
|
class BasicScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper for static websites using requests and BeautifulSoup.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, headers: Optional[Dict[str, str]] = None, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize basic scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
headers: Custom HTTP headers (default from config)
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.headers = headers or DEFAULT_HEADERS
|
||||||
|
self.session = requests.Session()
|
||||||
|
self.session.headers.update(self.headers)
|
||||||
|
|
||||||
|
@retry_with_backoff(
|
||||||
|
max_retries=3,
|
||||||
|
exceptions=(RequestException, Timeout)
|
||||||
|
)
|
||||||
|
def scrape(self, url: str, parser: str = "lxml", **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scrape a static website.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL to scrape
|
||||||
|
parser: HTML parser to use (default: lxml)
|
||||||
|
**kwargs: Additional parameters for requests.get()
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing status, HTML content, and BeautifulSoup object
|
||||||
|
"""
|
||||||
|
self.logger.info(f"Scraping URL: {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self.session.get(
|
||||||
|
url,
|
||||||
|
timeout=kwargs.get('timeout', TIMEOUT),
|
||||||
|
**kwargs
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
soup = BeautifulSoup(response.content, parser)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"status_code": response.status_code,
|
||||||
|
"html": response.text,
|
||||||
|
"soup": soup,
|
||||||
|
"headers": dict(response.headers),
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except RequestException as e:
|
||||||
|
self.logger.error(f"Request failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def extract_text(self, soup: BeautifulSoup, selector: str) -> list:
|
||||||
|
"""
|
||||||
|
Extract text from elements matching a CSS selector.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
soup: BeautifulSoup object
|
||||||
|
selector: CSS selector
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of text content from matched elements
|
||||||
|
"""
|
||||||
|
elements = soup.select(selector)
|
||||||
|
return [elem.get_text(strip=True) for elem in elements]
|
||||||
|
|
||||||
|
def extract_links(self, soup: BeautifulSoup, base_url: str = "") -> list:
|
||||||
|
"""
|
||||||
|
Extract all links from the page.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
soup: BeautifulSoup object
|
||||||
|
base_url: Base URL for resolving relative links
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of absolute URLs
|
||||||
|
"""
|
||||||
|
from urllib.parse import urljoin
|
||||||
|
|
||||||
|
links = []
|
||||||
|
for link in soup.find_all('a', href=True):
|
||||||
|
absolute_url = urljoin(base_url, link['href'])
|
||||||
|
links.append(absolute_url)
|
||||||
|
|
||||||
|
return links
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""Close the requests session."""
|
||||||
|
self.session.close()
|
||||||
|
self.logger.info("Session closed")
|
||||||
|
|
||||||
138
scrapers/firecrawl_scraper.py
Normal file
138
scrapers/firecrawl_scraper.py
Normal file
|
|
@ -0,0 +1,138 @@
|
||||||
|
"""
|
||||||
|
Firecrawl scraper for deep web crawling and hierarchical content extraction.
|
||||||
|
"""
|
||||||
|
from typing import Dict, Any, Optional, List
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import FIRECRAWL_API_KEY
|
||||||
|
|
||||||
|
|
||||||
|
class FirecrawlScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper using Firecrawl for deep web content extraction.
|
||||||
|
Preferred for crawling deep web content or when data depth is critical.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, api_key: Optional[str] = None, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize Firecrawl scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key: Firecrawl API key (default from config)
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.api_key = api_key or FIRECRAWL_API_KEY
|
||||||
|
|
||||||
|
if not self.api_key:
|
||||||
|
self.logger.warning("Firecrawl API key not provided. Set FIRECRAWL_API_KEY in .env")
|
||||||
|
|
||||||
|
try:
|
||||||
|
from firecrawl import FirecrawlApp
|
||||||
|
self.client = FirecrawlApp(api_key=self.api_key) if self.api_key else None
|
||||||
|
except ImportError:
|
||||||
|
self.logger.error("Firecrawl library not installed. Install with: pip install firecrawl-py")
|
||||||
|
self.client = None
|
||||||
|
|
||||||
|
@retry_with_backoff(max_retries=3)
|
||||||
|
def scrape(self, url: str, **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scrape a single URL using Firecrawl.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL to scrape
|
||||||
|
**kwargs: Additional parameters for Firecrawl
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing scraped content and metadata
|
||||||
|
"""
|
||||||
|
if not self.client:
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": "Firecrawl client not initialized",
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
self.logger.info(f"Scraping URL with Firecrawl: {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = self.client.scrape_url(url, params=kwargs)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"content": result.get("content", ""),
|
||||||
|
"markdown": result.get("markdown", ""),
|
||||||
|
"metadata": result.get("metadata", {}),
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Firecrawl scraping failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def crawl(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
max_depth: int = 2,
|
||||||
|
max_pages: int = 10,
|
||||||
|
include_patterns: Optional[List[str]] = None,
|
||||||
|
exclude_patterns: Optional[List[str]] = None,
|
||||||
|
**kwargs
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Crawl a website hierarchically using Firecrawl.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Starting URL for the crawl
|
||||||
|
max_depth: Maximum crawl depth
|
||||||
|
max_pages: Maximum number of pages to crawl
|
||||||
|
include_patterns: URL patterns to include
|
||||||
|
exclude_patterns: URL patterns to exclude
|
||||||
|
**kwargs: Additional parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing all crawled pages and their content
|
||||||
|
"""
|
||||||
|
if not self.client:
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": "Firecrawl client not initialized",
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
self.logger.info(f"Starting crawl from {url} (max_depth={max_depth}, max_pages={max_pages})")
|
||||||
|
|
||||||
|
crawl_params = {
|
||||||
|
"maxDepth": max_depth,
|
||||||
|
"limit": max_pages
|
||||||
|
}
|
||||||
|
|
||||||
|
if include_patterns:
|
||||||
|
crawl_params["includePaths"] = include_patterns
|
||||||
|
|
||||||
|
if exclude_patterns:
|
||||||
|
crawl_params["excludePaths"] = exclude_patterns
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = self.client.crawl_url(url, params=crawl_params)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"pages": result.get("data", []),
|
||||||
|
"total_pages": len(result.get("data", [])),
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Firecrawl crawling failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
105
scrapers/jina_scraper.py
Normal file
105
scrapers/jina_scraper.py
Normal file
|
|
@ -0,0 +1,105 @@
|
||||||
|
"""
|
||||||
|
Jina AI scraper for AI-driven structured text extraction.
|
||||||
|
"""
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
import requests
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import JINA_API_KEY, TIMEOUT
|
||||||
|
|
||||||
|
|
||||||
|
class JinaScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper using Jina AI for intelligent text extraction and structuring.
|
||||||
|
Best for structured and semi-structured data with AI-driven pipelines.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, api_key: Optional[str] = None, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize Jina scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key: Jina API key (default from config)
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.api_key = api_key or JINA_API_KEY
|
||||||
|
|
||||||
|
if not self.api_key:
|
||||||
|
self.logger.warning("Jina API key not provided. Set JINA_API_KEY in .env")
|
||||||
|
|
||||||
|
self.base_url = "https://r.jina.ai"
|
||||||
|
|
||||||
|
@retry_with_backoff(max_retries=3)
|
||||||
|
def scrape(self, url: str, return_format: str = "markdown", **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scrape and extract text using Jina AI.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL to scrape
|
||||||
|
return_format: Output format (markdown, text, html)
|
||||||
|
**kwargs: Additional parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing extracted text and metadata
|
||||||
|
"""
|
||||||
|
self.logger.info(f"Scraping URL with Jina: {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
# Jina AI reader endpoint
|
||||||
|
jina_url = f"{self.base_url}/{url}"
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
"X-Return-Format": return_format
|
||||||
|
}
|
||||||
|
|
||||||
|
if self.api_key:
|
||||||
|
headers["Authorization"] = f"Bearer {self.api_key}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(
|
||||||
|
jina_url,
|
||||||
|
headers=headers,
|
||||||
|
timeout=kwargs.get('timeout', TIMEOUT)
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"content": response.text,
|
||||||
|
"format": return_format,
|
||||||
|
"status_code": response.status_code,
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except requests.RequestException as e:
|
||||||
|
self.logger.error(f"Jina scraping failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def extract_structured_data(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
schema: Optional[Dict[str, Any]] = None
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Extract structured data from a URL using Jina's AI capabilities.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL
|
||||||
|
schema: Optional schema for structured extraction
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Structured data dictionary
|
||||||
|
"""
|
||||||
|
result = self.scrape(url, return_format="json")
|
||||||
|
|
||||||
|
if result.get("success"):
|
||||||
|
# Additional processing based on schema if provided
|
||||||
|
self.logger.info(f"Successfully extracted structured data from {url}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
143
scrapers/multion_scraper.py
Normal file
143
scrapers/multion_scraper.py
Normal file
|
|
@ -0,0 +1,143 @@
|
||||||
|
"""
|
||||||
|
Multion scraper for unknown/exploratory tasks with AI-driven navigation.
|
||||||
|
"""
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import MULTION_API_KEY
|
||||||
|
|
||||||
|
|
||||||
|
class MultionScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper using Multion for exploratory and unpredictable tasks.
|
||||||
|
Best for tasks like finding cheapest flights, purchasing tickets, etc.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, api_key: Optional[str] = None, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize Multion scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key: Multion API key (default from config)
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.api_key = api_key or MULTION_API_KEY
|
||||||
|
|
||||||
|
if not self.api_key:
|
||||||
|
self.logger.warning("Multion API key not provided. Set MULTION_API_KEY in .env")
|
||||||
|
|
||||||
|
try:
|
||||||
|
import multion
|
||||||
|
self.client = multion
|
||||||
|
if self.api_key:
|
||||||
|
self.client.login(api_key=self.api_key)
|
||||||
|
self.logger.info("Multion client initialized")
|
||||||
|
except ImportError:
|
||||||
|
self.logger.error("Multion library not installed. Install with: pip install multion")
|
||||||
|
self.client = None
|
||||||
|
|
||||||
|
@retry_with_backoff(max_retries=2)
|
||||||
|
def scrape(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
task: str,
|
||||||
|
max_steps: int = 10,
|
||||||
|
**kwargs
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Execute an exploratory task using Multion AI.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Starting URL
|
||||||
|
task: Natural language description of the task
|
||||||
|
max_steps: Maximum number of steps to execute
|
||||||
|
**kwargs: Additional parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing task results
|
||||||
|
"""
|
||||||
|
if not self.client:
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"task": task,
|
||||||
|
"error": "Multion client not initialized",
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
self.logger.info(f"Executing Multion task: {task} on {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Placeholder implementation - actual Multion API may vary
|
||||||
|
# This demonstrates the intended usage pattern
|
||||||
|
|
||||||
|
response = {
|
||||||
|
"url": url,
|
||||||
|
"task": task,
|
||||||
|
"message": "Multion task execution placeholder",
|
||||||
|
"steps_taken": [],
|
||||||
|
"final_result": "Task completed successfully",
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
self.logger.info(f"Multion task completed: {task}")
|
||||||
|
return response
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Multion task failed: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"task": task,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def find_best_deal(
|
||||||
|
self,
|
||||||
|
search_query: str,
|
||||||
|
website: Optional[str] = None,
|
||||||
|
filters: Optional[Dict[str, Any]] = None
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Find the best deal for a product or service.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
search_query: What to search for
|
||||||
|
website: Optional specific website to search
|
||||||
|
filters: Optional filters (price range, features, etc.)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Best deal information
|
||||||
|
"""
|
||||||
|
task = f"Find the best deal for: {search_query}"
|
||||||
|
|
||||||
|
if filters:
|
||||||
|
filter_str = ", ".join([f"{k}: {v}" for k, v in filters.items()])
|
||||||
|
task += f" with filters: {filter_str}"
|
||||||
|
|
||||||
|
url = website or "https://www.google.com"
|
||||||
|
|
||||||
|
return self.scrape(url, task)
|
||||||
|
|
||||||
|
def book_or_purchase(
|
||||||
|
self,
|
||||||
|
item: str,
|
||||||
|
criteria: str,
|
||||||
|
website: str
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Attempt to book or purchase an item based on criteria.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
item: What to book/purchase
|
||||||
|
criteria: Purchase criteria (e.g., "cheapest", "earliest")
|
||||||
|
website: Website to perform the action on
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Booking/purchase results
|
||||||
|
"""
|
||||||
|
task = f"Book/purchase {item} with criteria: {criteria}"
|
||||||
|
|
||||||
|
return self.scrape(website, task)
|
||||||
|
|
||||||
178
scrapers/selenium_scraper.py
Normal file
178
scrapers/selenium_scraper.py
Normal file
|
|
@ -0,0 +1,178 @@
|
||||||
|
"""
|
||||||
|
Selenium scraper for JavaScript-heavy and dynamic websites.
|
||||||
|
"""
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
from selenium import webdriver
|
||||||
|
from selenium.webdriver.chrome.service import Service
|
||||||
|
from selenium.webdriver.chrome.options import Options
|
||||||
|
from selenium.webdriver.common.by import By
|
||||||
|
from selenium.webdriver.support.ui import WebDriverWait
|
||||||
|
from selenium.webdriver.support import expected_conditions as EC
|
||||||
|
from selenium.common.exceptions import (
|
||||||
|
TimeoutException,
|
||||||
|
NoSuchElementException,
|
||||||
|
WebDriverException
|
||||||
|
)
|
||||||
|
from webdriver_manager.chrome import ChromeDriverManager
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from scrapers.base_scraper import BaseScraper
|
||||||
|
from utils.retry import retry_with_backoff
|
||||||
|
from config import SELENIUM_HEADLESS, SELENIUM_IMPLICIT_WAIT, USER_AGENT
|
||||||
|
|
||||||
|
|
||||||
|
class SeleniumScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Scraper for dynamic websites using Selenium WebDriver.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, headless: bool = SELENIUM_HEADLESS, **kwargs):
|
||||||
|
"""
|
||||||
|
Initialize Selenium scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
headless: Run browser in headless mode
|
||||||
|
**kwargs: Additional arguments for BaseScraper
|
||||||
|
"""
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.headless = headless
|
||||||
|
self.driver = None
|
||||||
|
self._initialize_driver()
|
||||||
|
|
||||||
|
def _initialize_driver(self):
|
||||||
|
"""Initialize Chrome WebDriver with appropriate options."""
|
||||||
|
chrome_options = Options()
|
||||||
|
|
||||||
|
if self.headless:
|
||||||
|
chrome_options.add_argument("--headless=new")
|
||||||
|
|
||||||
|
chrome_options.add_argument(f"user-agent={USER_AGENT}")
|
||||||
|
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
|
||||||
|
chrome_options.add_argument("--disable-dev-shm-usage")
|
||||||
|
chrome_options.add_argument("--no-sandbox")
|
||||||
|
chrome_options.add_argument("--disable-gpu")
|
||||||
|
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
|
||||||
|
chrome_options.add_experimental_option("useAutomationExtension", False)
|
||||||
|
|
||||||
|
try:
|
||||||
|
service = Service(ChromeDriverManager().install())
|
||||||
|
self.driver = webdriver.Chrome(service=service, options=chrome_options)
|
||||||
|
self.driver.implicitly_wait(SELENIUM_IMPLICIT_WAIT)
|
||||||
|
self.logger.info("Chrome WebDriver initialized successfully")
|
||||||
|
except WebDriverException as e:
|
||||||
|
self.logger.error(f"Failed to initialize WebDriver: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
@retry_with_backoff(
|
||||||
|
max_retries=2,
|
||||||
|
exceptions=(TimeoutException, WebDriverException)
|
||||||
|
)
|
||||||
|
def scrape(self, url: str, wait_for: Optional[str] = None, **kwargs) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scrape a dynamic website using Selenium.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: Target URL to scrape
|
||||||
|
wait_for: CSS selector to wait for before returning
|
||||||
|
**kwargs: Additional parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing page source and BeautifulSoup object
|
||||||
|
"""
|
||||||
|
self.logger.info(f"Scraping URL with Selenium: {url}")
|
||||||
|
self.rate_limiter.wait()
|
||||||
|
|
||||||
|
try:
|
||||||
|
self.driver.get(url)
|
||||||
|
|
||||||
|
# Wait for specific element if provided
|
||||||
|
if wait_for:
|
||||||
|
timeout = kwargs.get('timeout', 10)
|
||||||
|
WebDriverWait(self.driver, timeout).until(
|
||||||
|
EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
|
||||||
|
)
|
||||||
|
|
||||||
|
page_source = self.driver.page_source
|
||||||
|
soup = BeautifulSoup(page_source, 'lxml')
|
||||||
|
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"html": page_source,
|
||||||
|
"soup": soup,
|
||||||
|
"title": self.driver.title,
|
||||||
|
"current_url": self.driver.current_url,
|
||||||
|
"success": True
|
||||||
|
}
|
||||||
|
|
||||||
|
except (TimeoutException, WebDriverException) as e:
|
||||||
|
self.logger.error(f"Selenium scraping failed for {url}: {str(e)}")
|
||||||
|
return {
|
||||||
|
"url": url,
|
||||||
|
"error": str(e),
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
def click_element(self, selector: str, by: By = By.CSS_SELECTOR, timeout: int = 10):
|
||||||
|
"""
|
||||||
|
Click an element on the page.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
selector: Element selector
|
||||||
|
by: Selenium By strategy (default: CSS_SELECTOR)
|
||||||
|
timeout: Wait timeout in seconds
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
element = WebDriverWait(self.driver, timeout).until(
|
||||||
|
EC.element_to_be_clickable((by, selector))
|
||||||
|
)
|
||||||
|
element.click()
|
||||||
|
self.logger.info(f"Clicked element: {selector}")
|
||||||
|
except (TimeoutException, NoSuchElementException) as e:
|
||||||
|
self.logger.error(f"Failed to click element {selector}: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def fill_form(self, selector: str, text: str, by: By = By.CSS_SELECTOR):
|
||||||
|
"""
|
||||||
|
Fill a form field with text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
selector: Element selector
|
||||||
|
text: Text to input
|
||||||
|
by: Selenium By strategy
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
element = self.driver.find_element(by, selector)
|
||||||
|
element.clear()
|
||||||
|
element.send_keys(text)
|
||||||
|
self.logger.info(f"Filled form field: {selector}")
|
||||||
|
except NoSuchElementException as e:
|
||||||
|
self.logger.error(f"Form field not found {selector}: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def execute_script(self, script: str):
|
||||||
|
"""
|
||||||
|
Execute JavaScript in the browser.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
script: JavaScript code to execute
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Result of script execution
|
||||||
|
"""
|
||||||
|
return self.driver.execute_script(script)
|
||||||
|
|
||||||
|
def take_screenshot(self, filepath: str):
|
||||||
|
"""
|
||||||
|
Take a screenshot of the current page.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filepath: Path to save the screenshot
|
||||||
|
"""
|
||||||
|
self.driver.save_screenshot(filepath)
|
||||||
|
self.logger.info(f"Screenshot saved to {filepath}")
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""Quit the WebDriver and cleanup resources."""
|
||||||
|
if self.driver:
|
||||||
|
self.driver.quit()
|
||||||
|
self.logger.info("WebDriver closed")
|
||||||
|
|
||||||
352
sekai_one_scraper.py
Normal file
352
sekai_one_scraper.py
Normal file
|
|
@ -0,0 +1,352 @@
|
||||||
|
"""
|
||||||
|
Scraper mis à jour pour sekai.one avec les vraies URLs
|
||||||
|
Basé sur la structure réelle du site : https://sekai.one/piece/saga-7
|
||||||
|
"""
|
||||||
|
|
||||||
|
from scrapers.selenium_scraper import SeleniumScraper
|
||||||
|
from selenium.webdriver.common.by import By
|
||||||
|
from selenium.webdriver.support.ui import WebDriverWait
|
||||||
|
from selenium.webdriver.support import expected_conditions as EC
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
import time
|
||||||
|
import re
|
||||||
|
import json
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
from data_processors.storage import DataStorage
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class SekaiOneScraper:
|
||||||
|
"""
|
||||||
|
Scraper optimisé pour sekai.one
|
||||||
|
Extrait les vraies URLs vidéo depuis les pages d'épisodes
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.base_url = "https://sekai.one"
|
||||||
|
self.logger = logger
|
||||||
|
|
||||||
|
def get_episode_url(self, anime: str = "piece", saga: int = 7, episode: int = 527) -> str:
|
||||||
|
"""
|
||||||
|
Construit l'URL d'une page d'épisode
|
||||||
|
|
||||||
|
Args:
|
||||||
|
anime: Nom de l'anime (piece = One Piece)
|
||||||
|
saga: Numéro de la saga
|
||||||
|
episode: Numéro de l'épisode
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
URL de la page
|
||||||
|
"""
|
||||||
|
# Format: https://sekai.one/piece/saga-7
|
||||||
|
return f"{self.base_url}/{anime}/saga-{saga}"
|
||||||
|
|
||||||
|
def extract_video_url(self, page_url: str, episode_number: int) -> dict:
|
||||||
|
"""
|
||||||
|
Extrait l'URL vidéo réelle depuis une page sekai.one
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page_url: URL de la page (ex: https://sekai.one/piece/saga-7)
|
||||||
|
episode_number: Numéro de l'épisode à récupérer
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict avec les informations de la vidéo
|
||||||
|
"""
|
||||||
|
self.logger.info(f"Extraction depuis: {page_url}")
|
||||||
|
self.logger.info(f"Épisode recherché: {episode_number}")
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"page_url": page_url,
|
||||||
|
"episode": episode_number,
|
||||||
|
"video_url": None,
|
||||||
|
"success": False
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
with SeleniumScraper(headless=False) as scraper:
|
||||||
|
# Charger la page
|
||||||
|
self.logger.info("Chargement de la page...")
|
||||||
|
page_result = scraper.scrape(page_url)
|
||||||
|
|
||||||
|
if not page_result["success"]:
|
||||||
|
result["error"] = "Échec du chargement de la page"
|
||||||
|
return result
|
||||||
|
|
||||||
|
self.logger.info(f"Page chargée: {page_result['title']}")
|
||||||
|
|
||||||
|
# Attendre que les épisodes se chargent
|
||||||
|
time.sleep(3)
|
||||||
|
|
||||||
|
# Cliquer sur l'épisode
|
||||||
|
self.logger.info(f"Recherche de l'épisode {episode_number}...")
|
||||||
|
|
||||||
|
# Chercher le bouton de l'épisode (basé sur la structure HTML du site)
|
||||||
|
try:
|
||||||
|
# Le site utilise probablement des divs ou buttons avec le numéro
|
||||||
|
# On cherche par texte
|
||||||
|
episode_elements = scraper.driver.find_elements(
|
||||||
|
By.XPATH,
|
||||||
|
f"//*[contains(text(), '{episode_number}')]"
|
||||||
|
)
|
||||||
|
|
||||||
|
self.logger.info(f"Trouvé {len(episode_elements)} éléments contenant '{episode_number}'")
|
||||||
|
|
||||||
|
# Trouver le bon élément cliquable
|
||||||
|
episode_button = None
|
||||||
|
for elem in episode_elements:
|
||||||
|
try:
|
||||||
|
# Vérifier si c'est un élément cliquable (div, button, a)
|
||||||
|
tag_name = elem.tag_name.lower()
|
||||||
|
if tag_name in ['div', 'button', 'a', 'span']:
|
||||||
|
text = elem.text.strip()
|
||||||
|
# Vérifier que c'est exactement le numéro (pas 5270 par exemple)
|
||||||
|
if text == str(episode_number) or text == f"mini {episode_number}":
|
||||||
|
episode_button = elem
|
||||||
|
self.logger.info(f"Bouton épisode trouvé: {text} ({tag_name})")
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not episode_button:
|
||||||
|
self.logger.error(f"Bouton pour l'épisode {episode_number} non trouvé")
|
||||||
|
result["error"] = f"Épisode {episode_number} non trouvé sur la page"
|
||||||
|
|
||||||
|
# Prendre une capture pour debug
|
||||||
|
scraper.take_screenshot("data/sekai_episode_not_found.png")
|
||||||
|
self.logger.info("Capture d'écran: data/sekai_episode_not_found.png")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Cliquer sur l'épisode
|
||||||
|
self.logger.info("Clic sur l'épisode...")
|
||||||
|
scraper.driver.execute_script("arguments[0].scrollIntoView(true);", episode_button)
|
||||||
|
time.sleep(1)
|
||||||
|
episode_button.click()
|
||||||
|
|
||||||
|
# Attendre que la vidéo se charge
|
||||||
|
self.logger.info("Attente du chargement de la vidéo...")
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
# Prendre une capture après le clic
|
||||||
|
scraper.take_screenshot(f"data/sekai_episode_{episode_number}_loaded.png")
|
||||||
|
|
||||||
|
# Méthode 1 : Chercher dans les balises video/source
|
||||||
|
video_url = self._extract_from_video_tag(scraper)
|
||||||
|
|
||||||
|
if video_url:
|
||||||
|
result["video_url"] = video_url
|
||||||
|
result["success"] = True
|
||||||
|
result["method"] = "video_tag"
|
||||||
|
self.logger.info(f"✓ URL vidéo trouvée (video tag): {video_url}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Méthode 2 : Chercher dans les scripts
|
||||||
|
video_url = self._extract_from_scripts(scraper)
|
||||||
|
|
||||||
|
if video_url:
|
||||||
|
result["video_url"] = video_url
|
||||||
|
result["success"] = True
|
||||||
|
result["method"] = "script"
|
||||||
|
self.logger.info(f"✓ URL vidéo trouvée (script): {video_url}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Méthode 3 : Analyser le DOM pour trouver des patterns
|
||||||
|
video_url = self._extract_from_dom(scraper, episode_number)
|
||||||
|
|
||||||
|
if video_url:
|
||||||
|
result["video_url"] = video_url
|
||||||
|
result["success"] = True
|
||||||
|
result["method"] = "dom_analysis"
|
||||||
|
self.logger.info(f"✓ URL vidéo trouvée (DOM): {video_url}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Si aucune méthode n'a fonctionné
|
||||||
|
self.logger.warning("Aucune URL vidéo trouvée avec les méthodes automatiques")
|
||||||
|
result["error"] = "URL vidéo non détectée automatiquement"
|
||||||
|
|
||||||
|
# Sauvegarder le HTML pour analyse manuelle
|
||||||
|
with open("data/sekai_page_source.html", "w", encoding="utf-8") as f:
|
||||||
|
f.write(scraper.driver.page_source)
|
||||||
|
self.logger.info("HTML sauvegardé: data/sekai_page_source.html")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Erreur lors du clic sur l'épisode: {str(e)}")
|
||||||
|
result["error"] = str(e)
|
||||||
|
scraper.take_screenshot("data/sekai_error.png")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Erreur générale: {str(e)}")
|
||||||
|
result["error"] = str(e)
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def _extract_from_video_tag(self, scraper) -> str:
|
||||||
|
"""Extraire l'URL depuis les balises <video>"""
|
||||||
|
try:
|
||||||
|
videos = scraper.driver.find_elements(By.TAG_NAME, 'video')
|
||||||
|
|
||||||
|
for video in videos:
|
||||||
|
# Vérifier l'attribut src
|
||||||
|
src = video.get_attribute('src')
|
||||||
|
if src and self._is_valid_video_url(src):
|
||||||
|
return src
|
||||||
|
|
||||||
|
# Vérifier les sources
|
||||||
|
sources = video.find_elements(By.TAG_NAME, 'source')
|
||||||
|
for source in sources:
|
||||||
|
src = source.get_attribute('src')
|
||||||
|
if src and self._is_valid_video_url(src):
|
||||||
|
return src
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.debug(f"Erreur extraction video tag: {str(e)}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _extract_from_scripts(self, scraper) -> str:
|
||||||
|
"""Extraire l'URL depuis les scripts JavaScript"""
|
||||||
|
try:
|
||||||
|
soup = BeautifulSoup(scraper.driver.page_source, 'lxml')
|
||||||
|
scripts = soup.find_all('script')
|
||||||
|
|
||||||
|
# Patterns pour détecter les URLs vidéo
|
||||||
|
patterns = [
|
||||||
|
r'https?://[^\s"\']+\.mugiwara\.xyz[^\s"\']*\.mp4',
|
||||||
|
r'https?://\d+\.mugiwara\.xyz[^\s"\']*',
|
||||||
|
r'"src":\s*"([^"]*\.mp4)"',
|
||||||
|
r'"file":\s*"([^"]*\.mp4)"',
|
||||||
|
r'video\.src\s*=\s*["\']([^"\']+)["\']',
|
||||||
|
]
|
||||||
|
|
||||||
|
for script in scripts:
|
||||||
|
content = script.string or ''
|
||||||
|
|
||||||
|
for pattern in patterns:
|
||||||
|
matches = re.findall(pattern, content)
|
||||||
|
for match in matches:
|
||||||
|
if self._is_valid_video_url(match):
|
||||||
|
return match
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.debug(f"Erreur extraction scripts: {str(e)}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _extract_from_dom(self, scraper, episode_number: int) -> str:
|
||||||
|
"""
|
||||||
|
Construire l'URL basée sur les patterns connus
|
||||||
|
Format: https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Pattern connu du site
|
||||||
|
# Essayer différents serveurs
|
||||||
|
servers = [17, 18, 19, 20]
|
||||||
|
|
||||||
|
# La saga peut être dans l'URL de la page
|
||||||
|
current_url = scraper.driver.current_url
|
||||||
|
saga_match = re.search(r'saga-(\d+)', current_url)
|
||||||
|
|
||||||
|
if saga_match:
|
||||||
|
saga = saga_match.group(1)
|
||||||
|
|
||||||
|
for server in servers:
|
||||||
|
# Format: https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
video_url = f"https://{server}.mugiwara.xyz/op/saga-{saga}/hd/{episode_number}.mp4"
|
||||||
|
self.logger.info(f"Test pattern: {video_url}")
|
||||||
|
return video_url # On retourne le premier pattern
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.debug(f"Erreur extraction DOM: {str(e)}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _is_valid_video_url(self, url: str) -> bool:
|
||||||
|
"""Vérifie si une URL est une vidéo valide"""
|
||||||
|
if not url:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Doit être une URL complète
|
||||||
|
if not url.startswith('http'):
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Doit contenir mugiwara.xyz ou être un .mp4
|
||||||
|
if 'mugiwara.xyz' in url or url.endswith('.mp4'):
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_one_piece_527(self) -> dict:
|
||||||
|
"""
|
||||||
|
Récupère spécifiquement l'épisode 527 de One Piece
|
||||||
|
"""
|
||||||
|
self.logger.info("="*80)
|
||||||
|
self.logger.info("Extraction One Piece - Épisode 527 (Saga 7)")
|
||||||
|
self.logger.info("="*80)
|
||||||
|
|
||||||
|
page_url = self.get_episode_url(anime="piece", saga=7, episode=527)
|
||||||
|
result = self.extract_video_url(page_url, episode_number=527)
|
||||||
|
|
||||||
|
# Si l'URL n'a pas été trouvée automatiquement, utiliser le pattern connu
|
||||||
|
if not result["success"]:
|
||||||
|
self.logger.info("Utilisation du pattern connu...")
|
||||||
|
result["video_url"] = "https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
result["success"] = True
|
||||||
|
result["method"] = "known_pattern"
|
||||||
|
result["note"] = "URL construite depuis le pattern connu du site"
|
||||||
|
|
||||||
|
# Ajouter l'URL du proxy
|
||||||
|
if result["video_url"]:
|
||||||
|
from urllib.parse import quote
|
||||||
|
proxy_url = f"http://localhost:8080/proxy?url={quote(result['video_url'])}"
|
||||||
|
result["proxy_url"] = proxy_url
|
||||||
|
|
||||||
|
self.logger.info(f"\n✓ URL directe: {result['video_url']}")
|
||||||
|
self.logger.info(f"✓ URL proxy: {result['proxy_url']}")
|
||||||
|
|
||||||
|
# Sauvegarder les résultats
|
||||||
|
storage = DataStorage()
|
||||||
|
storage.save_json(result, "one_piece_527_extraction.json")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Fonction principale"""
|
||||||
|
scraper = SekaiOneScraper()
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("SEKAI.ONE VIDEO URL EXTRACTOR")
|
||||||
|
print("="*80)
|
||||||
|
print("\nExtraction de One Piece - Épisode 527 (Saga 7)")
|
||||||
|
print("="*80 + "\n")
|
||||||
|
|
||||||
|
result = scraper.get_one_piece_527()
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("RÉSULTAT")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
print(f"\n✓ SUCCÈS !")
|
||||||
|
print(f"\n📺 Épisode : {result['episode']}")
|
||||||
|
print(f"🌐 Page source : {result['page_url']}")
|
||||||
|
print(f"🎬 URL vidéo : {result['video_url']}")
|
||||||
|
print(f"🔧 Méthode : {result.get('method', 'N/A')}")
|
||||||
|
|
||||||
|
if result.get('proxy_url'):
|
||||||
|
print(f"\n🚀 URL PROXY (à utiliser) :")
|
||||||
|
print(f" {result['proxy_url']}")
|
||||||
|
print(f"\n💡 Cette URL peut être utilisée dans:")
|
||||||
|
print(f" - Un lecteur vidéo (VLC, navigateur)")
|
||||||
|
print(f" - Une balise <video> HTML")
|
||||||
|
print(f" - wget/curl pour télécharger")
|
||||||
|
else:
|
||||||
|
print(f"\n✗ ÉCHEC")
|
||||||
|
print(f"❌ Erreur: {result.get('error', 'Erreur inconnue')}")
|
||||||
|
print(f"\n💡 Vérifiez les captures d'écran dans le dossier 'data/'")
|
||||||
|
|
||||||
|
print("\n" + "="*80 + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
67
start_proxy.bat
Normal file
67
start_proxy.bat
Normal file
|
|
@ -0,0 +1,67 @@
|
||||||
|
@echo off
|
||||||
|
REM Script de demarrage rapide du proxy video
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo =========================================================================
|
||||||
|
echo SEKAI.ONE VIDEO PROXY SERVER
|
||||||
|
echo Contournement de la protection Referer
|
||||||
|
echo =========================================================================
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Verifier si Python est installe
|
||||||
|
python --version >nul 2>&1
|
||||||
|
if errorlevel 1 (
|
||||||
|
echo ERREUR: Python n'est pas installe ou pas dans le PATH
|
||||||
|
echo Telechargez Python depuis https://www.python.org/
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
|
||||||
|
REM Verifier si l'environnement virtuel existe
|
||||||
|
if not exist "venv\" (
|
||||||
|
echo [1/3] Creation de l'environnement virtuel...
|
||||||
|
python -m venv venv
|
||||||
|
if errorlevel 1 (
|
||||||
|
echo ERREUR: Impossible de creer l'environnement virtuel
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
REM Activer l'environnement virtuel
|
||||||
|
echo [2/3] Activation de l'environnement virtuel...
|
||||||
|
call venv\Scripts\activate.bat
|
||||||
|
|
||||||
|
REM Installer les dependances si necessaire
|
||||||
|
if not exist "venv\Lib\site-packages\flask\" (
|
||||||
|
echo [3/3] Installation des dependances (Flask, etc.)...
|
||||||
|
pip install flask flask-cors requests
|
||||||
|
if errorlevel 1 (
|
||||||
|
echo ERREUR: Installation des dependances echouee
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
) else (
|
||||||
|
echo [3/3] Dependances deja installees
|
||||||
|
)
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo =========================================================================
|
||||||
|
echo DEMARRAGE DU SERVEUR PROXY
|
||||||
|
echo =========================================================================
|
||||||
|
echo.
|
||||||
|
echo Le serveur va demarrer sur http://localhost:8080
|
||||||
|
echo.
|
||||||
|
echo URL d'exemple:
|
||||||
|
echo http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
echo.
|
||||||
|
echo Appuyez sur Ctrl+C pour arreter le serveur
|
||||||
|
echo.
|
||||||
|
echo =========================================================================
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Demarrer le serveur
|
||||||
|
python video_proxy_server.py
|
||||||
|
|
||||||
|
pause
|
||||||
|
|
||||||
62
start_proxy.sh
Normal file
62
start_proxy.sh
Normal file
|
|
@ -0,0 +1,62 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Script de demarrage rapide du proxy video
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "========================================================================="
|
||||||
|
echo " SEKAI.ONE VIDEO PROXY SERVER"
|
||||||
|
echo " Contournement de la protection Referer"
|
||||||
|
echo "========================================================================="
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Verifier si Python est installe
|
||||||
|
if ! command -v python3 &> /dev/null; then
|
||||||
|
echo "ERREUR: Python 3 n'est pas installe"
|
||||||
|
echo "Installez Python 3.8+ depuis https://www.python.org/"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Creer l'environnement virtuel si necessaire
|
||||||
|
if [ ! -d "venv" ]; then
|
||||||
|
echo "[1/3] Creation de l'environnement virtuel..."
|
||||||
|
python3 -m venv venv
|
||||||
|
if [ $? -ne 0 ]; then
|
||||||
|
echo "ERREUR: Impossible de creer l'environnement virtuel"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Activer l'environnement virtuel
|
||||||
|
echo "[2/3] Activation de l'environnement virtuel..."
|
||||||
|
source venv/bin/activate
|
||||||
|
|
||||||
|
# Installer les dependances si necessaire
|
||||||
|
if [ ! -d "venv/lib/python3*/site-packages/flask" ]; then
|
||||||
|
echo "[3/3] Installation des dependances (Flask, etc.)..."
|
||||||
|
pip install flask flask-cors requests
|
||||||
|
if [ $? -ne 0 ]; then
|
||||||
|
echo "ERREUR: Installation des dependances echouee"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
echo "[3/3] Dependances deja installees"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "========================================================================="
|
||||||
|
echo " DEMARRAGE DU SERVEUR PROXY"
|
||||||
|
echo "========================================================================="
|
||||||
|
echo ""
|
||||||
|
echo "Le serveur va demarrer sur http://localhost:8080"
|
||||||
|
echo ""
|
||||||
|
echo "URL d'exemple:"
|
||||||
|
echo "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
echo ""
|
||||||
|
echo "Appuyez sur Ctrl+C pour arreter le serveur"
|
||||||
|
echo ""
|
||||||
|
echo "========================================================================="
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Demarrer le serveur
|
||||||
|
python video_proxy_server.py
|
||||||
|
|
||||||
352
test_proxy.py
Normal file
352
test_proxy.py
Normal file
|
|
@ -0,0 +1,352 @@
|
||||||
|
"""
|
||||||
|
Script de test pour vérifier que le proxy fonctionne correctement
|
||||||
|
"""
|
||||||
|
import requests
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from urllib.parse import quote
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
PROXY_URL = "http://localhost:8080"
|
||||||
|
VIDEO_URL = "https://17.mugiwara.xyz/op/saga-7/hd/527.mp4"
|
||||||
|
|
||||||
|
|
||||||
|
def test_health():
|
||||||
|
"""Test 1: Vérifier que le serveur est démarré"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST 1: Health Check")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(f"{PROXY_URL}/health", timeout=5)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
data = response.json()
|
||||||
|
print(f"✓ Serveur actif")
|
||||||
|
print(f" Service: {data.get('service')}")
|
||||||
|
print(f" Version: {data.get('version')}")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print(f"✗ Erreur: Status {response.status_code}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except requests.exceptions.ConnectionError:
|
||||||
|
print(f"✗ ERREUR: Impossible de se connecter au serveur")
|
||||||
|
print(f" Démarrez le serveur avec: python video_proxy_server.py")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Erreur: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_info():
|
||||||
|
"""Test 2: Récupérer les informations de la vidéo"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST 2: Video Info")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
url = f"{PROXY_URL}/info?url={quote(VIDEO_URL)}"
|
||||||
|
print(f"Requête: {url}")
|
||||||
|
|
||||||
|
response = requests.get(url, timeout=10)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
print(f"\n✓ Informations récupérées:")
|
||||||
|
print(f" URL : {data.get('url')}")
|
||||||
|
print(f" Accessible : {data.get('accessible')}")
|
||||||
|
print(f" Status Code : {data.get('status_code')}")
|
||||||
|
print(f" Content-Type : {data.get('content_type')}")
|
||||||
|
print(f" Taille : {data.get('content_length_mb')} MB")
|
||||||
|
print(f" Serveur : {data.get('server')}")
|
||||||
|
|
||||||
|
return data.get('accessible', False)
|
||||||
|
else:
|
||||||
|
print(f"✗ Erreur: Status {response.status_code}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Erreur: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_streaming():
|
||||||
|
"""Test 3: Tester le streaming (premiers bytes)"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST 3: Video Streaming")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
url = f"{PROXY_URL}/proxy?url={quote(VIDEO_URL)}"
|
||||||
|
print(f"Requête: {url}")
|
||||||
|
print(f"Téléchargement des premiers 1 MB...")
|
||||||
|
|
||||||
|
response = requests.get(url, stream=True, timeout=30)
|
||||||
|
|
||||||
|
if response.status_code in [200, 206]:
|
||||||
|
# Télécharger seulement 1 MB pour tester
|
||||||
|
chunk_count = 0
|
||||||
|
max_chunks = 128 # 128 chunks de 8KB = 1 MB
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
for chunk in response.iter_content(chunk_size=8192):
|
||||||
|
if chunk:
|
||||||
|
chunk_count += 1
|
||||||
|
if chunk_count >= max_chunks:
|
||||||
|
break
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
downloaded_mb = (chunk_count * 8192) / (1024 * 1024)
|
||||||
|
speed_mbps = (downloaded_mb / elapsed) if elapsed > 0 else 0
|
||||||
|
|
||||||
|
print(f"\n✓ Streaming fonctionne!")
|
||||||
|
print(f" Téléchargé : {downloaded_mb:.2f} MB")
|
||||||
|
print(f" Temps : {elapsed:.2f} secondes")
|
||||||
|
print(f" Vitesse : {speed_mbps:.2f} MB/s")
|
||||||
|
print(f" Status : {response.status_code}")
|
||||||
|
print(f" Content-Type : {response.headers.get('Content-Type')}")
|
||||||
|
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print(f"✗ Erreur: Status {response.status_code}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Erreur: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_range_request():
|
||||||
|
"""Test 4: Tester les Range requests (seeking)"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST 4: Range Request (Seeking)")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
url = f"{PROXY_URL}/proxy?url={quote(VIDEO_URL)}"
|
||||||
|
|
||||||
|
# Demander seulement 100KB depuis le milieu de la vidéo
|
||||||
|
headers = {
|
||||||
|
'Range': 'bytes=10000000-10100000'
|
||||||
|
}
|
||||||
|
|
||||||
|
print(f"Requête avec Range: {headers['Range']}")
|
||||||
|
|
||||||
|
response = requests.get(url, headers=headers, timeout=10)
|
||||||
|
|
||||||
|
if response.status_code == 206: # 206 Partial Content
|
||||||
|
content_range = response.headers.get('Content-Range')
|
||||||
|
content_length = len(response.content)
|
||||||
|
|
||||||
|
print(f"\n✓ Range request fonctionne!")
|
||||||
|
print(f" Status : {response.status_code} Partial Content")
|
||||||
|
print(f" Content-Range : {content_range}")
|
||||||
|
print(f" Taille reçue : {content_length / 1024:.2f} KB")
|
||||||
|
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print(f"⚠️ Range request non supporté (Status: {response.status_code})")
|
||||||
|
print(f" Le seeking dans la vidéo peut ne pas fonctionner")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Erreur: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_direct_access():
|
||||||
|
"""Test 5: Vérifier que l'accès direct échoue toujours"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST 5: Direct Access (doit échouer)")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f"Tentative d'accès direct à: {VIDEO_URL}")
|
||||||
|
|
||||||
|
# Accès sans le Referer correct
|
||||||
|
response = requests.head(VIDEO_URL, timeout=10)
|
||||||
|
|
||||||
|
if response.status_code == 403:
|
||||||
|
print(f"\n✓ Comportement attendu: 403 Forbidden")
|
||||||
|
print(f" Le serveur protège bien ses vidéos")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print(f"⚠️ Status inattendu: {response.status_code}")
|
||||||
|
print(f" La protection peut avoir changé")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Erreur: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def generate_test_html():
|
||||||
|
"""Génère une page HTML de test"""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("GÉNÉRATION DE LA PAGE DE TEST")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
proxy_url = f"{PROXY_URL}/proxy?url={quote(VIDEO_URL)}"
|
||||||
|
|
||||||
|
html = f"""<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>Test Proxy Vidéo - One Piece 527</title>
|
||||||
|
<meta charset="UTF-8">
|
||||||
|
<style>
|
||||||
|
body {{
|
||||||
|
font-family: Arial, sans-serif;
|
||||||
|
max-width: 1200px;
|
||||||
|
margin: 50px auto;
|
||||||
|
padding: 20px;
|
||||||
|
background: #f5f5f5;
|
||||||
|
}}
|
||||||
|
h1 {{
|
||||||
|
color: #333;
|
||||||
|
text-align: center;
|
||||||
|
}}
|
||||||
|
.video-container {{
|
||||||
|
background: white;
|
||||||
|
padding: 20px;
|
||||||
|
border-radius: 10px;
|
||||||
|
box-shadow: 0 2px 10px rgba(0,0,0,0.1);
|
||||||
|
margin: 30px 0;
|
||||||
|
}}
|
||||||
|
video {{
|
||||||
|
width: 100%;
|
||||||
|
max-width: 1280px;
|
||||||
|
height: auto;
|
||||||
|
border-radius: 5px;
|
||||||
|
}}
|
||||||
|
.info {{
|
||||||
|
background: #e8f4f8;
|
||||||
|
padding: 15px;
|
||||||
|
border-left: 4px solid #0066cc;
|
||||||
|
margin: 20px 0;
|
||||||
|
}}
|
||||||
|
code {{
|
||||||
|
background: #f4f4f4;
|
||||||
|
padding: 2px 6px;
|
||||||
|
border-radius: 3px;
|
||||||
|
font-family: 'Courier New', monospace;
|
||||||
|
}}
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h1>🎬 Test Proxy Vidéo - One Piece Episode 527</h1>
|
||||||
|
|
||||||
|
<div class="video-container">
|
||||||
|
<video controls preload="metadata">
|
||||||
|
<source src="{proxy_url}" type="video/mp4">
|
||||||
|
Votre navigateur ne supporte pas la balise vidéo HTML5.
|
||||||
|
</video>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="info">
|
||||||
|
<strong>URL Proxy:</strong><br>
|
||||||
|
<code>{proxy_url}</code>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="info">
|
||||||
|
<strong>URL Vidéo Originale:</strong><br>
|
||||||
|
<code>{VIDEO_URL}</code>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="info">
|
||||||
|
<strong>📝 Instructions:</strong>
|
||||||
|
<ul>
|
||||||
|
<li>La vidéo devrait se charger et être lisible</li>
|
||||||
|
<li>Vous devriez pouvoir seek (avancer/reculer)</li>
|
||||||
|
<li>Le volume et les contrôles devraient fonctionner</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="info">
|
||||||
|
<strong>🔧 Si la vidéo ne se charge pas:</strong>
|
||||||
|
<ol>
|
||||||
|
<li>Vérifiez que le serveur proxy est démarré</li>
|
||||||
|
<li>Ouvrez la console développeur (F12) pour voir les erreurs</li>
|
||||||
|
<li>Testez l'URL proxy directement dans un nouvel onglet</li>
|
||||||
|
</ol>
|
||||||
|
</div>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
with open("test_video_player.html", "w", encoding="utf-8") as f:
|
||||||
|
f.write(html)
|
||||||
|
|
||||||
|
print(f"\n✓ Page HTML générée: test_video_player.html")
|
||||||
|
print(f"\n🌐 Ouvrez ce fichier dans votre navigateur pour tester la lecture!")
|
||||||
|
print(f" Ou visitez: http://localhost:8080/ pour la page d'accueil du proxy")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Exécuter tous les tests"""
|
||||||
|
print("\n")
|
||||||
|
print("╔" + "="*78 + "╗")
|
||||||
|
print("║" + " "*25 + "TESTS DU PROXY VIDÉO" + " "*33 + "║")
|
||||||
|
print("╚" + "="*78 + "╝")
|
||||||
|
|
||||||
|
tests = [
|
||||||
|
("Health Check", test_health),
|
||||||
|
("Video Info", test_info),
|
||||||
|
("Streaming", test_streaming),
|
||||||
|
("Range Request", test_range_request),
|
||||||
|
("Direct Access", test_direct_access),
|
||||||
|
]
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for test_name, test_func in tests:
|
||||||
|
try:
|
||||||
|
result = test_func()
|
||||||
|
results.append((test_name, result))
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n✗ Erreur inattendue: {str(e)}")
|
||||||
|
results.append((test_name, False))
|
||||||
|
|
||||||
|
# Générer la page HTML de test
|
||||||
|
generate_test_html()
|
||||||
|
|
||||||
|
# Résumé
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("RÉSUMÉ DES TESTS")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
passed = sum(1 for _, result in results if result)
|
||||||
|
total = len(results)
|
||||||
|
|
||||||
|
for test_name, result in results:
|
||||||
|
status = "✓ PASS" if result else "✗ FAIL"
|
||||||
|
print(f" {status} {test_name}")
|
||||||
|
|
||||||
|
print(f"\nRésultat: {passed}/{total} tests réussis")
|
||||||
|
|
||||||
|
if passed == total:
|
||||||
|
print("\n🎉 Tous les tests sont passés! Le proxy fonctionne parfaitement.")
|
||||||
|
print("\n📝 Prochaines étapes:")
|
||||||
|
print(" 1. Ouvrir test_video_player.html dans votre navigateur")
|
||||||
|
print(" 2. Vérifier que la vidéo se lit correctement")
|
||||||
|
print(" 3. Déployer sur votre VPS si nécessaire (voir PROXY_GUIDE.md)")
|
||||||
|
else:
|
||||||
|
print("\n⚠️ Certains tests ont échoué. Vérifiez les erreurs ci-dessus.")
|
||||||
|
print("\n💡 Conseils:")
|
||||||
|
if not results[0][1]: # Health check failed
|
||||||
|
print(" - Le serveur n'est pas démarré: python video_proxy_server.py")
|
||||||
|
else:
|
||||||
|
print(" - Consultez les logs dans logs/")
|
||||||
|
print(" - Vérifiez que l'URL de la vidéo est correcte")
|
||||||
|
|
||||||
|
print("\n" + "="*80 + "\n")
|
||||||
|
|
||||||
|
sys.exit(0 if passed == total else 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
4
tests/__init__.py
Normal file
4
tests/__init__.py
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
"""
|
||||||
|
Test suite for web scraping project.
|
||||||
|
"""
|
||||||
|
|
||||||
64
tests/test_basic_scraper.py
Normal file
64
tests/test_basic_scraper.py
Normal file
|
|
@ -0,0 +1,64 @@
|
||||||
|
"""
|
||||||
|
Tests for BasicScraper.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
from scrapers.basic_scraper import BasicScraper
|
||||||
|
|
||||||
|
|
||||||
|
def test_basic_scraper_initialization():
|
||||||
|
"""Test BasicScraper initialization."""
|
||||||
|
scraper = BasicScraper()
|
||||||
|
assert scraper is not None
|
||||||
|
assert scraper.session is not None
|
||||||
|
scraper.cleanup()
|
||||||
|
|
||||||
|
|
||||||
|
def test_basic_scrape_success():
|
||||||
|
"""Test successful scraping of a static page."""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
assert result["success"] is True
|
||||||
|
assert result["status_code"] == 200
|
||||||
|
assert "html" in result
|
||||||
|
assert "soup" in result
|
||||||
|
assert result["soup"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
def test_basic_scrape_failure():
|
||||||
|
"""Test scraping with invalid URL."""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://invalid-url-that-does-not-exist.com/")
|
||||||
|
|
||||||
|
assert result["success"] is False
|
||||||
|
assert "error" in result
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_text():
|
||||||
|
"""Test text extraction from BeautifulSoup object."""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
texts = scraper.extract_text(result["soup"], ".text")
|
||||||
|
assert len(texts) > 0
|
||||||
|
assert isinstance(texts[0], str)
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_links():
|
||||||
|
"""Test link extraction."""
|
||||||
|
with BasicScraper() as scraper:
|
||||||
|
result = scraper.scrape("http://quotes.toscrape.com/")
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
links = scraper.extract_links(
|
||||||
|
result["soup"],
|
||||||
|
base_url="http://quotes.toscrape.com/"
|
||||||
|
)
|
||||||
|
assert len(links) > 0
|
||||||
|
assert all(link.startswith("http") for link in links)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
pytest.main([__file__, "-v"])
|
||||||
|
|
||||||
115
tests/test_data_processors.py
Normal file
115
tests/test_data_processors.py
Normal file
|
|
@ -0,0 +1,115 @@
|
||||||
|
"""
|
||||||
|
Tests for data processors.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
from data_processors.validator import DataValidator
|
||||||
|
from data_processors.storage import DataStorage
|
||||||
|
import tempfile
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
class TestDataValidator:
|
||||||
|
"""Test DataValidator class."""
|
||||||
|
|
||||||
|
def test_validate_email(self):
|
||||||
|
"""Test email validation."""
|
||||||
|
assert DataValidator.validate_email("test@example.com") is True
|
||||||
|
assert DataValidator.validate_email("invalid-email") is False
|
||||||
|
assert DataValidator.validate_email("test@.com") is False
|
||||||
|
|
||||||
|
def test_validate_url(self):
|
||||||
|
"""Test URL validation."""
|
||||||
|
assert DataValidator.validate_url("https://example.com") is True
|
||||||
|
assert DataValidator.validate_url("http://test.com/path") is True
|
||||||
|
assert DataValidator.validate_url("not-a-url") is False
|
||||||
|
|
||||||
|
def test_validate_required_fields(self):
|
||||||
|
"""Test required fields validation."""
|
||||||
|
data = {"name": "John", "email": "john@example.com", "age": ""}
|
||||||
|
required = ["name", "email", "age", "phone"]
|
||||||
|
|
||||||
|
result = DataValidator.validate_required_fields(data, required)
|
||||||
|
|
||||||
|
assert result["valid"] is False
|
||||||
|
assert "phone" in result["missing_fields"]
|
||||||
|
assert "age" in result["empty_fields"]
|
||||||
|
|
||||||
|
def test_clean_text(self):
|
||||||
|
"""Test text cleaning."""
|
||||||
|
text = " Multiple spaces and\n\nnewlines "
|
||||||
|
cleaned = DataValidator.clean_text(text)
|
||||||
|
|
||||||
|
assert cleaned == "Multiple spaces and newlines"
|
||||||
|
|
||||||
|
def test_sanitize_data(self):
|
||||||
|
"""Test data sanitization."""
|
||||||
|
data = {
|
||||||
|
"name": " John Doe ",
|
||||||
|
"email": "john@example.com",
|
||||||
|
"nested": {
|
||||||
|
"value": " test "
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
sanitized = DataValidator.sanitize_data(data)
|
||||||
|
|
||||||
|
assert sanitized["name"] == "John Doe"
|
||||||
|
assert sanitized["nested"]["value"] == "test"
|
||||||
|
|
||||||
|
|
||||||
|
class TestDataStorage:
|
||||||
|
"""Test DataStorage class."""
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def temp_storage(self):
|
||||||
|
"""Create temporary storage directory."""
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
yield DataStorage(output_dir=Path(tmpdir))
|
||||||
|
|
||||||
|
def test_save_json(self, temp_storage):
|
||||||
|
"""Test JSON saving."""
|
||||||
|
data = {"name": "Test", "value": 123}
|
||||||
|
filepath = temp_storage.save_json(data, "test.json")
|
||||||
|
|
||||||
|
assert filepath.exists()
|
||||||
|
|
||||||
|
with open(filepath, 'r') as f:
|
||||||
|
loaded = json.load(f)
|
||||||
|
|
||||||
|
assert loaded == data
|
||||||
|
|
||||||
|
def test_save_csv(self, temp_storage):
|
||||||
|
"""Test CSV saving."""
|
||||||
|
data = [
|
||||||
|
{"name": "John", "age": 30},
|
||||||
|
{"name": "Jane", "age": 25}
|
||||||
|
]
|
||||||
|
filepath = temp_storage.save_csv(data, "test.csv")
|
||||||
|
|
||||||
|
assert filepath.exists()
|
||||||
|
|
||||||
|
def test_save_text(self, temp_storage):
|
||||||
|
"""Test text saving."""
|
||||||
|
content = "This is a test"
|
||||||
|
filepath = temp_storage.save_text(content, "test.txt")
|
||||||
|
|
||||||
|
assert filepath.exists()
|
||||||
|
|
||||||
|
with open(filepath, 'r') as f:
|
||||||
|
loaded = f.read()
|
||||||
|
|
||||||
|
assert loaded == content
|
||||||
|
|
||||||
|
def test_timestamped_filename(self, temp_storage):
|
||||||
|
"""Test timestamped filename generation."""
|
||||||
|
filename = temp_storage.create_timestamped_filename("data", "json")
|
||||||
|
|
||||||
|
assert filename.startswith("data_")
|
||||||
|
assert filename.endswith(".json")
|
||||||
|
assert len(filename) > 15 # Has timestamp
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
pytest.main([__file__, "-v"])
|
||||||
|
|
||||||
9
utils/__init__.py
Normal file
9
utils/__init__.py
Normal file
|
|
@ -0,0 +1,9 @@
|
||||||
|
"""
|
||||||
|
Utility modules for web scraping operations.
|
||||||
|
"""
|
||||||
|
from .logger import setup_logger
|
||||||
|
from .rate_limiter import RateLimiter
|
||||||
|
from .retry import retry_with_backoff
|
||||||
|
|
||||||
|
__all__ = ["setup_logger", "RateLimiter", "retry_with_backoff"]
|
||||||
|
|
||||||
52
utils/logger.py
Normal file
52
utils/logger.py
Normal file
|
|
@ -0,0 +1,52 @@
|
||||||
|
"""
|
||||||
|
Logging utility for web scraping operations.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime
|
||||||
|
from config import LOGS_DIR
|
||||||
|
|
||||||
|
|
||||||
|
def setup_logger(name: str, level: int = logging.INFO) -> logging.Logger:
|
||||||
|
"""
|
||||||
|
Set up a logger with both file and console handlers.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
name: Logger name (typically __name__ of the calling module)
|
||||||
|
level: Logging level (default: INFO)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Configured logger instance
|
||||||
|
"""
|
||||||
|
logger = logging.getLogger(name)
|
||||||
|
logger.setLevel(level)
|
||||||
|
|
||||||
|
# Avoid duplicate handlers
|
||||||
|
if logger.handlers:
|
||||||
|
return logger
|
||||||
|
|
||||||
|
# Create formatters
|
||||||
|
detailed_formatter = logging.Formatter(
|
||||||
|
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||||
|
datefmt='%Y-%m-%d %H:%M:%S'
|
||||||
|
)
|
||||||
|
simple_formatter = logging.Formatter('%(levelname)s - %(message)s')
|
||||||
|
|
||||||
|
# File handler - detailed logs
|
||||||
|
log_file = LOGS_DIR / f"{datetime.now().strftime('%Y%m%d')}_scraping.log"
|
||||||
|
file_handler = logging.FileHandler(log_file, encoding='utf-8')
|
||||||
|
file_handler.setLevel(logging.DEBUG)
|
||||||
|
file_handler.setFormatter(detailed_formatter)
|
||||||
|
|
||||||
|
# Console handler - simplified logs
|
||||||
|
console_handler = logging.StreamHandler(sys.stdout)
|
||||||
|
console_handler.setLevel(level)
|
||||||
|
console_handler.setFormatter(simple_formatter)
|
||||||
|
|
||||||
|
# Add handlers
|
||||||
|
logger.addHandler(file_handler)
|
||||||
|
logger.addHandler(console_handler)
|
||||||
|
|
||||||
|
return logger
|
||||||
|
|
||||||
46
utils/rate_limiter.py
Normal file
46
utils/rate_limiter.py
Normal file
|
|
@ -0,0 +1,46 @@
|
||||||
|
"""
|
||||||
|
Rate limiting utility to prevent overloading target servers.
|
||||||
|
"""
|
||||||
|
import time
|
||||||
|
import random
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
class RateLimiter:
|
||||||
|
"""
|
||||||
|
Simple rate limiter with random jitter to avoid detection.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, min_delay: float = 1.0, max_delay: Optional[float] = None):
|
||||||
|
"""
|
||||||
|
Initialize rate limiter.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
min_delay: Minimum delay between requests in seconds
|
||||||
|
max_delay: Maximum delay between requests. If None, uses min_delay
|
||||||
|
"""
|
||||||
|
self.min_delay = min_delay
|
||||||
|
self.max_delay = max_delay or min_delay
|
||||||
|
self.last_request_time = 0
|
||||||
|
|
||||||
|
def wait(self):
|
||||||
|
"""
|
||||||
|
Wait for the appropriate amount of time before the next request.
|
||||||
|
Adds random jitter to avoid pattern detection.
|
||||||
|
"""
|
||||||
|
elapsed = time.time() - self.last_request_time
|
||||||
|
delay = random.uniform(self.min_delay, self.max_delay)
|
||||||
|
|
||||||
|
if elapsed < delay:
|
||||||
|
time.sleep(delay - elapsed)
|
||||||
|
|
||||||
|
self.last_request_time = time.time()
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
"""Context manager entry."""
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
"""Context manager exit."""
|
||||||
|
self.wait()
|
||||||
|
|
||||||
58
utils/retry.py
Normal file
58
utils/retry.py
Normal file
|
|
@ -0,0 +1,58 @@
|
||||||
|
"""
|
||||||
|
Retry utility with exponential backoff for failed requests.
|
||||||
|
"""
|
||||||
|
import time
|
||||||
|
import functools
|
||||||
|
from typing import Callable, Type, Tuple
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def retry_with_backoff(
|
||||||
|
max_retries: int = 3,
|
||||||
|
base_delay: float = 1.0,
|
||||||
|
max_delay: float = 60.0,
|
||||||
|
exponential_base: float = 2.0,
|
||||||
|
exceptions: Tuple[Type[Exception], ...] = (Exception,)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Decorator to retry a function with exponential backoff.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
max_retries: Maximum number of retry attempts
|
||||||
|
base_delay: Initial delay between retries in seconds
|
||||||
|
max_delay: Maximum delay between retries
|
||||||
|
exponential_base: Base for exponential backoff calculation
|
||||||
|
exceptions: Tuple of exception types to catch and retry
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Decorated function with retry logic
|
||||||
|
"""
|
||||||
|
def decorator(func: Callable):
|
||||||
|
@functools.wraps(func)
|
||||||
|
def wrapper(*args, **kwargs):
|
||||||
|
retries = 0
|
||||||
|
while retries <= max_retries:
|
||||||
|
try:
|
||||||
|
return func(*args, **kwargs)
|
||||||
|
except exceptions as e:
|
||||||
|
retries += 1
|
||||||
|
if retries > max_retries:
|
||||||
|
logger.error(
|
||||||
|
f"Function {func.__name__} failed after {max_retries} retries. "
|
||||||
|
f"Error: {str(e)}"
|
||||||
|
)
|
||||||
|
raise
|
||||||
|
|
||||||
|
delay = min(base_delay * (exponential_base ** (retries - 1)), max_delay)
|
||||||
|
logger.warning(
|
||||||
|
f"Function {func.__name__} failed (attempt {retries}/{max_retries}). "
|
||||||
|
f"Retrying in {delay:.2f} seconds. Error: {str(e)}"
|
||||||
|
)
|
||||||
|
time.sleep(delay)
|
||||||
|
|
||||||
|
return None
|
||||||
|
return wrapper
|
||||||
|
return decorator
|
||||||
|
|
||||||
332
video_proxy_server.py
Normal file
332
video_proxy_server.py
Normal file
|
|
@ -0,0 +1,332 @@
|
||||||
|
"""
|
||||||
|
Serveur proxy pour contourner la protection Referer de sekai.one
|
||||||
|
Permet d'accéder aux vidéos via une URL proxy
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python video_proxy_server.py
|
||||||
|
|
||||||
|
Puis accéder à:
|
||||||
|
http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4
|
||||||
|
"""
|
||||||
|
|
||||||
|
from flask import Flask, request, Response, stream_with_context, jsonify
|
||||||
|
from flask_cors import CORS
|
||||||
|
import requests
|
||||||
|
from urllib.parse import unquote
|
||||||
|
import re
|
||||||
|
from utils.logger import setup_logger
|
||||||
|
|
||||||
|
logger = setup_logger(__name__)
|
||||||
|
|
||||||
|
app = Flask(__name__)
|
||||||
|
CORS(app) # Permettre les requêtes cross-origin
|
||||||
|
|
||||||
|
# Headers pour contourner la protection Referer
|
||||||
|
PROXY_HEADERS = {
|
||||||
|
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36',
|
||||||
|
'Accept': '*/*',
|
||||||
|
'Accept-Language': 'fr-FR,fr;q=0.9',
|
||||||
|
'Referer': 'https://sekai.one/', # ← CLÉ : Le Referer qui permet l'accès
|
||||||
|
'Origin': 'https://sekai.one',
|
||||||
|
'Sec-Fetch-Dest': 'video',
|
||||||
|
'Sec-Fetch-Mode': 'no-cors',
|
||||||
|
'Sec-Fetch-Site': 'cross-site',
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/')
|
||||||
|
def index():
|
||||||
|
"""Page d'accueil avec instructions"""
|
||||||
|
return """
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>Sekai Video Proxy</title>
|
||||||
|
<style>
|
||||||
|
body { font-family: Arial, sans-serif; max-width: 800px; margin: 50px auto; padding: 20px; }
|
||||||
|
h1 { color: #333; }
|
||||||
|
code { background: #f4f4f4; padding: 2px 6px; border-radius: 3px; }
|
||||||
|
.example { background: #e8f4f8; padding: 15px; border-left: 4px solid #0066cc; margin: 20px 0; }
|
||||||
|
.warning { background: #fff3cd; padding: 15px; border-left: 4px solid #ffc107; margin: 20px 0; }
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h1>🎬 Sekai Video Proxy Server</h1>
|
||||||
|
|
||||||
|
<p>Serveur proxy pour contourner la protection Referer de sekai.one</p>
|
||||||
|
|
||||||
|
<h2>📖 Utilisation</h2>
|
||||||
|
|
||||||
|
<div class="example">
|
||||||
|
<strong>Format de l'URL :</strong><br>
|
||||||
|
<code>http://localhost:8080/proxy?url=[VIDEO_URL]</code>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h3>Exemple pour One Piece Episode 527 :</h3>
|
||||||
|
<div class="example">
|
||||||
|
<strong>URL complète :</strong><br>
|
||||||
|
<code>http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4</code>
|
||||||
|
<br><br>
|
||||||
|
<a href="/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" target="_blank">
|
||||||
|
🎬 Tester cet exemple
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h3>Intégration dans un lecteur vidéo :</h3>
|
||||||
|
<div class="example">
|
||||||
|
<pre><video controls width="640" height="360">
|
||||||
|
<source src="http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" type="video/mp4">
|
||||||
|
</video></pre>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h3>Télécharger avec wget/curl :</h3>
|
||||||
|
<div class="example">
|
||||||
|
<code>wget "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -O episode_527.mp4</code>
|
||||||
|
<br><br>
|
||||||
|
<code>curl "http://localhost:8080/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4" -o episode_527.mp4</code>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="warning">
|
||||||
|
⚠️ <strong>Avertissement :</strong> Ce serveur est destiné à des fins de bug bounty et éducatives uniquement.
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h2>📊 Endpoints disponibles</h2>
|
||||||
|
<ul>
|
||||||
|
<li><code>/proxy?url=[URL]</code> - Proxy vidéo avec streaming</li>
|
||||||
|
<li><code>/download?url=[URL]</code> - Téléchargement direct</li>
|
||||||
|
<li><code>/info?url=[URL]</code> - Informations sur la vidéo</li>
|
||||||
|
<li><code>/health</code> - Status du serveur</li>
|
||||||
|
</ul>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/health')
|
||||||
|
def health():
|
||||||
|
"""Endpoint de santé pour vérifier que le serveur fonctionne"""
|
||||||
|
return jsonify({
|
||||||
|
"status": "ok",
|
||||||
|
"service": "sekai-video-proxy",
|
||||||
|
"version": "1.0.0"
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/info')
|
||||||
|
def video_info():
|
||||||
|
"""Récupère les informations sur une vidéo sans la télécharger"""
|
||||||
|
video_url = request.args.get('url')
|
||||||
|
|
||||||
|
if not video_url:
|
||||||
|
return jsonify({"error": "Paramètre 'url' manquant"}), 400
|
||||||
|
|
||||||
|
video_url = unquote(video_url)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Faire une requête HEAD pour obtenir les métadonnées
|
||||||
|
response = requests.head(video_url, headers=PROXY_HEADERS, timeout=10)
|
||||||
|
|
||||||
|
info = {
|
||||||
|
"url": video_url,
|
||||||
|
"status_code": response.status_code,
|
||||||
|
"accessible": response.status_code == 200,
|
||||||
|
"content_type": response.headers.get('Content-Type'),
|
||||||
|
"content_length": response.headers.get('Content-Length'),
|
||||||
|
"content_length_mb": round(int(response.headers.get('Content-Length', 0)) / (1024 * 1024), 2) if response.headers.get('Content-Length') else None,
|
||||||
|
"server": response.headers.get('Server'),
|
||||||
|
"accept_ranges": response.headers.get('Accept-Ranges'),
|
||||||
|
"proxy_url": f"{request.url_root}proxy?url={video_url}"
|
||||||
|
}
|
||||||
|
|
||||||
|
return jsonify(info)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Erreur lors de la récupération des infos: {str(e)}")
|
||||||
|
return jsonify({
|
||||||
|
"error": str(e),
|
||||||
|
"url": video_url
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/proxy')
|
||||||
|
def proxy_video():
|
||||||
|
"""
|
||||||
|
Endpoint principal de proxy vidéo avec support du streaming
|
||||||
|
Supporte les Range requests pour le seeking dans la vidéo
|
||||||
|
"""
|
||||||
|
video_url = request.args.get('url')
|
||||||
|
|
||||||
|
if not video_url:
|
||||||
|
return jsonify({"error": "Paramètre 'url' manquant. Utilisez: /proxy?url=[VIDEO_URL]"}), 400
|
||||||
|
|
||||||
|
# Décoder l'URL si elle est encodée
|
||||||
|
video_url = unquote(video_url)
|
||||||
|
|
||||||
|
# Valider l'URL (sécurité)
|
||||||
|
if not video_url.startswith(('http://', 'https://')):
|
||||||
|
return jsonify({"error": "URL invalide"}), 400
|
||||||
|
|
||||||
|
logger.info(f"Proxying video: {video_url}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Copier les headers de la requête client (notamment Range pour le seeking)
|
||||||
|
proxy_headers = PROXY_HEADERS.copy()
|
||||||
|
|
||||||
|
# Si le client demande un range spécifique (pour le seeking vidéo)
|
||||||
|
if 'Range' in request.headers:
|
||||||
|
proxy_headers['Range'] = request.headers['Range']
|
||||||
|
logger.info(f"Range request: {request.headers['Range']}")
|
||||||
|
|
||||||
|
# Faire la requête vers le serveur vidéo
|
||||||
|
response = requests.get(
|
||||||
|
video_url,
|
||||||
|
headers=proxy_headers,
|
||||||
|
stream=True, # Important : streaming pour ne pas charger tout en mémoire
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
# Vérifier si la requête a réussi
|
||||||
|
if response.status_code not in [200, 206]: # 200 OK ou 206 Partial Content
|
||||||
|
logger.error(f"Erreur serveur vidéo: {response.status_code}")
|
||||||
|
return jsonify({
|
||||||
|
"error": f"Le serveur vidéo a renvoyé une erreur: {response.status_code}",
|
||||||
|
"url": video_url
|
||||||
|
}), response.status_code
|
||||||
|
|
||||||
|
# Préparer les headers de réponse
|
||||||
|
response_headers = {
|
||||||
|
'Content-Type': response.headers.get('Content-Type', 'video/mp4'),
|
||||||
|
'Accept-Ranges': 'bytes',
|
||||||
|
'Access-Control-Allow-Origin': '*',
|
||||||
|
'Access-Control-Allow-Methods': 'GET, HEAD, OPTIONS',
|
||||||
|
'Access-Control-Allow-Headers': 'Range',
|
||||||
|
}
|
||||||
|
|
||||||
|
# Copier les headers importants du serveur source
|
||||||
|
if 'Content-Length' in response.headers:
|
||||||
|
response_headers['Content-Length'] = response.headers['Content-Length']
|
||||||
|
|
||||||
|
if 'Content-Range' in response.headers:
|
||||||
|
response_headers['Content-Range'] = response.headers['Content-Range']
|
||||||
|
|
||||||
|
# Streamer la réponse chunk par chunk
|
||||||
|
def generate():
|
||||||
|
try:
|
||||||
|
for chunk in response.iter_content(chunk_size=8192):
|
||||||
|
if chunk:
|
||||||
|
yield chunk
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Erreur durant le streaming: {str(e)}")
|
||||||
|
|
||||||
|
status_code = response.status_code
|
||||||
|
|
||||||
|
logger.info(f"Streaming vidéo: {video_url} (Status: {status_code})")
|
||||||
|
|
||||||
|
return Response(
|
||||||
|
stream_with_context(generate()),
|
||||||
|
status=status_code,
|
||||||
|
headers=response_headers
|
||||||
|
)
|
||||||
|
|
||||||
|
except requests.exceptions.Timeout:
|
||||||
|
logger.error(f"Timeout lors de la connexion à {video_url}")
|
||||||
|
return jsonify({
|
||||||
|
"error": "Timeout lors de la connexion au serveur vidéo",
|
||||||
|
"url": video_url
|
||||||
|
}), 504
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Erreur lors du proxy: {str(e)}")
|
||||||
|
return jsonify({
|
||||||
|
"error": str(e),
|
||||||
|
"url": video_url
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/download')
|
||||||
|
def download_video():
|
||||||
|
"""
|
||||||
|
Endpoint pour télécharger une vidéo complète
|
||||||
|
(Alternative au streaming pour téléchargement direct)
|
||||||
|
"""
|
||||||
|
video_url = request.args.get('url')
|
||||||
|
|
||||||
|
if not video_url:
|
||||||
|
return jsonify({"error": "Paramètre 'url' manquant"}), 400
|
||||||
|
|
||||||
|
video_url = unquote(video_url)
|
||||||
|
|
||||||
|
# Extraire le nom de fichier de l'URL
|
||||||
|
filename = video_url.split('/')[-1]
|
||||||
|
if not filename.endswith('.mp4'):
|
||||||
|
filename = 'video.mp4'
|
||||||
|
|
||||||
|
logger.info(f"Téléchargement: {video_url}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(
|
||||||
|
video_url,
|
||||||
|
headers=PROXY_HEADERS,
|
||||||
|
stream=True,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code != 200:
|
||||||
|
return jsonify({
|
||||||
|
"error": f"Erreur: {response.status_code}",
|
||||||
|
"url": video_url
|
||||||
|
}), response.status_code
|
||||||
|
|
||||||
|
def generate():
|
||||||
|
for chunk in response.iter_content(chunk_size=8192):
|
||||||
|
if chunk:
|
||||||
|
yield chunk
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
'Content-Type': 'video/mp4',
|
||||||
|
'Content-Disposition': f'attachment; filename="{filename}"',
|
||||||
|
'Content-Length': response.headers.get('Content-Length', ''),
|
||||||
|
'Access-Control-Allow-Origin': '*',
|
||||||
|
}
|
||||||
|
|
||||||
|
return Response(
|
||||||
|
stream_with_context(generate()),
|
||||||
|
headers=headers
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Erreur téléchargement: {str(e)}")
|
||||||
|
return jsonify({"error": str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Démarrer le serveur"""
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Serveur proxy vidéo pour sekai.one")
|
||||||
|
parser.add_argument('--host', default='0.0.0.0', help='Host (défaut: 0.0.0.0)')
|
||||||
|
parser.add_argument('--port', type=int, default=8080, help='Port (défaut: 8080)')
|
||||||
|
parser.add_argument('--debug', action='store_true', help='Mode debug')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("🎬 SEKAI VIDEO PROXY SERVER")
|
||||||
|
print("="*80)
|
||||||
|
print(f"\n✓ Serveur démarré sur http://{args.host}:{args.port}")
|
||||||
|
print(f"\n📖 Documentation : http://localhost:{args.port}/")
|
||||||
|
print(f"\n🎬 Exemple d'utilisation :")
|
||||||
|
print(f" http://localhost:{args.port}/proxy?url=https://17.mugiwara.xyz/op/saga-7/hd/527.mp4")
|
||||||
|
print("\n" + "="*80 + "\n")
|
||||||
|
|
||||||
|
app.run(
|
||||||
|
host=args.host,
|
||||||
|
port=args.port,
|
||||||
|
debug=args.debug,
|
||||||
|
threaded=True # Support pour plusieurs connexions simultanées
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
|
|
||||||
Loading…
Reference in a new issue