基于Playwright TypeScript/JavaScript的API调用爬虫成熟方案
Playwright作为微软开发的现代化浏览器自动化工具,特别适合用于构建API调用的爬虫服务。以下是使用TypeScript/JavaScript实现的成熟方案:
1. 基础API爬虫服务架构
1.1 Express + Playwright方案
这是一个基于Node.js和Express的轻量级API爬虫服务实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
| import express from 'express'; import { chromium, Browser, Page } from 'playwright';
const app = express(); app.use(express.json());
let browser: Browser;
async function initBrowser() { browser = await chromium.launch({ headless: true, args: ['--no-sandbox'] }); }
async function scrapePage(url: string, options = {}) { const context = await browser.newContext(); const page = await context.newPage(); try { await page.goto(url, { waitUntil: 'networkidle' }); const data = await page.evaluate(() => { return { title: document.title, content: document.body.innerText, links: [...document.querySelectorAll('a')].map(a => a.href) }; }); return { success: true, data }; } catch (error) { return { success: false, error: error.message }; } finally { await page.close(); await context.close(); } }
app.post('/api/scrape', async (req, res) => { const { url } = req.body; if (!url) { return res.status(400).json({ error: 'URL is required' }); } const result = await scrapePage(url); res.json(result); });
initBrowser().then(() => { app.listen(3000, () => { console.log('Scraper API running on http://localhost:3000'); }); });
process.on('SIGTERM', async () => { await browser.close(); process.exit(0); });
|
这个方案提供了以下特性:
- 基于Express的RESTful API接口
- Playwright的无头浏览器实例管理
- 基本错误处理和资源清理
- 优雅的启动和关闭流程
2. 高级功能实现
2.1 支持动态参数和配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| interface ScrapeOptions { waitUntil?: 'load' | 'domcontentloaded' | 'networkidle'; timeout?: number; headers?: Record<string, string>; screenshot?: boolean; pdf?: boolean; userAgent?: string; }
async function scrapeWithOptions(url: string, options: ScrapeOptions = {}) { const context = await browser.newContext({ userAgent: options.userAgent || 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' }); const page = await context.newPage(); if (options.headers) { await page.setExtraHTTPHeaders(options.headers); } try { await page.goto(url, { waitUntil: options.waitUntil || 'networkidle', timeout: options.timeout || 30000 }); const result: any = { title: await page.title(), url: page.url() }; if (options.screenshot) { result.screenshot = await page.screenshot({ fullPage: true }); } if (options.pdf) { result.pdf = await page.pdf(); } return result; } finally { await page.close(); await context.close(); } }
|
2.2 拦截网络请求优化性能
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| async function scrapeWithInterception(url: string) { const context = await browser.newContext(); const page = await context.newPage(); await context.route('**/*.{png,jpg,jpeg,svg,gif,woff,woff2}', route => route.abort()); const apiResponses = []; page.on('response', async response => { if (response.url().includes('/api/')) { apiResponses.push({ url: response.url(), status: response.status(), body: await response.json().catch(() => null) }); } }); await page.goto(url); return { pageContent: await page.content(), apiResponses }; }
|
3. 生产级架构方案
3.1 完整的爬虫服务架构
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| playwright-api/ ├── src/ │ ├── config/ # 配置管理 │ │ └── browser.ts # 浏览器配置 │ ├── controllers/ # API控制器 │ │ └── scrape.controller.ts │ ├── services/ # 业务逻辑 │ │ ├── browser.service.ts # 浏览器管理 │ │ └── scrape.service.ts # 爬虫逻辑 │ ├── routes/ # API路由 │ │ └── scrape.route.ts │ ├── middlewares/ # 中间件 │ │ └── error.middleware.ts │ └── index.ts # 应用入口 ├── test/ # 测试 ├── package.json ├── tsconfig.json └── .env # 环境变量
|
3.2 浏览器服务管理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| import { chromium, Browser, BrowserContext } from 'playwright';
class BrowserService { private browser: Browser | null = null; private contexts: BrowserContext[] = []; async launch() { if (this.browser) return; this.browser = await chromium.launch({ headless: true, args: ['--no-sandbox'] }); } async newContext() { if (!this.browser) await this.launch(); const context = await this.browser!.newContext(); this.contexts.push(context); return context; } async close() { for (const context of this.contexts) { await context.close(); } this.contexts = []; if (this.browser) { await this.browser.close(); this.browser = null; } } }
export const browserService = new BrowserService();
|
3.3 爬虫服务实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
| import { browserService } from './browser.service'; import { Page } from 'playwright';
export class ScrapeService { async scrape(url: string, options: any = {}) { const context = await browserService.newContext(); const page = await context.newPage(); try { await page.goto(url, { waitUntil: options.waitUntil || 'networkidle', timeout: options.timeout || 30000 }); const data = await this.extractData(page, options); return { success: true, data }; } catch (error) { return { success: false, error: error.message }; } finally { await page.close(); } } private async extractData(page: Page, options: any) { return { title: await page.title(), content: await page.content(), }; } }
|
4. 性能优化与扩展
4.1 使用集群提高并发能力
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| import { Cluster } from 'playwright-cluster';
async function runCluster() { const cluster = await Cluster.launch({ concurrency: Cluster.CONCURRENCY_CONTEXT, maxConcurrency: 4, playwrightOptions: { headless: true } }); await cluster.task(async ({ page, data: url }) => { await page.goto(url); return await page.evaluate(() => document.title); }); cluster.queue('https://example.com'); cluster.queue('https://example.org'); cluster.on('taskend', (result) => { console.log(`Title: ${result}`); }); await cluster.idle(); await cluster.close(); }
|
4.2 结合消息队列实现分布式爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| import { Consumer } from 'sqs-consumer'; import AWS from 'aws-sdk'; import { scrapeService } from './services/scrape.service';
const app = Consumer.create({ queueUrl: process.env.SQS_QUEUE_URL, handleMessage: async (message) => { const { url, options } = JSON.parse(message.Body!); const result = await scrapeService.scrape(url, options); console.log(result); }, sqs: new AWS.SQS() });
app.on('error', (err) => { console.error(err.message); });
app.on('processing_error', (err) => { console.error(err.message); });
app.start();
|
5. 部署与监控
5.1 Docker部署方案
1 2 3 4 5 6 7 8 9 10 11 12 13
| FROM node:16
WORKDIR /app COPY package*.json ./ RUN npm install
COPY . .
RUN npx playwright install RUN npx playwright install-deps
CMD ["node", "dist/index.js"]
|
5.2 使用PM2进行进程管理
1 2 3
| pm2 start dist/index.js --name "playwright-api" -i max pm2 save pm2 startup
|
5.3 健康检查与监控
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| app.get('/health', (req, res) => { res.json({ status: 'UP', browser: browserService.isRunning(), timestamp: new Date().toISOString() }); });
import client from 'prom-client'; const collectDefaultMetrics = client.collectDefaultMetrics; collectDefaultMetrics({ timeout: 5000 });
app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); });
|
6. 安全考虑
6.1 API认证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| import passport from 'passport'; import { BasicStrategy } from 'passport-http';
passport.use(new BasicStrategy((username, password, done) => { if (username === process.env.API_USER && password === process.env.API_PASS) { return done(null, { user: 'api' }); } return done(null, false); }));
app.post('/api/scrape', passport.authenticate('basic', { session: false }), scrapeController.scrape );
|
6.2 请求限流
1 2 3 4 5 6 7 8 9
| import rateLimit from 'express-rate-limit';
const limiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 100, message: 'Too many requests from this IP, please try again later' });
app.use(limiter);
|
7. 测试与调试
7.1 单元测试示例
1 2 3 4 5 6 7 8 9 10
| import { test, expect } from '@playwright/test'; import { scrapeService } from '../src/services/scrape.service';
test.describe('ScrapeService', () => { test('should return page title', async () => { const result = await scrapeService.scrape('https://example.com'); expect(result.success).toBe(true); expect(result.data.title).toContain('Example'); }); });
|
7.2 调试技巧
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| const browser = await chromium.launch({ headless: false, devtools: true });
page.on('console', msg => { console.log('Browser console:', msg.text()); });
page.on('request', request => console.log('>>', request.method(), request.url())); page.on('response', response => console.log('<<', response.status(), response.url()));
|
总结
以上方案提供了基于Playwright TypeScript/JavaScript实现API调用爬虫的完整路径,从基础实现到生产级架构,涵盖了:
- 基础API服务:Express与Playwright的简单集成
- 高级功能:请求拦截、动态参数支持、多种输出格式
- 生产架构:模块化设计、错误处理、资源管理
- 性能扩展:集群支持、消息队列集成
- 部署运维:Docker容器化、进程管理、监控
- 安全保障:API认证、请求限流
- 测试调试:单元测试、调试技巧
这些方案可以根据实际需求进行组合和调整,构建出适合不同场景的爬虫API服务。对于需要更高性能或更复杂业务逻辑的场景,可以考虑进一步引入分布式任务队列、缓存机制等高级架构。