谷粒的泥坑

🌿谷粒的生活笔记 —— 在数字世界的泥坑中,播种代码、文字的种子,静待每一份热爱自然生长

基于Playwright TypeScript/JavaScript的API调用爬虫成熟方案

基于Playwright TypeScript/JavaScript的API调用爬虫成熟方案

Playwright作为微软开发的现代化浏览器自动化工具,特别适合用于构建API调用的爬虫服务。以下是使用TypeScript/JavaScript实现的成熟方案:

1. 基础API爬虫服务架构

1.1 Express + Playwright方案

这是一个基于Node.js和Express的轻量级API爬虫服务实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import express from 'express';
import { chromium, Browser, Page } from 'playwright';

const app = express();
app.use(express.json());

let browser: Browser;

// 初始化浏览器实例
async function initBrowser() {
browser = await chromium.launch({
headless: true,
args: ['--no-sandbox']
});
}

// 爬虫服务核心逻辑
async function scrapePage(url: string, options = {}) {
const context = await browser.newContext();
const page = await context.newPage();

try {
await page.goto(url, { waitUntil: 'networkidle' });

// 可根据需求定制数据提取逻辑
const data = await page.evaluate(() => {
return {
title: document.title,
content: document.body.innerText,
links: [...document.querySelectorAll('a')].map(a => a.href)
};
});

return { success: true, data };
} catch (error) {
return { success: false, error: error.message };
} finally {
await page.close();
await context.close();
}
}

// API端点
app.post('/api/scrape', async (req, res) => {
const { url } = req.body;
if (!url) {
return res.status(400).json({ error: 'URL is required' });
}

const result = await scrapePage(url);
res.json(result);
});

// 启动服务
initBrowser().then(() => {
app.listen(3000, () => {
console.log('Scraper API running on http://localhost:3000');
});
});

// 优雅关闭
process.on('SIGTERM', async () => {
await browser.close();
process.exit(0);
});

这个方案提供了以下特性:

  • 基于Express的RESTful API接口
  • Playwright的无头浏览器实例管理
  • 基本错误处理和资源清理
  • 优雅的启动和关闭流程

2. 高级功能实现

2.1 支持动态参数和配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
interface ScrapeOptions {
waitUntil?: 'load' | 'domcontentloaded' | 'networkidle';
timeout?: number;
headers?: Record<string, string>;
screenshot?: boolean;
pdf?: boolean;
userAgent?: string;
}

async function scrapeWithOptions(url: string, options: ScrapeOptions = {}) {
const context = await browser.newContext({
userAgent: options.userAgent || 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
});

const page = await context.newPage();
if (options.headers) {
await page.setExtraHTTPHeaders(options.headers);
}

try {
await page.goto(url, {
waitUntil: options.waitUntil || 'networkidle',
timeout: options.timeout || 30000
});

const result: any = {
title: await page.title(),
url: page.url()
};

if (options.screenshot) {
result.screenshot = await page.screenshot({ fullPage: true });
}

if (options.pdf) {
result.pdf = await page.pdf();
}

return result;
} finally {
await page.close();
await context.close();
}
}

2.2 拦截网络请求优化性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
async function scrapeWithInterception(url: string) {
const context = await browser.newContext();
const page = await context.newPage();

// 拦截不必要的资源请求
await context.route('**/*.{png,jpg,jpeg,svg,gif,woff,woff2}', route => route.abort());

// 监听API请求
const apiResponses = [];
page.on('response', async response => {
if (response.url().includes('/api/')) {
apiResponses.push({
url: response.url(),
status: response.status(),
body: await response.json().catch(() => null)
});
}
});

await page.goto(url);

return {
pageContent: await page.content(),
apiResponses
};
}

3. 生产级架构方案

3.1 完整的爬虫服务架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
playwright-api/
├── src/
│ ├── config/ # 配置管理
│ │ └── browser.ts # 浏览器配置
│ ├── controllers/ # API控制器
│ │ └── scrape.controller.ts
│ ├── services/ # 业务逻辑
│ │ ├── browser.service.ts # 浏览器管理
│ │ └── scrape.service.ts # 爬虫逻辑
│ ├── routes/ # API路由
│ │ └── scrape.route.ts
│ ├── middlewares/ # 中间件
│ │ └── error.middleware.ts
│ └── index.ts # 应用入口
├── test/ # 测试
├── package.json
├── tsconfig.json
└── .env # 环境变量

3.2 浏览器服务管理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// src/services/browser.service.ts
import { chromium, Browser, BrowserContext } from 'playwright';

class BrowserService {
private browser: Browser | null = null;
private contexts: BrowserContext[] = [];

async launch() {
if (this.browser) return;

this.browser = await chromium.launch({
headless: true,
args: ['--no-sandbox']
});
}

async newContext() {
if (!this.browser) await this.launch();

const context = await this.browser!.newContext();
this.contexts.push(context);
return context;
}

async close() {
for (const context of this.contexts) {
await context.close();
}
this.contexts = [];

if (this.browser) {
await this.browser.close();
this.browser = null;
}
}
}

export const browserService = new BrowserService();

3.3 爬虫服务实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// src/services/scrape.service.ts
import { browserService } from './browser.service';
import { Page } from 'playwright';

export class ScrapeService {
async scrape(url: string, options: any = {}) {
const context = await browserService.newContext();
const page = await context.newPage();

try {
await page.goto(url, {
waitUntil: options.waitUntil || 'networkidle',
timeout: options.timeout || 30000
});

// 自定义数据提取逻辑
const data = await this.extractData(page, options);
return { success: true, data };
} catch (error) {
return { success: false, error: error.message };
} finally {
await page.close();
}
}

private async extractData(page: Page, options: any) {
// 实现具体的数据提取逻辑
return {
title: await page.title(),
content: await page.content(),
// 其他自定义数据
};
}
}

4. 性能优化与扩展

4.1 使用集群提高并发能力

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import { Cluster } from 'playwright-cluster';

async function runCluster() {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 4, // 根据CPU核心数调整
playwrightOptions: {
headless: true
}
});

// 任务队列处理
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
return await page.evaluate(() => document.title);
});

// 添加任务
cluster.queue('https://example.com');
cluster.queue('https://example.org');

// 获取结果
cluster.on('taskend', (result) => {
console.log(`Title: ${result}`);
});

await cluster.idle();
await cluster.close();
}

4.2 结合消息队列实现分布式爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import { Consumer } from 'sqs-consumer';
import AWS from 'aws-sdk';
import { scrapeService } from './services/scrape.service';

const app = Consumer.create({
queueUrl: process.env.SQS_QUEUE_URL,
handleMessage: async (message) => {
const { url, options } = JSON.parse(message.Body!);
const result = await scrapeService.scrape(url, options);

// 处理结果,如存储到数据库或发送到另一个队列
console.log(result);
},
sqs: new AWS.SQS()
});

app.on('error', (err) => {
console.error(err.message);
});

app.on('processing_error', (err) => {
console.error(err.message);
});

app.start();

5. 部署与监控

5.1 Docker部署方案

1
2
3
4
5
6
7
8
9
10
11
12
13
FROM node:16

WORKDIR /app
COPY package*.json ./
RUN npm install

COPY . .

# 安装Playwright依赖
RUN npx playwright install
RUN npx playwright install-deps

CMD ["node", "dist/index.js"]

5.2 使用PM2进行进程管理

1
2
3
pm2 start dist/index.js --name "playwright-api" -i max
pm2 save
pm2 startup

5.3 健康检查与监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 添加健康检查端点
app.get('/health', (req, res) => {
res.json({
status: 'UP',
browser: browserService.isRunning(),
timestamp: new Date().toISOString()
});
});

// 添加Prometheus指标
import client from 'prom-client';
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics({ timeout: 5000 });

app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});

6. 安全考虑

6.1 API认证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import passport from 'passport';
import { BasicStrategy } from 'passport-http';

passport.use(new BasicStrategy((username, password, done) => {
if (username === process.env.API_USER && password === process.env.API_PASS) {
return done(null, { user: 'api' });
}
return done(null, false);
}));

// 保护爬虫端点
app.post('/api/scrape',
passport.authenticate('basic', { session: false }),
scrapeController.scrape
);

6.2 请求限流

1
2
3
4
5
6
7
8
9
import rateLimit from 'express-rate-limit';

const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15分钟
max: 100, // 每个IP限制100个请求
message: 'Too many requests from this IP, please try again later'
});

app.use(limiter);

7. 测试与调试

7.1 单元测试示例

1
2
3
4
5
6
7
8
9
10
import { test, expect } from '@playwright/test';
import { scrapeService } from '../src/services/scrape.service';

test.describe('ScrapeService', () => {
test('should return page title', async () => {
const result = await scrapeService.scrape('https://example.com');
expect(result.success).toBe(true);
expect(result.data.title).toContain('Example');
});
});

7.2 调试技巧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// 启用调试模式
const browser = await chromium.launch({
headless: false,
devtools: true
});

// 监听控制台输出
page.on('console', msg => {
console.log('Browser console:', msg.text());
});

// 捕获网络请求
page.on('request', request => console.log('>>', request.method(), request.url()));
page.on('response', response => console.log('<<', response.status(), response.url()));

总结

以上方案提供了基于Playwright TypeScript/JavaScript实现API调用爬虫的完整路径,从基础实现到生产级架构,涵盖了:

  1. 基础API服务:Express与Playwright的简单集成
  2. 高级功能:请求拦截、动态参数支持、多种输出格式
  3. 生产架构:模块化设计、错误处理、资源管理
  4. 性能扩展:集群支持、消息队列集成
  5. 部署运维:Docker容器化、进程管理、监控
  6. 安全保障:API认证、请求限流
  7. 测试调试:单元测试、调试技巧

这些方案可以根据实际需求进行组合和调整,构建出适合不同场景的爬虫API服务。对于需要更高性能或更复杂业务逻辑的场景,可以考虑进一步引入分布式任务队列、缓存机制等高级架构。