背景:大名鼎鼎的Medium平台, 首先需要注册,使用邮箱注册用户,分为免费用户和付费会员,免费用户每个月可以免费查看三篇文章,会员可以查看所有的优质文章,费用 5 Dollar / Month 或者 50 Dollar / Year。
关于爬虫,之前写过一篇比较入门
的文章 (传送门:https://www.noraxu.online/f0142a5f6230/)
今天接着介绍
分析页面
这里是我们目标页面,按下F12, 或者右击【检查】, 看到它的html页面结构如下图,h2标签是文章标题
抓取列表页面
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
|
const chromium = require('chrome-aws-lambda');
async pageOne(tag) { const { app } = this;
const main = await chromium.puppeteer.launch(app.config.puppeteer.config); browser = await main.createIncognitoBrowserContext(); const page = await browser.newPage(); await page.setUserAgent(app.config.puppeteer.agent); await page.setRequestInterception(true); page.on('request', req => { if ([ 'image', 'stylesheet', 'font' ].includes(req.resourceType())) { return req.abort(); } return req.continue(); });
await page.goto(`https://medium.com/tag/javascript`, { waitUntil: 'domcontentloaded', }); await page.waitForSelector('article');
const hotels = await page.$$eval('article', anchors => { return anchors.map(anchor => { const item = anchor.querySelector("a[aria-label='Post Preview Title']"); const img = anchor.querySelector('img'); const author = anchor.querySelector("div[style='flex: 1 1 0%;']") || anchor.querySelector("div[style='flex:1']");
const one = { href: (item.href), title: item.querySelector('h2').innerText, content: item.querySelector('p').innerText, thumb: img.src, author: (author.querySelector('a').href), authorName: author.querySelector('p').innerText, }; return one; }); }); await page.close(); return hotels; }
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| config: { ignoreHTTPSErrors: true, devtools: false, args: [ '--disable-blink-features=AutomationControlled', '–no-first-run', '–no-zygote', '–single-process', '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-accelerated-2d-canvas', '--disable-gpu', ], dumpio: false, },
|
抓取结果打印出来
抓取会员详情页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
| async detailOne(site) {、 const browser = await puppeteer.launch({ headless: true, args: ['--proxy-server=PROXY_SERVER_ADDRESS'] }); const page = await browser.newPage();
const userAgent = 'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/' + 59 + Math.round(Math.random() * 10) + '.0.3497.' + Math.round(Math.random() * 100) + 'Safari/537.36'; await page.setUserAgent(userAgent);
await page.setRequestInterception(true); page.on('request', req => { if ( [ 'javascript', 'image', 'stylesheet', 'font' ].includes( req.resourceType() ) ) { return req.abort(); } return req.continue(); });
await page.authenticate({ username: 'xxxx', password: 'xxxx', });
await page.goto(site, { waitUntil: 'domcontentloaded', });
await page.waitForSelector('article');
const detail = await page.$eval('article', el => el.innerHTML); console.log('detail: ', detail);
await page.close(); return { detail, }; }
|
数据库mysql保存
数据库挂载
1 2 3 4 5 6 7 8 9
| module.exports = app => { app.beforeStart(async () => { const mysqlConfig = await app.configCenter.fetch('mysql'); app.database = app.mysql.createInstance(mysqlConfig); }); };
|
保存数据
1 2 3 4 5 6 7 8 9 10
| await app.mysql.insert('medium', { uid: item.uid, href: item.href, title: item.title, content: item.content, thumb: item.thumb, author: item.author, authorName: item.authorName, detail: item.detail, })
|
定时任务来自动执行
1 2 3 4 5 6 7 8 9 10 11 12
| module.exports = { schedule: { immediate: true, interval: '1h', type: 'all', }, async task(ctx) { await ctx.service.medium.getAllPost(); }, };
|
成果
如果感兴趣请看我的成品:https://next-blog-three-hazel.vercel.app/post/javascript?page=1
内容自己学习,请勿传播。