💻 Google Chrome headless mode. php
的爬虫方式是使用curl库函数去抓去静态界面抓取,所以正则匹配的时候不是整个渲染的界面,而现在的网页中,有很多的数据以及界面采用的是二次加载,前端的界面也越来越复杂,为了更好的解决这个问题,大牛们提出了不同的解决方案:
- 分析界面的JS请求,然后模拟。
- 想办法真实的模拟浏览器的请求、然后抓取Js 请求后渲染的界面。
本文就总结了第二种方式。
注意⚠️ google的浏览器的指定版本,已经开始支持了Chrome Headless ,这导致了一些第三方的工具不去再去维护他们的项目。
1. phantomjs (已经停止开发维护)
<aside> 👉 https://phantomjs.org/
Important: PhantomJS development is suspended until further notice (more details).
PhantomJS is a headless web browser scriptable with JavaScript. It runs on Windows, macOS, Linux, and FreeBSD.
Using QtWebKit as the back-end, it offers fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
The following simple script for PhantomJS loads Google homepage, waits a bit, and then captures it to an image.
var page = require('webpage').create();
page.open('<http://www.google.com>', function() {
setTimeout(function() {
page.render('google.png');
phantom.exit();
}, 200);
});
PhantomJS is an optimal solution for:
Access webpages and extract information using the standard DOM API, or with usual libraries like jQuery.
Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview.
Run functional tests with frameworks such as Jasmine, QUnit, Mocha, WebDriver, etc.
Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins.
Ready to play with PhantomJS? Install and follow the Quick Start guide.
Want to learn more? Read the FAQ, explore more examples, and study the complete API documentation.
For the source code, issue tracker, and other development information, visit github.com/ariya/phantomjs.
</aside>
2. Goolge Chrome/Chromium Headless CommandLine Function
1. Print DOM
google-chrome --headless --disable-gpu --no-sandbox --dump-dom <https://www.ted.com>
你可能找不到chrome
命令,根据命令行的tab
按键功能提示是否有这样的命令。或者使用ln -s
命令创建一个快捷键。 命令可能执行的会旧,等待一下即可。
2. 创建PDF
google-chrome --headless --disable-gpu --no-sandbox --dump-dom --print-to-pdf <https://www.ted.com>
会打印HTML字符串,并在执行命令的当前目录生成一个output.pdf
文件
3. 截屏
Selenium .(目前看没有支持PHP)
The Selenium Browser Automation Project
PHP扩展
Headless Error
1. No accept header receieved on handshake response (PHP extension composer)
Need google version :
Google Chrome 110.0.5481.177
How found your chrome Version:
google-chrome --version
Install the specifical version :
wget <http://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_110.0.5481.177-1_amd64.deb>
dpkg -i google-chrome-stable_110.0.5481.177-1_amd64.deb
Limited the chrome updated :
apt-mark hold google-chrome-stable
if you want upgrade .
apt-mark unhold google-chrome-stable