Content table
Some Important tips
networkidle0 – use with SPA apps that uses fetch requests
networkidle2 – use with pages that do long pulling and other side activity
How to create a simple example
#Needed libs with ubuntu 16.04 after installing on a new server
important: an error when installing on the ubuntu 16.04 server How to fix
/home/user/erp/node_modules/puppeteer/.local-chromium/linux-555668/chrome-linux/chrome: error while loading shared libraries: libX11-xcb.so.1: cannot open shared object file: No such file or directory
we have to install libs
1 |
sudo apt-get install gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget |
#How to close browser properly
Here is an explanation of how to use the global module scope.
We have to define a global variable browser before the function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
const puppeteer = require('puppeteer') const fs = require('fs') // do not make browser in the global // because it can't work in parallel and every users // will be using the same browser and if somebody // close the browser all the users will be affected // so use the local scope // let browser = null // it's wrong const getData = async () => { let browser = null // it's a local scope (do not make in the global scope) try { browser = await puppeteer.launch({ headless: false, // false - to see browser slowMo: 250 // to slow down processes between actions }) // create a new browser const page = await browser.newPage() // create a new page await page.goto('http://google.com') // go to the url const content = await page.content() // get content fs.writeFile(`${__dirname}/data.html`, content, err => { if (err) throw err console.log('Saved'); }) // you can create a promise const pr1 = () => { return new Promise(async (resolve, reject) => { try { const page1 = await browser.newPage() await page1.goto('http://kselax.ru') const title = await page1.title() return resolve({ title }) } catch(e) { // browser could also be closed console.log(e); // here you could use // resolve() // reject() // throw // whatever you want it's up to you it depends on app's logic } }) .catch(e => { // we could close the browser here console.log('pr1 e = ', e); }) } const res = await pr1() console.log('res = ', res); await browser.close() // close here } catch(e) { await browser.close() // we are able to close here console.log('e = ', e); } } // module.exports = getData getData() |
don’t use the global scope in modules to manage the browser because of it will be like this code
1 2 3 4 5 6 7 8 9 10 |
// let value = 10 // it's wrong because of the value is a reference inside a function const valuePlusFive = () => { let value = 10 // it will always different meaning console.log('value = ', value); value += 5 return { value } } module.exports = valuePlusFive |
when somebody removes value everybody will use removed value, so the value is a reference it’s not a local value. I think of you should define browser inside the function
#How to waitFor needed selector
In puppeteer you have a few functions
waitFor – it accepts different parameters
waitForNavigation – it uses when we change the url
waitForSelector – it uses when we wait for selectors
waitForFunction – it uses when we check the element by js function inside the browser
Here is an example of how to use the Promise.race how to use Promise.race properly.
The below code shows how to use properly, the main function it’s watiForFunction it helps more precisely check the DOM.
1 2 3 4 5 6 7 8 9 10 11 |
await Promise.race([ page1.waitForFunction("document.querySelector('.no-data__message').innerText === 'Нет записей, удовлетворяющих поиску'"), page1.waitFor('tr:nth-child(2) > td:nth-child(3) > a') ]) if (! await page1.$('tr:nth-child(2) > td:nth-child(3) > a')) { console.log('pr1 [no-data]'); if ( await page1.$eval('.no-data__message', el => el.innerText) === 'Нет записей, удовлетворяющих поиску') { console.log('[Нет записей, удовлетворяющих поиску]'); return resolve({ error: 'Нет записей, удовлетворяющих поиску' }) } return resolve(null) |
The code is difficult to understand that is a real-world example that I used in one of my apps.
#How to select a selector with a semicolon
I’ve got an error when trying selection a selector like that ‘#weeklyPublicationSelectionForm:currGazette_label‘
Uncaught DOMException: Failed to execute ‘querySelector’ on ‘Document’: ‘weeklyPublicationSelectionForm:currGazette_input’ is not a valid selector.
we have to alter the ‘:‘ on this symbol ‘\\3a ‘ like that ‘#weeklyPublicationSelectionForm\\3a currGazette_input‘
#PM2 after rebooting couldn’t run an app with puppeteer
When you do pm2 log after rebooting OS you’ll see an error TimeoutError
The message that urged me to solve the problem is https://github.com/GoogleChrome/puppeteer/issues/1347#issuecomment-357144365
1 2 3 4 5 6 |
(node:662) UnhandledPromiseRejectionWarning: TimeoutError: Timed out after 30000 ms while trying to connect to Chrome! The only Chrome revision guaranteed to work is r624492 at Timeout.onTimeout (/home/neo/node.js/scrapper-ros-acreditation1/server/node_modules/puppeteer/lib/Launcher.js:371:14) at listOnTimeout (timers.js:327:15) at processTimers (timers.js:271:5) (node:662) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1) (node:662) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code. |
and of course, your app isn’t running, so the problem is in little timeout delay, you have to increase it. Set up 0 it means endless timeout. You could put there anytime in milliseconds, by default there is 30000
1 2 3 4 5 6 7 8 9 10 11 12 13 |
var browser = puppeteer.launch({ headless: true, slowMo: 0, timeout: 0 }) .then(br => { console.log('Browser launched'); browser = br; }) .catch(e => { console.log('e = ', e) throw e }) |
#Concurrency with puppeteer. How many tabs to open and browser instances
Some links:
Threading and tasks in chrome.
Multiple processes, more than 900 requests at the same time, how can I handle?
When you use a weak server with 1 CPU better open one browser instance and no more than 4 tabs because of chrome will run in turn each opened tabs. So when you do many concurrent requests it will slow down all responses especially if you did a scrapper that works a few 10 – 30 seconds. The more CPU has a server the more tabs you can open. Chromium works as in the desktop. it doesn’t need a cluster. All the CPU will be loaded.
I made a test. I opened 9 tabs and made 9 simultaneous requests to a server. The server is weak the cheapest for 3$ and it sucked. It could only manage four requests. There was a scrapper that executing for 30 seconds. It’s long. So guys Ideally I think 1 tab one thread – 1CPU but it is suggested to use four, so you could make an experiment and find out what is better for you.
#How to looping links with poppeteer
Don’t use page.goto inside the loop, you better do the loop and inside create each time a new page like that
1 2 3 4 5 6 7 8 9 10 11 12 |
// ... await page.close() for (let i = 0; i < permalinks.length; i++) { page = await browser.newPage() // some code ... // ... // ... await page.close() } // ... |
This is better than do always page.goto without a closing/opening page. I tried out and discovered that the page without refreshing will soon be broken down and such functions like page.content(), page.click() will not be working. Here is my question on GitHub about it await page.content() is hanging without a response.
#How properly click a button or link
Puppeteer has a function click() this function has a delay, So when you do click you have to use a delay. Without delay, it will suck sometime nad passed clicking. I set up a 250 mc
1 |
page.click(selector, { delay:250 }) |
#How to use a poppeteer timeout
There is a timeout. When you set up 0 it will be endless waiting, so if you do some apps where it will work automatically You’d rather use some certain timeout than use endless timeout. Put always some certain value like that 2 minutes ( 120000 mc)
1 |
page.waitForFunction(`document.querySelector('a[id*="someSelector"]').href !== '${currentLink2}'`, { timeout: 0 }) |
#The structure that I should use when I doing scrapper
There is two variant how you could do scrappers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Variant 1: looping categories looping pages save the datalinks to the database, define the state column and set to "fail" extract datalinks from the database looping the datalinks save the data to the database (usually, you have to update the current row) Varian 2: looping categories looping pages lopping links from this page save the data from the link to the database |
The more preferable Variant 1. Simply because you design the structure of the database and put the link to the donor page, then do scraping by those links in anytime whatever you like. Variant 1 is more easy to implement. Variant 2 is far more difficult. It is straightforward but updating is a lot more difficult
the end