Kselax.ru

Hacker Kselax – the best hacker in the world

Menu
  • Blog
  • Contacts
  • wp plugin generator
  • English
    • Русский
Menu

Puppeteer sclerotic

Posted on 13 February, 201926 February, 2019 by admin

Content table

 

Some Important tips

networkidle0 – use with SPA apps that uses fetch requests

networkidle2 – use with pages that do long pulling and other side activity

 

How to create a simple example

 

#Needed libs with ubuntu 16.04 after installing on a new server

important: an error when installing on the ubuntu 16.04 server How to fix

/home/user/erp/node_modules/puppeteer/.local-chromium/linux-555668/chrome-linux/chrome: error while loading shared libraries: libX11-xcb.so.1: cannot open shared object file: No such file or directory

we have to install libs

1
sudo apt-get install gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

 

 

#How to close browser properly

Here is an explanation of how to use the global module scope.

We have to define a global variable browser before the function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
const puppeteer = require('puppeteer')
const fs = require('fs')
 
// do not make browser  in the global
// because it can't work in parallel and every users
// will be using the same browser and if somebody
// close the browser all the users will be affected
// so use the local scope
 
// let browser = null // it's wrong
 
const getData = async () => {
  let browser = null // it's a local scope (do not make in the global scope)
  try {
    browser = await puppeteer.launch({
      headless: false, // false - to see browser
      slowMo: 250 // to slow down processes between actions
    }) // create a new browser
    const page = await browser.newPage() // create a new page
    await page.goto('http://google.com') // go to the url
    const content = await page.content() // get content
    fs.writeFile(`${__dirname}/data.html`, content, err => {
      if (err) throw err
      console.log('Saved');
    })
    
 
    // you can create a promise
    const pr1 = () => {
      return new Promise(async (resolve, reject) => {
        try {
          const page1 = await browser.newPage()
          await page1.goto('http://kselax.ru')
          const title = await page1.title()
          return resolve({ title })
        } catch(e) {
          // browser could also be closed
          console.log(e);
          // here you could use
          // resolve()
          // reject()
          // throw
          // whatever you want it's up to you it depends on app's logic
        }
      })
      .catch(e => {
        // we could close the browser here
        console.log('pr1 e = ', e);
      })
    }
 
    const res = await pr1()
    console.log('res = ', res);
 
    await browser.close() // close here
  } catch(e) {
    await browser.close() // we are able to close here
    console.log('e = ', e);
  }
}
 
// module.exports = getData
 
getData()

don’t use the global scope in modules to manage the browser because of it will be like this code

1
2
3
4
5
6
7
8
9
10
// let value = 10 // it's wrong because of the value is a reference inside a function
 
const valuePlusFive = () => {
  let value = 10 // it will always different meaning
  console.log('value = ', value);
  value += 5
  return { value }
}
 
module.exports = valuePlusFive

when somebody removes value everybody will use removed value, so the value is a reference it’s not a local value. I think of you should define browser inside the function

 

#How to waitFor needed selector

In puppeteer you have a few functions

waitFor – it accepts different parameters

waitForNavigation – it uses when we change the url

waitForSelector – it uses when we wait for selectors

waitForFunction – it uses when we check the element by js function inside the browser

Here is an example of how to use the Promise.race how to use Promise.race properly.

The below code shows how to use properly, the main function it’s watiForFunction it helps more precisely check the DOM.

1
2
3
4
5
6
7
8
9
10
11
await Promise.race([
  page1.waitForFunction("document.querySelector('.no-data__message').innerText === 'Нет записей, удовлетворяющих поиску'"),
  page1.waitFor('tr:nth-child(2) > td:nth-child(3) > a')
])
if (! await page1.$('tr:nth-child(2) > td:nth-child(3) > a')) {
  console.log('pr1 [no-data]');
  if ( await page1.$eval('.no-data__message', el => el.innerText) === 'Нет записей, удовлетворяющих поиску') {
    console.log('[Нет записей, удовлетворяющих поиску]');
    return resolve({ error: 'Нет записей, удовлетворяющих поиску' })
  }
  return resolve(null)

The code is difficult to understand that is a real-world example that I used in one of my apps.

 

#How to select a selector with a semicolon

I’ve got an error when trying selection a selector like that ‘#weeklyPublicationSelectionForm:currGazette_label‘

Uncaught DOMException: Failed to execute ‘querySelector’ on ‘Document’: ‘weeklyPublicationSelectionForm:currGazette_input’ is not a valid selector.

we have to alter the ‘:‘ on this symbol ‘\\3a ‘ like that ‘#weeklyPublicationSelectionForm\\3a currGazette_input‘

 

#PM2 after rebooting couldn’t run an app with puppeteer

When you do pm2 log after rebooting OS you’ll see an error TimeoutError

The message that urged me to solve the problem is https://github.com/GoogleChrome/puppeteer/issues/1347#issuecomment-357144365

1
2
3
4
5
6
(node:662) UnhandledPromiseRejectionWarning: TimeoutError: Timed out after 30000 ms while trying to connect to Chrome! The only Chrome revision guaranteed to work is r624492
at Timeout.onTimeout (/home/neo/node.js/scrapper-ros-acreditation1/server/node_modules/puppeteer/lib/Launcher.js:371:14)
at listOnTimeout (timers.js:327:15)
at processTimers (timers.js:271:5)
(node:662) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:662) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

 

and of course, your app isn’t running, so the problem is in little timeout delay, you have to increase it. Set up 0 it means endless timeout. You could put there anytime in milliseconds, by default there is 30000

1
2
3
4
5
6
7
8
9
10
11
12
13
var browser = puppeteer.launch({
  headless: true,
  slowMo: 0,
  timeout: 0
})
  .then(br => {
    console.log('Browser launched');
    browser = br;
  })
  .catch(e => {
    console.log('e = ', e)
    throw e
  })

 

#Concurrency with puppeteer. How many tabs to open and browser instances

Some links:
Threading and tasks in chrome.
Multiple processes, more than 900 requests at the same time, how can I handle?

When you use a weak server with 1 CPU better open one browser instance and no more than 4 tabs because of chrome will run in turn each opened tabs. So when you do many concurrent requests it will slow down all responses especially if you did a scrapper that works a few 10 – 30 seconds. The more CPU has a server the more tabs you can open. Chromium works as in the desktop. it doesn’t need a cluster. All the CPU will be loaded.
I made a test. I opened 9 tabs and made 9 simultaneous requests to a server. The server is weak the cheapest for 3$ and it sucked. It could only manage four requests. There was a scrapper that executing for 30 seconds. It’s long. So guys Ideally I think 1 tab one thread – 1CPU but it is suggested to use four, so you could make an experiment and find out what is better for you.

 

#How to looping links with poppeteer

Don’t use page.goto inside the loop, you better do the loop and inside create each time a new page like that

1
2
3
4
5
6
7
8
9
10
11
12
// ...
await page.close()
for (let i = 0; i < permalinks.length; i++) {
  page = await browser.newPage()
 
  // some code ...
  // ...
  // ...
 
  await page.close()
}
// ...

This is better than do always page.goto without a closing/opening page. I tried out and discovered that the page without refreshing will soon be broken down and such functions like page.content(), page.click() will not be working. Here is my question on GitHub about it await page.content() is hanging without a response.

 

#How properly click a button or link

Puppeteer has a function click() this function has a delay, So when you do click you have to use a delay. Without delay, it will suck sometime nad passed clicking. I set up a 250 mc

1
page.click(selector, { delay:250 })

 

#How to use a poppeteer timeout

There is a timeout. When you set up 0 it will be endless waiting, so if you do some apps where it will work automatically You’d rather use some certain timeout than use endless timeout. Put always some certain value like that 2 minutes ( 120000 mc)

1
page.waitForFunction(`document.querySelector('a[id*="someSelector"]').href !== '${currentLink2}'`, { timeout: 0 })

 

#The structure that I should use when I doing scrapper

There is two variant how you could do scrappers

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Variant 1:
looping categories
  looping pages
    save the datalinks to the database, define the state column and set to "fail"
 
extract datalinks from the database
looping the datalinks
  save the data to the database (usually, you have to update the current row)
 
 
 
Varian 2:
looping categories
  looping pages
    lopping links from this page
      save the data from the link to the database

 

The more preferable Variant 1. Simply because you design the structure of the database and put the link to the donor page, then do scraping by those links in anytime whatever you like. Variant 1 is more easy to implement. Variant 2 is far more difficult. It is straightforward but updating is a lot more difficult

the end

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • bash (1)
  • English (9)
  • JavaScript (4)
  • node.js (22)
  • photoshop (1)
  • php (3)
  • React (9)
  • sclerotic (6)
  • Ubuntu (10)
  • Uncategorized (15)
  • Wordpress (1)

Tags

Ajax apache2 automation bash chrome-extension command line editor ejs email English English-grammar framework functions git graphql handlebars hybrid app installation javascript js linux newbie node.js node.js javascript nodemailer npm objects Performance php phpmyadmin playonlinux promise rabbitmq React react-router redis reverse-proxy session shell socket.io sublime text 3 time zones ubuntu unity webpack

Recent Comments

  • damien on How to install npm and nodejs the latest versions on ubuntu
  • Cam on How to install npm and nodejs the latest versions on ubuntu
  • Pierre on socket.io with apache as a reverse proxy on the CentOS
  • admin on How to use react-router with a few languages
  • admin on How to install npm and nodejs the latest versions on ubuntu
©2021 Kselax.ru Theme by ThemeGiant