create account

How to Scrape Data from Web Pages Using Node.js/Express by gotgame

View this thread on: hive.blogpeakd.comecency.com
· @gotgame ·
$5.21
How to Scrape Data from Web Pages Using Node.js/Express
When it comes to retrieving data from the web through scraping, not much is known about how to achieve that with Node.Js/JavaScript unlike languages like Python/PHP which already have popular modules that can help do that.

This post is going to teach you how exactly you can scrape data from the web using Node.js/JavaScript.

We are going to be using three packages to create our web scraping module so we need to install it in  our Node project.

The packages are 

- Cheerio
- Request
- Request Promise

After you must have set up a working Node.js server for your project, go to the project terminal and install puppeteer using this command

```
npm install request cheerio request-promise
```

[Cheerio](https://github.com/cheeriojs/cheerio) is a lean implementation of jQuery that can be used to perform front-end tasks from the back-end.

[Request](https://www.npmjs.com/package/request) and [request-promise](https://github.com/request/request-promise) are Node.js tools that will be used to make http requests.

Create a new file in the root directory of the project and name it `scrape.js` or something. In the file, add the following starter code as a boilerplate

```
const  scraper = () => {
	console.log('Scraping tool')
}

module.exports = scraper
```

In `app.js` which is in our project root directory, add the code 

```
//run scraper

var  scrape = require('./scrape');
scrape()
```
below the line

```
app.use('/users', usersRouter);
```

Save all files and rerun the server and you should get something identical to the following results in your console

![enter image description here](https://i.ibb.co/cTTfRrs/new.png)

In `scrape.js`, we are going to replace the contents of the file with the following code

```
const  requestPromise = require('request-promise');
const  url = 'https://cointelegraph.com/tags/cryptocurrencies';

const  scraper = () => {
	requestPromise(url)
		.then(function(html){
			//success!
			console.log(html);
		})
		.catch(function(err){
			//handle error
			console.log(err)
		});
} 

module.exports = scraper
```

Run your server again and you should get something like this in your terminal

![enter image description here](https://i.ibb.co/4NCgvL4/new.png)

What the code above does is to use `request-promise` library that we installed earlier to fetch and return the html contents of any given url and log it in the console.

In this case the given url is stored in the variable `url` and the library is called with the keyword `requestPromise`, which takes the `url` variable as an argument and returns the HTML contents of this page [https://cointelegraph.com/tags/cryptocurrencies](https://cointelegraph.com/tags/cryptocurrencies), which is a page containing latest crypto news on the cointelegraph website.

After getiing the HTML code from the page we need to sort the code and extract whatever data we need to extract from the page.

Visit the link of the page we scraped in Chrome browser and right click on the element you want to scrape then click inspect, to get access to the element in the Chrome inspector.

![enter image description here](https://i.ibb.co/j9CrMzR/new.png)

Once we are inspecting the element we want to scrape(in this case, the titles of each news piece on the page), we can now use `Cheerio` to parse the html for those titles and extract what we need from there.

Replace the code in `scrape.js`, with the following code

```
const  requestPromise = require('request-promise');
const  $ = require('cheerio');
const  url = 'https://cointelegraph.com/tags/cryptocurrencies';

const  scraper = () => {
	requestPromise(url)
		.then(function(html){
			//success!
			const  newsHead = $('a > span.post-card-inline__title', html).toArray()
			const  newsTitles = []

			for (let  i = 0; i < newsHead.length; i++) {
				newsTitles.push({
					newsLink:  `https://www.cointelegraph.com${newsHead[i].parent.attribs.href}`,
					newsTitle:  `${newsHead[i].children[0].data}`
				})
			}

			console.log(newsTitles)
		})
		.catch(function(err){
			//handle error
			console.log(err)
		});
}

module.exports = scraper
```

The code above takes each element that we scraped from the crypto news page and then extracts two different data which are

- Link to the actual news content
- The news title

We then store the data for each news piece in an object and the object is put into an array.

If your run server now and check the tearminal you should get a result that looks like the image below which displays an array that lists each news object


![enter image description here](https://i.ibb.co/0F7W3sY/new.png)

That shows us how we can successfully scrape data from a web page and use it for our own purposes on our end.

You can use this approach to get any data from any page, try it out and share your opinions in the comments.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
properties (23)
authorgotgame
permlinkhow-to-scrape-data-from-web-pages-using-node-js-express
categoryhive-175254
json_metadata{"tags":["technology","ocd","neoxian","palnet","kurator","curie","gems"],"image":["https://i.ibb.co/cTTfRrs/new.png","https://i.ibb.co/4NCgvL4/new.png","https://i.ibb.co/j9CrMzR/new.png","https://i.ibb.co/0F7W3sY/new.png"],"links":["https://github.com/cheeriojs/cheerio","https://www.npmjs.com/package/request","https://github.com/request/request-promise","https://cointelegraph.com/tags/cryptocurrencies"],"app":"hiveblog/0.1","format":"markdown"}
created2020-06-29 00:31:57
last_update2020-06-29 00:31:57
depth0
children6
last_payout2020-07-06 00:31:57
cashout_time1969-12-31 23:59:59
total_payout_value2.318 HBD
curator_payout_value2.889 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length4,814
author_reputation23,969,707,386,372
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries
0.
accountph-fund
weight2,000
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,233,067
net_rshares16,753,097,198,211
author_curate_reward""
vote details (65)
@crypto.piotr ·
@tipu curate
properties (22)
authorcrypto.piotr
permlinkqcol1q
categoryhive-175254
json_metadata{"users":["tipu"],"app":"hiveblog/0.1"}
created2020-06-29 09:29:51
last_update2020-06-29 09:29:51
depth1
children1
last_payout2020-07-06 09:29:51
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length12
author_reputation27,396,789,428,606
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,239,279
net_rshares0
@tipu ·
<a href="https://tipu.online/hive_curator?crypto.piotr" target="_blank">Upvoted  &#128076;</a> (Mana: 16/32)
properties (22)
authortipu
permlinkre-qcol1q-20200629t093003
categoryhive-175254
json_metadata""
created2020-06-29 09:30:03
last_update2020-06-29 09:30:03
depth2
children0
last_payout2020-07-06 09:30:03
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length108
author_reputation55,914,546,531,008
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,239,283
net_rshares0
@gitplait-mod1 ·
Thanks for sharing an amazing Javascript tutorial. We are looking for people like you in our platform.
<sub> Your post has been submitted to be curated with @gitplait community account because this is the kind of publications we like to see in our community. </sub>

Join our [Community on Hive](https://hive.blog/trending/hive-103590) and Chat with us on [Discord](https://discord.gg/CWCj3rw).

[[Gitplait-Team]](https://gitplait.tech/)
properties (22)
authorgitplait-mod1
permlinkqcomvf
categoryhive-175254
json_metadata{"users":["gitplait"],"links":["https://hive.blog/trending/hive-103590","https://discord.gg/CWCj3rw","https://gitplait.tech/"],"app":"hiveblog/0.1"}
created2020-06-29 10:09:21
last_update2020-06-29 10:09:21
depth1
children0
last_payout2020-07-06 10:09:21
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length437
author_reputation64,455,719,431
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,239,739
net_rshares0
@hivebuzz ·
Congratulations @gotgame! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s) :

<table><tr><td><img src="https://images.hive.blog/60x70/http://hivebuzz.me/@gotgame/upvotes.png?202007022238"></td><td>You distributed more than 56000 upvotes. Your next target is to reach 57000 upvotes.</td></tr>
</table>

<sub>_You can view [your badges on your board](https://hivebuzz.me/@gotgame) And compare to others on the [Ranking](https://hivebuzz.me/ranking)_</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>



###### Support the HiveBuzz project. [Vote](https://hivesigner.com/sign/update_proposal_votes?proposal_ids=%5B%22109%22%5D&approve=true) for [our proposal](https://peakd.com/me/proposals/109)!
properties (22)
authorhivebuzz
permlinkhivebuzz-notify-gotgame-20200702t224205000z
categoryhive-175254
json_metadata{"image":["http://hivebuzz.me/notify.t6.png"]}
created2020-07-02 22:42:06
last_update2020-07-02 22:42:06
depth1
children0
last_payout2020-07-09 22:42:06
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length813
author_reputation370,792,828,599,978
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,307,696
net_rshares0
@joelagbo ·
I love javascript, even though I'm only good at Reactjs and vanilla javascript I know javascript as a 'language of all possibilities' and this tutorial proved it once again. Bookmarked!
properties (22)
authorjoelagbo
permlinkre-gotgame-qcofur
categoryhive-175254
json_metadata{"tags":["hive-175254"],"app":"peakd/2020.06.2"}
created2020-06-29 07:37:42
last_update2020-06-29 07:37:42
depth1
children1
last_payout2020-07-06 07:37:42
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length185
author_reputation171,221,632,716,773
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,238,108
net_rshares0
@gotgame ·
Thanks for dropping by, glad you love the piece.
properties (22)
authorgotgame
permlinkqcp69z
categoryhive-175254
json_metadata{"app":"hiveblog/0.1"}
created2020-06-29 17:08:27
last_update2020-06-29 17:08:27
depth2
children0
last_payout2020-07-06 17:08:27
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length48
author_reputation23,969,707,386,372
root_title"How to Scrape Data from Web Pages Using Node.js/Express"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id98,245,588
net_rshares0