How to Scrape Data from Web Pages Using Node.js/Express by gotgame

View this thread on: hive.blog | peakd.com | ecency.com

hive-175254 · @gotgame · Jun 29 '20

$5.21

How to Scrape Data from Web Pages Using Node.js/Express

When it comes to retrieving data from the web through scraping, not much is known about how to achieve that with Node.Js/JavaScript unlike languages like Python/PHP which already have popular modules that can help do that.

This post is going to teach you how exactly you can scrape data from the web using Node.js/JavaScript.

We are going to be using three packages to create our web scraping module so we need to install it in  our Node project.

The packages are 

- Cheerio
- Request
- Request Promise

After you must have set up a working Node.js server for your project, go to the project terminal and install puppeteer using this command

```
npm install request cheerio request-promise
```

[Cheerio](https://github.com/cheeriojs/cheerio) is a lean implementation of jQuery that can be used to perform front-end tasks from the back-end.

[Request](https://www.npmjs.com/package/request) and [request-promise](https://github.com/request/request-promise) are Node.js tools that will be used to make http requests.

Create a new file in the root directory of the project and name it `scrape.js` or something. In the file, add the following starter code as a boilerplate

```
const  scraper = () => {
	console.log('Scraping tool')
}

module.exports = scraper
```

In `app.js` which is in our project root directory, add the code 

```
//run scraper

var  scrape = require('./scrape');
scrape()
```
below the line

```
app.use('/users', usersRouter);
```

Save all files and rerun the server and you should get something identical to the following results in your console

![enter image description here](https://i.ibb.co/cTTfRrs/new.png)

In `scrape.js`, we are going to replace the contents of the file with the following code

```
const  requestPromise = require('request-promise');
const  url = 'https://cointelegraph.com/tags/cryptocurrencies';

const  scraper = () => {
	requestPromise(url)
		.then(function(html){
			//success!
			console.log(html);
		})
		.catch(function(err){
			//handle error
			console.log(err)
		});
} 

module.exports = scraper
```

Run your server again and you should get something like this in your terminal

![enter image description here](https://i.ibb.co/4NCgvL4/new.png)

What the code above does is to use `request-promise` library that we installed earlier to fetch and return the html contents of any given url and log it in the console.

In this case the given url is stored in the variable `url` and the library is called with the keyword `requestPromise`, which takes the `url` variable as an argument and returns the HTML contents of this page [https://cointelegraph.com/tags/cryptocurrencies](https://cointelegraph.com/tags/cryptocurrencies), which is a page containing latest crypto news on the cointelegraph website.

After getiing the HTML code from the page we need to sort the code and extract whatever data we need to extract from the page.

Visit the link of the page we scraped in Chrome browser and right click on the element you want to scrape then click inspect, to get access to the element in the Chrome inspector.

![enter image description here](https://i.ibb.co/j9CrMzR/new.png)

Once we are inspecting the element we want to scrape(in this case, the titles of each news piece on the page), we can now use `Cheerio` to parse the html for those titles and extract what we need from there.

Replace the code in `scrape.js`, with the following code

```
const  requestPromise = require('request-promise');
const  $ = require('cheerio');
const  url = 'https://cointelegraph.com/tags/cryptocurrencies';

const  scraper = () => {
	requestPromise(url)
		.then(function(html){
			//success!
			const  newsHead = $('a > span.post-card-inline__title', html).toArray()
			const  newsTitles = []

			for (let  i = 0; i < newsHead.length; i++) {
				newsTitles.push({
					newsLink:  `https://www.cointelegraph.com${newsHead[i].parent.attribs.href}`,
					newsTitle:  `${newsHead[i].children[0].data}`
				})
			}

			console.log(newsTitles)
		})
		.catch(function(err){
			//handle error
			console.log(err)
		});
}

module.exports = scraper
```

The code above takes each element that we scraped from the crypto news page and then extracts two different data which are

- Link to the actual news content
- The news title

We then store the data for each news piece in an object and the object is put into an array.

If your run server now and check the tearminal you should get a result that looks like the image below which displays an array that lists each news object


![enter image description here](https://i.ibb.co/0F7W3sY/new.png)

That shows us how we can successfully scrape data from a web page and use it for our own purposes on our end.

You can use this approach to get any data from any page, try it out and share your opinions in the comments.

👍 project.hope, tipu, reflektor, xpilar, minnowbooster, hingsten, bippe, crypto.piotr, cwow2, choppy, sandeep126, kamchore, bigpower, gitplait, bala41288, hf19, patrickulrich, jagged, happy-soul, deathcross, lrcconsult, unbiasedwriter, guruvaj, pixelfan, tfame3865, bartheek, peerzadazeeshan, joelagbo, brage, gbenga, devcoin, reverseacid, develcuy, beckie96830, paragism, minerthreat, tomlee, samstonehill, tykee, retinox, jacuzzi, twoshyguys, allyson19, sunsan, london65, khiabels, contestcoin, steempampanga, cyberspacegod, quatro, krbecrypto, longer, iamowomizz, dwinf, theithei, khussan, cjsean, leighscotford, tinster, glstech, mtl1979, yggdrasil.laguna, laissez-faire, limka, dashand

properties (23)vote details (65)

voter	rshares	pct
xpilar	1,054,750,264,771	50%
samstonehill	2,983,582,328	4%
patrickulrich	40,378,451,104	100%
develcuy	4,814,101,267	30%
gbenga	7,113,513,729	30%
khussan	711,665,641	31%
jagged	36,878,964,456	15%
minnowbooster	953,217,941,215	10%
pixelfan	11,838,164,730	2%
tykee	2,618,797,590	50%
tfame3865	10,555,751,924	15%
devcoin	5,513,435,142	30%
tipu	4,081,175,269,812	8%
sandeep126	247,804,663,528	100%
cjsean	680,721,407	10.5%
mtl1979	538,858,873	15.5%
kamchore	177,578,925,723	50%
guruvaj	14,153,210,936	18%
bartheek	10,160,759,175	4%
bala41288	52,194,336,732	20%
iamowomizz	859,450,113	100%
happy-soul	33,035,771,238	4%
allyson19	2,191,301,513	12%
steempampanga	1,031,928,345	15%
joelagbo	8,766,023,231	100%
retinox	2,377,872,156	14.73%
peerzadazeeshan	9,221,875,904	22.5%
paragism	3,914,641,274	15%
crypto.piotr	516,065,740,169	31%
minerthreat	2,998,003,728	15%
khiabels	1,178,353,715	13.5%
krbecrypto	973,070,588	100%
longer	920,853,202	2%
bigpower	138,152,491,005	50%
cyberspacegod	1,002,360,749	30%
quatro	987,822,814	12%
deathcross	29,565,071,387	100%
london65	1,252,867,120	13.5%
laissez-faire	63,397,285	100%
unbiasedwriter	18,044,492,250	20%
reverseacid	4,969,960,943	31%
twoshyguys	2,241,064,528	85%
cwow2	442,112,430,820	31%
dashand	0	0.9%
jacuzzi	2,258,938,744	4%
reflektor	1,941,100,687,226	50%
bippe	682,591,928,375	50%
hingsten	931,133,583,754	50%
limka	39,704,079	100%
theithei	788,352,848	12%
dwinf	845,684,002	100%
tinster	603,355,672	15%
glstech	561,680,478	15%
leighscotford	638,086,343	1.6%
tomlee	2,996,291,227	15%
lrcconsult	26,439,184,631	50%
contestcoin	1,076,447,932	100%
project.hope	4,752,801,041,245	30%
yggdrasil.laguna	229,617,340	55%
choppy	332,459,112,845	100%
brage	8,530,544,673	50%
gitplait	75,294,761,298	100%
sunsan	1,744,139,895	15%
hf19	48,564,778,588	100%
beckie96830	4,811,052,856	30%

@crypto.piotr · Jun 29 '20

@tipu curate

properties (22)

`author`	crypto.piotr
`permlink`	qcol1q
`category`	hive-175254
`json_metadata`	{"users":["tipu"],"app":"hiveblog/0.1"}
`created`	2020-06-29 09:29:51
`last_update`	2020-06-29 09:29:51
`depth`	1
`children`	1
`last_payout`	2020-07-06 09:29:51
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	12
`author_reputation`	27,396,789,428,606
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,239,279
`net_rshares`	0

@tipu · Jun 29 '20

<a href="https://tipu.online/hive_curator?crypto.piotr" target="_blank">Upvoted  &#128076;</a> (Mana: 16/32)

properties (22)

`author`	tipu
`permlink`	re-qcol1q-20200629t093003
`category`	hive-175254
`json_metadata`	""
`created`	2020-06-29 09:30:03
`last_update`	2020-06-29 09:30:03
`depth`	2
`children`	0
`last_payout`	2020-07-06 09:30:03
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	108
`author_reputation`	55,914,702,009,771
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,239,283
`net_rshares`	0

@gitplait-mod1 · Jun 29 '20

Thanks for sharing an amazing Javascript tutorial. We are looking for people like you in our platform.
<sub> Your post has been submitted to be curated with @gitplait community account because this is the kind of publications we like to see in our community. </sub>

Join our [Community on Hive](https://hive.blog/trending/hive-103590) and Chat with us on [Discord](https://discord.gg/CWCj3rw).

[[Gitplait-Team]](https://gitplait.tech/)

properties (22)

`author`	gitplait-mod1
`permlink`	qcomvf
`category`	hive-175254
`json_metadata`	{"users":["gitplait"],"links":["https://hive.blog/trending/hive-103590","https://discord.gg/CWCj3rw","https://gitplait.tech/"],"app":"hiveblog/0.1"}
`created`	2020-06-29 10:09:21
`last_update`	2020-06-29 10:09:21
`depth`	1
`children`	0
`last_payout`	2020-07-06 10:09:21
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	437
`author_reputation`	64,455,719,431
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,239,739
`net_rshares`	0

@hivebuzz · Jul 2 '20

Congratulations @gotgame! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s) :

<table><tr><td><img src="https://images.hive.blog/60x70/http://hivebuzz.me/@gotgame/upvotes.png?202007022238"></td><td>You distributed more than 56000 upvotes. Your next target is to reach 57000 upvotes.</td></tr>
</table>

<sub>_You can view [your badges on your board](https://hivebuzz.me/@gotgame) And compare to others on the [Ranking](https://hivebuzz.me/ranking)_</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>



###### Support the HiveBuzz project. [Vote](https://hivesigner.com/sign/update_proposal_votes?proposal_ids=%5B%22109%22%5D&approve=true) for [our proposal](https://peakd.com/me/proposals/109)!

properties (22)

`author`	hivebuzz
`permlink`	hivebuzz-notify-gotgame-20200702t224205000z
`category`	hive-175254
`json_metadata`	{"image":["http://hivebuzz.me/notify.t6.png"]}
`created`	2020-07-02 22:42:06
`last_update`	2020-07-02 22:42:06
`depth`	1
`children`	0
`last_payout`	2020-07-09 22:42:06
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	813
`author_reputation`	370,311,826,014,229
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,307,696
`net_rshares`	0

@joelagbo · Jun 29 '20

I love javascript, even though I'm only good at Reactjs and vanilla javascript I know javascript as a 'language of all possibilities' and this tutorial proved it once again. Bookmarked!

properties (22)

`author`	joelagbo
`permlink`	re-gotgame-qcofur
`category`	hive-175254
`json_metadata`	{"tags":["hive-175254"],"app":"peakd/2020.06.2"}
`created`	2020-06-29 07:37:42
`last_update`	2020-06-29 07:37:42
`depth`	1
`children`	1
`last_payout`	2020-07-06 07:37:42
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	185
`author_reputation`	171,221,632,716,773
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,238,108
`net_rshares`	0

@gotgame · Jun 29 '20

Thanks for dropping by, glad you love the piece.

properties (22)

`author`	gotgame
`permlink`	qcp69z
`category`	hive-175254
`json_metadata`	{"app":"hiveblog/0.1"}
`created`	2020-06-29 17:08:27
`last_update`	2020-06-29 17:08:27
`depth`	2
`children`	0
`last_payout`	2020-07-06 17:08:27
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	48
`author_reputation`	23,969,707,386,372
`root_title`	"How to Scrape Data from Web Pages Using Node.js/Express"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,245,588
`net_rshares`	0