使用Scrapy自动获取豆瓣每周热门电影 by aafeng

hive-105017 · @aafeng · Jun 29 '20 (edited)

$21.13

使用Scrapy自动获取豆瓣每周热门电影

自从宅在家中后，养成了一个习惯：每周五全家一起在家中看一场电影。但每次在选择电影的时候都很耽误时间。于是就想到实现一个自动的程序，每周五下午的在各个影评的平台自动抓取本周热门电影，再发送消息/邮件给我，作为当晚要播放电影的参考。其实这个功能完全可以使用urllib来实现。不过下面使用Python中的爬虫框架Scrapy来实现。


##  安装并创建项目

首先安装Scrapy:

    pip install Scrapy

接下来创建一个项目：

    scrapy startproject douban

## 添加核心代码

### items.py

首先修改items.py:

    import scrapy

    class DoubanItem(scrapy.Item):
        name = scrapy.Field()

可以看到DoubanItem类是scrapy.Item的子类。

### 使用Scrapy shell获取电影标题对应的路径

要想正确获取影片路径，需要使用浏览器的调试工具和Scrapy自带的命令行工具。

在浏览器中打开"https://movie.douban.com/"这个页面，在浏览器的开发者工具中查看：

![image.png](https://images.hive.blog/DQmPREtpb4dzg1KrsnmkqLhXPn2a3H2sADZwTCNaN92ivnV/image.png)

在下面的核心代码中，我们将使用

    <div class="billboard-bd">
    <td class="title">
    <a>

标记来定位影片的标题。


### DoubanSpider

接下来编辑文件douban/spiders/douban_spider.py：

    import scrapy

    class DoubanSpider(scrapy.Spider):
        name = "douban"
        allowed_domains = ["https://movie.douban.com/"]
        start_urls = [
            "https://movie.douban.com/"
        ]

        def parse(self, response):
            movie_list = []

            for movie in response.xpath("//div[@class='billboard-bd']//td[@class='title']/a/text()").getall():
                movie_list.append(movie)

            print(movie_list)
            filename = "/var/tmp/movielist.txt"
            with open(filename, 'w') as f:
                f.write(str(movie_list))

DoubanSpider类继承自scrapy.Spider这个类。在上面的实现中重写了parse方法，自定义处理逻辑。

尝试运行一下：

    scrapy crawl douban

从LOG中可以看到，豆瓣返回了一个403错误。这是由于其反爬虫机制导致的。

打开douban/settings.py，添加如下行：

    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0"

再尝试一下，成功！其输出类似于：

    2020-06-26 15:49:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2020-06-26 15:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/robots.txt> (referer: None)
    2020-06-26 15:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/> (referer: None)
    ['默片解说员', '数电影的人', '拍拖故事', '若水', '知晓天空之蓝的人啊', '二十世纪', '房子的故事', '温德米尔儿童', '翻译疑云', '乳牙']

上面这种方式根本就没有用到前面定义的items.py，如果想要使用的话，可以把douban_spider.py更改为：

    import scrapy

    class DoubanSpider(scrapy.Spider):
        name = "douban"
        allowed_domains = ["https://movie.douban.com/"]
        start_urls = [
            "https://movie.douban.com/"
        ]

        def parse(self, response):
            for movie in response.xpath("//div[@class='billboard-bd']//td[@class='title']/a/text()").getall():
                yield {
                    'name': movie
                }

需要注意的是，要想输出中文，需要在settings.py中添加：

    FEED_EXPORT_ENCODING = 'utf-8'

再次运行：

    scrapy crawl douban -o movies.json

其输出为：

![image.png](https://images.hive.blog/DQmYXWujjXG8qc3Nv1ad9tftLH1VJNtPcNbcwwoVGj17Jv8/image.png)

同样结果已经保存在json文件中。然后就可以把这些影片信息发送给自己了。利用同样的思路，可以获取其他影评网站的信息再汇总后一起发给自己。

👍 abit, sweetsssj, oflyhigh, steem.services, mindtrap, i-d, wherein, ajks, jywahaha, justyy, ctime, nateaguila, rosatravels, runicar, aafeng, musiccccat, mmmmkkkk311, honoru, exyle, karja, btscn, nuagnorab, vcs, victory622, cnstm, webdeals, susanli3769, ezzy, team-cn, morwen, cherryzz, gerber, hertz300, littleksroad, curx, cnfund, morningshine, aellly, guyverckw, lebin, pet.society, pladozero, kirato, annepink, xiaoliang, chronocrypto, wilkinshui, iptrucs, bring, pardeepkumar, aaronli, eliel, dcityrewards, codingdefined, steem.leo, lovequeen, accelerator, emrebeyler, nealmcspadden, theskmeister, mrspointm, botante, nostalgic1212, davidchen, and 231 others

`author`	aafeng
`permlink`	scrapy
`category`	hive-105017
`json_metadata`	{"tags":["cn","cn-reader","cn-curation","cn-programming","python"],"image":["https://images.hive.blog/DQmPREtpb4dzg1KrsnmkqLhXPn2a3H2sADZwTCNaN92ivnV/image.png","https://images.hive.blog/DQmYXWujjXG8qc3Nv1ad9tftLH1VJNtPcNbcwwoVGj17Jv8/image.png"],"links":["https://movie.douban.com/"],"app":"hiveblog/0.1","format":"markdown"}
`created`	2020-06-28 15:12:06
`last_update`	2020-06-29 10:55:33
`depth`	0
`children`	0
`last_payout`	2020-07-05 15:12:06
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	11.140 HBD
`curator_payout_value`	9.986 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	2,885
`author_reputation`	546,202,457,352,023
`root_title`	使用Scrapy自动获取豆瓣每周热门电影
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	98,224,756
`net_rshares`	56,808,432,675,117
`author_curate_reward`	""

properties (23)vote details (295)

voter	rshares	pct
abit	25,400,135,818,755	100%
wongshiying	3,850,160,800	45.86%
mangou007	10,390,797,483	6.64%
gerber	136,362,308,507	2%
ezzy	156,499,552,800	2%
exyle	199,023,897,030	2%
webdeals	177,508,708,983	13.5%
tensaix2j	1,818,411,010	45.86%
oflyhigh	4,909,147,219,114	100%
bert0	31,483,187,152	6.64%
sweetsssj	13,565,343,987,341	33%
someguy123	26,033,420,056	2%
cnfund	100,397,819,329	45.86%
justyy	390,111,995,804	45.86%
btshuang	1,677,163,834	45.86%
bring	60,870,666,726	100%
devilwsy	1,526,771,738	45.86%
janiceting	1,528,256,547	45.86%
elizacheng	16,740,271,367	9%
privex	3,613,299,882	4%
travelnepal	3,810,076,211	22.93%
lordneroo	5,788,974,340	29.1%
frankk	9,987,296,525	10%
david777111	31,973,427,348	55%
dune69	6,793,581,271	2%
eliel	55,421,529,398	30%
jerrybanfield	19,837,738,195	2%
aleister	5,097,702,705	10%
mys	15,158,187,897	4.59%
guyverckw	90,381,348,677	45.86%
improv	3,023,062,241	4.5%
nuagnorab	191,279,269,202	45.86%
linuslee0216	31,928,114,318	45.86%
wilkinshui	63,629,222,340	45.86%
mxzn	7,427,742,152	15%
aaronli	57,972,912,973	45.86%
marylaw	2,501,737,231	45.86%
shenchensucc	24,857,189,884	45.86%
krischy	16,174,181,763	36.68%
biuiam	2,560,919,785	45.86%
techken	30,540,149,335	50%
whd	5,364,731,372	4.59%
d-pend	7,843,425,208	0.3%
furious-one	17,663,325,123	20%
gniksivart	19,288,710,875	15%
susanli3769	167,259,922,929	100%
raili	737,032,653	10%
runicar	263,022,330,233	29.1%
davidmendel	4,657,632,287	36.68%
rafalski	569,097,365	4.59%
mrpointp	31,926,834,510	45.86%
codingdefined	52,136,637,185	18%
dado13btc	971,247,490	29.1%
mygod	1,674,865,691	45.86%
shitsignals	667,457,044	2%
everrich	4,313,015,742	45.86%
syh7758520	12,025,147,220	45.86%
diaohuijun	3,085,855,477	45.86%
nicolemoker	28,825,225,882	36.68%
stinawog	4,816,299,094	30%
mangoanddaddy	3,199,885,033	80%
aafeng	248,521,059,565	100%
shihabieee	1,778,731,549	16.5%
pardeepkumar	60,822,297,453	29.1%
felander	8,183,959,649	2%
karja	198,884,157,945	10%
mrspointm	38,461,099,591	45.86%
liumei	5,698,601,597	45.86%
waiyee422	4,086,792,394	45.86%
fbslo	1,621,994,660	2.29%
accelerator	48,714,322,224	5%
yogacoach	722,090,012	1%
estream.studios	1,263,563,399	30%
veenang	555,843,950	1.35%
chenlocus	31,172,674,872	100%
rosatravels	275,314,767,527	45.86%
deathwing	2,022,040,373	2%
dancingapple	4,822,924,690	20%
rakkasan84	1,441,320,654	24%
minloulou	8,014,417,627	45.86%
victory622	185,225,509,264	97%
flamingirl	1,687,249,549	6.64%
miti	23,383,255,661	15%
jychbetter	3,644,864,652	45.86%
winniex	31,752,793,977	45.86%
caladan	7,391,026,062	2%
cryptotradingfr	5,121,165,333	10%
jianan	1,478,453,345	45.86%
emrebeyler	47,657,215,471	2%
windowglass	1,841,070,295	45.86%
zmx	5,887,046,243	45.86%
nileelily	2,805,042,126	45.86%
jacktan	1,320,287,913	22.93%
angelina6688	2,048,380,354	45.86%
lebin	77,948,448,923	50%
iptrucs	62,892,405,999	40%
elex17	1,805,221,118	100%
enjoyinglife	7,168,420,406	29.1%
cheva	9,280,824,761	45.86%
duke77	5,727,684,787	30%
adityajainxds	16,171,839,346	30%
mmmmkkkk311	206,358,755,793	3.5%
nealmcspadden	40,423,628,070	2%
maiyude	1,471,869,906	45.86%
curx	107,694,806,537	30%
culgin	26,572,884,920	20%
purefood	32,159,968,852	2%
enmaart	15,951,456,979	15%
portugalcoin	22,469,416,919	20%
gribouille	2,188,859,849	50%
itharagaian	29,738,026,762	100%
emmali	6,661,305,454	45.86%
chronocrypto	69,340,808,155	2%
kirato	76,882,706,021	45.86%
mahtabansari370	23,874,586,848	14.55%
cadawg	4,000,127,743	1.4%
kristves	13,895,013,836	13%
nostalgic1212	36,950,535,927	45.86%
nenya	537,225,517	72%
florenceboens	1,078,243,310	10%
pkocjan	866,925,694	1.6%
ofildutemps	2,834,342,065	30%
sunrawhale	30,238,804,192	30%
shentrading	13,092,774,440	45.86%
mindtrap	946,405,823,370	29.1%
also.einstein	3,580,413,171	45.86%
ericet	31,435,566,568	45.86%
josevas217	5,837,862,819	5.97%
beleg	1,999,480,997	4.59%
bestboom	7,686,808,266	2%
onepercentbetter	3,488,483,979	1.5%
abrockman	11,547,056,543	2%
aellly	91,086,504,193	45.86%
huangzuomin	6,291,989,675	45.86%
liewsc	1,330,444,686	22.93%
tanzy	611,107,795	22.93%
freddio	21,128,569,306	15%
sustainablelivin	1,630,197,453	15%
imcore	1,024,526,772	10%
tresor	18,863,965,992	6.64%
andrewma	10,496,521,857	45.86%
softmetal	4,836,108,676	45.86%
xiaoliang	70,363,481,859	45.86%
i-d	602,538,732,537	45.86%
steem.services	1,300,521,897,269	30%
honoru	201,587,654,751	45.86%
pladozero	77,529,828,355	10%
nateaguila	288,837,839,993	8%
hmayak	10,565,617,664	45.86%
fishlucy	13,305,980,453	50%
ronbong	1,470,169,174	45.86%
merlion	601,053,434	1.5%
robertyan	7,300,416,684	45.86%
xiaoyuanwmm	2,116,125,947	45.86%
kidsreturn	2,373,553,345	45.86%
swisswitness	1,140,323,272	2%
tydebbie	3,501,966,545	22.93%
moneybaby	15,598,789,947	45.86%
ybeyond	2,168,452,124	45.86%
team-cn	147,190,574,907	45.86%
milaan	10,129,580,890	29.1%
wanggang	16,232,044,210	9.17%
chick-fil-a	1,494,063,994	45.86%
redlobster	1,496,421,030	45.86%
fiveguys	818,100,973	45.86%
marcoy2j	1,309,412,094	45.86%
mastersa	952,726,096	16.5%
bonefish	1,491,305,819	45.86%
chilis	1,502,788,708	45.86%
olive-garden	788,751,807	45.86%
zhuanzhibufu	2,689,812,521	100%
shine.wong	1,653,966,164	45.86%
shuxuan	1,511,786,048	45.86%
zhuxi	2,685,877,314	100%
dlike	25,361,028,762	2%
cryptoyzzy	10,567,736,159	10%
melaniewang	7,425,715,108	45.86%
teamcn-news	790,464,365	45.86%
wenxuecity	1,092,695,924	45.86%
mitbbs	1,497,567,391	45.86%
gorbisan	4,971,939,626	3.44%
rayshiuimages	2,129,546,456	7.5%
artsymelanie	25,131,461,916	45.86%
engrave	25,565,635,937	1.9%
cercle	2,256,638,032	100%
bobby.madagascar	744,815,248	0.5%
slientstorm	4,463,921,638	45.86%
laissez-faire	20,280,929	100%
pet.society	77,783,908,653	45.86%
minminlou	727,365,781	34.39%
annepink	72,501,923,135	45.86%
l-singclear	2,539,373,742	100%
cherryzz	141,099,683,558	45.86%
itharagaian.net	2,181,876,840	100%
curart38	1,872,798,528	20%
teamcn-shop	13,413,450,090	45.86%
yanyanbebe	4,056,394,787	45.86%
memeteca	828,317,941	6.64%
followjohngalt	10,909,870,225	2%
vcs	189,235,593,954	29.1%
quenty	1,233,491,393	72%
kelvinzhang	2,496,126,368	45.86%
starrouge	597,896,646	30%
infinite-love	1,842,686,209	30%
theskmeister	39,489,880,857	100%
wherein	509,248,513,892	60%
zerofive	794,259,897	45.86%
jacuzzi	4,244,765,562	7.5%
samsemilia7	7,217,342,170	40%
ahua	946,236,589	45.86%
morningshine	99,211,064,358	45.86%
cnstm	180,268,769,556	60%
nimloth	802,781,811	72%
likuang007	5,816,007,705	60%
cn-activity	3,409,928,704	45.86%
davidchen	32,258,694,093	45.86%
ajks	506,664,141,448	29.1%
ctime	371,824,101,674	3%
lianjingmedia	560,477,503	60%
mia-cc	2,909,824,196	22.93%
cecilian	3,409,868,045	45.86%
hungrybear	73,259,011	1.5%
devyleona	13,480,713,908	45.86%
yanhan	18,010,309,019	45.86%
foodiecouple	3,239,265,660	45.86%
holydog	16,557,267,045	45%
m18207319997	6,485,031,550	45.86%
lovelemon	8,717,980,595	45.86%
theinspiration	533,027,046	100%
epic4chris	535,629,029	100%
cn-hello	1,259,383,348	45.86%
bergelmirsenpai	1,694,284,965	30%
mylord1992	3,448,754,514	45.86%
sirbush	6,743,292,989	100%
aafeng.test	554,971,375	100%
candy.tang	11,905,052,540	100%
morwen	143,688,612,173	72%
phillarecette	2,885,117,522	12%
kgame	1,528,556,536	45.86%
klima	1,144,479,016	72%
hertz300	131,560,130,609	45.86%
koei	1,026,577,519	45.86%
mosquito76	899,033,365	15%
nympheas	8,274,496,658	36.68%
milu-the-dog	863,276,754	2%
steem-drivers	893,895,983	45.86%
triplea.bot	678,424,350	2%
steem.leo	49,841,912,591	2%
xiaoq.sports	3,542,815,298	45.86%
freddio.sport	3,930,790,775	15%
hykwf678233	16,243,057,863	45.86%
asteroids	11,328,186,686	2%
atyh	2,203,381,990	45.86%
botante	37,124,828,847	15%
ericetchen	803,728,805	45.86%
stevewu	2,076,299,526	45.86%
kristinasiu	828,233,914	45.86%
pukeko	6,984,386,425	15%
maxuvd	20,505,016,897	6%
trevorlp97	4,916,762,967	45.86%
btscn	197,966,497,120	50%
trevormomo	978,284,491	45.86%
freedomteam2019	1,860,873,490	20%
annzhao	7,821,067,677	45.86%
cn-trail	327,012,965	45.86%
gerbo	0	2%
ladyalkaid	707,230,325	45.86%
lnakuma	5,537,956,821	45.86%
policewala	24,570,369,638	15%
sacrosanct	8,465,401,235	29.1%
roamingsparrow	11,434,429,123	11.25%
ribary	614,591,789	1%
kenchung1	15,606,528,135	45.86%
mice-k	3,999,667,892	2%
ignet	554,096,270	100%
steem.buzz	1,080,894,331	45.86%
curamax	577,353,951	1.5%
catanknight	1,213,792,401	45.86%
steemcityrewards	411,230,150	2%
dpend.active	642,609,165	0.4%
lovequeen	49,183,511,687	100%
bnk	7,884,629,335	6.64%
polish.hive	5,089,693,742	2%
littleksroad	129,040,652,863	45.86%
dcityrewards	54,691,984,581	2%
andrewmusic	2,927,066,181	100%
portraits	0	50%
kikoxixi	11,647,984,011	45.86%
jywahaha	398,872,968,123	45.86%
musiccccat	220,209,846,068	45.86%
alwaysthinking	6,815,371,513	45.86%
lithajacobs	541,181,411	100%
hivecur	5,110,554,007	2%
xiaomalailiao	6,573,928,085	45.86%
weiweilove	1,415,103,864	45.86%