create account

Soccer Predictions using Python (part 1) by stevencurrie

View this thread on: hive.blogpeakd.comecency.com
· @stevencurrie ·
Soccer Predictions using Python (part 1)
<html>
<p>I've seen many articles online describing how the poisson distribution could potentially be used as a means of predicting soccer scores. &nbsp;However, I haven't seen much in the way of practical examples.</p>
<p>It seems to be me that this might be an ideal subject for my first post on Steemit.</p>
<p>Now, before you read any further, I'm a hobbyist, so don't expect high quality code! &nbsp;I've just recently started learning Python so this will be a learning process for me more than anything. &nbsp;I taught myself 6502 assembly language in my early teens and wrote a couple of (unpublished) games on the Commodore 64. &nbsp;I was also involved in the demo scene and coded many a fine starfield. &nbsp;Later, I moved the Amiga and 68000 code and some C, before everything went PC and I ended up using Visual Basic. &nbsp;</p>
<p>I drifted away from programming for a long time but now that I'm using Linux, I've got the bug again, so I've been learning Python. &nbsp;Big shout out to <a href="pythonprogramming.net">pythonprogramming.net</a> &amp; <a href="www.youtube.com/user/sentdex">www.youtube.com/user/sentdex</a> for getting me this far.</p>
<p>But enough about me, lets get to work.</p>
<p>First thing we need is some data, so we're going to use Beautiful Soup to scrape some from the web.</p>
<p>There are plenty of places to find historical soccer results, some with ready made .CSV files that can be downloaded, but I've decided to scrape raw data from the Soccer Punter website (<a href="https://www.soccerpunter.com">www.soccerpunter.com</a>). &nbsp;There are a couple of reasons for this choice but the main one is that the .CSVs available usually seem to take a few days before they're updated with the latest results.</p>
<p>I'm going to assume you all know how to install pandas, beautifulsoup and selenium or are clever enough to find out how elsewhere. ;-)</p>
<p>Here's what I've came up with so far...</p>
<blockquote><strong>import </strong>pandas <strong>as </strong>pd<br>
<strong>from </strong>bs4 <strong>import </strong>BeautifulSoup <strong>as </strong>bs<br>
<strong>from </strong>selenium <strong>import </strong>webdriver<br>
<strong>import </strong>datetime<br>
<br>
<strong>def </strong>scrapeseason(country, comp, season):<br>
&nbsp;&nbsp;&nbsp;<em># output what the function is attempting to do.</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>print(<strong>"Scraping:"</strong>, country, comp, str(season)+<strong>"-"</strong>+str(season+1))<br>
&nbsp;&nbsp;&nbsp;baseurl = <strong>"http://www.soccerpunter.com/soccer-statistics/"</strong><br>
<strong>&nbsp;&nbsp;&nbsp;</strong>scrapeaddress = (baseurl + country + <strong>"/" </strong>+ comp.replace(<strong>" "</strong>, <strong>"-"</strong>).replace(<strong>"/"</strong>, <strong>"-"</strong>) + <strong>"-"</strong><br>
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</strong>+ str(season) + <strong>"-" </strong>+ str(season + 1) + <strong>"/results"</strong>)<br>
&nbsp;&nbsp;&nbsp;print(<strong>"URL:"</strong>, scrapeaddress)<br>
&nbsp;&nbsp;&nbsp;print(<strong>""</strong>)<br>
<br>
&nbsp;&nbsp;&nbsp;<em># scrape the page and create beautifulsoup object</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>sess = webdriver.PhantomJS()<br>
&nbsp;&nbsp;&nbsp;sess.get(scrapeaddress)<br>
&nbsp;&nbsp;&nbsp;page = bs(sess.page_source, <strong>"lxml"</strong>)<br>
<br>
&nbsp;&nbsp;&nbsp;<em># find the main data table within the page source</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>maintable = page.find(<strong>"table"</strong>, <strong>"competitionRanking"</strong>)<br>
<br>
&nbsp;&nbsp;&nbsp;<em># seperate the data table into rows</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>games = maintable.find_all(<strong>"tr"</strong>)<br>
<br>
&nbsp;&nbsp;&nbsp;<em># create an empty pandas dataframe to store our data</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>df = pd.DataFrame(columns=[<strong>"date"</strong>, <strong>"homeTeam"</strong>, <strong>"homeScore"</strong>, <strong>"awayScore"</strong>, <strong>"awayTeam"</strong>])<br>
<br>
&nbsp;&nbsp;&nbsp;idx = 0<br>
&nbsp;&nbsp;&nbsp;today = datetime.date.today()<br>
<br>
&nbsp;&nbsp;&nbsp;<strong>for </strong>game <strong>in </strong>games:<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># these lines filter out any rows not containing game data, some competitions contain extra info.</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em><strong>try</strong>:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cls = game[<strong>"class"</strong>]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>except</strong>:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cls = <strong>"none"</strong><br>
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if </strong>(<strong>"titleSpace" not in </strong>cls <strong>and "compHeading" not in </strong>cls <strong>and</strong><br>
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"matchEvents" not in </strong>cls <strong>and "compSubTitle" not in </strong>cls <strong>and </strong>cls != <strong>"none"</strong>):<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;datestr = game.find(<strong>"a"</strong>).text<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;gamedate = datetime.datetime.strptime(datestr, <strong>"%d/%m/%Y"</strong>).date()<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># filter out "extra time", "penalty shootout" and "neutral ground" markers</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em>hometeam = game.find(<strong>"td"</strong>, <strong>"teamHome"</strong>).text<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;hometeam = hometeam.replace(<strong>"[ET]"</strong>, <strong>""</strong>).replace(<strong>"[PS]"</strong>, <strong>""</strong>).replace(<strong>"[N]"</strong>, <strong>""</strong>).strip()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;awayteam = game.find(<strong>"td"</strong>, <strong>"teamAway"</strong>).text<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;awayteam = awayteam.replace(<strong>"[ET]"</strong>, <strong>""</strong>).replace(<strong>"[PS]"</strong>, <strong>""</strong>).replace(<strong>"[N]"</strong>, <strong>""</strong>).strip()<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># if game was played before today, try and get the score</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em><strong>if </strong>gamedate &lt; today:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;scorestr = game.find(<strong>"td"</strong>, <strong>"score"</strong>).text<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># if the string holding the scores doesn't contain " - " then it hasn't yet been updated</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em><strong>if " - " in </strong>scorestr:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;homescore, awayscore = scorestr.split(<strong>" - "</strong>)<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># make sure the game wasn't cancelled postponed or suspended</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em><strong>if </strong>homescore != <strong>"C" and </strong>homescore != <strong>"P" and </strong>homescore != <strong>"S"</strong>:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># store game in dataframe</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em>df.loc[idx] = {<strong>"date"</strong>: gamedate.strftime(<strong>"%Y-%m-%d"</strong>),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"homeTeam"</strong>: hometeam,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"homeScore"</strong>: int(homescore),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"awayScore"</strong>: int(awayscore),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"awayTeam"</strong>: awayteam}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># update our index</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em>idx += 1<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>else</strong>:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em># it's a future game, so store it with scores of -1</em><br>
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</em>df.loc[idx] = {<strong>"date"</strong>: gamedate.strftime(<strong>"%Y-%m-%d"</strong>),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"homeTeam"</strong>: hometeam,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"homeScore"</strong>: -1,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"awayScore"</strong>: -1,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>"awayTeam"</strong>: awayteam}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;idx += 1<br>
<br>
&nbsp;&nbsp;&nbsp;<em># sort our dataframe by date</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>df.sort_values([<strong>'date'</strong>, <strong>'homeTeam'</strong>], ascending=[<strong>True</strong>, <strong>True</strong>], inplace=<strong>True</strong>)<br>
&nbsp;&nbsp;&nbsp;df.reset_index(inplace=<strong>True</strong>, drop=<strong>True</strong>)<br>
&nbsp;&nbsp;&nbsp;<em># add a column containing the season, it'll come in handy later.</em><br>
<em>&nbsp;&nbsp;&nbsp;</em>df[<strong>"season"</strong>] = season<br>
&nbsp;&nbsp;&nbsp;<strong>return </strong>df<br>
<br>
<em># set which country and competition we want to use</em><br>
<em># others to try, "Scotland" &amp; "Premiership" or "Europe" &amp; "UEFA Champions League"</em><br>
country = <strong>"England"</strong><br>
competition = <strong>"Premier League"</strong><br>
lastseason = 2016<br>
thisseason = 2017<br>
<br>
lastseasondata = scrapeseason(country, competition, lastseason)<br>
thisseasondata = scrapeseason(country, competition, thisseason)<br>
<br>
<em># combine our data to one frame</em><br>
data = pd.concat([lastseasondata, thisseasondata])<br>
data.reset_index(inplace=<strong>True</strong>, drop=<strong>True</strong>)<br>
<br>
<em># save to file so we don't need to scrape multiple times</em><br>
data.to_csv(<strong>"data.csv"</strong>)<br>
</blockquote>
<p>Okay, that's enough for now. &nbsp;If you run this you'll have a file called data.csv. &nbsp;Load it up in a spreadsheet and confirm it looks OK and I'll be back soon with some code to do something with our new data.</p>
<p>In the meantime, If anyone has any questions, tips, advice or abuse they'd like to share, please do.</p>
<p><br></p>
</html>
👍  , ,
properties (23)
authorstevencurrie
permlinksoccer-predictions-using-python-part-1
categorypython
json_metadata{"tags":["python","soccer","prediction","poisson"],"links":["pythonprogramming.net","www.youtube.com/user/sentdex","https://www.soccerpunter.com","http://www.soccerpunter.com/soccer-statistics/"],"app":"steemit/0.1","format":"html"}
created2017-09-16 00:29:24
last_update2017-09-16 00:29:24
depth0
children0
last_payout2017-09-23 00:29:24
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length12,779
author_reputation7,132,661,654
root_title"Soccer Predictions using Python (part 1)"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id15,012,392
net_rshares1,638,563,647
author_curate_reward""
vote details (3)