Python football predictions

Python football predictions DEFAULT

I am currently updating Footy Predictor for the coming season, hope to release on 22nd June

I will leave this page here for reference purposes

Python Soccer Results Predictor

Updated big-time-see this post

I have used the words “soccer” as well as “football” and “footy” to cater for everyone. It will probably cause more confusion than it avoids, but we will see.

The Simple Minded

For quite some time now I have been avoiding Python dictionaries, I have only just got my head around Python lists ffs.

The strange and annoying thing is that I had absolutely no problem understanding and using multi-dimension arrays in Amos Basic, and other early BASIC languages.

There isn’t that much difference between MDA’s and Python dictionaries really, though Amos always went out of its way to be very clear and simple with the syntax, it never used unnecessary braces and brackets to confuse the simple-minded, like me.

Morning Brainwave

Now that I feel I understand Python lists fairly well, it seemed the right time to try to code something using a Python dictionary, and what do you know, I had a brainwave just this morning for a project that would probably require the use of a dictionary, how convenient. Were the Python gods encouraging me, or laughing at me, again, I wondered?

My idea was for a fun little program that makes predictions of Premier League football matches for the coming season, (topical or what, eh?). I was thinking along the lines of the team name and a weighting score. Perfect for a dictionary key and a value, or so I thought.

Weighting Allowance

The weighting isn’t exactly set arbitrarily or at random, it is set by my knowledge on how I rate the teams chances for the 2019\2020 season. I have been watching football since the 1970s and I keep up to date with team form and transfers in the Premier League, so I feel I have something to offer here.

Obviously this one single rating awarded by myself does not cover the myriad of other possible variables that come into play during a football season and of course the football match itself, though it does take into account home advantage which is a big results modifier.

You Cannot Be Serious?

To make Footy Predictor more serious it would probably need to connect to some football site API, (see code snippet #126), and gather data from there and then collate it and turn it into ratings of some kind, that unfortunately is way out of my [premier] league, so I decided to make what amounts to a toy 😉

I meant to add a scoreline prediction as well, but I forgot. I will do that in the next update.

Here is what I came up with:


And that is it, apart from a drop-down menu with “About”, “Visit Blog” and “Quit” in it.

A Cheap Shot

As I keep hinting, this program is not to be taken too seriously okay, it’s just a cheap way to get footy fans visiting this blog as the excitement of the new season rises.

If you look at how my naff my prediction “logic” works in the code you might be disappointed, but there is nothing to stop you writing your own prediction routines for it and sharing it with us math(s) noobs if you want to.

Historical Data

If you decide to have a go at writing a real football results predictor then here are a few pointers to get you started. You can get a free .csv file of historical football results data at football data co uk and here is a brilliant article (link now dead) that shows you how to use it. I will try it out when I get time, maybe.

I Got Dict

Getting back to Python dictionaries, I did manage to use a dictionary in the code, after reading several blog posts on the subject.

It is in this dictionary where you can edit my weightings for each team. Zero for the lowest rated, and no limit to the upper rating.

For example, last years league winners, Man City are rated 6, along with runners up Liverpool (just 1pt behind in the table, but Liverpool won the champions league as well), also on 6.

If you disagree with any of my ratings then it is easy to see where they are in the source code, feel free to change the ratings, but if you share the code or program modified please make it known you have made changes in the code and any executable you create.

Future Ratings

As the season progresses things will evolve. for example I am updating this post in October and Tottenham and Man Utd are having a nightmare of problems with players, form and hell knows what behind the scenes, and their ratings should be trimmed to 3 each, at least until things improve.

Conversely, teams like Leicester and Crystal Palace are having a brilliant start to the season and need their ratings increased by 2 each, in my opinion.

I will be looking to make an update to Footy Predictor soon, adding any features I can think of, and updating the ratings for the second half of the season, or better still I will add a built in updater that will get the latest ratings from this site to keep it up to date with current form trends etc.

Multi Platformer

Anyway, here is the code that I knocked up, see what you think. It is both Windows and Linux tested, and I am hopeful it will work on the Mac as well.



Footy Predictors Predictions

For the first week of the new Premier League season. Matches are played on the 9th, 10th and 11th Aug 2019. Home teams are the left side.

Game                              Prediction                    Result

Liverpool V Norwich – 3-1, Home Win           :Result 4-1
West Ham V Man City – 1-3 , Away win          :Result 0-5
Bournemouth V Sheff Utd – 2-0, Home Win  :Result 1-1
Burnley V Southampton – 2-0, Home Win     :Result 3-0
Crystal Pal V Everton – 0-0,  Draw                   :Result 0-0
Watford V Brighton – 1-0, Home Win             :Result 0-3
Tottenham V Aston Villa – 3-1, Home Win     :Result 3-1
Leicester V Wolves – 1-1 ,Draw                        :Result 1-1
Newcastle V Arsenal – 0-1, Away Win            :Result 0-1
Man Utd V Chelsea –1-0, Home Win                :Result 4-0

I will update this post with the results as and when they come in, no matter how much of a twit they make me look, I’m used to that 🙂

Back soon, COYS.

This post checked December 2019

Home Page – Contact

A-Z Of My Projects

Full list of my Python code snippets

Previous Post: Python Code Snippets #25 

Next Post: Python Code Snippets #26

Published by steve_shambles

I have 2 blogs here, one is for my personal memoirs the other is about learning to program in Python. I am not a developer, programming is just a hobby for me. Any adverts you might see are from WordPress, (use an Ad-Blocker). I make zero money from this site. Any and all support keeps me working hard you can support me via likes, shares or comments, they are gratefully received. View all posts by steve_shambles



Machine Learning Algorithms for Football Predictions

This article evaluated football/Soccer results (victory, draw, loss) prediction in Brazilian Football Championship using various machine learning models based on real-world data from the real matches. The models were tested recursively and average predictive results were compared. The results showed that logistic regression and support vectors machine yielded the best results, exhibiting superior average accuracy performance in comparison to others classifiers (KNN and Random Forest), with 49.77% accuracy (logistic Regression), almost 17% better than a random decision (benchmark) which has 33% of success chance. In addition, a ranking of the features’ relative importance was made to orient the use of Data.

Football/Soccer is a sport that is very present in people life’s, people use to watch, play, and also bet. Thinking about betting, we clearly can see that football is a very unpredictable sport, and it does not acquire serious research to prove that. In the premier league 2015/2016 season, we had a very unexpected champion, and their probability for the title at the beginning of the season was one to five thousand.

So, the prior objective of this project is to create a supervised machine learning algorithm that predicts the football matches results based on the statistics of the matches. Thus it will be possible to evaluate the difficulty level of prediction.

This project aims to:

  • Web scrapping robot to pick all the information of the matches
  • Automatize the process of Web scrapping to all the season matches
  • Create a supervised machine learning model to predict the outcome of the matches
  • Evaluate the models

In classification problems, is common to use accuracy, as an evaluation metric. As our outcome prediction is a multi-class problem, it’s not going to be necessary to use other metrics.

Where TP are the true positives, FP are the false positives, TN are the true negatives and FN are false negatives.


However, before exploring the collected data, it’s crucial to understand how this information was collected. So, this part will reach since the web-scrap robot developed for the final analysis of the whole database treated. The first and the second image below show how displays the page that the data was collected. So step number one is Web-scrap the main page, picking the football data, and creating a Data frame with all information combined.

Now that we have the code ready to pick data from the matches it is necessary to create another code to collect all the season matches URLs so that it will be an automatized robot to do this task.

All the codes will be attached to the Git repository.

Before the data were clean and ready to be used for analysis, the below steps were done:

  • Select columns: Selected columns that hadn’t had a lot number of null values.
  • Problems: The data collection came with some “problems” as the total of all player team’s statistics, so it was needed that these lines were expelled from the data.
  • Group by match and team: Because the data collected were from the players of the match, it was necessary to group all the statistics of the player’s team to matches and teams.
  • Append the Result: Because the data collected were from the player’s table, it did not bring the result of the match, so it was necessary to create a code that appends this result to the Data frame.
  • Place: It’s was necessary to create a code that shows which team played at home and away.


Now it’s time to evaluate the data collected so that it could be possible to create a prediction model to the results. These were the columns that our cleaned data had:

The meaning of all the variables:

  • Confronto: Match
  • Time: Team
  • Data: Date of the match
  • Gls: number of Goals in the match
  • Ast: Assists
  • PK: Penalty Kicks Made
  • PKatt: Penalty Kicks Attempted
  • Sh: Shots Total
  • SoT: Shots on target
  • CrdY: Yellow Cards
  • CrdR: Red Cards
  • Crs: Crosses
  • Fls: Fouls Committed
  • TklW: Tackles Won
  • Int: Interceptions
  • Fld: Fouls Drawn
  • OG: Own Goals
  • Off: Offsides
  • Resultado: Result of the match (Victory, Loss, Draw)
  • Place: Home/ Away
  • Torcida: Crowd

Observation: Every match had to lines one for the home team, and another for the away team.

It’s good to check how these variables relate to each other so that it was created to approaches: Attack and Defense. In the code, it was also provided with a function that does this approach by the team and places played (Home/Away).

For the attacking approach, it was selected 4 variables: “SoT”, “Sh”, “Gls”, “Torcida”. Looking at the data it’s possible to see how difficult is to solve this problem because we can’t find patterns looking at it. But it might be possible to conjecture some hypotheses different from the usual as:

  • A high Torcida(football crowd) might not be related to a more significant number of goals.
  • A low Torcida(football crowd) might be related to a high number of SoT(Shot on target).

With these graphs, it’s possible to see how our data is related, how the columns are related to each other, thus looking at 3 variables at the same time is possible to see the level of difficulty to separate this data. For the defense approach, it was selected 4 variables: “TklW”, “Fls”, “CrdY”, “Torcida”. And as in attack analysis, it was not possible to find any patterns on the Defense Data either.

However looking at Columns Info Figure, there are more variables available than those shown on graphs. And it’s clear that at least 17 collected variables, can influence the match outcome therefore also on the performance of the prediction. But because football is a lot more complex, more variables were needed to have a good predict a match outcome, so in the next part, it will generate some other variables.

Data Pre-Processing

As it’s not possible to have the statistic of the games before the result of the match, it’s necessary to create some new variables that would be available before the games. So in order to solve this dilemma, it is necessary to generate a mean for all the variables, this means will contain all the games before the correspondent game, by this way when a team plays on September 18, the code will provide a mean of all the variables available to all the games before this exactly game.

It’s going to be created some variables that try to show the computer the sequence of points for every team in the last 5 games, 3 games, and for the last game also. For every victory, the team the code summed 3 points, for a draw summed 1 point and for a loss summed no point, 0 points. In this way, it would be possible to see if the team comes from victories, draws, and losses.

Since these treatments were done, it was found a way to inform the machine of the “place” of the match. Because of this variable (place) and as the championship has 2 turns, the first turn may be in the Home of a team, and if the first game of the first championship turn it was on their home, the second necessarily has to me Away, or in other words, not him their stadium. And in Football matches, this is a very important variable. So the way designed to consider this variable was: Taking all data with all the variables created for the home teams(Place = Home) for every match, and then, subtract by the same dataset(the same variables) but from visiting teams (Place = Away). By this way it could be possible to generate a Database that the data would basically say:

  • If Variables = Negative Values: The visiting team has had better performances in this variable in the past games than the visitant/opponent. It’s possible to know that because, for example: picking the variable ”Average of Shots”, if the home team has a value P, then the visiting team has X, if P is higher than X the output value of the subtraction of both (HOME — AWAY) will be positive, otherwise if P is lower than X the value will be negative, showing that the team does not have better performances on that variable in the past games played.


  1. Problem: The variables were only available after the match and for our model to perform it’s necessary to have all data features data before the correspondent match so that it was created a season average for every variable of the season and more, some moving averages for each variable of the Dataset too.
  2. Problem: Insert the sequence variables: That problem was solved by summing all the points that the team had had in the past 3, 5 games, and also the last game. For every victory, the team summed 3 points, for a draw 1 point and for a loss 0 points.
  3. Problem: Show to the machine who is the home and the away team. To solve this we subtracted the home team variables results by the visitant variables results for every match. Showing in this way if the team home is in any way superior or inferior to the visitant team.

The final Database had 41 columns and 380 lines, and looked like this:

Coming at this point our database has finally been treated and by now it’s possible to run some models and check their performances


To run we dropped the first four variables and “Gls” variable either also was used MinMaxScalar on all the features variables. As can be seen on Sklearn, this estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one. Moreover were used all these 4 models:

  • Support Vector Machine: is a supervised learning model, that always aims to increase the distance between the points so that it is possible to classify the classes, so, SVM separates data maximizing the margin between the classes.
  • Random Forest: As the name says, random forest creates several decision trees and groups them into one “Forest” of trees, taking a sample with size m bootstrap of the columns that represent the explanatory variables when partitioning the tree in each node of it. The final decision vote will be given by the majority vote for classification problems (and the average, in a regression problem).
  • KNN: is a simple model that performs classification based on the class of its k nearest points based on a distance metric.
  • Logistic Regression: Basically, logistic regression is a multiple linear regression whose result is “squeezed” in the interval [0, 1] using the sigmoid function.

In order to run all these models we split the Database randomly using the library ”train test split” from scikit-learn. For the test, it was used 30% of the Data. Moreover, the algorithm was randomly played through 1000 times, for all four models. In this way it was possible to check the standard deviation and the accuracy of all models. All models were used by the default values, not exploring the parameters of each specific model.

Model Evaluation and Validation

As can be seen in the table below, the results weren’t too expressive. The algorithm achieved almost 50% of accuracy, but, considering that the random prediction probability of success is 33% (Victory/Draw/Loss), it is a good signal, showing that the algorithm was able to identify some patterns after all.

After all 1000 times, the Logistic Regression had the best results, with one of the lowest standard deviation rates and the highest accuracy between all four models. SVM performed better than the others, but had the highest standard deviation, maybe showing that the model tends to vary more than other models. Also were created a feature ranking using Random Forest features importance. As can be seen in the figure bellow, the football game crowd from the last 3 matches, the yellow card numbers, and also football game crowd from all the season were influencing more on the prediction outcome.


It’s weird to see logistic regression performed much better than the other models, mainly because as it was seen, data clearly it did not have patterns and it was a complex problem and surprisingly a very complex problem was solved, by a ”linear” approach, maybe the simplest here. One of the reasons for this may be because of the parameters of each model, that were not explored in this approach. Because SVM has a lot of kernels, trying different kernels might be a solution to reach a better performance for this model. Another strange point is that the ”points sequence” was one of the lasts variables to interfere with the prediction. As far as I am concerned this variable should not be ignored and maybe there are more ways to show these patterns to the machine.


This article proves that football prediction is still a very hard task, it still needs more variables to help with the prediction of the results. However, we can see by this article that a machine learning algorithms can already ”think” on which team bet and can still be more accurate than people that do not know about the games having almost 17% of advantage in the prediction when comparing to the probability of a random prediction.


For the future, I suggest to investigate and find more variables that could be useful, as injuries for example, or more details of the players of each team, maybe FIFA or Pro Evolution game data could help to bring more information inside of the Base. Another thing that could be done for the future is on predicting the number of goals for of each team, this is more complex because it depends on the results predicted and they must conciliate with it, for example, it couldn’t be two goals for the home team and two goals for the way team if the result was predicted as a victory of the home team. So, maybe, this article can be a source of inspiration for the creation of better and complex models in the future.

  1. Ark paint colors
  2. Audi s5 test pipes
  3. Clear the rack

Supervised learning models used to predict football matches outcomes

Results - You can see the results on the html file in src/
Disclaimer - This repository was created for educational purposes and we do not take any responsibility for anything related to its content. You are free to use any code or algorithm you find, but do so at your own risk.

Notebook by Martim Pinto da Silva, LuisRamos, Francisco Gonçalves

Supported by Luis Paulo Reis

Faculdade de Engenharia da Universidade do Porto

It is recommended to view this notebook in nbviewer for the best overall experience

You can also execute the code on this notebook using Jupyter Notebook or Binder(no local installation required)

Table of contents

        • Step 2: Classification & Results Interpretation


          go back to the top

          In the most recent years there's been a major influx of data. In response to this situation, Machine Learning alongside the field of Data Science have come to the forefront, representing the desire of humans to better understand and make sense of the current abundance of data in the world we live in.

          In this notebook, we look forward to use Supervised Learning models to harness a dataset of around 25k football matches in order to be able to predict the outcome of other matchups according to a set of classes (win, draw, loss, etc.)

          Required libraries and models

          go back to the top


          If you don't have Python on your computer, you can use the Anaconda Python distribution to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

          This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

          NumPy: Provides a fast numerical array structure and helper functions.

          pandas: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.

          scikit-learn: The essential Machine Learning package for a variaty of supervised learning models, in Python.

          tensorflow: The essential Machine Learning package for deep learning, in Python.

          matplotlib: Basic plotting library in Python; most other Python plotting libraries are built on top of it.


          Regarding the supervised learning models, we are using:

          # Primary librariesfromtimeimporttimeimportnumpyasnpimportpandasaspdimportsqlite3importmatplotlib.pyplotasplt# Modelsfromsklearn.naive_bayesimportGaussianNBfromsklearn.neighborsimportKNeighborsClassifierfromsklearn.treeimportDecisionTreeClassifierfromsklearn.svmimportSVCfromxgboostimportXGBClassifier# Neural Networksfromtensorflowimportkerasfromkeras.modelsimportSequentialfromkeras.layersimportDensefromkeras.layersimportFlattenfromkeras.layersimportInputfromkeras.modelsimportModelfromkeras.utilsimportnp_utils# Measuresfromsklearn.preprocessingimportNormalizerfromsklearn.preprocessingimportStandardScalerfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportclassification_report, accuracy_scorefromsklearnimportmetricsfromsklearn.model_selectionimportKFoldfromsklearn.preprocessingimportLabelEncoder

          The problem domain

          go back to the top

          The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so let's work through those questions.

          Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

          We are trying to design a predictive model capable of accurately predicting if the home team will either win, lose or draw, i.e., predict the outcome of football matche based on a set of measurements, including player ratings, team ratings, team average stats(possession, corners, shoots), team style(pressing, possession, defending, counter attacking, speed of play, ..) and team match history(previous games).

          Did you define the metric for success before beginning?

          Let's do that now. Since we're performing classification, we can use accuracy - the fraction of correctly classified matches - to quantify how well our model is performing. Knowing that most bookkeepers predict matches with an accuracy of 50%, we will try to match or beat that value. We will also use a confusion matrix, and analyse the precision, recall and f1-score.

          Did you consider whether the question could be answered with the available data?"

          The data provided has information about more than 25k matches across multiple leagues. Even though the usability isn't great, after some processing and cleansing of the data, we will be able to predict matches with great confidence. To answer the question, yes, we have more than enough data to analyse football matches..

          Step 1: Data analysis

          go back to the top

          The first step we have, is to look at the data, and after extracting, analyse it. We know that most datasets can contain minor issues, so we have to search for possible null or not defined values, and if so how do we proceed? Do we remove an entire row of a Dataframe? Maybe we just need to purify and substitute it's value? This analysis is done below.

          Before analysing the data, we need to first extract it. For that we use multiple methods to have a cleaner code

          Extracting data from the database

          withsqlite3.connect("../dataset/database.sqlite") ascon: matches=pd.read_sql_query("SELECT * from Match", con) team_attributes=pd.read_sql_query("SELECT distinct * from Team_Attributes",con) player=pd.read_sql_query("SELECT * from Player",con) player_attributes=pd.read_sql_query("SELECT * from Player_Attributes",con)


          We start by cleaning the match data and defining some methods for the data extraction and the labels

          ''' Derives a label for a given match. '''defget_match_outcome(match): #Define variableshome_goals=match['home_team_goal'] away_goals=match['away_team_goal'] outcome=pd.DataFrame() outcome.loc[0,'match_api_id'] =match['match_api_id'] #Identify match outcome ifhome_goals>away_goals: outcome.loc[0,'outcome'] ="Win"ifhome_goals==away_goals: outcome.loc[0,'outcome'] ="Draw"ifhome_goals<away_goals: outcome.loc[0,'outcome'] ="Defeat"#Return outcome returnoutcome.loc[0] ''' Get the last x matches of a given team. '''defget_last_matches(matches, date, team, x=10): #Filter team matches from matchesteam_matches=matches[(matches['home_team_api_id'] ==team) | (matches['away_team_api_id'] ==team)] #Filter x last matches from team matcheslast_matches=team_matches[<date].sort_values(by='date', ascending=False).iloc[0:x,:] #Return last matchesreturnlast_matches''' Get the last team stats of a given team. '''defget_last_team_stats(team_id, date, team_stats): #Filter team statsall_team_stats=teams_stats[teams_stats['team_api_id'] ==team_id] #Filter last stats from teamlast_team_stats=all_team_stats[<date].sort_values(by='date', ascending=False) iflast_team_stats.empty: last_team_stats=all_team_stats[>date].sort_values(by='date', ascending=True) #Return last matchesreturnlast_team_stats.iloc[0:1,:] ''' Get the last x matches of two given teams. '''defget_last_matches_against_eachother(matches, date, home_team, away_team, x=10): #Find matches of both teamshome_matches=matches[(matches['home_team_api_id'] ==home_team) & (matches['away_team_api_id'] ==away_team)] away_matches=matches[(matches['home_team_api_id'] ==away_team) & (matches['away_team_api_id'] ==home_team)] total_matches=pd.concat([home_matches, away_matches]) #Get last x matchestry: last_matches=total_matches[<date].sort_values(by='date', ascending=False).iloc[0:x,:] except: last_matches=total_matches[<date].sort_values(by='date', ascending=False).iloc[0:total_matches.shape[0],:] #Check for error in dataif(last_matches.shape[0] >x): print("Error in obtaining matches") #Return datareturnlast_matches''' Get the goals[home & away] of a specfic team from a set of matches. '''defget_goals(matches, team): home_goals=int(matches.home_team_goal[matches.home_team_api_id==team].sum()) away_goals=int(matches.away_team_goal[matches.away_team_api_id==team].sum()) total_goals=home_goals+away_goalsreturntotal_goals''' Get the goals[home & away] conceided of a specfic team from a set of matches. '''defget_goals_conceided(matches, team): home_goals=int(matches.home_team_goal[matches.away_team_api_id==team].sum()) away_goals=int(matches.away_team_goal[matches.home_team_api_id==team].sum()) total_goals=home_goals+away_goalsreturntotal_goals''' Get the number of wins of a specfic team from a set of matches. '''defget_wins(matches, team): #Find home and away winshome_wins=int(matches.home_team_goal[(matches.home_team_api_id==team) & (matches.home_team_goal>matches.away_team_goal)].count()) away_wins=int(matches.away_team_goal[(matches.away_team_api_id==team) & (matches.away_team_goal>matches.home_team_goal)].count()) total_wins=home_wins+away_winsreturntotal_wins''' Create match specific features for a given match. '''defget_match_features(match, matches, teams_stats, x=10): #Define variablesdate=match.datehome_team=match.home_team_api_idaway_team=match.away_team_api_id# Gets home and away team_statshome_team_stats=get_last_team_stats(home_team, date, teams_stats); away_team_stats=get_last_team_stats(away_team, date, teams_stats); #Get last x matches of home and away teammatches_home_team=get_last_matches(matches, date, home_team, x=5) matches_away_team=get_last_matches(matches, date, away_team, x=5) #Get last x matches of both teams against each otherlast_matches_against=get_last_matches_against_eachother(matches, date, home_team, away_team, x=3) #Create goal variableshome_goals=get_goals(matches_home_team, home_team) away_goals=get_goals(matches_away_team, away_team) home_goals_conceided=get_goals_conceided(matches_home_team, home_team) away_goals_conceided=get_goals_conceided(matches_away_team, away_team) #Define result data frameresult=pd.DataFrame() #Define ID featuresresult.loc[0, 'match_api_id'] =match.match_api_idresult.loc[0, 'league_id'] =match.league_id#Create match features and team statsif(nothome_team_stats.empty): result.loc[0, 'home_team_buildUpPlaySpeed'] =home_team_stats['buildUpPlaySpeed'].values[0] result.loc[0, 'home_team_buildUpPlayPassing'] =home_team_stats['buildUpPlayPassing'].values[0] result.loc[0, 'home_team_chanceCreationPassing'] =home_team_stats['chanceCreationPassing'].values[0] result.loc[0, 'home_team_chanceCreationCrossing'] =home_team_stats['chanceCreationCrossing'].values[0] result.loc[0, 'home_team_chanceCreationShooting'] =home_team_stats['chanceCreationShooting'].values[0] result.loc[0, 'home_team_defencePressure'] =home_team_stats['defencePressure'].values[0] result.loc[0, 'home_team_defenceAggression'] =home_team_stats['defenceAggression'].values[0] result.loc[0, 'home_team_defenceTeamWidth'] =home_team_stats['defenceTeamWidth'].values[0] result.loc[0, 'home_team_avg_shots'] =home_team_stats['avg_shots'].values[0] result.loc[0, 'home_team_avg_corners'] =home_team_stats['avg_corners'].values[0] result.loc[0, 'home_team_avg_crosses'] =away_team_stats['avg_crosses'].values[0] if(notaway_team_stats.empty): result.loc[0, 'away_team_buildUpPlaySpeed'] =away_team_stats['buildUpPlaySpeed'].values[0] result.loc[0, 'away_team_buildUpPlayPassing'] =away_team_stats['buildUpPlayPassing'].values[0] result.loc[0, 'away_team_chanceCreationPassing'] =away_team_stats['chanceCreationPassing'].values[0] result.loc[0, 'away_team_chanceCreationCrossing'] =away_team_stats['chanceCreationCrossing'].values[0] result.loc[0, 'away_team_chanceCreationShooting'] =away_team_stats['chanceCreationShooting'].values[0] result.loc[0, 'away_team_defencePressure'] =away_team_stats['defencePressure'].values[0] result.loc[0, 'away_team_defenceAggression'] =away_team_stats['defenceAggression'].values[0] result.loc[0, 'away_team_defenceTeamWidth'] =away_team_stats['defenceTeamWidth'].values[0] result.loc[0, 'away_team_avg_shots'] =away_team_stats['avg_shots'].values[0] result.loc[0, 'away_team_avg_corners'] =away_team_stats['avg_corners'].values[0] result.loc[0, 'away_team_avg_crosses'] =away_team_stats['avg_crosses'].values[0] result.loc[0, 'home_team_goals_difference'] =home_goals-home_goals_conceidedresult.loc[0, 'away_team_goals_difference'] =away_goals-away_goals_conceidedresult.loc[0, 'games_won_home_team'] =get_wins(matches_home_team, home_team) result.loc[0, 'games_won_away_team'] =get_wins(matches_away_team, away_team) result.loc[0, 'games_against_won'] =get_wins(last_matches_against, home_team) result.loc[0, 'games_against_lost'] =get_wins(last_matches_against, away_team) result.loc[0, 'B365H'] =match.B365Hresult.loc[0, 'B365D'] =match.B365Dresult.loc[0, 'B365A'] =match.B365A#Return match featuresreturnresult.loc[0] ''' Create and aggregate features and labels for all matches. '''defget_features(matches, teams_stats, fifa, x=10, get_overall=False): #Get fifa stats featuresfifa_stats=get_overall_fifa_rankings(fifa, get_overall) #Get match features for all matchesmatch_stats=matches.apply(lambdai: get_match_features(i, matches, teams_stats, x=10), axis=1) #Create dummies for league ID featuredummies=pd.get_dummies(match_stats['league_id']).rename(columns=lambdax: 'League_'+str(x)) match_stats=pd.concat([match_stats, dummies], axis=1) match_stats.drop(['league_id'], inplace=True, axis=1) #Create match outcomesoutcomes=matches.apply(get_match_outcome, axis=1) #Merges features and outcomes into one framefeatures=pd.merge(match_stats, fifa_stats, on='match_api_id', how='left') features=pd.merge(features, outcomes, on='match_api_id', how='left') #Drop NA valuesfeatures.dropna(inplace=True) #Return preprocessed datareturnfeaturesdefget_overall_fifa_rankings(fifa, get_overall=False): ''' Get overall fifa rankings from fifa data. '''temp_data=fifa#Check if only overall player stats are desiredifget_overall==True: #Get overall statsdata=temp_data.loc[:,(fifa.columns.str.contains('overall_rating'))] data.loc[:,'match_api_id'] =temp_data.loc[:,'match_api_id'] else: #Get all stats except for stat datecols=fifa.loc[:,(fifa.columns.str.contains('date_stat'))] temp_data=fifa.drop(cols.columns, axis=1) data=temp_data#Return datareturndata

          Looking at the match data we can see that most columns have 25979 values. This means we are analysing this number of matches from the database. We can start by looking at the bookkeeper data. We can see that the number of bookkepper match data is different for each bookkeper. We start by selecting the bookeeper with the most predictions data available.

          viable_matches=matches.sample(n=5000) b365=viable_matches.dropna(subset=['B365H', 'B365D', 'B365A'],inplace=False) b365.drop(['BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA'], inplace=True, axis=1) bw=viable_matches.dropna(subset=['BWH', 'BWD', 'BWA'],inplace=False) bw.drop(['B365H', 'B365D', 'B365A', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA'], inplace=True, axis=1) iw=viable_matches.dropna(subset=['IWH', 'IWD', 'IWA'],inplace=False) iw.drop(['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA'], inplace=True, axis=1) lb=viable_matches.dropna(subset=['LBH', 'LBD', 'LBA'],inplace=False) lb.drop(['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA'], inplace
          Today Football Prediction - 12/10/2021 - Betting Strategy - DAILY FOOTBALL PREDICTIONS

          Football Match Predictor using Machine Learning


          I worked on this project as part of the finals for my Artificial Intelligence class. This project uses Machine Learning to predict the outcome of a football match when given some stats from half time.

          Demo Link

          You can check out the demo here:

          Link to Code

          You can check out the source code here:

          How I built it

          For this project, I decided to use Python since I was very familiar with it, and also because it had a lot of awesome tools for machine learning.

          Firstly, in order for this match prediction to work, I needed some good datasets. After looking around for a while, I found this site:, which contained structured datasets for a variety of football competitions ranging from national leagues to world cups. But to keep things simple, I decided to select the datasets for the top 5 European Leagues that contained the match results for the last 9 years.

          Even though the data was already structured, I had to clean up some missing/misleading data. You can check out the repo for more info on this process.

          After cleaning up the data, I performed a wide variety of data analysis techniques such as Chi-Squared analysis, and calculating the Variance Inflation Factor, to extract the most important features.

          With this processed data, I trained 3 different models, namely:

          • Naive Bayes
          • Random Forest
          • Logistic Regression (One vs Rest)

          After further tweaks and adjustments, both Logistic Regression model and the Random Forest model had the best performance with 70% accuracy, and the Naive Bayes model had around 65%.

          I really wanted to show off this model to my friends and professors. So, I decided to deploy it on a remote server.

          To do this, I exported the trained model into a file using a python package called 'joblib'. Then, I created a simple Django Web Server with a REST API that uses this trained model, and makes the prediction. You can check out the final result here:

          NOTE: For a more detailed description of the process, check out the Readme in the repo.

          Additional Thoughts / Feelings / Stories

          Initially, I did not think I was gonna get 70% accuracy with these models. But it is really cool to see it in action. Analysing the dataset, preprocessing it, and selecting the right features were the most stressful portions of this project. But looking back, it was all totally worth it.

          Some things I'd like to add to this project in the feature are:

          Team skill & strategy

          One of the drawbacks at the moment is that the teams don't have a huge impact on the outcome. But in practice, that plays a huge role. A first-division team has a much higher chance of winning a game against a third-division team, even if the match was played at the third-division team's home ground.

          The other thing I want the model to take into account is the ability of a team to bounce back. There are certain teams in football that play defensive in the first half, and are more aggressive in the second half or vice versa.

          In order for the model to take these things into account, I plan to pre-compute these values for each team and store them locally. I can re-train the models with these features and during prediction, I can use the respective team's pre-computed values as supplemental features which should help it make better predictions.

          Team Roster / Player Skills

          I'd like the model to also take the players on the pitch into consideration when making the prediction. In practice, a team has a higher chance of winning the game when its star players are on the pitch.

          Live Prediction

          This is more of a long shot. As of now, the model makes the prediction based on the half-time stats. Eventually I'd like the model to predict the results for a live match all the way from minute 0 to minute 90. To do this, it must learn to account for the current match time. But training the model to account for this is going to be extremely hard.


          Football predictions python

          Prediction of Football Match Result by Python Machine Learning

          Football is one of the most popular sports in the world, and the World Cup is often the most exciting time for fans. In the carnival season, besides blown up the hard-core fans who stayed up late to watch the game, football guessing has become the most popular topic after dinner. Even people who didn’t think much about football in the past tried to mend a lot of knowledge about football secretly, hoping to make a small goal first through competition. Today we will introduce how to use machine learning to predict the result of football match.

          This Chat uses Python programming language, uses artificial intelligence modeling platform Mo as the online development environment to program. By acquiring the competition data of the Premier League for 19 years from 2000 to 2018, the results of the Premier League matches are then processed based on logistic regression model, support vector machine model and XGBoost model in supervised learning. Forecast.

          Let’s take a look at the machine learning steps to predict the results of the Premier League.

          Main process steps

          1. Information for acquiring and reading data
          2. Data cleaning and preprocessing
          3. Characteristic Engineering
          4. Establishing Machine Learning Model and Predicting
          5. Summary and Prospect

          1. Information to acquire and read data

          First we go to the Mo workbench, create a blank project, and clickStart developmentEnter the Notebook development environment with built-in JupyterLab.

          Then we need to upload data sets in the project.

          The Premier League holds a season every year, from August to May of the second year. There are 20 teams in the Premier League. They have a Home-away double-round system. Each season has 38 matches (19 home matches and 19 away matches). There are 10 matches in each round. So every season, there are 380 matches in the Premier League. Competition.

          • Data Set Address
          • Feature Description Documents in Data Sets

          If you have built a new project on MO platform, you can import data sets directly on the platform. The process is as follows:

          1.1 Interpretation of Reading CSV Data Interface

          • Using Pandas to Read and Write Data API to Summarize Web Sites

          Pandas. read_csv():

          pandas.read_csv(filepath_or_buffer, sep =’,’ , delimiter = None)

          • Filepath_or_buffer: File path
          • Sep: Specifies a delimiter, which defaults to a comma
          • Delimiter: delimiter, alternate delimiter (sep invalidated if a parameter change is specified)
          • Usecols: Specifies the column name read, list form

          Let’s begin our performance:

          1.2 Time List

          After obtaining the data for each year, place each year in the time_list list list:


          1.3 Use Pandas.read_csv() interface to read data

          When reading, the data corresponds to the element names in res_name one by one.

          1.4 Delete null values for specific files

          After looking at the empty value of the 381 action read by the 15th file, the operation of deleting the empty value of the line is adopted.

          1.4.1 Interface for deleting null values
          • Pandas.dropna(axis=0,how=’any’)

            • Axis: 0 denotes rows; 1 denotes columns
            • How:’all’means that only rows and columns with missing values are removed; any means that only rows and columns with missing values are removed.
          1.4.2 Interface Application
          375E024/05/15HullMan United0.00.0D0.00.0D1.9925.00.501.761.712.
          377E024/05/15Man CitySouthampton2.00.0H1.00.0H2.6628.0-
          378E024/05/15NewcastleWest Ham2.00.0H0.00.0D2.2525.0-0.501.821.782.202.101.764.014.98

          5 rows × 68 columns

          1.5 Delete a file name that does not have 380 rows

          Considering that there are usually 19 teams in the Premier League, each team needs to play 20 games, so delete the number of rows not 380, and find the original CSV file one by one.

          1.6 View n rows of data before a data set

          • File name. head (n)

            • N: The default is 5. Fill in the number of rows of data you want.

          Read the first five lines of data:

          0E019/08/00CharltonMan City40H20H2.
          1E019/08/00ChelseaWest Ham42H10H4.21.503.406.001.503.606.001.443.66.50

          5 rows × 45 columns

          The first 10 lines of reading data:

          0E019/08/00CharltonMan City40H20H2.
          1E019/08/00ChelseaWest Ham42H10H4.21.503.406.001.503.606.001.443.606.50
          5E019/08/00LeicesterAston Villa00D00D2.52.353.202.602.253.252.752.403.252.50
          9E020/08/00Man UnitedNewcastle20H10H5.01.403.757.001.403.757.501.403.757.00

          10 rows × 45 columns
          Read the last five lines:

          375E019/05/01Man CityChelsea12A11D1.654.03.601.674.203.401.704.003.11.80
          376E019/05/01MiddlesbroughWest Ham21H21H3.
          377E019/05/01NewcastleAston Villa30H20H2.902.43.252.502.383.302.502.253.42.60
          379E019/05/01TottenhamMan United31H11D2.

          5 rows × 45 columns

          Read the last four lines:

          376E019/05/01MiddlesbroughWest Ham21H21H3.
          377E019/05/01NewcastleAston Villa30H20H2.902.43.252.502.383.302.502.253.42.60
          379E019/05/01TottenhamMan United31H11D2.

          4 rows × 45 columns

          1.8 Get the name of the home team for a year

          1.9 Analyzing the Header Meaning of Dataset List

          Data sets have fixed rows, generally 380 rows, and column number may change every year, not necessarily equal, and we are more concerned about column number table headers. Because of its small size, you can directly look at the number of columns in the data set, which is faster, you can also code to find the largest number of columns, and then get the table header of the number of columns for general introduction and interpretation.

          We see data includingDate, Hometeam, Away team, FTHG, HTHG, FTR.And so on, more information about the characteristics of the data set can refer to the data set feature description document.

          2. Data cleaning and preprocessing

          We select Hometeam, Awayteam, FTHG, FTAG, FTR as our original feature data, and then we construct some new features based on these original features.

          2.1 Selection of columns of information

          • Home Team: Home Team Name
          • Away Team: away team name
          • FTHG: Number of goals scored by the home team
          • FTAG: Number of goals scored by all away teams
          • FTR: Result of the game (H = home win, D = draw, A = away win)

          2.2 Analysis of raw data

          We first predict that all the home teams will win, then predict that all the away teams will win, and make a comparative analysis of the results.

          2.2.1 Accuracy of all home teams winning
          2.2.2 Accuracy of all away teams winning

          To sum up, we can see that the probability of home victory is higher than that of losing a draw.

          2.3 We want to know Arsenal’s performance as home team, how to find out the total number of goals in all matches from 2005 to 06?

          We know that the data for 2005-06 are in play_statistics [2]:

          2.4 We want to know how well teams perform as home teams.

          First try to find the total number of goals scored by each team in all matches from 2005 to 06.

          3. Characteristic Engineering

          Feature engineering refers to the process of transforming original data into training data of model. Its purpose is to obtain better training data features and better training model. Feature engineering can improve the performance of models, and sometimes even on simple models can achieve good results. Feature engineering plays a very important role in machine learning. Generally speaking, it includes feature construction, feature extraction and feature selection.

          3.1 Tectonic Characteristics

          Because this game is a season a year, there is a sequence, then we can count up to the end of the game, the whole season, home and away team net wins. So for every week of every season, the difference between the number of goals and the number of goals conceded by each team up to this week is counted, that is, the number of net wins.

          3.1.1 Calculate the cumulative net winners per team week

          The processed data can be reflected by looking at some data of a certain year, such as the last five data of 2005-06.

          376Man UnitedCharlton40H34-10
          379West HamTottenham21H-416

          Through the above data: We found the characteristics of 376 rows of data. Before this game, the net wins of Manchester United at home were 34 and Charlton at away were – 10.

          3.1.2 Statistics of the cumulative scores of home and away teams up to the current match week

          Statistics of the cumulative scores of home and away teams throughout the season up to the current match week. In a game, three points were scored for victory, one for draw and zero for defeat. We’re counting this based on the results of the games before this week of the season. We continue to look at the last five data for 2005-06:

          376Man UnitedCharlton40H34-108047
          379West HamTottenham21H-4165265

          We deal with HTP (the home team’s accumulated points up to this week) and ATP (the away team’s accumulated points up to this week).
          Let’s look at 376 rows. By the end of the season, Manchester United had 80 points and Charlton had 47.

          3.1.3 Statistics of a team’s performance in the last three matches

          The features we have constructed above reflect a team’s overall performance this season. Let’s look at the team’s performance in the last three games.
          We use:

          HM1 stands for the last win or lose of the home team.

          AM1 was a winner or loser in the last game for the away team.

          Similarly, HM2 AM2 is the last win or lose, HM3 AM3 is the last win or lose.

          We continue to look at the last five data in 2005-06 after processing:

          376Man UnitedCharlton40H34-108047DLLLWW
          379West HamTottenham21H-4165265WWLDLL
          3.1.4 Participation Week Characteristic (Several Weeks)

          Then we put in the information about the match week, which is the number of match weeks that the match took place.
          As a result of feature construction, we can directly look at the last five data in 2005-06.

          376Man UnitedCharlton40H34-108047DLLLWW38
          379West HamTottenham21H-4165265WWLDLL38
          3.1.5 Consolidated Competition Information

          We intend to merge the information of the data set competition into a table, and then we divide the scoring data we just calculated by the number of net wins by the number of weeks to get the weekly average. As a result, you can view the last five pieces of data in the dataset after constructing the feature.

          5696SouthamptonMan City0.01.0A-0.4736842.0526320.9473682.552632WWDDWW38.0
          5699West HamEverton3.01.0H-0.578947-0.3157891.0263161.289474DDWWLW38.0

          We see that the number of rows in the last row of the data set is 5699, and the number of rows in the first row is 0, which means a total of 5700 data. We have counted the data for 15 years, and there are 380 data in each year. After calculating, we find that the size of the data set is accurate.

          3.2 Delete some data

          Previously, we constructed many features according to the initial features. Some of these are intermediate features, which we need to discard. Because there is insufficient information about each team’s historical wins and losses in the first three weeks, we intend to discard the data of the first three weeks.

          3.3 Analysis of the data we constructed

          In the first place, we calculated the home and away victory rates for each year. Now let’s look at the valid data. Are there more home wins or more away wins?

          The statistical results show that our home winning rate is 46.69%, which is consistent with the results of the original data analysis in section 2.2.1. It shows that the features we constructed before are effective and close to reality.

          3.4 Solving the problem of sample imbalance

          Through the construction of features, it is found that the proportion of home winners is close to 50%, so the label proportion is not balanced for this three-class problem.

          We simplify it into two categories, that is, whether the home team will win or not, which is also a way to solve the problem of unbalanced label proportion.

          3.5 Divide data into eigenvalues and label values

          3.6 Data normalization and standardization

          We normalize the maximum and minimum HTP of all game features.

          3.7 Conversion of Characteristic Data Types


          5 rows × 22 columns

          3.8 Pearson correlation thermogram

          We generate correlation maps of some features to see the correlation between features and features. To this end, we will use Seaborn Drawing Software Package to enable us to draw thermodynamic diagrams very easily, as follows:

          From the above figure, we can see that the correlation between HTP features and HTGD features is very strong, and the correlation between ATP features and ATGD features is also very strong, which can show the situation of multiple collinearity. It’s easy to understand that the higher the average score per week, the higher the average net winners per week. If we consider these variables, we can conclude that they give almost the same information, so in fact there are multiple collinearities. Here we will consider deleting the two features of HTP and’ATP’and retaining the two features of HTGD and ATGD. Pearson thermal maps are very suitable for detecting this situation, and they are indispensable tools in feature engineering. At the same time, we can also see that the last team’s results have little impact on the results of the current game. Here we consider retaining these characteristics.

          • Considering that the correlation between ATP and ATGD is more than 90%, we delete the feature HTP, ATP:
          • Look at the 10 features most relevant to FTR

          We can see that the most relevant feature is HTGD, which shows that the higher the average number of net wins per team’s home week, the greater the probability of their winning.

          4. Establishing machine learning model and forecasting

          4.1 Cutting Data

          The data set is randomly divided into training set and test set, and the partitioned training set test set samples and test set labels are returned. We adopt it directly.The interface is processed.

          4.1.1 Introduction of train_test_split API interface
          • X_train, X_test, y_train, y_test =cross_validation.train_test_split(train_data,train_target,test_size=0.3, random_state=0)
          • Parametric interpretation:

            • Train_data: The partitioned sample feature set
            • Train_target: The partitioned sample label
            • Test_size: If it’s a floating point number, it’s between 0 and 1, indicating the proportion of samples; if it’s an integer, it’s the number of samples.
            • Random_state: The seed of a random number.
          • Return value interpretation:

            • X_train: eigenvalue of training set
            • X_test: Test set eigenvalues
            • Y_train: Target value of training set
            • Y_test: Target value of test set

          Random Number Seed: In fact, it is the number of the group of random numbers. When repeated experiments are needed, it is guaranteed that the same group of random numbers can be obtained. For example, if you fill in 1 every time, you will get the same random array with the same other parameters. But if you fill in 0 or not, it will be different every time. The generation of random numbers depends on seeds, and the relationship between random numbers and seeds follows the following two rules:Different seeds produce different random numbers; the same seeds produce the same random numbers even though the instances are different.

          4.1.2 Code Processing Segmented Data

          4.2 Relevant Model and Interface Introduction

          Next, we use three different models, Logical Regression, Support Vector Machine and XGBoost, to see their performance. Firstly, we define some auxiliary functions, record the training and evaluation time of the model, calculate the accuracy of the model and F1 score. Let’s first introduce the links and differences between the three models and the related interfaces.

          Introduction of 4.2.1 Logical Regression

          Logic regression model is to suppose that the data obey Bernoulli distribution. By maximizing the likelihood function and using gradient descent to solve the parameters, we can achieve the goal of two-class data. The main advantages of the model are better explanatory; if the feature engineering is well done, the effect of the model is very good; the training speed is relatively fast; and the output results are easy to adjust. However, the shortcomings of the model are also outstanding, such as: the accuracy is not very high, it is difficult to deal with the problem of data imbalance, and so on.

          Introduction of 4.2.2 Logical Regression Model Interface

          API:sklearn.linear_model.LogisticRegression(penalty=’l2′, dual=False, tol=0.0001, C=1.0,fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None,solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0,warm_start=False, n_jobs=1)

          • Main parameters analysis:

            • Penalty: regularization parameter, L1 or l2, default: l2;
            • C: The reciprocal of regularization coefficient, default: 1.0;
            • Fit_intercept: Is there an intercept, default: True
            • Solver: There are four optimization methods for loss function {newton-cg, lbfgs, liblinear, sag}, default: liblinear
            • Multi_class: Selection of classification methods, generally {ovr, multinomial}, default: ovr;
            • Class_weight: Type weight parameter, default to None
            • Random_state: Random number seed, default to none
            • Tol: error range of iteration termination criterion
            • N_jobs: Parallel number, when -1, is the same as CPU core number, default value is 1.

          The above is a simple analysis of the main parameters, if you want to know more, you can refer to the official website.

          Introduction of 4.2.3 Support Vector Machine

          Support Vector Machine (SVM) is a two-class classification model. Its basic model is to find a linear classifier of separating hyperplanes with maximized spacing in feature space.

          (1) When the training sample is linearly separable, a linear classifier, i.e. linear branching support vector machine, is learned by maximizing the hard interval.
          (2) When the training data are approximately linear separable, a relaxation variable is introduced to learn a linear classifier, i.e. linear support vector machine, by maximizing the soft interval.
          (3) When the training data are linearly inseparable, the non-linear support vector machine (NLSVM) is learned by using the kernel technique and maximizing the soft interval.

          4.2.4 Support Vector Machine Classification Model API


          • Main parameters analysis:

            • C: C-SVC penalty parameter C, default value is 1.0. The bigger C is, the equivalent of penalty relaxation variable. We hope that the relaxation variable is close to 0, that is, the penalty for misclassification increases, and tends to be the case of full pairing of training sets. Thus, the accuracy of testing training sets is high, but the generalization ability is weak. The C value is small, the penalty for misclassification is reduced, fault tolerance is allowed, and they are regarded as noise points with strong generalization ability.
            • Kernel: The default is rbf, which can be `linear’, `poly’, `rbf’, `sigmoid’, `precomputed’.

              • 0 – Linearity: u’v
              • 1-polynomial: (gamma_u’_v + coef0) ^ degree
              • 2-RBF function: exp (-gamma | U-V |^ 2)
              • 3 –sigmoid:tanh(gamma_u’_v + coef0)
            • Degree: The dimension of polynomial poly function, which defaults to 3, is ignored when selecting other kernel functions.
            • Core function parameters of gamma: rbf, poly and sigmoid. By default auto, 1/n_features will be selected
            • Coef0: The constant term of a kernel function. Useful for poly and sigmoid.
            • Max_iter: Maximum number of iterations. – 1 is unlimited.
            • decision_function_shape :ovo, ovr or None, default=None。

          The main parameters are: C, kernel, degree, gamma, coef0; for detailed parameters, please refer to the official website.

          Introduction of 4.2.5 XGBoost Principle

          XGBoost is one of Boosting algorithms. The idea of Boosting algorithm is to integrate many weak classifiers to form a strong classifier. The basic principle is that the input samples of the next decision tree will be related to the training and prediction of the previous decision tree. Considering XGBoost as a lifting tree model, he integrates many tree models to form a strong classifier. The tree model used is the CART regression tree model.

          Introduction of 4.2.6 XGBoost Interface

          XGBoost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective=’reg:linear’, booster=’gbtree’, n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)

          • Main parameters analysis:

            • Booster: model categories, there are mainly two kinds, gbtree and gbliner, default is: gbtree;
            • Nthread: When the number of CPUs is – 1, it means that all CPUs are used for parallel operations (default), and when it is equal to 1, it means that one CPU is used for operations.
            • Scale_pos_weight: The weight of positive samples. In the binary classification task, when the proportion of positive and negative samples is unbalanced, the weight of positive samples is set, and the model works better. For example, when the proportion of positive and negative samples is 1:10, scale_pos_weight = 10;
            • N_estimators: the total number of iterations, i.e. the number of decision trees;
            • Early_stopping_rounds: On the verification set, the training is terminated ahead of time when the score is not improved after n iterations.
            • Max_depth: tree depth, default value is 6, typical value is 3-10;
            • Min_child_weight: The larger the value, the easier the under-fitting; the smaller the value, the easier the over-fitting (when the value is large, avoid learning from local special samples), default to 1;
            • Learning_rate: The learning rate, which controls the step size of each iteration when updating the weight, defaults to 0.3;
            • Gamma: penalty coefficient, specifying the minimum loss function descent required for node splitting;
            • Alpha: L1 regularization coefficient, default 1;
            • Lambda: L2 regularization coefficient, default 1;
            • Seed: random seed.

          If you want to learn the API in detail, you can refer to the official website.

          4.3 Establishment and evaluation of machine learning model

          4.3.1 Modeling
          4.3.2 Initialization, Training and Evaluation Models

          Through the operation results, we found that:

          • In terms of training time,logistic regressionIt takes the shortest time and XGBoost takes the longest time, more than 2 seconds.
          • In terms of forecast time,logistic regressionIt takes the shortest time and the longest time for support vector machine.
          • In terms of F1 score in training set,XGBoost The highest score, the lowest score of support vector machine, but the gap is not very large.
          • The accuracy of training set is analyzed.XGBoostThe highest score and the lowest logistic regression.
          • Analysis of F1 score on test set.logistic regressionBest of all, the other two models are basically the same, relatively low.
          • The accuracy of the test set is analyzed.logistic regressionThe two models of SVM** are basically the same, slightly higher than XBGoost.

          4.4 Superparametric Adjustment

          We use Sklearns GridSearch for hyperparametric tuning.

          4.5 Preservation Model and Loading Model

          Then we can save the model for future use.

          Through the above, we randomly select five from the test data set, and four of the predicted values are the same as the actual values. Considering our low accuracy, we are lucky to get this result.

          5. Summary and outlook:

          Through this article, you should be familiar with the process of data mining and analysis and machine learning, understand the basic ideas of logistic regression model, support vector machine model and XGBoost model in supervised learning, and be familiar with the basic use of machine learning libraries Pandas, Scikit-Learn, Searbon, XGBoost and Joblib. It should be noted that if you do not use MO platform, you may also need to install XGBoost, SKlearner and other third-party libraries. At present, the Mo platform has installed commonly used machine learning related libraries, which can save you the time to install the development platform. In addition, the data set has been made public on the platform and can be imported directly. At present, we summarize the relevant information of mainstream machine learning libraries as follows:

          • Python installation

            • Anaconnda: Download address
            • IDE: Pycharm Download Address
            • Anaconda + Jupyter notebook + Pycharm: Installation tutorial
          • Machine learning tool information:

            • Numpy: Official Document
            • Numpy: Chinese Document
            • Pandas: Official Documents
            • Pandas: Chinese Documents
            • Matplotlib: Official Document
            • Matplotlib: Chinese Documents
            • Scikit-Learn: Official Document
            • Scikit-Learn: Chinese Documents

          At present, the accuracy of our model is not very high, and we can further improve our model. Here we provide some solutions:

          • 1. Get more data or use more features;
          • 2. Cross-validation of data sets;
          • 3. The above models can be processed in depth or model fusion technology can be used.
          • 4. Analyse the technical information and health information of the players.
          • 5. Using a more comprehensive model evaluation mechanism, we only consider the accuracy and F1 score at present, which can be further considered.ROCAUC ** curve, etc.

          We have organized the above contents into machine learning practical combat related courses, you can choose from the website training camp practical combat course.Supervised Learning – Analysis and Prediction of Football Match ResultsPractice learning. In the process of learning, you can contact us whenever you find our mistakes or encounter difficulties.

          Mo( It’s a Python-enabledArtificial Intelligence Online Modeling PlatformIt can help you develop, train and deploy models quickly.

          Mo Artificial Intelligence ClubIt is a club initiated by the R&D and product design team of the website and dedicated to lowering the threshold of AI development and use. The team has experience in large data processing, analysis, visualization and data modeling, has undertaken multi-domain intelligent projects, and has the ability to design and develop from the bottom to the front. The main research direction is large data management analysis and artificial intelligence technology, and to promote data-driven scientific research.

          At present, the club holds an offline Technology Salon in Hangzhou every Saturday with the theme of machine learning, and carries out paper sharing and academic exchanges from time to time. I hope to gather friends from all walks of life who are interested in AI, exchange and grow together, and promote the democratization and popularization of AI.

          Football Predictions Today 12/10/2021 - Soccer Predictions - Betting Strategy #freepicks

          Evening sports fans.  Hope everyone's having a wonderful weekend.

          Before we get to the code, I'm happy to say that 7 out of 10 predictions were correct and the 3 that were wrong were draws!

          If we had put £1 single bets on each game, then for our £10 stake, we'd have had £12.86 back.  Only time will tell if this 28.6% ROI will continue.

          In Part 3, I spoke about limiting how far back the system would look when making it's predictions and chose 100 games as a default limit.  I've now added a function to backtest different values for this.

          The updated code is available on my GitHub.

          What I've done is take 60 days of games from a year before the current date and backtest with values from 50 to 500 games, outputting the most successful value.

          I've also added a cutoff value for the predicted probability to decide if the game is worth betting on.  So the code also sweeps through values for this from 40 to 95.

          There was a problem with this approach initially in that it would get to 100% accuracy but only suggest betting on 1 game out of 100.  In other words only games that were pretty much foregone conclusions and therefore not worth betting on.

          So I've now limited this to advise of at least 1 game out of 10.  It reports somewhere in the region of 70-90% accuracy during the backtest.

          Now this is a pretty naive form of machine learning, basically a brute force scan through what could be called our hyperparameters, so there's likely to be a danger of curve fitting.  To rule this out, I also added a function to test the parameters found during the scan on the next 60 days of games.  If the reported accuracy still looks good then we're golden.

          New command line options are "-t" or "--test" to scan through the values, and "-b" or "--cutoff" to have the program print out predictions with predicted probabilities above that value.

          Running the following command line will find the best values to use for the Scottish Premiership.

          This returns with values of 450 for history and 70 for cutoff with 100% accuracy for 7 predictions out of 70 games.  Sounds too good to be true, I know!  However, it also returns 100% accuracy in the validation test.

          Running the tests on the English Premier League returns 400 & 70 with 71% accuracy for 7 games from 70.  The validation test returns 93%.

          I've only tested with the English Premier League, English Championship and the Scottish Premiership so far but as the predictions the code made in Part 3a show, it appears to be working pretty well.

          Hey, I know the code isn't pretty, efficient, elegant or any of the things it would be if a professional programmer had written it but who the hell cares if it works eh?  I'll be continuing to test it and hope some of you guys give it a try too.  Feel free to use or change the code in any way you want and if you've any ideas for improvements or fixes please share them here.  

          Maybe we can all stick it to the bookies. hehehehe.


          You will also like:

          They got to the house by one route, but Sonya had to go to the final stop, and then walk another 15 minutes. On foot. - Is it you who want to lure me on a visit. she joked.

          213 214 215 216 217