Soccer Database

(original source on Kaggle: https://www.google.com/url?q=https://www.kaggle.com/hugomathien/soccer&sa=D&ust=1522946578113000)

This soccer database comes from Kaggle and is well suited for data analysis and machine learning. It contains data for soccer matches, players, and teams from several European countries from 2008 to 2016. This dataset is quite extensive, and we encourage you to read more about it here.

In this notebook, I am going to explore the dataset and address the following questions which have been explored accordingly.

In [1]:
# import Statements need for importing Data

import numpy as np
import pandas as pd
import sqlite3 ##Used to read the sqlite database within this notebook
import matplotlib.pyplot as plt

print("-------------------->>  Imports successful")
-------------------->>  Imports successful
In [2]:
#Setting up Path of the Database
path = "data/"
database = path + 'database.sqlite'

#Connection to Database
conn = sqlite3.connect(database)
print("-------------------->>Debug: Database connected")

#Command used to check tables in DB from existing database 'sqlite_sequence'
tables = pd.read_sql("""SELECT * FROM sqlite_sequence""",conn)
tables
-------------------->>Debug: Database connected
Out[2]:
name seq
0 Team 103916
1 Country 51958
2 League 51958
3 Match 51958
4 Player 11075
5 Player_Attributes 183978
6 Team_Attributes 1458
In [3]:
# The following SQL Query does the following things:
# Select relevant Data CountryName, Leauge name, season,
#count of the number of stages, count of Home Team Leauge,
# Average Goals of home team, away team, 
# The diffrence of home team goals and away team goals
# Avg Total Goals
# Sum of total goals per match
#It also Joins relevant connecting tables and selects countries 'Spain', 'Germany', 'France', 'Italy', 'England'


leages_by_season = pd.read_sql("""SELECT Country.name AS country_name, 
                                        League.name AS league_name, 
                                        season,
                                        count(distinct stage) AS number_of_stages,
                                        count(distinct HT.team_long_name) AS number_of_teams,
                                        avg(home_team_goal) AS avg_home_team_scors, 
                                        avg(away_team_goal) AS avg_away_team_goals, 
                                        avg(home_team_goal-away_team_goal) AS avg_goal_diff,
                                        avg(away_team_goal-home_team_goal) AS avg_goal_diff_rev,
                                        avg(home_team_goal+away_team_goal) AS avg_goals, 
                                        sum(home_team_goal+away_team_goal) AS total_goals                                       
                                FROM Match
                                JOIN Country on Country.id = Match.country_id
                                JOIN League on League.id = Match.league_id
                                LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                                LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                                WHERE country_name in ('Spain', 'Germany', 'France', 'Italy', 'England')
                                GROUP BY Country.name, League.name, season
                                HAVING count(distinct stage) > 10
                                ORDER BY Country.name, League.name, season DESC
                                ;""", conn)

Average Goals Per Game

In [4]:
#Here we find avg goals per season and seperate into country verticals
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())


df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goals'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goals'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])
df.plot(figsize=(12,5), title='Average Goals per Game', grid=True, legend=True);

Average goals per game Analysis

Looking at the average goals per season, we can see that the team which improved the most in terms of goals scored is Germany. In the third section performed very well but were unable to manage that throughout the data,

The most consistent performing team would have to be Spain, which started a little higher than what they started out at.

Home vs Out Score Diffrence

In [5]:
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_diff'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goal_diff'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_diff'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goal_diff'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_diff'])
df.plot(figsize=(12,5), title='Home vs Out -#1', grid=True, legend=True);
In [6]:
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_diff_rev'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goal_diff_rev'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_diff_rev'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goal_diff_rev'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_diff_rev'])
df.plot(figsize=(12,5), title='Out vs Home-#2', grid=True, legend=True);

Home vs Out Score Diffrence

Here I have plotted two graphs Home vs Out -#1 and 'Out vs Home-#2' which will be referred to #1 and #2 respectively. Both #1 and #2 are mirror images of each other.

What the graphs show us is that the team which plays best at home is Spain and the team which plays best away(and not at home) is Germany.

Penalty Scores

In [7]:
index = 0
columns = "player_id", "p count"
players = pd.read_sql("SELECT * from Player_Attributes",conn)
penalties = pd.DataFrame(players['id'],players['penalties'])
In [8]:
penalty = pd.read_sql(""" 
SELECT 
Player_Attributes.id,
Player_Attributes.penalties,
Player.player_name as name

FROM Player_Attributes
JOIN Player on Player.id = Player_Attributes.id
""",conn)
In [9]:
penalty = penalty.sort_values(by='penalties', ascending=False)
penalty.head(10)
Out[9]:
id penalties name
2958 2962 94.0 Emanuele Belardi
2959 2963 94.0 Emanuele Calaio
2957 2961 94.0 Emanuel Sarki
2956 2960 94.0 Emanuel Rivas
7686 7697 93.0 Moritz Leitner
4913 4918 93.0 Jesus Fernandez
7685 7696 93.0 Moritz Hartmann
4914 4919 93.0 Jesus Gamez
4912 4917 93.0 Jesus Datolo
6523 6533 92.0 Manu Torres
In [10]:
penalty.hist(column='penalties');

Penalty Scores Analysis

As asked by the question I have displayed a list of people who have scored the maximum penalties, maximum number ranges from 92-94 shots.

Also the average number of penalties per season being about 55