(original source on Kaggle: https://www.google.com/url?q=https://www.kaggle.com/hugomathien/soccer&sa=D&ust=1522946578113000)
This soccer database comes from Kaggle and is well suited for data analysis and machine learning. It contains data for soccer matches, players, and teams from several European countries from 2008 to 2016. This dataset is quite extensive, and we encourage you to read more about it here.
In this notebook, I am going to explore the dataset and address the following questions which have been explored accordingly.
# import Statements need for importing Data
import numpy as np
import pandas as pd
import sqlite3 ##Used to read the sqlite database within this notebook
import matplotlib.pyplot as plt
print("-------------------->> Imports successful")
#Setting up Path of the Database
path = "data/"
database = path + 'database.sqlite'
#Connection to Database
conn = sqlite3.connect(database)
print("-------------------->>Debug: Database connected")
#Command used to check tables in DB from existing database 'sqlite_sequence'
tables = pd.read_sql("""SELECT * FROM sqlite_sequence""",conn)
tables
# The following SQL Query does the following things:
# Select relevant Data CountryName, Leauge name, season,
#count of the number of stages, count of Home Team Leauge,
# Average Goals of home team, away team,
# The diffrence of home team goals and away team goals
# Avg Total Goals
# Sum of total goals per match
#It also Joins relevant connecting tables and selects countries 'Spain', 'Germany', 'France', 'Italy', 'England'
leages_by_season = pd.read_sql("""SELECT Country.name AS country_name,
League.name AS league_name,
season,
count(distinct stage) AS number_of_stages,
count(distinct HT.team_long_name) AS number_of_teams,
avg(home_team_goal) AS avg_home_team_scors,
avg(away_team_goal) AS avg_away_team_goals,
avg(home_team_goal-away_team_goal) AS avg_goal_diff,
avg(away_team_goal-home_team_goal) AS avg_goal_diff_rev,
avg(home_team_goal+away_team_goal) AS avg_goals,
sum(home_team_goal+away_team_goal) AS total_goals
FROM Match
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
WHERE country_name in ('Spain', 'Germany', 'France', 'Italy', 'England')
GROUP BY Country.name, League.name, season
HAVING count(distinct stage) > 10
ORDER BY Country.name, League.name, season DESC
;""", conn)
#Here we find avg goals per season and seperate into country verticals
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goals'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goals'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])
df.plot(figsize=(12,5), title='Average Goals per Game', grid=True, legend=True);
Average goals per game Analysis
Looking at the average goals per season, we can see that the team which improved the most in terms of goals scored is Germany. In the third section performed very well but were unable to manage that throughout the data,
The most consistent performing team would have to be Spain, which started a little higher than what they started out at.
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_diff'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goal_diff'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_diff'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goal_diff'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_diff'])
df.plot(figsize=(12,5), title='Home vs Out -#1', grid=True, legend=True);
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns = leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_diff_rev'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']== 'Spain','avg_goal_diff_rev'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_diff_rev'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']== 'France','avg_goal_diff_rev'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_diff_rev'])
df.plot(figsize=(12,5), title='Out vs Home-#2', grid=True, legend=True);
Home vs Out Score Diffrence
Here I have plotted two graphs Home vs Out -#1 and 'Out vs Home-#2' which will be referred to #1 and #2 respectively. Both #1 and #2 are mirror images of each other.
What the graphs show us is that the team which plays best at home is Spain and the team which plays best away(and not at home) is Germany.
index = 0
columns = "player_id", "p count"
players = pd.read_sql("SELECT * from Player_Attributes",conn)
penalties = pd.DataFrame(players['id'],players['penalties'])
penalty = pd.read_sql("""
SELECT
Player_Attributes.id,
Player_Attributes.penalties,
Player.player_name as name
FROM Player_Attributes
JOIN Player on Player.id = Player_Attributes.id
""",conn)
penalty = penalty.sort_values(by='penalties', ascending=False)
penalty.head(10)
penalty.hist(column='penalties');
Penalty Scores Analysis
As asked by the question I have displayed a list of people who have scored the maximum penalties, maximum number ranges from 92-94 shots.
Also the average number of penalties per season being about 55