A Small Note: Welcome to my final Data Analysis Project. I have tried my level best to analyze and ask questions which will help answer not only the initial question asked, but also provide an insight into the demographics of the patients that attend health services. I have generated graphics where required and tried to think of some intresting questions, although not very useful, but indeed intresting to know. Thank you for taking the time to go through my DA project, any feedback is welcome.
-- Arjun Sabharwal
I have chosen the following dataset because firstly i find medical problems very intriguing. And although not this one exactly, it kind of makes me realise the potential of data analysis not to only make profit but also save lives.
#Global Import Statements
#These Statements allow packages to be called seamlessly throughout the whole notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
# 'magic word' so that visualizations are plotted Inline
%matplotlib inline
#Importing Data and doing some basic level exploration of the data
df = pd.read_csv('./data/noshowappointments-kagglev2-may-2016.csv')
df.head(3)
df.shape
df.info()
df.describe()
sum(df.duplicated())
The data set has 110527 rows and 14 columns.
They are also no duplicates and no missing values in the data.
DataType | No of Columns |
---|---|
Float64 | 1 |
int64 | 8 |
String object | 5 |
Problems with the Data Set:
- The patient ID uses an E notation (After futher investigation, i realised that the E for only display purposes, the actual number is present below and unique, so this is not a problem)
- 'ScheduledDay' and 'AppointmentDay' use a non-standard time format
- time for both above-mentioned value is not needed
- no show could be a boolean to make it simpler
# After discussing the structure of the data and any problems that need to be
# cleaned, perform those cleaning steps in the second part of this section.
#Introductory Information
df['PatientId'].describe()
type(df['ScheduledDay'][0])
#Converting 'ScheduledDay' and 'AppointmentDay' to Date time Objects
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'],format="%Y-%m-%dT%H:%M:%SZ")
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ")
#Doing steps to make the 'No-Show' Column make sense
#Refrence : Be careful about the encoding of the last column:
#it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
df['No-show'].replace({"Yes": 0, "No": 1},inplace=True)
#Renaming colomn to something more appropriate
df.rename(index=str, columns={'No-show': 'Attendance'}, inplace=True)
#Checking result
df.head()
Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.
# ## Finding the Frequency of Patients attending
# p_id.columns = ['patient_id', 'count']
p_id = df['PatientId'].value_counts()
## Finding the mean count of patients repeating.
p_mean = round(p_id.mean(),2)
#Graph to plot first 500 Values
graph = p_id.iloc[:500].plot(kind="bar", figsize=(10,10), title="Frequency of Repeated Visits");
# plt.xlabel("Patient ID", fontsize=18)
plt.ylabel("Frequency of Vists",fontsize=18);
#Code used to hide XAsis Labels
ax1 = plt.axes();
x_axis = ax1.axes.get_xaxis();
x_axis.set_visible(False);
plt.show();
plt.close();
#Answers to Questions
print("On an Average Patients visit {} times for Health Services.".format(p_mean))
print("The maximum any one patient has visited is {}.".format(p_id.max()))
print("Although for the whole data set the visit frequency sharply falls off. ")
gen = df['Gender'].value_counts()
gen.plot(kind="bar", title="Gender Demograph");
plt.xlabel("Gender",fontsize=14);
plt.ylabel("Count",fontsize=14);
#Calculating Percentile for Gender
f_per = gen[0]*100 / (gen[0]+gen[1])
f_per = round(f_per,2)
print("Female Percentile is : {}%".format(f_per))
Summary for Question 2
Out of the total population of the patients attending, the gender demograph shows that 65% is occupied by females and the remaining 35% by male.
Here is this question, we are finding out age group of patients that request the maximum medical services. We have separated the entire age data into bins of 5 from 0 to 115. Also, there is some invalid data from ages -5 to 0 which we have dropped.
#Caluclating age count and seperating into bins of 5
age = df['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115), sort=False)
#Dropping First column (from -5 to 0 ) as invalid data
age.drop(df.index[0])
#plotting Bar Graph
age.plot(kind='bar', animated=True, title = "Age Demograph");
plt.xlabel("Age Span",fontsize=14);
plt.ylabel("Count",fontsize=14);
age.head()
The Maximum Patients that attend are from the Age of 0-5 Years of age
Closely followed by 50-55 Years of Age
# gen = df.groupby(['Gender','PatientId'])['Age'].mean()
# agex = df.groupby(['Age'])['PatientId'].mean()
#Seprating whole DataSet into Male and Female
male = df.query('Gender== "M"')
female = df.query('Gender== "F"')
#Dropping Unessary Colmns from Both Data Sets
male = male.drop(['AppointmentID','ScheduledDay','AppointmentDay','Neighbourhood','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received','Attendance','Gender'], axis=1)
female = female.drop(['AppointmentID','ScheduledDay','AppointmentDay','Neighbourhood','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received','Attendance','Gender'], axis=1)
#Plotting Both Graphs
male.plot(x='PatientId', y='Age', kind="hist", title="Male Age Demograph");
female.plot(x='PatientId', y='Age', kind="hist", color="orange", title="Male Age Demograph");
#Finding count of Males.
male['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115)).head()
#Finding count of females
female['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115)).head()
Here we can see the following from the two graphs, firstly the male population is present in great quantity during the intervals of 0-5 years of age. This is also present for the female demographic but it is less pronounced.
After that spike, female admittance rate is overall higher throughout the range. The maximum value of the female population is at the age of 50-55 Years.
popular = df['Neighbourhood'].value_counts()
popular.iloc[:20].plot(kind="bar", title="Popularity Index");
plt.xlabel("Hospitals",fontsize=14);
plt.ylabel("Count",fontsize=14);
popular.head()
Most popular hospitals in Order are
Hospital Name | Case Counts |
---|---|
JARDIM CAMBURI | 7717 |
MARIA ORTIZ | 5805 |
RESISTÊNCIA | 4431 |
JARDIM DA PENHA | 3877 |
ITARARÉ | 3514 |
CENTRO | 3334 |
These hospitals get the most visits from the given data set.
#Taking SheduleDay and stripping the time
ScheduleDate = pd.to_datetime(df['ScheduledDay'].dt.date)
#Taking AppointmentDay and stripping the time
AppointmentDate = pd.to_datetime(df['AppointmentDay'].dt.date)
#Getting Attendance records
confirm = df['Attendance']
#Calculatuing Diffrence
diff = AppointmentDate - ScheduleDate
#Creating new DataFrame to store values.
ans = pd.DataFrame()
#Assigning columns one by one
ans = ans.assign(Diffrence=diff.values)
ans = ans.assign(Attendance=confirm.values)
#Checking values of new DF
ans.head()
#Finding value counts of diffrence
gph_pr = ans['Diffrence'].value_counts()
gph_pr.head()
#plotting bar graph
gph_pr[:25].plot(kind="barh", figsize=(10,10), title="Chances of Cancellation to Days booked Prior");
plt.xlabel("Count",fontsize=14);
plt.ylabel("Days",fontsize=14);
gph_pr.head()
In this question we try to understand if the duration between between booking time and shedule appoitment time makes any diffrence to the attendance of a particular patient.
If we see the graph we have plotted, it can be observed that patients that book on the same day (proably morning) have the higest percentange of showing up.
This is then followed by patients that book two days and four days respectively.
Days Prior Appointment Taken | Successful Admission |
---|---|
0 days | 38563 |
2 days | 6725 |
4 days | 5290 |
1 days | 5213 |
7 days | 4906 |
6 days | 4037 |
5 days | 3277 |
14 days | 2913 |
Scholarship = df['Scholarship'].value_counts()
Hipertension = df['Hipertension'].value_counts()
Diabetes = df['Diabetes'].value_counts()
Alcoholism = df['Alcoholism'].value_counts()
Handcap = df['Handcap'].value_counts()
Scholarship.replace({0: False, 1: True},inplace=True)
sch_per = Scholarship[0]*100 / (Scholarship[0]+Scholarship[1])
Scholarship.plot(kind="pie", title="Distribution of Scholarship in population");
hyper_per = Hipertension[0]*100 / (Hipertension[0]+Hipertension[1])
Hipertension.plot(kind="pie", title="Distribution of Hypertention in population");
dia_per = Diabetes[0]*100/(Diabetes[0]+Diabetes[1])
Diabetes.plot(kind="pie", title="Distribution of Diabetes in population");
Alcohol_per = Alcoholism[0]*100 / (Alcoholism[0]+Alcoholism[1])
Alcoholism.plot(kind="pie", title="Distribution of Alcoholism in population" );
Handcap.plot(kind="bar",title="Distribution of handicap in population" );
print("Scholorship Percentage {}%".format(sch_per))
print("HyperTention Percentage {}%".format(hyper_per))
print("Diabities Percentage {}%".format(dia_per))
print("Alcoholism Percentage {}%".format(Alcohol_per))
In this question we are trying to find the percentages of people who have a certain condition vs the general population. We have also considered if the patients have health care (Scholarship) for the same
Attribute | Yes | No |
---|---|---|
Scholorship | 9.83% | 90.17% |
HyperTention | 19.72% | 80.28% |
Diabities | 7.19% | 92.81% |
Alcholism | 3.04% | 96.96% |
The above stats show that people who have diseases are rather limited for the data set and only about 10% if the population have a scholarship. Also, HyperTention seems to be the most prevalent disease with about 20% People having it.
#Declare new Data Frame
sms_check = pd.DataFrame()
#Store only values which have recived sms
sms_sent = df.query('SMS_received == 1')
#Load Attendance Values
sms_appr = sms_sent['Attendance']
#Load Attendance Values in new DF
sms_check = sms_check.assign(Attendance=sms_appr.values)
#find Count of people who attended
sms_gph = sms_check['Attendance'].value_counts()
#Plot a bar graph
sms_gph.plot(kind="bar", title="Success Rate of SMS Reminder Service");
plt.ylabel("Count",fontsize=14);
sms_per = round(sms_gph[1]*100/(sms_gph[0]+sms_gph[1]),2)
print("The SMS Service is a success with {}% amongst subscribers and is a great success and should be kept".format(sms_per))
#making a copy of the original DataSet as we will need to drop values
df_final = df.copy()
df.head(0)
#Dropping some columns which are not relevant to this question.
df_final.drop(['AppointmentID','AppointmentDay','Gender','ScheduledDay','Neighbourhood','Age'],axis=1, inplace=True)
df_final.head()
# For the following values we first make sure that they are present and then we sort accoording to Attendance
# This way We know for diffrent attendance, what are the demographs.
#Values = Scholarship,Hipertension,Diabetes,Alcoholism,Handcap
#Plotting for Scholarship Present
df_schr = df_final.query("Scholarship == '1'")
df_schr.groupby(['Attendance','Scholarship'])['PatientId'].count().plot(kind="barh",title="Attendance for Scholarship");
#Plotting for Scholarship not Present (Absent)
df_schr_no = df_final.query("Scholarship == '0'")
df_schr_no.groupby(['Attendance','Scholarship'])['PatientId'].count().plot(kind="barh",title="Attendance for Non Scholarship");
#Plotting for Diabities Present
df_dia = df_final.query("Diabetes == '1'")
df_dia.groupby(['Attendance','Diabetes'])['PatientId'].count().plot(kind="barh",title="Attendance for Diabetes");
#Plotting for Alcoholism Present
df_acl = df_final.query("Alcoholism == '1'")
df_acl.groupby(['Attendance','Alcoholism'])['PatientId'].count().plot(kind="barh",title="Attendance for Alcoholism");
#Plotting for any Handicap Present
df_hnd = df_final.query("Handcap == '1'")
df_hnd.groupby(['Attendance','Handcap'])['PatientId'].count().plot(kind="barh",title="Attendance for Handcap");
##We use this mean value to vertiy if the above bar charts are indeed correct.
df_final.groupby(['Attendance']).mean()
In the next step, we are going to do analysis on the patient attendance of the different days of the week, this will allow us to gauge that does the day of the week make any difference to missed appointments.
#Find day of week
days = pd.DataFrame()
apptDate = pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ")
attn = df['Attendance']
#Converting to WeekDay Name
days['dayNo'] = pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ").dt.weekday_name
days = days.assign(attn=attn.values)
#Seperating into two Data Sets
days_true = days.query("attn==1") #Data for people who came for their appointments
days_false = days.query("attn==0") #Data for people who did not come for their appointments
days_true['dayNo'].value_counts().plot(kind="bar",title="Days Appointments are Successful");
days_false['dayNo'].value_counts().plot(kind="bar",title="Days Appointments are Unsuccessful");
The final question was a challenging one, and I have found a few points to draw conclusions.
#Percentage of Patients attended
attn_percent = df['Attendance'].value_counts()
percent = attn_percent[1]*100 /(attn_percent[0]+attn_percent[1])
percent = round(percent,2)
print("The Attendance Percentage is {}%".format(percent))
#Graphically Visualated
df['Attendance'].value_counts().plot(kind="bar");
For the final visualization, i have decided to plot a line graph of all the different parameters. They are two distinct states for the data ( yes an no). This breaks up the whole data set and first starts with total attendance and amount of people which have a certain disorder or condition. Scholarship has been also included while visualization the data
df['Attendance'].value_counts().plot(legend=True, label = "Total Attendance")
df_schr_no = df_final.query("Scholarship == '0'")
df_schr_no.groupby(['Attendance'])['Scholarship'].count().plot(legend=True)
df_schr = df_final.query("Scholarship == '1'")
df_schr.groupby(['Attendance'])['Scholarship'].count().plot(legend=True)
df_hyper_no = df_final.query("Hipertension == '0'")
df_hyper_no.groupby(['Attendance'])['Hipertension'].count().plot(legend=True, label = "hyperTention No");
df_hyper = df_final.query("Hipertension == '1'")
df_hyper.groupby(['Attendance'])['Hipertension'].count().plot(legend=True, label = "hyperTention Yes");
df_dia_no = df_final.query("Diabetes == '0'")
df_dia_no.groupby(['Attendance'])['Diabetes'].count().plot(legend=True, label = "Dia No");
df_dia = df_final.query("Diabetes == '1'")
df_dia.groupby(['Attendance'])['Diabetes'].count().plot(legend=True, label = "Dia Yes");
df_acl_no = df_final.query("Alcoholism == '0'")
df_acl_no.groupby(['Attendance'])['Alcoholism'].count().plot(legend=True, label = "Alcohol No");
df_acl = df_final.query("Alcoholism == '1'")
df_acl.groupby(['Attendance'])['Alcoholism'].count().plot(legend=True, label = "Alcohol Yes");
df_hnd_no = df_final.query("Handcap == '0'")
df_hnd_no.groupby(['Attendance'])['Handcap'].count().plot(legend=True, label = "Handi No");
df_hnd = df_final.query("Handcap == '1'")
df_hnd.groupby(['Attendance'])['Handcap'].count().plot(legend=True, label = "Handi Yes", figsize=(10,10));
Conclusions from Graph
This dataset posed a few limitations to me