A Small Note: Welcome to my final Data Analysis Project. I have tried my level best to analyze and ask questions which will help answer not only the initial question asked, but also provide an insight into the demographics of the patients that attend health services. I have generated graphics where required and tried to think of some intresting questions, although not very useful, but indeed intresting to know. Thank you for taking the time to go through my DA project, any feedback is welcome.

-- Arjun Sabharwal

Project: No-show appointments

Table of Contents

Introduction

I have chosen the following dataset because firstly i find medical problems very intriguing. And although not this one exactly, it kind of makes me realise the potential of data analysis not to only make profit but also save lives.

In [1]:
#Global Import Statements
#These Statements allow packages to be called seamlessly throughout the whole notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
# 'magic word' so that visualizations are plotted Inline
%matplotlib inline

Data Wrangling

Tip: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

General Properties

In [2]:
#Importing Data and doing some basic level exploration of the data
df = pd.read_csv('./data/noshowappointments-kagglev2-may-2016.csv')
df.head(3)
Out[2]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
In [3]:
df.shape
Out[3]:
(110527, 14)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
In [5]:
df.describe()
Out[5]:
PatientId AppointmentID Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received
count 1.105270e+05 1.105270e+05 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000
mean 1.474963e+14 5.675305e+06 37.088874 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026
std 2.560949e+14 7.129575e+04 23.110205 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873
min 3.921784e+04 5.030230e+06 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.172614e+12 5.640286e+06 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 3.173184e+13 5.680573e+06 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 9.439172e+13 5.725524e+06 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 9.999816e+14 5.790484e+06 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000
In [6]:
sum(df.duplicated())
Out[6]:
0

Initial Findings of Dataset

The data set has 110527 rows and 14 columns.

They are also no duplicates and no missing values in the data.

DataType No of Columns
Float64 1
int64 8
String object 5

Problems with the Data Set:

  • The patient ID uses an E notation (After futher investigation, i realised that the E for only display purposes, the actual number is present below and unique, so this is not a problem)
  • 'ScheduledDay' and 'AppointmentDay' use a non-standard time format
  • time for both above-mentioned value is not needed
  • no show could be a boolean to make it simpler

Data Cleaning (Replace this with more specific notes!)

In [7]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.

#Introductory Information
df['PatientId'].describe()
type(df['ScheduledDay'][0]) 
Out[7]:
str
In [8]:
#Converting 'ScheduledDay' and 'AppointmentDay' to Date time Objects
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'],format="%Y-%m-%dT%H:%M:%SZ")
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ")
In [9]:
#Doing steps to make the 'No-Show' Column make sense
#Refrence : Be careful about the encoding of the last column:
#it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
df['No-show'].replace({"Yes": 0, "No": 1},inplace=True)
In [10]:
#Renaming colomn to something more appropriate
df.rename(index=str, columns={'No-show': 'Attendance'}, inplace=True)
In [11]:
#Checking result
df.head()
Out[11]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received Attendance
0 2.987250e+13 5642903 F 2016-04-29 18:38:08 2016-04-29 62 JARDIM DA PENHA 0 1 0 0 0 0 1
1 5.589978e+14 5642503 M 2016-04-29 16:08:27 2016-04-29 56 JARDIM DA PENHA 0 0 0 0 0 0 1
2 4.262962e+12 5642549 F 2016-04-29 16:19:04 2016-04-29 62 MATA DA PRAIA 0 0 0 0 0 0 1
3 8.679512e+11 5642828 F 2016-04-29 17:29:31 2016-04-29 8 PONTAL DE CAMBURI 0 0 0 0 0 0 1
4 8.841186e+12 5642494 F 2016-04-29 16:07:23 2016-04-29 56 JARDIM DA PENHA 0 1 1 0 0 0 1

Exploratory Data Analysis

Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Questions to ask

  • How frequently to patients come for followup appointments after their initial appointment,
  • Gender demographics of patients attending
  • Age Demographics of Patients attending
  • Age v/s Gender demographics
  • Most popular neighbourhood
  • How many days prior an appointment is booked and the chances of cancellation
  • Percentage of patients that have the scholarship, hypertension, diabetes, alcoholism and physical handicap
  • Percentage of SMS sent to people who showed up for their appointment and is it viable to yet keep this service
  • What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?

Question 1

How frequently to patients come for followup appointments after their initial appointment ?

In [12]:
# ## Finding the Frequency of Patients attending
# p_id.columns = ['patient_id', 'count']

p_id = df['PatientId'].value_counts()


## Finding the mean count of patients repeating.
p_mean = round(p_id.mean(),2)
In [13]:
#Graph to plot first 500 Values
graph = p_id.iloc[:500].plot(kind="bar", figsize=(10,10), title="Frequency of Repeated Visits");
# plt.xlabel("Patient ID", fontsize=18)
plt.ylabel("Frequency of Vists",fontsize=18);

#Code used to hide XAsis Labels
ax1 = plt.axes();
x_axis = ax1.axes.get_xaxis();
x_axis.set_visible(False);
plt.show();
plt.close();
D:\Anaconda3\lib\site-packages\matplotlib\cbook\deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)
In [14]:
#Answers to Questions
print("On an Average Patients visit {} times for Health Services.".format(p_mean))
print("The maximum any one patient has visited is {}.".format(p_id.max()))
print("Although for the whole data set the visit frequency sharply falls off. ")
On an Average Patients visit 1.77 times for Health Services.
The maximum any one patient has visited is 88.
Although for the whole data set the visit frequency sharply falls off. 

Question 2

Gender demographics of patients attending healthcare (Overall)

In [62]:
gen = df['Gender'].value_counts()
gen.plot(kind="bar", title="Gender Demograph");
plt.xlabel("Gender",fontsize=14);
plt.ylabel("Count",fontsize=14);
In [64]:
#Calculating Percentile for Gender
f_per = gen[0]*100 / (gen[0]+gen[1])
f_per = round(f_per,2)
print("Female Percentile is : {}%".format(f_per))
Female Percentile is : 65.0%

Summary for Question 2

Out of the total population of the patients attending, the gender demograph shows that 65% is occupied by females and the remaining 35% by male.

Question 3

Age Demographics of Patients attending healthcare (Overall)

Here is this question, we are finding out age group of patients that request the maximum medical services. We have separated the entire age data into bins of 5 from 0 to 115. Also, there is some invalid data from ages -5 to 0 which we have dropped.

In [17]:
#Caluclating age count and seperating into bins of 5
age = df['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115), sort=False)
#Dropping First column (from -5 to 0 ) as invalid data
age.drop(df.index[0])
#plotting Bar Graph
age.plot(kind='bar', animated=True, title = "Age Demograph");
plt.xlabel("Age Span",fontsize=14);
plt.ylabel("Count",fontsize=14);
In [65]:
age.head()
Out[65]:
(-5.001, 0.0]    3540
(0.0, 5.0]       8192
(5.0, 10.0]      7018
(10.0, 15.0]     5719
(15.0, 20.0]     7380
Name: Age, dtype: int64

Summary for Question 3

The Maximum Patients that attend are from the Age of 0-5 Years of age

Closely followed by 50-55 Years of Age

Question 4

Age v/s Gender demographics

In [20]:
# gen = df.groupby(['Gender','PatientId'])['Age'].mean()
# agex = df.groupby(['Age'])['PatientId'].mean()

#Seprating whole DataSet into Male and Female
male = df.query('Gender== "M"')
female = df.query('Gender== "F"')

#Dropping Unessary Colmns from Both Data Sets
male = male.drop(['AppointmentID','ScheduledDay','AppointmentDay','Neighbourhood','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received','Attendance','Gender'], axis=1)
female = female.drop(['AppointmentID','ScheduledDay','AppointmentDay','Neighbourhood','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received','Attendance','Gender'], axis=1)
In [21]:
#Plotting Both Graphs
male.plot(x='PatientId', y='Age', kind="hist", title="Male Age Demograph");
female.plot(x='PatientId', y='Age', kind="hist", color="orange", title="Male Age Demograph");
In [79]:
#Finding count of Males.
male['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115)).head()
Out[79]:
(0.0, 5.0]      4309
(5.0, 10.0]     3758
(10.0, 15.0]    2621
(50.0, 55.0]    2472
(55.0, 60.0]    2457
Name: Age, dtype: int64
In [80]:
#Finding count of females 
female['Age'].value_counts(bins=(-5,0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,105,110,115)).head()
Out[80]:
(50.0, 55.0]    5447
(35.0, 40.0]    5323
(55.0, 60.0]    5285
(25.0, 30.0]    5233
(30.0, 35.0]    5203
Name: Age, dtype: int64

Summary to Question 4

Here we can see the following from the two graphs, firstly the male population is present in great quantity during the intervals of 0-5 years of age. This is also present for the female demographic but it is less pronounced.

After that spike, female admittance rate is overall higher throughout the range. The maximum value of the female population is at the age of 50-55 Years.

Question 5

In [22]:
popular = df['Neighbourhood'].value_counts()
popular.iloc[:20].plot(kind="bar", title="Popularity Index");
plt.xlabel("Hospitals",fontsize=14);
plt.ylabel("Count",fontsize=14);
In [72]:
popular.head()
Out[72]:
JARDIM CAMBURI     7717
MARIA ORTIZ        5805
RESISTÊNCIA        4431
JARDIM DA PENHA    3877
ITARARÉ            3514
Name: Neighbourhood, dtype: int64

Most popular hospitals in Order are

Hospital Name Case Counts
JARDIM CAMBURI 7717
MARIA ORTIZ 5805
RESISTÊNCIA 4431
JARDIM DA PENHA 3877
ITARARÉ 3514
CENTRO 3334

These hospitals get the most visits from the given data set.

Question 6

How many days prior an appointment is booked and the chances of cancellation

In [24]:
#Taking SheduleDay and stripping the time
ScheduleDate = pd.to_datetime(df['ScheduledDay'].dt.date)
#Taking AppointmentDay and stripping the time
AppointmentDate  = pd.to_datetime(df['AppointmentDay'].dt.date)
#Getting Attendance records
confirm = df['Attendance']
#Calculatuing Diffrence
diff = AppointmentDate - ScheduleDate
In [25]:
#Creating new DataFrame to store values.
ans = pd.DataFrame()
#Assigning columns one by one
ans = ans.assign(Diffrence=diff.values)
ans = ans.assign(Attendance=confirm.values)
#Checking values of new DF
ans.head()
Out[25]:
Diffrence Attendance
0 0 days 1
1 0 days 1
2 0 days 1
3 0 days 1
4 0 days 1
In [26]:
#Finding value counts of diffrence
gph_pr = ans['Diffrence'].value_counts()
gph_pr.head()
Out[26]:
0 days    38563
2 days     6725
4 days     5290
1 days     5213
7 days     4906
Name: Diffrence, dtype: int64
In [27]:
#plotting bar graph
gph_pr[:25].plot(kind="barh", figsize=(10,10), title="Chances of Cancellation to Days booked Prior");
plt.xlabel("Count",fontsize=14);
plt.ylabel("Days",fontsize=14);
In [81]:
gph_pr.head()
Out[81]:
0 days    38563
2 days     6725
4 days     5290
1 days     5213
7 days     4906
Name: Diffrence, dtype: int64

Conclusions for Question 6

In this question we try to understand if the duration between between booking time and shedule appoitment time makes any diffrence to the attendance of a particular patient.

If we see the graph we have plotted, it can be observed that patients that book on the same day (proably morning) have the higest percentange of showing up.

This is then followed by patients that book two days and four days respectively.


Days Prior Appointment Taken Successful Admission
0 days 38563
2 days 6725
4 days 5290
1 days 5213
7 days 4906
6 days 4037
5 days 3277
14 days 2913

Question 7

Percentage of patients that have the scholarship, hypertension, diabetes, alcoholism and physical handicap

In [29]:
Scholarship = df['Scholarship'].value_counts()
Hipertension = df['Hipertension'].value_counts()
Diabetes = df['Diabetes'].value_counts()
Alcoholism = df['Alcoholism'].value_counts()
Handcap = df['Handcap'].value_counts()
Scholarship.replace({0: False, 1: True},inplace=True)
In [91]:
sch_per = Scholarship[0]*100 / (Scholarship[0]+Scholarship[1])
Scholarship.plot(kind="pie", title="Distribution of Scholarship in population");
In [92]:
hyper_per = Hipertension[0]*100 / (Hipertension[0]+Hipertension[1])
Hipertension.plot(kind="pie", title="Distribution of Hypertention in population");
In [93]:
dia_per = Diabetes[0]*100/(Diabetes[0]+Diabetes[1])
Diabetes.plot(kind="pie", title="Distribution of Diabetes in population");
In [94]:
Alcohol_per = Alcoholism[0]*100 / (Alcoholism[0]+Alcoholism[1])
Alcoholism.plot(kind="pie", title="Distribution of Alcoholism in population" );
In [34]:
Handcap.plot(kind="bar",title="Distribution of handicap in population" );
In [95]:
print("Scholorship Percentage {}%".format(sch_per))
print("HyperTention Percentage {}%".format(hyper_per))
print("Diabities Percentage {}%".format(dia_per))
print("Alcoholism Percentage {}%".format(Alcohol_per))
Scholorship Percentage 90.17344178345563%
HyperTention Percentage 80.27540781890397%
Diabities Percentage 92.81352067820532%
Alcoholism Percentage 96.96001881893112%

Question 7 Conclusion

In this question we are trying to find the percentages of people who have a certain condition vs the general population. We have also considered if the patients have health care (Scholarship) for the same

Attribute Yes No
Scholorship 9.83% 90.17%
HyperTention 19.72% 80.28%
Diabities 7.19% 92.81%
Alcholism 3.04% 96.96%

The above stats show that people who have diseases are rather limited for the data set and only about 10% if the population have a scholarship. Also, HyperTention seems to be the most prevalent disease with about 20% People having it.

Question 8

Percentage of SMS sent to people who showed up for their appointment and is it viable to yet keep this service

In [35]:
#Declare new Data Frame
sms_check = pd.DataFrame()

#Store only values which have recived sms
sms_sent = df.query('SMS_received == 1')
#Load Attendance Values
sms_appr = sms_sent['Attendance']
#Load Attendance Values in new DF
sms_check = sms_check.assign(Attendance=sms_appr.values)
#find Count of people who attended
sms_gph = sms_check['Attendance'].value_counts()
In [36]:
#Plot a bar graph
sms_gph.plot(kind="bar", title="Success Rate of SMS Reminder Service");
plt.ylabel("Count",fontsize=14);
In [37]:
sms_per = round(sms_gph[1]*100/(sms_gph[0]+sms_gph[1]),2)
print("The SMS Service is a success with {}% amongst subscribers and is a great success and should be kept".format(sms_per))
The SMS Service is a success with 72.43% amongst subscribers and is a great success and should be kept

Question 9

What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?

In [38]:
#making a copy of the original DataSet as we will need to drop values
df_final = df.copy()
df.head(0)
Out[38]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received Attendance
In [39]:
#Dropping some columns which are not relevant to this question.
df_final.drop(['AppointmentID','AppointmentDay','Gender','ScheduledDay','Neighbourhood','Age'],axis=1, inplace=True)
In [40]:
df_final.head()
Out[40]:
PatientId Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received Attendance
0 2.987250e+13 0 1 0 0 0 0 1
1 5.589978e+14 0 0 0 0 0 0 1
2 4.262962e+12 0 0 0 0 0 0 1
3 8.679512e+11 0 0 0 0 0 0 1
4 8.841186e+12 0 1 1 0 0 0 1
In [41]:
# For the following values we first make sure that they are present and then we sort accoording to Attendance
# This way We know for diffrent attendance, what are the demographs.
#Values = Scholarship,Hipertension,Diabetes,Alcoholism,Handcap

#Plotting for Scholarship Present
df_schr = df_final.query("Scholarship == '1'")
df_schr.groupby(['Attendance','Scholarship'])['PatientId'].count().plot(kind="barh",title="Attendance for Scholarship");
In [42]:
#Plotting for Scholarship not Present (Absent)
df_schr_no = df_final.query("Scholarship == '0'")
df_schr_no.groupby(['Attendance','Scholarship'])['PatientId'].count().plot(kind="barh",title="Attendance for Non Scholarship");
In [44]:
#Plotting for Diabities Present
df_dia = df_final.query("Diabetes == '1'")
df_dia.groupby(['Attendance','Diabetes'])['PatientId'].count().plot(kind="barh",title="Attendance for Diabetes");
In [45]:
#Plotting for Alcoholism Present
df_acl = df_final.query("Alcoholism == '1'")
df_acl.groupby(['Attendance','Alcoholism'])['PatientId'].count().plot(kind="barh",title="Attendance for Alcoholism");
In [46]:
#Plotting for any Handicap Present
df_hnd = df_final.query("Handcap == '1'")
df_hnd.groupby(['Attendance','Handcap'])['PatientId'].count().plot(kind="barh",title="Attendance for Handcap");
In [47]:
##We use this mean value to vertiy if the above bar charts are indeed correct.
df_final.groupby(['Attendance']).mean()
Out[47]:
PatientId Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received
Attendance
0 1.467523e+14 0.115507 0.169004 0.064071 0.030333 0.020297 0.438371
1 1.476845e+14 0.093903 0.204392 0.073837 0.030417 0.022742 0.291334

In the next step, we are going to do analysis on the patient attendance of the different days of the week, this will allow us to gauge that does the day of the week make any difference to missed appointments.

In [49]:
#Find day of week
days = pd.DataFrame()

apptDate = pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ")
attn = df['Attendance']
#Converting to WeekDay Name
days['dayNo'] =  pd.to_datetime(df['AppointmentDay'],format="%Y-%m-%dT%H:%M:%SZ").dt.weekday_name


days = days.assign(attn=attn.values)
#Seperating into two Data Sets

days_true = days.query("attn==1") #Data for people who came for their appointments
days_false = days.query("attn==0") #Data for people who did not come for their appointments
In [50]:
days_true['dayNo'].value_counts().plot(kind="bar",title="Days Appointments are Successful");
In [51]:
days_false['dayNo'].value_counts().plot(kind="bar",title="Days Appointments are Unsuccessful");

Question 9 Summary

The final question was a challenging one, and I have found a few points to draw conclusions.

  • Patients that are suffering from hypertension, Diabetes, Alcoholism and Handicap have a higher probability of showing up for scheduled appointments.
  • The section of the population that has a medical scholarship has a slightly higher chance of not missing their appointment
  • The Day overall where most appointments are missed is Tuesday followed by Wednesday.
  • The maximum success rate in terms of booking time in advance is the same day and then two and four days.

Conclusions

In [52]:
#Percentage of Patients attended
attn_percent = df['Attendance'].value_counts()
percent = attn_percent[1]*100 /(attn_percent[0]+attn_percent[1])
percent = round(percent,2)

print("The Attendance Percentage is {}%".format(percent))
The Attendance Percentage is 79.81%
In [53]:
#Graphically Visualated 
df['Attendance'].value_counts().plot(kind="bar");

For the final visualization, i have decided to plot a line graph of all the different parameters. They are two distinct states for the data ( yes an no). This breaks up the whole data set and first starts with total attendance and amount of people which have a certain disorder or condition. Scholarship has been also included while visualization the data

In [54]:
df['Attendance'].value_counts().plot(legend=True, label = "Total Attendance")

df_schr_no = df_final.query("Scholarship == '0'")
df_schr_no.groupby(['Attendance'])['Scholarship'].count().plot(legend=True)
df_schr = df_final.query("Scholarship == '1'")
df_schr.groupby(['Attendance'])['Scholarship'].count().plot(legend=True)
                                                               
df_hyper_no = df_final.query("Hipertension == '0'")
df_hyper_no.groupby(['Attendance'])['Hipertension'].count().plot(legend=True, label = "hyperTention No");                                                               
df_hyper = df_final.query("Hipertension == '1'")
df_hyper.groupby(['Attendance'])['Hipertension'].count().plot(legend=True, label = "hyperTention Yes");
                                                               
df_dia_no = df_final.query("Diabetes == '0'")
df_dia_no.groupby(['Attendance'])['Diabetes'].count().plot(legend=True, label = "Dia No");
df_dia = df_final.query("Diabetes == '1'")
df_dia.groupby(['Attendance'])['Diabetes'].count().plot(legend=True, label = "Dia Yes");

df_acl_no = df_final.query("Alcoholism == '0'")
df_acl_no.groupby(['Attendance'])['Alcoholism'].count().plot(legend=True, label = "Alcohol No");
df_acl = df_final.query("Alcoholism == '1'")
df_acl.groupby(['Attendance'])['Alcoholism'].count().plot(legend=True, label = "Alcohol Yes");
                                                               
                                                               
df_hnd_no = df_final.query("Handcap == '0'")
df_hnd_no.groupby(['Attendance'])['Handcap'].count().plot(legend=True, label = "Handi No");
df_hnd = df_final.query("Handcap == '1'")
df_hnd.groupby(['Attendance'])['Handcap'].count().plot(legend=True, label = "Handi Yes", figsize=(10,10));

Conclusions from Graph

  • We can see in the first figure that the attendance is about 75% + irrespective of any condition which is very good.
  • The Second Line graph also shows that no disability cripples any patients from frequently missing appointments
  • Other than that I have explored the data set and answered as many questions as possible.

Limitations

This dataset posed a few limitations to me

  • It was very hard to draw solid conclusions, and although i was able to plot the data well. Finding drastic changes was not very easy.
  • There was rather limited data about the patients and that limited to understand the effect of diseases on the attendance of pateints. More personal information about the patients would have also been benifitial.