Table of Contents¶

Introduction
Posing Questions
Data Collection and Wrangling
- Condensing the Trip Data
Exploratory Data Analysis
- Statistics
- Visualizations
Performing Your Own Analysis
Conclusions

Introduction¶

Tip: Quoted sections like this will provide helpful instructions on how to navigate and use a Jupyter notebook.

Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles for short trips, typically 30 minutes or less. Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, you will perform an exploratory analysis on data provided by Motivate, a bike-share system provider for many major cities in the United States. You will compare the system usage between three large cities: New York City, Chicago, and Washington, DC. You will also see if there are any differences within each system for those users that are registered, regular users and those users that are short-term, casual users.

Posing Questions¶

Before looking at the bike sharing data, you should start by asking questions you might want to understand about the bike share data. Consider, for example, if you were working for Motivate. What kinds of information would you want to know about in order to make smarter business decisions? If you were a user of the bike-share service, what factors might influence how you would want to use the service?

Question 1: Write at least two questions related to bike sharing that you think could be answered by data.

Answer: I think the following questions can be answered using the given data set

which stations are most popular in terms of usage ?
which days of the week/ month are the most busy ? and when should scheduled maintainence be planned
What age group of people ride bikes ? and how could you target the other age groups
When seeing the data which gender of riders is more ? and accordingly how can this influence buying of new vehicles

Tip: If you double click on this cell, you will see the text change so that all of the formatting is removed. This allows you to edit this block of text. This block of text is written using Markdown, which is a way to format text using headers, links, italics, and many other options using a plain-text syntax. You will also use Markdown later in the Nanodegree program. Use Shift + Enter or Shift + Return to run the cell and show its rendered form.

Data Collection and Wrangling¶

Now it's time to collect and explore our data. In this project, we will focus on the record of individual trips taken in 2016 from our selected cities: New York City, Chicago, and Washington, DC. Each of these cities has a page where we can freely download the trip data.:

New York City (Citi Bike): Link
Chicago (Divvy): Link
Washington, DC (Capital Bikeshare): Link

If you visit these pages, you will notice that each city has a different way of delivering its data. Chicago updates with new data twice a year, Washington DC is quarterly, and New York City is monthly. However, you do not need to download the data yourself. The data has already been collected for you in the /data/ folder of the project files. While the original data for 2016 is spread among multiple files for each city, the files in the /data/ folder collect all of the trip data for the year into one file per city. Some data wrangling of inconsistencies in timestamp format within each city has already been performed for you. In addition, a random 2% sample of the original data is taken to make the exploration more manageable.

Question 2: However, there is still a lot of data for us to investigate, so it's a good idea to start off by looking at one entry from each of the cities we're going to analyze. Run the first code cell below to load some packages and functions that you'll be using in your analysis. Then, complete the second code cell to print out the first trip recorded from each of the cities (the second line of each data file).

Tip: You can run a code cell like you formatted Markdown cells above by clicking on the cell and using the keyboard shortcut Shift + Enter or Shift + Return. Alternatively, a code cell can be executed using the Play button in the toolbar after selecting it. While the cell is running, you will see an asterisk in the message to the left of the cell, i.e. In [*]:. The asterisk will change into a number to show that execution has completed, e.g. In [1]. If there is output, it will show up as Out [1]:, with an appropriate number to match the "In" number.

## import all necessary packages and functions.
import csv # read and write csv files
import decimal #using to round off numbers 
from datetime import datetime # operations to parse dates
from pprint import pprint # use to print data structures like dictionaries in
                          # a nicer way than the base print function.

def print_first_point(filename):
    """
    This function prints and returns the first data point (second row) from
    a csv file that includes a header row.
    """
    # print city name for reference
    city = filename.split('-')[0].split('/')[-1]
#     print('\nCity: {}'.format(city))
    
    with open(filename, 'r', newline='') as f_in:
        ## TODO: Use the csv library to set up a DictReader object. ##
        ## see https://docs.python.org/3/library/csv.html           ##
        ## Use fileIn to 
        trip_reader = csv.DictReader(f_in)

        
        def fetchData(readerObj):
            output = readerObj.__next__()
#             pprint(output)
            return output
        
        
        ## TODO: Use a function on the DictReader object to read the     ##
        ## first trip from the data file and store it in a variable.     ##
        ## see https://docs.python.org/3/library/csv.html#reader-objects ##
        first_trip = fetchData(trip_reader)
        

        ## TODO: Use the pprint library to print the first trip. ##
        ## see https://docs.python.org/3/library/pprint.html     ##
        pprint("The Data is from the city for {}".format(city))
        pprint(first_trip)
        print("\n")
    # output city name and first trip for later testing
    return (city, first_trip )

# list of files for each city
data_files = ['./data/NYC-CitiBike-2016.csv',
              './data/Chicago-Divvy-2016.csv',
              './data/Washington-CapitalBikeshare-2016.csv',]

# print the first trip from each file, store in dictionary
example_trips = {}
for data_file in data_files:
    city, first_trip = print_first_point(data_file)
    example_trips[city] = first_trip
    
#print(example_trips.keys())

'The Data is from the city for NYC'
OrderedDict([('tripduration', '839'),
             ('starttime', '1/1/2016 00:09:55'),
             ('stoptime', '1/1/2016 00:23:54'),
             ('start station id', '532'),
             ('start station name', 'S 5 Pl & S 4 St'),
             ('start station latitude', '40.710451'),
             ('start station longitude', '-73.960876'),
             ('end station id', '401'),
             ('end station name', 'Allen St & Rivington St'),
             ('end station latitude', '40.72019576'),
             ('end station longitude', '-73.98997825'),
             ('bikeid', '17109'),
             ('usertype', 'Customer'),
             ('birth year', ''),
             ('gender', '0')])


'The Data is from the city for Chicago'
OrderedDict([('trip_id', '9080545'),
             ('starttime', '3/31/2016 23:30'),
             ('stoptime', '3/31/2016 23:46'),
             ('bikeid', '2295'),
             ('tripduration', '926'),
             ('from_station_id', '156'),
             ('from_station_name', 'Clark St & Wellington Ave'),
             ('to_station_id', '166'),
             ('to_station_name', 'Ashland Ave & Wrightwood Ave'),
             ('usertype', 'Subscriber'),
             ('gender', 'Male'),
             ('birthyear', '1990')])


'The Data is from the city for Washington'
OrderedDict([('Duration (ms)', '427387'),
             ('Start date', '3/31/2016 22:57'),
             ('End date', '3/31/2016 23:04'),
             ('Start station number', '31602'),
             ('Start station', 'Park Rd & Holmead Pl NW'),
             ('End station number', '31207'),
             ('End station', 'Georgia Ave and Fairmont St NW'),
             ('Bike number', 'W20842'),
             ('Member Type', 'Registered')])

If everything has been filled out correctly, you should see below the printout of each city name (which has been parsed from the data file name) that the first trip has been parsed in the form of a dictionary. When you set up a DictReader object, the first row of the data file is normally interpreted as column names. Every other row in the data file will use those column names as keys, as a dictionary is generated for each row.

This will be useful since we can refer to quantities by an easily-understandable label instead of just a numeric index. For example, if we have a trip stored in the variable row, then we would rather get the trip duration from row['duration'] instead of row[0].

Condensing the Trip Data¶

It should also be observable from the above printout that each city provides different information. Even where the information is the same, the column names and formats are sometimes different. To make things as simple as possible when we get to the actual exploration, we should trim and clean the data. Cleaning the data makes sure that the data formats across the cities are consistent, while trimming focuses only on the parts of the data we are most interested in to make the exploration easier to work with.

You will generate new data files with five values of interest for each trip: trip duration, starting month, starting hour, day of the week, and user type. Each of these may require additional wrangling depending on the city:

Duration: This has been given to us in seconds (New York, Chicago) or milliseconds (Washington). A more natural unit of analysis will be if all the trip durations are given in terms of minutes.
Month, Hour, Day of Week: Ridership volume is likely to change based on the season, time of day, and whether it is a weekday or weekend. Use the start time of the trip to obtain these values. The New York City data includes the seconds in their timestamps, while Washington and Chicago do not. The datetime package will be very useful here to make the needed conversions.
User Type: It is possible that users who are subscribed to a bike-share system will have different patterns of use compared to users who only have temporary passes. Washington divides its users into two types: 'Registered' for users with annual, monthly, and other longer-term subscriptions, and 'Casual', for users with 24-hour, 3-day, and other short-term passes. The New York and Chicago data uses 'Subscriber' and 'Customer' for these groups, respectively. For consistency, you will convert the Washington labels to match the other two.

Question 3a: Complete the helper functions in the code cells below to address each of the cleaning tasks described above.

def duration_in_mins(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the trip duration in units of minutes.
    
    Remember that Washington is in terms of milliseconds while Chicago and NYC
    are in terms of seconds. 
    
    HINT: The csv module reads in all of the data as strings, including numeric
    values. You will need a function to convert the strings into an appropriate
    numeric type when making your transformations.
    see https://docs.python.org/3/library/functions.html
    """

    
#     print(city)
#     print("Printing datum \n")
#     print(datum)
   
#     print("Trip Duration {}".format(times))
#     print(datum)
    # YOUR CODE HERE
    
    times = 0 # A temp variable which will be used to calculate the duration
    if(city == 'NYC') or (city == 'Chicago'):
        times = duration = datum['tripduration']
        duration = float(times)/60 # here we divide by 60 as we are converting seconds to minutes
    elif(city == 'Washington'):
        times = duration = datum['Duration (ms)']
        duration = float(times)/(60*1000) # Here we divide by 60 x 1000 for MilliSeconds
    #Rounding off values to 4 Decimal Points
    duration = round(duration,4)
    #Debug Statement
#     print("The time duration for the city of {} is {} seconds" .format(city,duration))
    
    return duration


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 13.9833,
         'Chicago': 15.4333,
         'Washington': 7.1231}

for city in tests:
    assert abs(duration_in_mins(example_trips[city], city) - tests[city]) < .001

from datetime import datetime
from datetime import date
from datetime import time
def time_of_trip(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the month, hour, and day of the week in
    which the trip was made.
    
    Remember that NYC includes seconds, while Washington and Chicago do not.
    
    HINT: You should use the datetime module to parse the original date
    strings into a format that is useful for extracting the desired information.
    see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
    """
    # Days Dict declared to map values of days to the strings
    days = ['Monday','Tuesday','Wednesday','Thursday', 'Friday','Saturday','Sunday'] 
    
    #Sample time : 3/31/2016 22:57
    # Converting current data into date time Object depending on city the Date format chnages.
    
    if(city == 'NYC'):
        start = datum['starttime']
        start = datetime.strptime(start, '%m/%d/%Y %H:%M:%S') #Additional Seconds field present 
    elif(city=='Chicago'):
        start = datum['starttime']
        start = datetime.strptime(start, '%m/%d/%Y %H:%M')
    elif(city=='Washington'):
        start = datum['Start date'] #Column name is different for Washington
        start = datetime.strptime(start, '%m/%d/%Y %H:%M')
        
    month = start.month
    hour = start.hour
    day_of_week = days[start.weekday()]

    # YOUR CODE HERE
    #Debug Print Statement - To Remove
#     print("Month = {} || Hour of Day = {} || Day of Week = {}".format(month,hour,day_of_week))
    return (month, hour, day_of_week)


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': (1, 0, 'Friday'),
         'Chicago': (3, 23, 'Thursday'),
         'Washington': (3, 22, 'Thursday')}

for city in tests:
    assert time_of_trip(example_trips[city], city) == tests[city]

def type_of_user(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the type of system user that made the
    trip.
    
    Remember that Washington has different category names compared to Chicago
    and NYC. 
    """
    
    user_type = '' # This var will hold the usertype
    if(city == 'NYC') or (city == 'Chicago'):
        user_type = datum['usertype']
    elif(city == 'Washington'):
        temp = datum['Member Type']
        if(temp == 'Registered'):
            user_type = 'Subscriber'
        elif(temp == 'Casual'):
            user_type = 'Customer'
    
    #Debug Statement - To Comment
#     print("The UserType is {} " .format(user_type))
        
    return user_type


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 'Customer',
         'Chicago': 'Subscriber',
         'Washington': 'Subscriber'}

for city in tests:
    assert type_of_user(example_trips[city], city) == tests[city]

Question 3b: Now, use the helper functions you wrote above to create a condensed data file for each city consisting only of the data fields indicated above. In the /examples/ folder, you will see an example datafile from the Bay Area Bike Share before and after conversion. Make sure that your output is formatted to be consistent with the example file.

def condense_data(in_file, out_file, city):
    """
    This function takes full data from the specified input file
    and writes the condensed data to a specified output file. The city
    argument determines how the input file will be parsed.
    
    HINT: See the cell below to see how the arguments are structured!
    """
    
    with open(out_file, 'w') as f_out, open(in_file, 'r') as f_in:
        # set up csv DictWriter object - writer requires column names for the
        # first row as the "fieldnames" argument
        out_colnames = ['duration', 'month', 'hour', 'day_of_week', 'user_type']        
        trip_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
        trip_writer.writeheader()
        
        ## TODO: set up csv DictReader object ##
        trip_reader = csv.DictReader(f_in)
        count = 0;

        # collect data from and process each row
        for row in trip_reader:
            # set up a dictionary to hold the values for the cleaned and trimmed
            # data point = new_point
            new_point = {}
            #duration calculated in minutes
            new_point['duration'] = duration_in_mins(row, city)
            #Temp variable taken to parse the list having needed values
            temp = time_of_trip(row, city)
            
            new_point['month'] = temp[0]
            new_point['hour'] = temp[1]
            new_point['day_of_week'] = temp[2]
            new_point['user_type'] = type_of_user(row, city)
            #print(new_point) #Debug - Please comment
            trip_writer.writerow(new_point)


            ## TODO: use the helper functions to get the cleaned data from  ##
            ## the original data dictionaries.                              ##
            ## Note that the keys for the new_point dictionary should match ##
            ## the column names set in the DictWriter object above.         ##
            

            ## TODO: write the processed information to the output file.     ##
            ## see https://docs.python.org/3/library/csv.html#writer-objects ##

# Run this cell to check your work
city_info = {'Washington': {'in_file': './data/Washington-CapitalBikeshare-2016.csv',
                            'out_file': './data/Washington-2016-Summary.csv'},
             'Chicago': {'in_file': './data/Chicago-Divvy-2016.csv',
                         'out_file': './data/Chicago-2016-Summary.csv'},
             'NYC': {'in_file': './data/NYC-CitiBike-2016.csv',
                     'out_file': './data/NYC-2016-Summary.csv'}}

for city, filenames in city_info.items():
    condense_data(filenames['in_file'], filenames['out_file'], city)
    print_first_point(filenames['out_file'])

'The Data is from the city for Washington'
OrderedDict([('duration', '7.1231'),
             ('month', '3'),
             ('hour', '22'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])


'The Data is from the city for Chicago'
OrderedDict([('duration', '15.4333'),
             ('month', '3'),
             ('hour', '23'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])


'The Data is from the city for NYC'
OrderedDict([('duration', '13.9833'),
             ('month', '1'),
             ('hour', '0'),
             ('day_of_week', 'Friday'),
             ('user_type', 'Customer')])

Tip: If you save a jupyter Notebook, the output from running code blocks will also be saved. However, the state of your workspace will be reset once a new session is started. Make sure that you run all of the necessary code blocks from your previous session to reestablish variables and functions before picking up where you last left off.

Exploratory Data Analysis¶

Now that you have the data collected and wrangled, you're ready to start exploring the data. In this section you will write some code to compute descriptive statistics from the data. You will also be introduced to the matplotlib library to create some basic histograms of the data.

Statistics¶

First, let's compute some basic counts. The first cell below contains a function that uses the csv module to iterate through a provided data file, returning the number of trips made by subscribers and customers. The second cell runs this function on the example Bay Area data in the /examples/ folder. Modify the cells to answer the question below.

Question 4a: Which city has the highest number of trips? Which city has the highest proportion of trips made by subscribers? Which city has the highest proportion of trips made by short-term customers?

Answer: The city with the highest number of trips is New York City; Totaling 276798 users

City	NYC	Chicago	Washington
Proportion of Subscribers	88.84 %	76.23 %	78.03 %
Proportion of Customers	11.16 %	23.77 %	21.97 %

The City with the highest Proportion of Subscribers Trips in Percentile is : NYC at 88.84%

The City with the highest Proportion of Customers Trips in Percentile is : Chicago at 23.77 %

def number_of_trips(filename):
    """
    This function reads in a file with trip data and reports the number of
    trips made by subscribers, customers, and total overall.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
#         print(city)
        
        # initialize count variables
        n_subscribers = 0
        n_customers = 0
        # Percentile Proportion calculation of Subscribers & Customers to Total Trips.
        P_sub = 0 
        p_cus = 0
        
        # tally up ride types
        for row in reader:
            if row['user_type'] == 'Subscriber':
                n_subscribers += 1
            else:
                n_customers += 1
        
        # compute total number of rides and proportion percentiles.
        n_total = n_subscribers + n_customers
        p_sub = round((n_subscribers*100)/n_total,2)
        p_cus = round((n_customers*100)/n_total,2)
    
        # return tallies as a tuple
        print("For the city of {} ".format(city))
        print("Subscribers Users : {}, Short Term Users :  {},  Total Users : {}".format(n_subscribers, n_customers, n_total))
        print('The Proportion of Subscribers to the total amount is {} %'.format(p_sub))
        print('The Proportion of Customers to the total amount is {} %'.format(p_cus))
        print("\n")
        
        return(n_subscribers, n_customers, n_total)

## Modify this and the previous cell to answer Question 4a. Remember to run ##
## the function on the cleaned data files you created from Question 3.      ##

# data_file = './examples/BayArea-Y3-Summary.csv'
data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

for rows in data_file:
    (number_of_trips(rows))

For the city of NYC 
Subscribers Users : 245896, Short Term Users :  30902,  Total Users : 276798
The Proportion of Subscribers to the total amount is 88.84 %
The Proportion of Customers to the total amount is 11.16 %


For the city of Chicago 
Subscribers Users : 54982, Short Term Users :  17149,  Total Users : 72131
The Proportion of Subscribers to the total amount is 76.23 %
The Proportion of Customers to the total amount is 23.77 %


For the city of Washington 
Subscribers Users : 51753, Short Term Users :  14573,  Total Users : 66326
The Proportion of Subscribers to the total amount is 78.03 %
The Proportion of Customers to the total amount is 21.97 %

Tip: In order to add additional cells to a notebook, you can use the "Insert Cell Above" and "Insert Cell Below" options from the menu bar above. There is also an icon in the toolbar for adding new cells, with additional icons for moving the cells up and down the document. By default, new cells are of the code type; you can also specify the cell type (e.g. Code or Markdown) of selected cells from the Cell menu or the dropdown in the toolbar.

Now, you will write your own code to continue investigating properties of the data.

Question 4b: Bike-share systems are designed for riders to take short trips. Most of the time, users are allowed to take trips of 30 minutes or less with no additional charges, with overage charges made for trips of longer than that duration. What is the average trip length for each city? What proportion of rides made in each city are longer than 30 minutes?

Answer:

Following is the structured form of data through which the average trip length along with the percentile of trips exceeding 30 mins.

City Name	Average time(Mins)	Excess Trip %
NYC	15.81	7.317 %
Chicago	16.56	8.347 %
Washington	18.93	10.839 %

The most rides exceeding 30 mins are made in the city of Washington accounting to 10.839 % from the total rides taken.

## Use this and additional cells to answer Question 4b.                 ##
##                                                                      ##
## HINT: The csv module reads in all of the data as strings, including  ##
## numeric values. You will need a function to convert the strings      ##
## into an appropriate numeric type before you aggregate data.          ##
## TIP: For the Bay Area example, the average trip length is 14 minutes ##
## and 3.5% of trips are longer than 30 minutes.                        ##
def avg_trip_len(filename):
    """
    This function calculates average trip duration for the diffrent cities.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        #init variables 
        counter = 0 #Count Var
        excessTrip = 0 # Counter for trips Greater than 30 mins
        r_sum = 0 #Sum var        
        for row in reader:
            duration = float(row['duration'])
            r_sum += duration  #Reading duration variable and adding to sum for avg
            counter += 1 # Incrementing counter by one
            #Checks if trip is more than 30 mins and increments counter
            if (duration >= 30):
                excessTrip +=1
            
            
            
    
    #Debug print Statements
    print("Sum = {}".format(r_sum))
    print("Count = {}".format(counter))
    print("Trips over 30 Mins = {}".format(excessTrip))
    
    ##Calculation Avergae (Sum / Count )
    avg = float(r_sum/counter) 
    #Rounding off Average to two decimals
    avg = round(avg,2)
    
    ##Calculation of Excess Trip Percentage
    excess = float((excessTrip*100)/counter)
    #Rounding off  to two decimals
    excess = round(excess,3)
    
    print("City Name : {}".format(city))
    print("Average time is {} Minutes ".format(avg))
    print("Excess Trip Percentage is {} %".format(excess))
    print("\n")
    
    return (avg,excess)

##Testing Code Block 

data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

for file in data_file:
    avg_trip_len(file)

Sum = 4376894.130100047
Count = 276798
Trips over 30 Mins = 20253
City Name : NYC
Average time is 15.81 Minutes 
Excess Trip Percentage is 7.317 %


Sum = 1194751.1489000123
Count = 72131
Trips over 30 Mins = 6021
City Name : Chicago
Average time is 16.56 Minutes 
Excess Trip Percentage is 8.347 %


Sum = 1255741.7745999938
Count = 66326
Trips over 30 Mins = 7189
City Name : Washington
Average time is 18.93 Minutes 
Excess Trip Percentage is 10.839 %

Question 4c: Dig deeper into the question of trip duration based on ridership. Choose one city. Within that city, which type of user takes longer rides on average: Subscribers or Customers?

Answer:

For the City of NYC - Average Subscriber Trip Duration : 13.68 minutes & the Average Customer Trip Duration 32.98 Minutes.

Customers are observed to take longer trips for NYC compared to Subscribers.

## Use this and additional cells to answer Question 4c. If you have    ##
## not done so yet, consider revising some of your previous code to    ##
## make use of functions for reusability.                              ##
##                                                                     ##
## TIP: For the Bay Area example data, you should find the average     ##
## Subscriber trip duration to be 9.5 minutes and the average Customer ##
## trip duration to be 54.6 minutes. Do the other cities have this     ##
## level of difference?                                                ##
        
def longer_rides(filename):
    """
    This function calculates advanced user statistics trip details regarding duration of riding for different customers.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        #init variables 
        subCounter = 0 #Subscriber Counter
        cusCounter = 0 #Customer Counter
        
        subSUM = 0 #Sum of Subscribers to calculate average
        cusSUM = 0 #Sum of Customers to calculate average
        
        subAVG = 0 # Average Value of Subscribers
        cusAVG = 0 # Average Value of Customers
        
        counter = 0

        for row in reader:
            #Converting duration to float point value
            duration = float(row['duration'])
            
            if(row['user_type']=='Subscriber'):
                subCounter += 1
                subSUM += duration
                ## If user_type is 'Subscriber' increment counter and add duration to 'subSUM' Variable
                
            elif(row['user_type']== 'Customer'):
                cusCounter += 1
                cusSUM += duration
                #If user_type is 'Customer' increment counter and add duration to 'cusSUM' Variable
    
            counter += counter+1
    
    #Debug print Statements
#     print("subCounter = {}, subSUM = {} ".format(subCounter, subSUM))
#     print("cusCounter = {}, cusSUM ={}  ".format(cusCounter, cusSUM))


    subAVG = round(subSUM/subCounter,2) #A
    cusAVG = round(cusSUM/cusCounter,2) #B
    
    print("For the City of {},\n Average Subscriber Duration : {} Minutes,\n Average Customer Duration {} Minutes. \n" .format(city,subAVG,cusAVG))
    
    return (subAVG,cusAVG)

data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

longer_rides(data_file[0])

For the City of NYC,
 Average Subscriber Duration : 13.68 Minutes,
 Average Customer Duration 32.98 Minutes.

(13.68, 32.98)

Visualizations¶

The last set of values that you computed should have pulled up an interesting result. While the mean trip time for Subscribers is well under 30 minutes, the mean trip time for Customers is actually above 30 minutes! It will be interesting for us to look at how the trip times are distributed. In order to do this, a new library will be introduced here, matplotlib. Run the cell below to load the library and to generate an example plot.

# load library
import matplotlib.pyplot as plt

# this is a 'magic word' that allows for plots to be displayed
# inline with the notebook. If you want to know more, see:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
%matplotlib inline 

# example histogram, data taken from bay area sample
data = [ 7.65,  8.92,  7.42,  5.50, 16.17,  4.20,  8.98,  9.62, 11.48, 14.33,
        19.02, 21.53,  3.90,  7.97,  2.62,  2.67,  3.08, 14.40, 12.90,  7.83,
        25.12,  8.30,  4.93, 12.43, 10.60,  6.17, 10.88,  4.78, 15.15,  3.53,
         9.43, 13.32, 11.72,  9.85,  5.22, 15.10,  3.95,  3.17,  8.78,  1.88,
         4.55, 12.68, 12.38,  9.78,  7.63,  6.45, 17.38, 11.90, 11.52,  8.63,]
plt.hist(data)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (m)')
plt.show()

In the above cell, we collected fifty trip times in a list, and passed this list as the first argument to the .hist() function. This function performs the computations and creates plotting objects for generating a histogram, but the plot is actually not rendered until the .show() function is executed. The .title() and .xlabel() functions provide some labeling for plot context.

You will now use these functions to create a histogram of the trip times for the city you selected in question 4c. Don't separate the Subscribers and Customers for now: just collect all of the trip times and plot them.

import csv # read and write csv files
import matplotlib.pyplot as plt
%matplotlib inline
def plotHist(filename):
    data = []
    with open(filename, 'r') as f_in:
        reader = csv.DictReader(f_in)
    
        for row in reader:
            data.append(row['duration'])
            
    #Plotting Graph for NYC Trip Duration        
    plt.hist(data)
    plt.title('NYC Trip Durations')
    plt.xlabel('Duration (mins)')
    plt.ylabel('Count')
    plt.show()
    
plotHist('./data/NYC-2016-Summary.csv')

If you followed the use of the .hist() and .show() functions exactly like in the example, you're probably looking at a plot that's completely unexpected. The plot consists of one extremely tall bar on the left, maybe a very short second bar, and a whole lot of empty space in the center and right. Take a look at the duration values on the x-axis. This suggests that there are some highly infrequent outliers in the data. Instead of reprocessing the data, you will use additional parameters with the .hist() function to limit the range of data that is plotted. Documentation for the function can be found [here].

Question 5: Use the parameters of the .hist() function to plot the distribution of trip times for the Subscribers in your selected city. Do the same thing for only the Customers. Add limits to the plots so that only trips of duration less than 75 minutes are plotted. As a bonus, set the plots up so that bars are in five-minute wide intervals. For each group, where is the peak of each distribution? How would you describe the shape of each distribution?

Answer: The City I have chosen is NYC. The peak distribution for the Subscriber User Type is between 0-10 (about 8). The peak distribution for the Customer User Type is between 20-30 (20 being the peak). Both these distributions are right-skewed, this shows that most people are using the vehicles for short trips.

Although, it can be observed that they are longer trips by the customers are compared to the Subscribers.

def extractData(filename='./data/NYC-2016-Summary.csv'):
    """
    This function extracts data from the the data set, it seprates the data into two lists for use by the graphing function.
    One for Subscribers and one for Customers
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        subData = [] #Subscriber Data List init
        custData = [] #Customer Data List init
        
        for row in reader:
            #Converting duration to float point value
            duration = float(row['duration'])
            
            if(row['user_type']=='Subscriber'):
                subData.append(duration)
                ## If user_type is 'Subscriber' add to 'SubData' List
                
            elif(row['user_type']== 'Customer'):
                custData.append(duration)
                #If user_type is 'Customer' add to 'custData' List
       
    
    return (subData,custData)

import matplotlib.pyplot as plt
%matplotlib inline

val = [] #Empty List init to store data set
val = extractData('./data/NYC-2016-Summary.csv')

#Reassining Data Sets
subscriberTrip = val[0] 
customerTrip = val[1]

rangeVal =[] # Var made to store steps of Graph
count = 0 #Counter

# Loop to generate steps of graph, can be made dynamic if needed
while(True):
    rangeVal.append(count)
    if(count >= 75):
        break;
    count +=5


#Plot Graph 1 -NYC Subscriber Data
plt.hist(subscriberTrip,bins=rangeVal)
plt.title('NYC Subscriber Data')
plt.xlabel('Duration (mins)')
plt.ylabel('Count in Nos.')
plt.show()

#plot Graph 2 - NYC Customer Data
plt.hist(customerTrip,bins=rangeVal)
plt.title('NYC Customer Data')
plt.xlabel('Duration (mins)')
plt.ylabel('Count in Nos.')
plt.show()

Performing Your Own Analysis¶

So far, you've performed an initial exploration into the data available. You have compared the relative volume of trips made between three U.S. cities and the ratio of trips made by Subscribers and Customers. For one of these cities, you have investigated differences between Subscribers and Customers in terms of how long a typical trip lasts. Now it is your turn to continue the exploration in a direction that you choose. Here are a few suggestions for questions to explore:

How does ridership differ by month or season? Which month / season has the highest ridership? Does the ratio of Subscriber trips to Customer trips change depending on the month or season?
Is the pattern of ridership different on the weekends versus weekdays? On what days are Subscribers most likely to use the system? What about Customers? Does the average duration of rides change depending on the day of the week?
During what time of day is the system used the most? Is there a difference in usage patterns for Subscribers and Customers?

If any of the questions you posed in your answer to question 1 align with the bullet points above, this is a good opportunity to investigate one of them. As part of your investigation, you will need to create a visualization. If you want to create something other than a histogram, then you might want to consult the Pyplot documentation. In particular, if you are plotting values across a categorical variable (e.g. city, user type), a bar chart will be useful. The documentation page for .bar() includes links at the bottom of the page with examples for you to build off of for your own use.

Question 6: Continue the investigation by exploring another question that could be answered by the data available. Document the question you want to explore below. Your investigation should involve at least two variables and should compare at least two groups. You should also use at least one visualization as part of your explorations.

Answer: Replace this text with your responses and include a visualization below!

Questions that need to be answered¶

Theirs¶

How does ridership differ by month or season? Which month / season has the highest ridership? Does the ratio of Subscriber trips to Customer trips change depending on the month or season?
Is the pattern of ridership different on the weekends versus weekdays? On what days are Subscribers most likely to use the system? What about Customers? Does the average duration of rides change depending on the day of the week?
During what time of day is the system used the most? Is there a difference in usage patterns for Subscribers and Customers?

Mine¶

Which stations are most popular in terms of usage ?
What age group of people ride bikes ? and how could you target the other age groups
When seeing the data which gender of riders is more ? and accordingly how can this influence buying of new vehicles

Question A:¶

How does ridership differ by month or season? Which month / season has the highest ridership? Does the ratio of Subscriber trips to Customer trips change depending on the month or season?

Answer¶

This is data for all cities, so i will be going over it in Sections

NYC :¶

The Maximum Ride were for the Month of September The Minumum Ride were for the Month of January The Subscriber to Customer Trip Ratio was Highest in December followed by January

Chicago :¶

The Maximum Ride were for the Month of September The Minumum Ride were for the Month of January The Subscriber to Customer Trip Ratio was Highest in January followed by December

Washington¶

The Maximum Ride were for the Month of September The Minumum Ride were for the Month of January The Subscriber to Customer Trip Ratio was Highest in January followed by Febuary

import csv # read and write csv files
from datetime import datetime
from datetime import date
from datetime import time
from pprint import pprint
import calendar
import matplotlib.pyplot as plt

def seasonData(filename):
    """
    This function will help us calculate riding figures for each month of the year,
    it also shall divide data into customer and subscribers and then help us get a ratio of the two.
    Visualization for the same is also provided.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        #Init 4 values
        #Total Months is used to store count of each trip irrespective of UserType
        Totalmonths = {'1':0,
                  '2':0,
                  '3':0,
                  '4':0,
                  '5':0,
                  '6':0,
                  '7':0,
                  '8':0,
                  '9':0,
                  '10':0,
                  '11':0,
                  '12':0}
        #Custmonths is used to store trip details of Customer Trips monthwise
        Custmonths = {'1':0,
                  '2':0,
                  '3':0,
                  '4':0,
                  '5':0,
                  '6':0,
                  '7':0,
                  '8':0,
                  '9':0,
                  '10':0,
                  '11':0,
                  '12':0}
        #Submonths is used to store trip details of Subscriber Trips monthwise
        Submonths = {'1':0,
                  '2':0,
                  '3':0,
                  '4':0,
                  '5':0,
                  '6':0,
                  '7':0,
                  '8':0,
                  '9':0,
                  '10':0,
                  '11':0,
                  '12':0}
        #Ratio is used to hold the Ratio of Subscriber to Customer Trips
        ratio ={'1':0,
                  '2':0,
                  '3':0,
                  '4':0,
                  '5':0,
                  '6':0,
                  '7':0,
                  '8':0,
                  '9':0,
                  '10':0,
                  '11':0,
                  '12':0}

        for row in reader:
            val = int(row['month'])
            val = str(val)
            
            #If condition used to calculate which type of user type is using the system
            if(row['user_type'] == 'Customer'):
                temp = Custmonths[val]+1 
                Custmonths[val] = temp
            elif(row['user_type'] == 'Subscriber'):
                temp = Submonths[val]+1 
                Submonths[val] = temp
                
            #Adding total users variable after incrementing
            temp = Totalmonths[val]+1 
            Totalmonths[val] = temp
        
        i=1;
        #For loop used to calculate Ratio of Subs/Customers and round it off to two decimal Places
        for val in Submonths:
            i = str(i)
            ratio[i] = round(Submonths[i]/Custmonths[i],2)
            i = int(i)+1
            
            
    totalMax = calendar.month_name[int(max(Totalmonths))] #Max Value for Months
    totalMin = calendar.month_name[int(min(Totalmonths))] #Min Value of Months
    
    
    #Print Statements - Answers & Debugging
    print("For the City of : {}.".format(city))
    print("The Maximum Ride were for the Month of {} \nThe Minumum Ride were for the Month of {}".format(totalMax,totalMin))
    pprint("Total Months are now  {} \n ".format(Totalmonths))
    pprint("Custmonths  are now  {} \n ".format(Custmonths))
    pprint("Submonths  are now  {} \n ".format(Submonths))
    pprint("Ratio of Sub to cust are as follows \n {}".format(ratio))
    print("\n\n")
    
    #Visualization Code Blocks
    %matplotlib inline
    
    #Plotting Graph for Total Trip Data
    plt.bar(range(len(Totalmonths)), list(Totalmonths.values()), align='center')
    plt.xticks(range(len(Totalmonths)), list(Totalmonths.keys()))
    plt.title('Total Travel Figures for {}'.format(city))
    plt.xlabel('Month No')
    plt.ylabel('Count in Nos.')
    plt.show()
    
    #Plotting Graph for Ratio of Subscribers to Customers
    plt.bar(range(len(ratio)), list(ratio.values()), align='center')
    plt.xticks(range(len(ratio)), list(ratio.keys()))
    plt.title('Ratio of Sub / Cust for  {}'.format(city))
    plt.xlabel('Sub')
    plt.ylabel('Cust')
    plt.show()

#Testing Code Block With all Data Sets
data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

for file in data_file:
    seasonData(file)

For the City of : NYC.
The Maximum Ride were for the Month of September 
The Minumum Ride were for the Month of January
("Total Months are now  {'1': 10180, '2': 11170, '3': 18413, '4': 20160, '5': "
 "24455, '6': 29242, '7': 27522, '8': 31104, '9': 32699, '10': 31519, '11': "
 "24148, '12': 16186} \n"
 ' ')
("Custmonths  are now  {'1': 488, '2': 569, '3': 1878, '4': 2632, '5': 3209, "
 "'6': 3136, '7': 3977, '8': 4412, '9': 4393, '10': 3006, '11': 1799, '12': "
 '686} \n'
 ' ')
("Submonths  are now  {'1': 9692, '2': 10601, '3': 16535, '4': 17528, '5': "
 "21246, '6': 26106, '7': 23545, '8': 26692, '9': 28306, '10': 28139, '11': "
 "22109, '12': 15397} \n"
 ' ')
('Ratio of Sub to cust are as follows \n'
 " {'1': 19.86, '2': 18.63, '3': 8.8, '4': 6.66, '5': 6.62, '6': 8.32, '7': "
 "5.92, '8': 6.05, '9': 6.44, '10': 9.36, '11': 12.29, '12': 22.44}")

For the City of : Chicago.
The Maximum Ride were for the Month of September 
The Minumum Ride were for the Month of January
("Total Months are now  {'1': 1901, '2': 2394, '3': 3719, '4': 4567, '5': "
 "7211, '6': 9794, '7': 10286, '8': 9810, '9': 8700, '10': 7160, '11': 4811, "
 "'12': 1778} \n"
 ' ')
("Custmonths  are now  {'1': 62, '2': 228, '3': 565, '4': 1017, '5': 2012, "
 "'6': 2612, '7': 3323, '8': 2757, '9': 2354, '10': 1492, '11': 667, '12': "
 '60} \n'
 ' ')
("Submonths  are now  {'1': 1839, '2': 2166, '3': 3154, '4': 3550, '5': 5199, "
 "'6': 7182, '7': 6963, '8': 7053, '9': 6346, '10': 5668, '11': 4144, '12': "
 '1718} \n'
 ' ')
('Ratio of Sub to cust are as follows \n'
 " {'1': 29.66, '2': 9.5, '3': 5.58, '4': 3.49, '5': 2.58, '6': 2.75, '7': "
 "2.1, '8': 2.56, '9': 2.7, '10': 3.8, '11': 6.21, '12': 28.63}")

For the City of : Washington.
The Maximum Ride were for the Month of September 
The Minumum Ride were for the Month of January
("Total Months are now  {'1': 2434, '2': 2854, '3': 5571, '4': 5602, '5': "
 "5768, '6': 7320, '7': 7341, '8': 7198, '9': 6878, '10': 6792, '11': 5214, "
 "'12': 3354} \n"
 ' ')
("Custmonths  are now  {'1': 222, '2': 283, '3': 1188, '4': 1192, '5': 1248, "
 "'6': 1707, '7': 2186, '8': 1806, '9': 1674, '10': 1560, '11': 1075, '12': "
 '432} \n'
 ' ')
("Submonths  are now  {'1': 2212, '2': 2571, '3': 4383, '4': 4410, '5': 4520, "
 "'6': 5613, '7': 5155, '8': 5392, '9': 5204, '10': 5232, '11': 4139, '12': "
 '2922} \n'
 ' ')
('Ratio of Sub to cust are as follows \n'
 " {'1': 9.96, '2': 9.08, '3': 3.69, '4': 3.7, '5': 3.62, '6': 3.29, '7': "
 "2.36, '8': 2.99, '9': 3.11, '10': 3.35, '11': 3.85, '12': 6.76}")

Question B¶

Is the pattern of ridership different on the weekends versus weekdays? On what days are Subscribers most likely to use the system? What about Customers? Does the average duration of rides change depending on the day of the week?

Answer B¶

Weekdays indeed see more Riding upto 50% more
The Average duration of rides is much longer on Weekends across all data sets

City	NYC	Chicago	Washington
Maximum Commute	Wednesday	Monday	Wednesday
Max Use for Subscribers	Saturday	Wednesday	Sunday
Max use for Customers	Wednesday	Tuesday	Wednesday

Weekends often have the longest ride times averaging about '17-22.5 mins' across all cities.
Shortest ride times are between Tuesday - Wednesday

import csv # read and write csv files
from datetime import datetime
from datetime import date
from datetime import time
import calendar

def dayData(filename):
    """
    This function calculates usage figures of different days of the week for overall users, subscribers and customers.
    It also maintains a record for weekdays and weekends.
    The Data is further visualized below.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        #Varible Declerations and init
        dayName = calendar.day_name #Storing all 7 Days of the week
        days = {} # Normal Days Counter
        subDays = {} # Subscriber Days Counter
        cusDays = {} #Customer Days counter
        avgSum = {} #Sum of all values of days to calculate avg
        avgCount ={} # Count also used to calculate avg
        avgDuration={} # Variable to hold average of duration
        weekOrEnd = {'Week Day' : 0,
                    'Week End' : 0
                    }
        
        counter = 0
        #Loop used to init all above loops with value 0
        while(True):
            days[dayName[counter]] = 0
            subDays[dayName[counter]] = 0
            cusDays[dayName[counter]] = 0
            avgSum[dayName[counter]] = 0
            avgCount[dayName[counter]] = 0
            avgDuration[dayName[counter]] = 0
            counter+=1
            if(counter>=7):
                break
        
        #Loops for adding sum, count, usertype data and lastly weekday or weekend.
        for row in reader:
            temp = row['day_of_week']
           
            avgSum[temp] = avgSum[temp] + float(row['duration'])
            avgCount[temp]+=1
            
            days[temp]+=1
                
            if(row['user_type'] == 'Customer'):
                subDays[temp]+=1   
            elif(row['user_type'] == 'Subscriber'):
                 cusDays[temp]+=1
                    
            if(temp == 'Saturday') or (temp == 'Sunday'):
                weekOrEnd['Week End']+=1
            else:
                weekOrEnd['Week Day']+=1
                
        #Loop for calulating avergae
        for val in avgDuration:
            avgDuration[val] = float(avgSum[val])/float(avgCount[val])
        
        #Finding max values from Dict of days, subscriberData, CustomerData and avergae    
        maxUse = max(days, key=days.get)
        subMaxUse = max(subDays, key=subDays.get)
        custMaxUse = max(cusDays, key=cusDays.get)
        maxAvg = max(avgDuration, key=avgDuration.get)
       
        #Result Print Statements
        print("For the city of {}".format(city))
        print("The Maximum Usage is on {}  Overall".format(maxUse))
        print("The Maximum Usage for Subscribers is on {} ".format(subMaxUse))
        print("The Maximum Usage for Customers is on  {} ".format(custMaxUse))
        
        #Visualtions Code Blocks
        %matplotlib inline
        
        #Plotting Graph to see diffrent days Usage figures
        plt.bar(range(len(days)), list(days.values()), align='center')
        plt.xticks(range(len(days)), list(days.keys()))
        plt.title('Universal Usage for {} city'.format(city))
        plt.xlabel('Day of Week')
        plt.ylabel('Count in Nos.')
        plt.show()
        
        #Plotting Graph to See which is used more weekend or weekday
        plt.bar(range(len(weekOrEnd)), list(weekOrEnd.values()), align='center')
        plt.xticks(range(len(weekOrEnd)), list(weekOrEnd.keys()))
        plt.title('Weekend Vs. WeekDay Usage for {}'.format(city))
        plt.xlabel('Type of Day')
        plt.ylabel('Count in Nos.')
        plt.show()
        
        #Plotting Graph to see which day of the week avg riding period is more.
        plt.bar(range(len(avgDuration)), list(avgDuration.values()), align='center')
        plt.xticks(range(len(avgDuration)), list(avgDuration.keys()))
        plt.title('Average Daily Traffic'.format(city))
        plt.xlabel('Days')
        plt.ylabel('Count in Nos.')
        plt.show()

data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

for file in data_file:
    dayData(file)

For the city of NYC
The Maximum Usage is on Wednesday  Overall
The Maximum Usage for Subscribers is on Saturday 
The Maximum Usage for Customers is on  Wednesday

For the city of Chicago
The Maximum Usage is on Monday  Overall
The Maximum Usage for Subscribers is on Sunday 
The Maximum Usage for Customers is on  Tuesday

For the city of Washington
The Maximum Usage is on Wednesday  Overall
The Maximum Usage for Subscribers is on Saturday 
The Maximum Usage for Customers is on  Wednesday

Question C¶

During what time of day is the system used the most? Is there a difference in usage patterns for Subscribers and Customers?

Answer : The System is used the most during the following time periods.

City	NYC	Chicago	Washington
Time of Day, system is used the most (Overall)	17 HRS	17 HRS	17 HRS
Time of Day, system is used the most (Subscriber)	15 HRS	14 HRS	17 HRS
Time of Day, system is used the most (Customers)	17 HRS	17 HRS	17 HRS

It can be observed that 17:00 Hours is the popular time to use the system, which coincides with ending of office hours. There is a definite time difference in usage patterns for subscribers and customers as seen above.

import csv # read and write csv files
from pprint import pprint
import matplotlib.pyplot as plt


def hourData(filename):
    """
    This function calculates usage figures on a hourly basis for rides.
    It then finds the max for three catogories (overall users, subscribers and Customers)
    Visualizations are also provided for clarity.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        #Init Dicts
        timeofDay= {}
        subTime = {}
        custTime = {}
        
        #Initing each with keys from 0 to 24 and value = 0
        counter = 0
        while(True):
            timeofDay[counter] = 0
            subTime[counter] = 0
            custTime[counter] = 0
            counter+=1
            if(counter>24):
                break
        
        for row in reader:
            #Adding to General Dict named timeofDay
            val = row['hour']
            val = int(val)
            timeofDay[val] = int(timeofDay[val])+1
            
            #Conditional staements for checking Customer type and adding data to Dict
            if(row['user_type'] == 'Customer'):
                val = row['hour']
                val = int(val)
                subTime[val] = int(subTime[val])+1
            elif(row['user_type'] == 'Subscriber'):
                val = row['hour']
                val = int(val)
                custTime[val] = int(custTime[val])+1
                
#         pprint(timeofDay)
        maxUse = max(timeofDay, key=timeofDay.get)
        maxSub = max(subTime, key=subTime.get)
        maxCus = max(custTime, key=custTime.get)
        
        print("City is {}".format(city))
        print("The time of Day the System is used most is {} Hrs".format(maxUse))
        print("The time of Day the System is used most by Subscribers is {} Hrs".format(maxSub))
        print("The time of Day the System is used most by Customers is {} Hrs".format(maxCus))
        
        %matplotlib inline
        ##Plotting Graohs
        
        #Plotting for Total Usage in timeofDay (Usage Per hour of Day)
        plt.bar(range(len(timeofDay)), list(timeofDay.values()), align='center')
        plt.xticks(range(len(timeofDay)), list(timeofDay.keys()))
        plt.title('Total Hourly Usage for {}'.format(city))
        plt.xlabel('Month No')
        plt.ylabel('Count in Nos.')
        plt.show()
        
        #Plotting for Subscribers
        plt.bar(range(len(subTime)), list(subTime.values()), align='center')
        plt.xticks(range(len(subTime)), list(subTime.keys()))
        plt.title('{} Subscriber Hourly Usage'.format(city))
        plt.xlabel('Time in 24hrs')
        plt.ylabel('Count in Nos.')
        plt.show()
        
        #plotting for Customers
        plt.bar(range(len(custTime)), list(custTime.values()), align='center')
        plt.xticks(range(len(custTime)), list(custTime.keys()))
        plt.title('{} Customer Hourly Usage'.format(city))
        plt.xlabel('Time in 24hrs')
        plt.ylabel('Count in Nos.')
        plt.show()
        print("--------------------------------------------------------------------------------------------------")
        print("\n\n\n")

data_file = [
    './data/NYC-2016-Summary.csv',
    './data/Chicago-2016-Summary.csv',
    './data/Washington-2016-Summary.csv']

for file in data_file:
    hourData(file)

City is NYC
The time of Day the System is used most is 17 Hrs
The time of Day the System is used most by Subscribers is 15 Hrs
The time of Day the System is used most by Customers is 17 Hrs

--------------------------------------------------------------------------------------------------


City is Chicago
The time of Day the System is used most is 17 Hrs
The time of Day the System is used most by Subscribers is 14 Hrs
The time of Day the System is used most by Customers is 17 Hrs

--------------------------------------------------------------------------------------------------


City is Washington
The time of Day the System is used most is 17 Hrs
The time of Day the System is used most by Subscribers is 14 Hrs
The time of Day the System is used most by Customers is 17 Hrs

--------------------------------------------------------------------------------------------------

Question D¶

Which stations are most popular in terms of usage ?

Answer :¶

The Maximum Traffic for NYC is from Station No : 519

The Maximum Traffic for Chicago is from Station No : 35

The Maximum Traffic for Washington is from Station No : 31623

import csv # read and write csv files
from pprint import pprint
import matplotlib.pyplot as plt
def popularStation(filename):
     with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        #Init for StationList whihc will hold all the station names
        stationList = {}
        #conditonal statements to check City as each city has a diffrent Column Name.
        ##For the entire if's i have used try and except blocks, what this basically does is that
        ## it will try to get the row with station ID value and increment it by one.
        ## If it is not found an error is generated and it will go to the excpet block where it will be
        ##Initied with the value of 1. This way the dict is dynamically generated on spot.
        if(city == 'NYC'):
            for row in reader:
                stID = int(row['start station id'])
                try:
                    stationList[stID]+=1
                except:
                     stationList[stID] = 1
        elif(city == 'Chicago'):
            for row in reader:
                stID = int(row['from_station_id'])
                try:
                    stationList[stID]+=1
                except:
                     stationList[stID] = 1
                        
        if(city == 'Washington'):
            for row in reader:
                stID = int(row['Start station number'])
                try:
                    stationList[stID]+=1
                except:
                     stationList[stID] = 1
                        
        #Check which station has the most traffic
        stationMax = max(stationList, key=stationList.get)
        print("The Maximum Traffic for {} is from Station No : {}".format(city,stationMax))
        
        #Plot Graph for Stations (X axis will be too Crowded)
        plt.bar(range(len(stationList)), list(stationList.values()), align='center')
        plt.xticks(range(len(stationList)), list(stationList.keys()))
        plt.title('Popular Stations for {}'.format(city))
        plt.xlabel('Station No')
        plt.ylabel('Count in Nos.')
        plt.show()
        print("\n\n\n")

##Please note applicable data sets for this question is : All (Unformatted)

data_file = [
    './data/NYC-CitiBike-2016.csv',
    './data/Chicago-Divvy-2016.csv',
    './data/Washington-CapitalBikeshare-2016.csv']

for file in data_file:
    popularStation(file)

The Maximum Traffic for NYC is from Station No : 519



The Maximum Traffic for Chicago is from Station No : 35



The Maximum Traffic for Washington is from Station No : 31623

Question E¶

What age group of people ride bikes ? and how could you target the other age groups

Answer :¶

Across Both data Sets The maximum Age group of Cylist are between the age of '30-39' years. Other Age groups maybe targetted by

Advertising in schools and colleges with health benefits explained to them
Discount schemes for other age groups eg. Senior citizens
Promo Coupons for completing certain goals. eg Certian trips completed, or minutes done.

import csv # read and write csv files
from pprint import pprint
import matplotlib.pyplot as plt
import calendar
from datetime import datetime
from datetime import date
from datetime import time

def ageGroup(filename):
     with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        city = filename.split('-')[0].split('/')[-1]
        
        #Init today's Year. I could have taken this from the DateTime object but was unsuccessful.
        today = 2017
        #Missing shall record the missing entiries. Where no Age is provided.
        missing = 0
        
        #Initing the age breakup,could have done it by a loop but there were few values so it seemed okay.
        
        ageGrp = {
            '0-9':0,
            '10-19':0,
            '20-29':0,
            '30-39':0,
            '40-49':0,
            '50-59':0,
            '60-69':0,
            '70+':0            
        }
        
        #Dict to Check Mail or Female
        sex = {
            'M':0,
            'F':0
        }
        
        ##Function AgeCat checks the age provided and add's one to the count of the age groups.
        def ageCat(val):
            if(val >= 0 and val <=9):
                ageGrp['0-9']+=1
            elif(val >= 10 and val <=19):
                ageGrp['10-19']+=1
            if(val >= 20 and val <=29):
                ageGrp['20-29']+=1
            elif(val >= 30 and val <=39):
                ageGrp['30-39']+=1
            if(val >= 40 and val <=49):
                ageGrp['40-49']+=1
            elif(val >= 50 and val <=59):
                ageGrp['50-59']+=1
            if(val >= 60 and val <=69):
                ageGrp['60-69']+=1
            elif(val >= 70):
                ageGrp['70+']+=1
        
        #Two Conditional statements to check city as the key changes from dataSet.
        if(city == 'NYC'):
            for row in reader:
                try:
                    birthYear = int(row['birth year'])
                    age = today - birthYear
                    ageCat(age)
                except:
                      missing+=1
                
                
                if(int(row['gender']) == 1):
                    sex['M']+=1
                if(int(row['gender']) == 2):
                    sex['F']+=1
                

        elif(city == 'Chicago'):
            for row in reader:

                gender = row['gender']
                if(gender == 'Male'):
                    sex['M']+=1
                elif(gender == 'Female'):
                    sex['F']+=1
                    
                try:
                    birthYear = int(row['birthyear'])
                    age = today - birthYear
                    ageCat(age)
                except:
                      missing+=1
                
                
                    

        #Find Maximum age from Data Set        
        maxAge = max(ageGrp, key=ageGrp.get)
        
        #Find Percetage of Male to Female.
        percentage = round(float(int(sex['M']*100)/(int(sex['M'])+int(sex['F']))),2)
        
        ##Final Print Statements.
        print("The Max Age group of Cylist are {}".format(maxAge))
        print("Missing Values =  {}".format(missing))
        print("Percentiles of Males to females is {} %".format(percentage))
        
        #Plot Graph for Age Range
        plt.bar(range(len(ageGrp)), list(ageGrp.values()), align='center')
        plt.xticks(range(len(ageGrp)), list(ageGrp.keys()))
        plt.title('Age classification for  {}'.format(city))
        plt.xlabel('Age Groups')
        plt.ylabel('Count in Nos.')
        plt.show()
        
        #Plot Graph for Sex Discrimination
        plt.bar(range(len(sex)), list(sex.values()), align='center')
        plt.xticks(range(len(sex)), list(sex.keys()))
        plt.title('Gender classification for  {}'.format(city))
        plt.xlabel('Gender')
        plt.ylabel('Count in Nos.')
        plt.show()
        print("\n\n\n")

##Please note applicable data sets for this question is : NYC, Chicago
data_file = [
    './data/NYC-CitiBike-2016.csv',
    './data/Chicago-Divvy-2016.csv']

for file in data_file:
    ageGroup(file)

The Max Age group of Cylist are 30-39
Missing Values =  31661
Percentiles of Males to females is 75.54 %



The Max Age group of Cylist are 30-39
Missing Values =  17145
Percentiles of Males to females is 74.93 %

Question F¶

When seeing the data which gender of riders is more ? and accordingly how can this influence buying of new vehicles

Answer :¶

According to the previous graphs males are significantly higher bike riders than females. Approx 75% More riders (observing graph)

Sales Purchase Advice:

If female demograph needs to be targetted; purchashing color options like pink or purple will seem more attractive to them.
Light weight cycles, with a lower height than male cycle model's would be preffered.
A universal color like black would be often suited to both male and female demograph.

A Small Note¶

Hello to the reviewer of my US bike share project. This is the first time I have ever worked with jupyter notebooks at this scale; and have been honestly confused from the start. I am a web developer and am very used to having debugging break points and IDE or text editor support at best. There are somethings lacking in this project which I wish I could improve, but as of writing this my course deadline has passed and time is limited.

Here are things I feel I could work on.¶

Spell Check - For some reason my browsers extentions really don't work here, I am dyslexic to some extent so some spelling errors maybe there in the comments.
Answers could be more descriptive; yes I am aware that I have written really short answers and I feel I could improve on that
Code reusability - I am very confused on how variable & function scoping works in jupyter, and hence I have assumed each code block to be like an independent file. I have put return statements everywhere except where I have data visualization. If given a chance I would like to submit later just code typed in a (.py) file, which would help me to make everything reusable and just one call method.
Finally I am also not familiar with markdown and for some reason linebreaks don't seem to work always, it's either two line breaks or one. So formatting might be a little off.

Sorry for these shortcommings in this project, and I actually had fun processing it. Saw my i7(6th Gen) spin up quite a bit while doing the calculations, something I have never seen in my time as a web developer.

Thank you !

Conclusions¶

Congratulations on completing the project! This is only a sampling of the data analysis process: from generating questions, wrangling the data, and to exploring the data. Normally, at this point in the data analysis process, you might want to draw conclusions about the data by performing a statistical test or fitting the data to a model for making predictions. There are also a lot of potential analyses that could be performed on the data which are not possible with only the data provided. For example, detailed location data has not been investigated. Where are the most commonly used docks? What are the most common routes? As another example, weather has potential to have a large impact on daily ridership. How much is ridership impacted when there is rain or snow? Are subscribers or customers affected more by changes in weather?

Question 7: Putting the bike share data aside, think of a topic or field of interest where you would like to be able to apply the techniques of data science. What would you like to be able to learn from your chosen subject?

Answer: Data Science is one topic where I see limitless potential, one of the main reasons I chose data science is that firstly every problem is unique, and comes with it's own challenges, unlike web development which ends up after a while being all the same.

Data Analysis can be done on anything and everything, it would intrest me to solve some problems which I see everyday and maybe optimise some things around me. Some things I would apply this too.

Figuring out what is the optimum time to keep street lights on after installing motion trackers in few places ( to save electricity)
For a company finding out what words and string of words help in sealing inword inquires, also which sales memeber is better for what kind of task
In some parts of my country (not a lot where I live); power cuts are a frequent problem, help the electricity board optimise thier layout better so cuts happen when usage is least (like 4am) and in a more equal pattern.
In a a business sense I could help potential shop's know what products shall sell at what time of the month/year, and with enough data be able to exactly predict stock shortage, and using price data even know what is the optimum price say a tennis ball will sell, which will help them when purchasing new stock.
I could also apply this to the field of medicine, with data from local blood testing clinics to see which area's of my city are healthier and what can be done to help others lead a healthier life.

The possibilities are limitless, and each question has a unique problem under it.

Tip: If we want to share the results of our analysis with others, we aren't limited to giving them a copy of the jupyter Notebook (.ipynb) file. We can also export the Notebook output in a form that can be opened even for those without Python installed. From the File menu in the upper left, go to the Download as submenu. You can then choose a different format that can be viewed more generally, such as HTML (.html) or PDF (.pdf). You may need additional packages or software to perform these exports.

If you are working on this project via the Project Notebook page in the classroom, you can also submit this project directly from the workspace. Before you do that, you should save an HTML copy of the completed project to the workspace by running the code cell below. If it worked correctly, the output code should be a 0, and if you click on the jupyter icon in the upper left, you should see your .html document in the workspace directory. Alternatively, you can download the .html copy of your report following the steps in the previous paragraph, then upload the report to the directory (by clicking the jupyter icon).

Either way, once you've gotten the .html report in your workspace, you can complete your submission by clicking on the "Submit Project" button to the lower-right hand side of the workspace.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Bike_Share_Analysis.ipynb'])

0

2016 US Bike Share Activity Snapshot¶

Table of Contents¶

Introduction¶

Posing Questions¶

Data Collection and Wrangling¶

Condensing the Trip Data¶

Exploratory Data Analysis¶

Statistics¶

Visualizations¶

Performing Your Own Analysis¶

Questions that need to be answered¶

Theirs¶

Mine¶

Question A:¶

Answer¶

NYC :¶

Chicago :¶

Washington¶

Question B¶

Answer B¶

Question C¶

Question D¶

Answer :¶

Question E¶

Answer :¶

Question F¶

Answer :¶

A Small Note¶

Here are things I feel I could work on.¶

Conclusions¶