Data Wrangling(Gathering)

This phase of the project containts the following tasks which need to be done programmatically.

In [1]:
#Importing basic packages needed to get Data 
import pandas as pd
import requests
import os
import tweepy
import json

1. Download Data Manually and Read in to check

In [2]:
archive = pd.read_csv('data/twitter-archive-enhanced.csv')
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="" r... This is Franklin. He would like you to stop ca... NaN NaN NaN 12 10 Franklin None None None None

2. Programmatically download data from a URL

In [3]:
folderName = 'data'
fName = url.split('/')[-1]
#Creating Folder Named Data 
if not os.path.exists(folderName):
In [ ]:
# Here we have the URL provided by UDACITY
url = ""
#fetching Data and saving to disk.
r = requests.get(url)
In [ ]:
#Writing Data to file 
with open(os.path.join(folderName,fName),mode = 'wb') as file:
In [ ]:
#Reading in Downloaded Data to check if working.
img = pd.read_csv('data/image-predictions.tsv', sep='\t')

3. Downloading Twitter API Data For the Required Values

In [4]:
#Extracting Twitter Id's from the Archive DataFrame.
tweet_id = archive['tweet_id']
In [5]:
#Twiter Auth Data (Remove before sumbission)

consumer_key = 'VjFpwyCsbShxMv2ECEDWu71Uo'
consumer_secret = 'tLKupsqpJlJbGAE595oLptb4zVgyTVe5cGRaRQHOfnDt06w29e'
access_token = '2981974992-nCKD9ib35SsdrNN0HuMHKUNqpBCPvzWYZYtd0PR'
access_token_secret = 'msZMlp6w3mAjAxmiiqhIwgwntJPlyXMHHgX2wc5xgKMOg'

# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_token_secret = ''
In [7]:
#Tweety Auth

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [8]:
#Variable Array created to capure Index ID's of Errors using the Twitter API
api_error = []

Writing a loop to download additional data from the twitter api and save them as their respective text files to be used in the futher steps.

Folder Name Being saved to : tweets/

File name: tweet_[ID OF TWEET].txt

In [6]:
#Using a Try Except Block here to access the twitter API, the Error's are logged in the api_error varuiable if needed later.
counter = 0;
for t in tweet_id:
        counter = counter+1
        fileName = 'data/tweet_json.txt'
        tweet = api.get_status(id=t, tweet_mode='extended')
        with open(fileName, 'a') as outfile:  
    except Exception:
        print(str(counter)+" ERROR ERROR ERROR")
In [ ]:
#Errors for ID's, They have been not added to the DataFrame.
errors = [2056,1993,1945,1865,1836,1616,1310]
In [9]:
#Loading in JSON File 
with open('data/tweet_json.txt') as f:
    data = json.loads(
In [29]:
#Using this Dumped Data (first Instance Only) to parse the JSON using the following tool : \
#and understand the structure of the JSON. 
In [65]:
#TEST CODE - Used to only check first Data

#Checking Queries on first Data Value printing them to the console.
Tid = data[0]['id_str']
full_text = data[0]['full_text']
retweet_count = data[0]['retweet_count']
fav_count = data[0]['favorite_count']
url = data[0]["extended_entities"]["media"][0]["url"]
index = data[0]['full_text'].index('/')
numerator = int(data[0]['full_text'][index-2:index])
denominator = int(data[0]['full_text'][index+1:index+3])
name = (data[0]['full_text'].split('.')[0].split(" ")[-1])
if 'doggo' in val:
    dog = 'doggo'
elif 'pupper' in val:
    dog = 'pupper'
elif 'puppo' in val:
    dog = 'puppo'
elif 'floofer' in val:
    dog = 'floofer'
    dog = None
This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10
In [11]:
#Another Check for parsing the JSON (it has quite a complicated Schema.)

Here we are extracting data which will be used to solve various data Quality issues.
These are Dog Name, Numerator, Denominator and the type of dog (doggo, fluffer, etc)

Creating the Final Dataframe : api

In [82]:
#Extracting Required feilds from JSON and making a new data frame df
df_list = []
for val in data:
    Tid = val['id_str']
    full_text = val['full_text']
    retweet_count = val['retweet_count']
    fav_count = val['favorite_count']
    index = val['full_text'].index('/')
    numerator = full_text[index-2:index]
    denominator = full_text[index+1:index+3]
    name = full_text.split('.')[0].split(" ")[-1]
    if 'doggo' in full_text:
        dog = 'doggo'
    elif 'pupper' in full_text:
        dog = 'pupper'
    elif 'puppo' in full_text:
        dog = 'puppo'
    elif 'floofer' in full_text:
        dog = 'floofer'
        dog = None
    df_list.append({'tweet_id': int(Tid),
                    'full_text': full_text,
                    'retweet_count': int(retweet_count),
                    'fav_count' : int(fav_count),
                    'numerator' : numerator, #[Q#1] 
                    'denominator': denominator, #[Q#3]
                    'pet_name' : name, #[Q#2]
                    'dog' : dog #[T#1]

api = pd.DataFrame(data=df_list)
In [70]:
#Checking our newly created DataFrame.
In [ ]:
#Rearanging the DF Columns to make more sense when read in.
api = api[['tweet_id', 'full_text', 'fav_count','retweet_count', 'pet_name', 'dog', 'numerator', 'denominator']]
In [ ]:
#Saving Twitter Data extracted from API as CSV

Testing Code is Below Please Check for Understanding how I reached the above solutions

In [55]:
#Sample Test Block for the loop used above.
df_api = []
for val in api['full_text']:
    #Code for finding numberator and denominator
    Tid = api["tweet_id"]
    index = val.index('/')
    rating_numerator = val[index-2:index]
    rating_denominator = val[index+1:index+3]
    name = (val.split('.')[0].split(" ")[-1])
    if 'doggo' in val:
        dog = 'doggo'
    elif 'pupper' in val:
        dog = 'pupper'
    elif 'puppo' in val:
        dog = 'puppo'
    elif 'floofer' in val:
        dog = 'floofer'
        dog = None

        'tweet_id' : Tid,
        'name' : name,
        'rating_numerator' : rating_numerator,
        'rating_denominator' : rating_denominator,
        'dog' : dog
df_api_pd = pd.DataFrame(data=df_api)
In [52]:
#Sample code used for Calculating Numerator, Denominator and Pet Name
index = val.index('/')
print(val.split('.')[0].split(" ")[-1])
In [53]:
#Checking the value of variable val which contains full_text from the twitter api
'Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet'
In [54]:
#Re initing Val with a diffrent Data value (without a dog Name)
val = api["full_text"][12]
"Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10"
In [ ]:
if 'doggo' in val:
    dog = ('doggo')
elif 'pupper' in val:
    dog = ('pupper')
elif 'puppo' in val:
    dog = ('puppo')
elif 'floofer' or 'floof' in val:
    dog =  ('floofer')
    dog = ("None")