Data Wrangling (Cleaning)

In [1]:
#Importing Basic Packages
import pandas as pd
import numpy as np
import math
In [2]:
#Assigning agreed upon variable names (Original Data)
img = pd.read_csv('data/image-predictions.tsv', sep='\t')
api = pd.read_csv('data/twitter_archive_api.csv')
archive = pd.read_csv('data/twitter-archive-enhanced.csv')
In [3]:
#Creating Backups and Working on the *_clean Data.

img_clean = img.copy()
api_clean = api.copy()
archive_clean = archive.copy()

Quality Issues

  1. archive The numerator needs to be Recalculated as told to us - can be taken from api column (needs to be Cleaned)
  2. archive Dog names are incorrect, need to re-extract
  3. archive Dog Ratings are incorrect, need to re-extract
  4. archive Remove retwetted Data
  5. img The columns p1,p2,p3 have underscores separating their names. We should add white spaces
  6. img if P1_dog ,P2_dog and P3_dog are all false. The tweet is Invalid and cannot be processed by the neural network and hence must be removed.
  7. api remove stray count column (unnamed 0)

Tidyness Issues

  1. achive and api : Combine doggo, pupper etc into one column
  2. Merge into two Tables (get rid of api merge into archive dataset)

Solving Quality Issues

#1. archive The numerator Column needs to be recaluclated as mentioned.

Define

  • This step was noted here, although it was easier to fix it in the gathering stage while the api data was being formed.
  • I will leave a tooltip that Quality issue has been solved there. [Q#1]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#2. archive Dog names are incorret, need to re-extract

Define

  • Solved in the Data Gathering Stage and columns will be Dropped Later [Q#2]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#3. archive Dog Ratings are incorrect, need to re-extract

Define

  • Solved in the Data Gathering Stage and columns will be Dropped Later [Q#3]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#4. archive Remove retwetted Data

Define

Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

  • Remove Any rows containting any other Values other than NaN in the retweeted_status_id columnn
  • Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

Code

In [4]:
archive_clean = archive[archive.retweeted_status_id.isnull()]
archive_clean.head()
Out[4]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

Test

  • Should be blank - This means we have removed all the retweeted Data.
In [5]:
archive_clean[archive_clean.retweeted_status_id.notnull()]
Out[5]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

#5. img The columns p1,p2,p3 have underscores seperating their names

Define

  • Remove underscores with whitespaces to increase readability
In [6]:
img.head(1)
Out[6]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True

Code

In [7]:
#Remvoving Whitespace using str.replace()
img_clean.p1 = img_clean.p1.str.replace('_',' ')
img_clean.p2 = img_clean.p2.str.replace('_',' ')
img_clean.p3 = img_clean.p3.str.replace('_',' ')

Test

In [8]:
#Checking if the above solution worked.
img_clean.head()
Out[8]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh springer spaniel 0.465074 True collie 0.156665 True Shetland sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature pinscher 0.074192 True Rhodesian ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian ridgeback 0.408143 True redbone 0.360687 True miniature pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

p1, p2 & p3 Columns have whiteSpaces. - Success

#6. img if P1_dog ,P2_dog and P3_dog are all false.

Define

  • The tweet is Invalid if p1_dog, p2_dog & p3_dog are all FALSE and cannot be processed by the neural network and hence must be removed.

Code

In [9]:
#looking at the original DF
img.head()
Out[9]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [10]:
#Keeping only the True Values
#img_clean = img_clean[~(img_clean.p1_dog & img_clean.p2_dog & img_clean.p3_dog)]
img.count()
Out[10]:
tweet_id    2075
jpg_url     2075
img_num     2075
p1          2075
p1_conf     2075
p1_dog      2075
p2          2075
p2_conf     2075
p2_dog      2075
p3          2075
p3_conf     2075
p3_dog      2075
dtype: int64
In [11]:
#Keeping only the True Values
img_clean = img_clean[~((img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False))]
img_clean.count()
Out[11]:
tweet_id    1751
jpg_url     1751
img_num     1751
p1          1751
p1_conf     1751
p1_dog      1751
p2          1751
p2_conf     1751
p2_dog      1751
p3          1751
p3_conf     1751
p3_dog      1751
dtype: int64

Test

In [12]:
img_clean[(img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False)]
Out[12]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog

#7 api remove stray count column ['unnamed 0']

Define

-When the API csv was imported there was a stray column, we must drop it as it will cause issues later while merging

In [13]:
api.head(0)
Out[13]:
Unnamed: 0 tweet_id full_text fav_count retweet_count pet_name dog numerator denominator

Code

In [14]:
#Dropping Stray Column
api_clean.drop(columns=['Unnamed: 0'], inplace=True)

Test

In [15]:
api_clean.head()
Out[15]:
tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10

Solving Tidiness Issues

#2. Merge into two Tables (get rid of api merge into archive dataset)

Define

As per the final project specification we need to merge these two tables to eliminate redudant data
These Tables are

  • archive
  • api

Code

In [16]:
#Merge into two Tables (get rid of `api` merge into `archive` dataset)
#Saving new DF as var name 'final'

final = pd.merge(archive_clean,api_clean,how='right',on='tweet_id')

Test

In [17]:
print(final.columns)
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'full_text', 'fav_count', 'retweet_count', 'pet_name', 'dog',
       'numerator', 'denominator'],
      dtype='object')

#1. archive and api : Combine doggo, pupper etc into one column

Define

  • We Extracted this column directly in the gathering stage while working with the Twitter JSON
  • This value is named dog in the api table, we can replace the columns floofer,pupper,puppo,doggo with the single column dog
  • The Remaining Columns mayebe Dropped.
  • referenced for Operations done during gathering [T#1]

Code

In [18]:
#Solving Tidiness Issue #1
#1. `achive` and `api` : Combine doggo, pupper etc into one column
final.drop(columns=['floofer','pupper','puppo','doggo'], inplace=True)

Test

In [19]:
final.head()
Out[19]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13.0 10.0 Phineas This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13.0 10.0 Tilly This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12.0 10.0 Archie This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13.0 10.0 Darla This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12.0 10.0 Franklin This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10

Other Misc Operations

Define

  • Dropping Extra Columns which were recalculated Before names [Q#2]
  • Dropping rating_numerator and rating_denominator [Q#1] [Q#3]
  • Dropping extra repeated column text

Code

In [20]:
final.drop(columns=['name','rating_numerator','rating_denominator','text'], inplace=True) 
#[Q#2] Solved.
#[Q#1], [Q#3] Solved,

We need to clean the numerator column

Define

The code we wrote turned out some errors, we shall manual clean this.

In [21]:
#We Need to check what values should be not there
final.numerator.value_counts()
Out[21]:
12       553
11       463
10       461
13       346
 9       152
 8       100
 7        54
14        52
 5        33
 6        32
 3        18
 4        17
 1         9
 2         8
.9         3
20         3
.5         2
60         2
15         2
44         2
\r\n9      2
75         2
 0         2
(8         1
24         1
50         1
-5         1
84         1
43         1
76         1
04         1
\r\n5      1
21         1
ry         1
st         1
65         1
.8         1
27         1
 w         1
17         1
07         1
82         1
99         1
;2         1
45         1
88         1
80         1
26         1
66         1
Name: numerator, dtype: int64

Code

In [22]:
#Making a list of which rows have errors - using value_counts() for refrence.
numerator_error = []
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == '.5']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n9']['tweet_id'])
numerator_error.append(final[final.numerator == 'ry']['tweet_id'])
numerator_error.append(final[final.numerator == '.8']['tweet_id'])
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == ';2']['tweet_id'])
numerator_error.append(final[final.numerator == '(8']['tweet_id'])
numerator_error.append(final[final.numerator == 'st']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n5']['tweet_id'])
numerator_error.append(final[final.numerator == '-5']['tweet_id'])
numerator_error.append(final[final.numerator == ' w']['tweet_id'])
numerator_error
Out[22]:
[847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 42      883482846933004288
 1509    681340665377193984
 Name: tweet_id, dtype: int64, 1492    682389078323662849
 2082    667538891197542400
 Name: tweet_id, dtype: int64, 2329    760153949710192640
 Name: tweet_id, dtype: int64, 1255    697259378236399616
 Name: tweet_id, dtype: int64, 847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 2066    667878741721415682
 Name: tweet_id, dtype: int64, 1473    683462770029932544
 Name: tweet_id, dtype: int64, 1635    676613908052996102
 Name: tweet_id, dtype: int64, 1912    670782429121134593
 Name: tweet_id, dtype: int64, 2343    667550882905632768
 Name: tweet_id, dtype: int64, 1069    711306686208872448
 Name: tweet_id, dtype: int64]

Here we start to Manually Clean the data.

Steps

  • First we check which rows have the error prone data, look at each dataframes full_text
  • Using our judgement we manually change the values
  • This is although time consuming, since there are just a few errors it is doable.
In [23]:
final[final.numerator == '.9'].full_text
final.loc[[847,1192,1430],'numerator'] = 9
In [24]:
final[final.numerator == '.5']
Out[24]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
42 883482846933004288 NaN NaN 2017-07-08 00:28:19 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/883482846... This is Bella. She hopes her smile made you sm... 45778 9964 Bella NaN .5 10
1509 681340665377193984 6.813394e+17 4.196984e+09 2015-12-28 05:07:27 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN NaN I've been told there's a slight possibility he... 1748 303 mirror NaN .5 10
In [25]:
final.loc[42].full_text
final.loc[[42],'numerator'] = 13.5
In [26]:
final.loc[1509].full_text
final.loc[[1509],'numerator'] = 9.5
In [27]:
final[final.numerator == '\r\n9']
Out[27]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1492 682389078323662849 NaN NaN 2015-12-31 02:33:29 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/682389078... Meet Brody. He's a Downton Abbey Falsetto. Add... 1780 509 Brody NaN \r\n9 10
2082 667538891197542400 NaN NaN 2015-11-20 03:04:08 +0000 <a href="http://twitter.com" rel="nofollow">Tw... NaN NaN NaN https://twitter.com/dog_rates/status/667538891... This is a southwest Coriander named Klint. Hat... 208 69 Klint NaN \r\n9 10
In [28]:
final.loc[1492].full_text
final.loc[[1492,2082],'numerator'] = 9
In [29]:
final[final.numerator == 'ry']
Out[29]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2329 760153949710192640 NaN NaN NaN NaN NaN NaN NaN NaN RT @hownottodraw: The story/person behind @dog... 0 36 af NaN ry pe
In [30]:
final.loc[2329].full_text
final.loc[[2329],'numerator'] = 11
In [31]:
final[final.numerator == '.8']
Out[31]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1255 697259378236399616 NaN NaN 2016-02-10 03:22:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/697259378... Please stop sending in saber-toothed tigers. T... 3500 1086 tigers NaN .8 10
In [32]:
final.loc[1255].full_text
final.loc[[1255],'numerator'] = 8
In [33]:
final[final.numerator == '.9']
Out[33]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
In [34]:
final[final.numerator == ';2']
Out[34]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2066 667878741721415682 NaN NaN 2015-11-21 01:34:35 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/667878741... This is Tedrick. He lives on the edge. Needs s... 403 123 Tedrick NaN ;2 10
In [35]:
final.loc[2066].full_text
final.loc[[2066],'numerator'] = 2
In [36]:
final[final.numerator == '(8']
Out[36]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1473 683462770029932544 NaN NaN 2016-01-03 01:39:57 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/683462770... "Hello forest pupper I am house pupper welcome... 2598 729 https://t pupper (8 10
In [37]:
final.loc[1473].full_text
final.loc[[1473],'numerator'] = 8
In [38]:
final[final.numerator == 'st']
Out[38]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1635 676613908052996102 NaN NaN 2015-12-15 04:05:01 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/676613908... This is the saddest/sweetest/best picture I've... 1147 209 sent NaN st sw
In [39]:
final.loc[1635].full_text
final.loc[[1635],'numerator'] = 12
In [40]:
final[final.numerator == '\r\n5']
Out[40]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1912 670782429121134593 NaN NaN 2015-11-29 01:52:48 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/670782429... This dude slaps your girl's ass what do you do... 1627 820 https://t NaN \r\n5 10
In [41]:
final.loc[1912].full_text
final.loc[[1912],'numerator'] = 5
In [42]:
final[final.numerator == '-5']
Out[42]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2343 667550882905632768 NaN NaN NaN NaN NaN NaN NaN NaN RT @dogratingrating: Unoriginal idea. Blatant ... 0 33 idea NaN -5 10
In [43]:
final[final.numerator == ' w']
Out[43]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1069 711306686208872448 NaN NaN 2016-03-19 21:41:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/711306686... What hooligan sent in pictures w/out a dog in ... 3516 795 af NaN w ou
In [44]:
final.loc[1069].full_text
final.loc[[1069],'numerator'] = 3

Test

  • All Stray values eliminated (especially strings)
In [45]:
print(final.numerator.value_counts())
12      553
11      463
10      461
13      346
 9      152
 8      100
 7       54
14       52
 5       33
 6       32
 3       18
 4       17
 1        9
 2        8
9         5
20        3
75        2
8         2
 0        2
60        2
15        2
44        2
24        1
3         1
07        1
43        1
11        1
2         1
12        1
13.5      1
84        1
99        1
04        1
66        1
5         1
88        1
-5        1
50        1
45        1
9.5       1
17        1
76        1
27        1
21        1
82        1
26        1
80        1
65        1
Name: numerator, dtype: int64

We need to clean the denominator column as well

Define

The code we wrote turned out some errors, we shall manual clean this.
We need to eliminate string values

Code

In [46]:
final.denominator.value_counts()
Out[46]:
10    2319
11       3
50       3
80       2
20       2
15       2
13       1
pe       1
ou       1
90       1
12       1
00       1
7        1
40       1
17       1
16       1
sw       1
70       1
2        1
Name: denominator, dtype: int64
In [47]:
final[final.denominator == 'pe']
Out[47]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2329 760153949710192640 NaN NaN NaN NaN NaN NaN NaN NaN RT @hownottodraw: The story/person behind @dog... 0 36 af NaN 11 pe
In [48]:
final.loc[2329].full_text
final.loc[[2329],'denominator'] = 10
In [49]:
final[final.denominator == 'ou']
Out[49]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1069 711306686208872448 NaN NaN 2016-03-19 21:41:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/711306686... What hooligan sent in pictures w/out a dog in ... 3516 795 af NaN 3 ou
In [50]:
final.loc[1069].full_text
final.loc[[1069],'denominator'] = 10
In [51]:
final[final.denominator == 'sw']
Out[51]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1635 676613908052996102 NaN NaN 2015-12-15 04:05:01 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/676613908... This is the saddest/sweetest/best picture I've... 1147 209 sent NaN 12 sw
In [52]:
final.loc[1635].full_text
final.loc[[1635],'denominator'] = 10

Test

  • All String Values eliminated
In [53]:
final.denominator.value_counts()
Out[53]:
10    2319
11       3
10       3
50       3
80       2
20       2
15       2
13       1
90       1
12       1
00       1
7        1
40       1
17       1
16       1
70       1
2        1
Name: denominator, dtype: int64

Test

In [54]:
final.columns
Out[54]:
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'full_text', 'fav_count',
       'retweet_count', 'pet_name', 'dog', 'numerator', 'denominator'],
      dtype='object')

Final Code to Save

RUN LAST AND UNCOMMENT

In [55]:
#Save Files to CSV

#final.to_csv('data/final/twitter_archive_master.csv')
#img_clean.to_csv('data/final/image_predictions.csv')
#print("Saved Successfully")
Saved Successfully