Data Wrangling (Cleaning)¶

#Importing Basic Packages
import pandas as pd
import numpy as np
import math

#Assigning agreed upon variable names (Original Data)
img = pd.read_csv('data/image-predictions.tsv', sep='\t')
api = pd.read_csv('data/twitter_archive_api.csv')
archive = pd.read_csv('data/twitter-archive-enhanced.csv')

#Creating Backups and Working on the *_clean Data.

img_clean = img.copy()
api_clean = api.copy()
archive_clean = archive.copy()

Quality Issues¶

archive The numerator needs to be Recalculated as told to us - can be taken from api column (needs to be Cleaned)
archive Dog names are incorrect, need to re-extract
archive Dog Ratings are incorrect, need to re-extract
archive Remove retwetted Data
img The columns p1,p2,p3 have underscores separating their names. We should add white spaces
img if P1_dog ,P2_dog and P3_dog are all false. The tweet is Invalid and cannot be processed by the neural network and hence must be removed.
api remove stray count column (unnamed 0)

Tidyness Issues¶

achive and api : Combine doggo, pupper etc into one column
Merge into two Tables (get rid of api merge into archive dataset)

Solving Quality Issues¶

#1. `archive` The numerator Column needs to be recaluclated as mentioned.¶

Define

This step was noted here, although it was easier to fix it in the gathering stage while the api data was being formed.
I will leave a tooltip that Quality issue has been solved there. [Q#1]
The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#2. `archive` Dog names are incorret, need to re-extract¶

Define

Solved in the Data Gathering Stage and columns will be Dropped Later [Q#2]
The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#3. `archive` Dog Ratings are incorrect, need to re-extract¶

Define

Solved in the Data Gathering Stage and columns will be Dropped Later [Q#3]
The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#4. `archive` Remove retwetted Data
¶

Define

Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

Remove Any rows containting any other Values other than NaN in the retweeted_status_id columnn
Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

Code

archive_clean = archive[archive.retweeted_status_id.isnull()]
archive_clean.head()

Test

Should be blank - This means we have removed all the retweeted Data.

archive_clean[archive_clean.retweeted_status_id.notnull()]

#5. `img` The columns p1,p2,p3 have underscores seperating their names¶

Define

Remove underscores with whitespaces to increase readability

img.head(1)

Code

#Remvoving Whitespace using str.replace()
img_clean.p1 = img_clean.p1.str.replace('_',' ')
img_clean.p2 = img_clean.p2.str.replace('_',' ')
img_clean.p3 = img_clean.p3.str.replace('_',' ')

Test

#Checking if the above solution worked.
img_clean.head()

p1, p2 & p3 Columns have whiteSpaces. - Success

#6. `img` if P1_dog ,P2_dog and P3_dog are all false.¶

Define

The tweet is Invalid if p1_dog, p2_dog & p3_dog are all FALSE and cannot be processed by the neural network and hence must be removed.

Code

#looking at the original DF
img.head()

#Keeping only the True Values
#img_clean = img_clean[~(img_clean.p1_dog & img_clean.p2_dog & img_clean.p3_dog)]
img.count()

tweet_id    2075
jpg_url     2075
img_num     2075
p1          2075
p1_conf     2075
p1_dog      2075
p2          2075
p2_conf     2075
p2_dog      2075
p3          2075
p3_conf     2075
p3_dog      2075
dtype: int64

#Keeping only the True Values
img_clean = img_clean[~((img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False))]
img_clean.count()

tweet_id    1751
jpg_url     1751
img_num     1751
p1          1751
p1_conf     1751
p1_dog      1751
p2          1751
p2_conf     1751
p2_dog      1751
p3          1751
p3_conf     1751
p3_dog      1751
dtype: int64

Test

img_clean[(img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False)]

#7 `api` remove stray count column ['unnamed 0']¶

Define

-When the API csv was imported there was a stray column, we must drop it as it will cause issues later while merging

api.head(0)

Code

#Dropping Stray Column
api_clean.drop(columns=['Unnamed: 0'], inplace=True)

Test

api_clean.head()

Solving Tidiness Issues¶

#2. Merge into two Tables (get rid of `api` merge into `archive` dataset)¶

Define

As per the final project specification we need to merge these two tables to eliminate redudant data
These Tables are

archive
api

Code

#Merge into two Tables (get rid of `api` merge into `archive` dataset)
#Saving new DF as var name 'final'

final = pd.merge(archive_clean,api_clean,how='right',on='tweet_id')

Test

print(final.columns)

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'full_text', 'fav_count', 'retweet_count', 'pet_name', 'dog',
       'numerator', 'denominator'],
      dtype='object')

#1. `archive` and `api` : Combine doggo, pupper etc into one column¶

Define

We Extracted this column directly in the gathering stage while working with the Twitter JSON
This value is named dog in the api table, we can replace the columns floofer,pupper,puppo,doggo with the single column dog
The Remaining Columns mayebe Dropped.
referenced for Operations done during gathering [T#1]

Code

#Solving Tidiness Issue #1
#1. `achive` and `api` : Combine doggo, pupper etc into one column
final.drop(columns=['floofer','pupper','puppo','doggo'], inplace=True)

Test

final.head()

Other Misc Operations¶

Define

Dropping Extra Columns which were recalculated Before names [Q#2]
Dropping rating_numerator and rating_denominator [Q#1] [Q#3]
Dropping extra repeated column text

Code

final.drop(columns=['name','rating_numerator','rating_denominator','text'], inplace=True) 
#[Q#2] Solved.
#[Q#1], [Q#3] Solved,

We need to clean the `numerator` column¶

Define

The code we wrote turned out some errors, we shall manual clean this.

#We Need to check what values should be not there
final.numerator.value_counts()

12       553
11       463
10       461
13       346
 9       152
 8       100
 7        54
14        52
 5        33
 6        32
 3        18
 4        17
 1         9
 2         8
.9         3
20         3
.5         2
60         2
15         2
44         2
\r\n9      2
75         2
 0         2
(8         1
24         1
50         1
-5         1
84         1
43         1
76         1
04         1
\r\n5      1
21         1
ry         1
st         1
65         1
.8         1
27         1
 w         1
17         1
07         1
82         1
99         1
;2         1
45         1
88         1
80         1
26         1
66         1
Name: numerator, dtype: int64

Code

#Making a list of which rows have errors - using value_counts() for refrence.
numerator_error = []
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == '.5']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n9']['tweet_id'])
numerator_error.append(final[final.numerator == 'ry']['tweet_id'])
numerator_error.append(final[final.numerator == '.8']['tweet_id'])
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == ';2']['tweet_id'])
numerator_error.append(final[final.numerator == '(8']['tweet_id'])
numerator_error.append(final[final.numerator == 'st']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n5']['tweet_id'])
numerator_error.append(final[final.numerator == '-5']['tweet_id'])
numerator_error.append(final[final.numerator == ' w']['tweet_id'])
numerator_error

[847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 42      883482846933004288
 1509    681340665377193984
 Name: tweet_id, dtype: int64, 1492    682389078323662849
 2082    667538891197542400
 Name: tweet_id, dtype: int64, 2329    760153949710192640
 Name: tweet_id, dtype: int64, 1255    697259378236399616
 Name: tweet_id, dtype: int64, 847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 2066    667878741721415682
 Name: tweet_id, dtype: int64, 1473    683462770029932544
 Name: tweet_id, dtype: int64, 1635    676613908052996102
 Name: tweet_id, dtype: int64, 1912    670782429121134593
 Name: tweet_id, dtype: int64, 2343    667550882905632768
 Name: tweet_id, dtype: int64, 1069    711306686208872448
 Name: tweet_id, dtype: int64]

Here we start to Manually Clean the data.

Steps

First we check which rows have the error prone data, look at each dataframes full_text
Using our judgement we manually change the values
This is although time consuming, since there are just a few errors it is doable.

final[final.numerator == '.9'].full_text
final.loc[[847,1192,1430],'numerator'] = 9

final[final.numerator == '.5']

final.loc[42].full_text
final.loc[[42],'numerator'] = 13.5

final.loc[1509].full_text
final.loc[[1509],'numerator'] = 9.5

final[final.numerator == '\r\n9']

final.loc[1492].full_text
final.loc[[1492,2082],'numerator'] = 9

final[final.numerator == 'ry']

final.loc[2329].full_text
final.loc[[2329],'numerator'] = 11

final[final.numerator == '.8']

final.loc[1255].full_text
final.loc[[1255],'numerator'] = 8

final[final.numerator == '.9']

final[final.numerator == ';2']

final.loc[2066].full_text
final.loc[[2066],'numerator'] = 2

final[final.numerator == '(8']

final.loc[1473].full_text
final.loc[[1473],'numerator'] = 8

final[final.numerator == 'st']

final.loc[1635].full_text
final.loc[[1635],'numerator'] = 12

final[final.numerator == '\r\n5']

final.loc[1912].full_text
final.loc[[1912],'numerator'] = 5

final[final.numerator == '-5']

final[final.numerator == ' w']

final.loc[1069].full_text
final.loc[[1069],'numerator'] = 3

Test

All Stray values eliminated (especially strings)

print(final.numerator.value_counts())

12      553
11      463
10      461
13      346
 9      152
 8      100
 7       54
14       52
 5       33
 6       32
 3       18
 4       17
 1        9
 2        8
9         5
20        3
75        2
8         2
 0        2
60        2
15        2
44        2
24        1
3         1
07        1
43        1
11        1
2         1
12        1
13.5      1
84        1
99        1
04        1
66        1
5         1
88        1
-5        1
50        1
45        1
9.5       1
17        1
76        1
27        1
21        1
82        1
26        1
80        1
65        1
Name: numerator, dtype: int64

We need to clean the `denominator` column as well¶

Define

The code we wrote turned out some errors, we shall manual clean this.
We need to eliminate string values

Code

final.denominator.value_counts()

10    2319
11       3
50       3
80       2
20       2
15       2
13       1
pe       1
ou       1
90       1
12       1
00       1
7        1
40       1
17       1
16       1
sw       1
70       1
2        1
Name: denominator, dtype: int64

final[final.denominator == 'pe']

final.loc[2329].full_text
final.loc[[2329],'denominator'] = 10

final[final.denominator == 'ou']

final.loc[1069].full_text
final.loc[[1069],'denominator'] = 10

final[final.denominator == 'sw']

final.loc[1635].full_text
final.loc[[1635],'denominator'] = 10

Test

All String Values eliminated

final.denominator.value_counts()

10    2319
11       3
10       3
50       3
80       2
20       2
15       2
13       1
90       1
12       1
00       1
7        1
40       1
17       1
16       1
70       1
2        1
Name: denominator, dtype: int64

Test

final.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'full_text', 'fav_count',
       'retweet_count', 'pet_name', 'dog', 'numerator', 'denominator'],
      dtype='object')

Final Code to Save¶

RUN LAST AND UNCOMMENT

#Save Files to CSV

#final.to_csv('data/final/twitter_archive_master.csv')
#img_clean.to_csv('data/final/image_predictions.csv')
#print("Saved Successfully")

Saved Successfully

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh springer spaniel	0.465074	True	collie	0.156665	True	Shetland sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature pinscher	0.074192	True	Rhodesian ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian ridgeback	0.408143	True	redbone	0.360687	True	miniature pinscher	0.222752	True
4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True

	tweet_id	full_text	fav_count	retweet_count	pet_name	dog	numerator	denominator
0	892420643555336193	This is Phineas. He's a mystical boy. Only eve...	38625	8541	Phineas	NaN	13	10
1	892177421306343426	This is Tilly. She's just checking pup on you....	33105	6282	Tilly	NaN	13	10
2	891815181378084864	This is Archie. He is a rare Norwegian Pouncin...	24922	4161	Archie	NaN	12	10
3	891689557279858688	This is Darla. She commenced a snooze mid meal...	42022	8670	Darla	NaN	13	10
4	891327558926688256	This is Franklin. He would like you to stop ca...	40172	9422	Franklin	NaN	12	10

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	13	10	Phineas	None	None	None	None
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	13	10	Tilly	None	None	None	None
2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891815181...	12	10	Archie	None	None	None	None
3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891689557...	13	10	Darla	None	None	None	None
4	891327558926688256	NaN	NaN	2017-07-29 16:00:24 +0000	<a href="http://twitter.com/download/iphone" r...	This is Franklin. He would like you to stop ca...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891327558...	12	10	Franklin	None	None	None	None

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	full_text	fav_count	retweet_count	pet_name	dog	numerator	denominator
42	883482846933004288	NaN	NaN	2017-07-08 00:28:19 +0000	<a href="http://twitter.com/download/iphone" r...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/883482846...	This is Bella. She hopes her smile made you sm...	45778	9964	Bella	NaN	.5	10
1509	681340665377193984	6.813394e+17	4.196984e+09	2015-12-28 05:07:27 +0000	<a href="http://twitter.com/download/iphone" r...	NaN	NaN	NaN	NaN	I've been told there's a slight possibility he...	1748	303	mirror	NaN	.5	10

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	full_text	fav_count	retweet_count	pet_name	dog	numerator	denominator
1492	682389078323662849	NaN	NaN	2015-12-31 02:33:29 +0000	<a href="http://twitter.com/download/iphone" r...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/682389078...	Meet Brody. He's a Downton Abbey Falsetto. Add...	1780	509	Brody	NaN	\r\n9	10
2082	667538891197542400	NaN	NaN	2015-11-20 03:04:08 +0000	<a href="http://twitter.com" rel="nofollow">Tw...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/667538891...	This is a southwest Coriander named Klint. Hat...	208	69	Klint	NaN	\r\n9	10

Data Wrangling (Cleaning)¶

Quality Issues¶

Tidyness Issues¶

Solving Quality Issues¶

#1. archive The numerator Column needs to be recaluclated as mentioned.¶

#2. archive Dog names are incorret, need to re-extract¶

#3. archive Dog Ratings are incorrect, need to re-extract¶

#4. archive Remove retwetted Data¶

#5. img The columns p1,p2,p3 have underscores seperating their names¶

#6. img if P1_dog ,P2_dog and P3_dog are all false.¶

#7 api remove stray count column ['unnamed 0']¶

Solving Tidiness Issues¶

#2. Merge into two Tables (get rid of api merge into archive dataset)¶

#1. archive and api : Combine doggo, pupper etc into one column¶

Other Misc Operations¶

We need to clean the numerator column¶

We need to clean the denominator column as well¶

Final Code to Save¶

#1. `archive` The numerator Column needs to be recaluclated as mentioned.¶

#2. `archive` Dog names are incorret, need to re-extract¶

#3. `archive` Dog Ratings are incorrect, need to re-extract¶

#4. `archive` Remove retwetted Data
¶

#5. `img` The columns p1,p2,p3 have underscores seperating their names¶

#6. `img` if P1_dog ,P2_dog and P3_dog are all false.¶

#7 `api` remove stray count column ['unnamed 0']¶

#2. Merge into two Tables (get rid of `api` merge into `archive` dataset)¶

#1. `archive` and `api` : Combine doggo, pupper etc into one column¶

We need to clean the `numerator` column¶

We need to clean the `denominator` column as well¶