Data Wrangling(Gathering)

This phase of the project containts the following tasks which need to be done programmatically.

In [1]:
#Importing basic packages needed to get Data 
import pandas as pd
import requests
import os
import tweepy
import json

1. Download Data Manually and Read in to check

In [2]:
archive = pd.read_csv('data/twitter-archive-enhanced.csv')
archive.head()
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

2. Programmatically download data from a URL

In [3]:
# Here we have the URL provided by UDACITY
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
#fetching Data and saving to disk.
r = requests.get(url)
In [4]:
folderName = 'data'
fName = url.split('/')[-1]
#Creating Folder Named Data 
if not os.path.exists(folderName):
    os.makedirs(folderName)
In [5]:
#Writing Data to file 
with open(os.path.join(folderName,fName),mode = 'wb') as file:
    file.write(r.content)
In [6]:
#Reading in Downloaded Data to check if working.
img = pd.read_csv('data/image-predictions.tsv', sep='\t')
img.head()
Out[6]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

3. Downloading Twitter API Data For the Required Values

In [7]:
#Extracting Twitter Id's from the Archive DataFrame.
tweet_id = archive['tweet_id']
In [8]:
#Twiter Auth Data (Remove before sumbission)

consumer_key = 'VjFpwyCsbShxMv2ECEDWu71Uo'
consumer_secret = 'tLKupsqpJlJbGAE595oLptb4zVgyTVe5cGRaRQHOfnDt06w29e'
access_token = '2981974992-nCKD9ib35SsdrNN0HuMHKUNqpBCPvzWYZYtd0PR'
access_token_secret = 'msZMlp6w3mAjAxmiiqhIwgwntJPlyXMHHgX2wc5xgKMOg'

# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_token_secret = ''
In [9]:
#Tweety Auth

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [10]:
#Variable Array created to capure Index ID's of Errors using the Twitter API
api_error = []

Writing a loop to download additional data from the twitter api and save them as their respective text files to be used in the futher steps.

Folder Name Being saved to : tweets/

File name: tweet_[ID OF TWEET].txt

In [11]:
#Using a Try Except Block here to access the twitter API, the Error's are logged in the api_error varuiable if needed later.
counter = 0;
for t in tweet_id:
    try:
        counter = counter+1
        fileName = 'data/tweet_json.txt'
        tweet = api.get_status(id=t, tweet_mode='extended')
        with open(fileName, 'a') as outfile:  
            json.dump(tweet._json,outfile)
            outfile.write(','+'\n')
            print(str(counter) +" Success")
    except Exception:
        print(str(counter)+" ERROR ERROR ERROR")
        api_error.append(counter)
        pass
1 Success
2 Success
3 Success
4 Success
5 Success
6 Success
7 Success
8 Success
9 Success
10 Success
11 Success
12 Success
13 Success
14 Success
15 Success
16 Success
17 Success
18 Success
19 Success
20 ERROR ERROR ERROR
21 Success
22 Success
23 Success
24 Success
25 Success
26 Success
27 Success
28 Success
29 Success
30 Success
31 Success
32 Success
33 Success
34 Success
35 Success
36 Success
37 Success
38 Success
39 Success
40 Success
41 Success
42 Success
43 Success
44 Success
45 Success
46 Success
47 Success
48 Success
49 Success
50 Success
51 Success
52 Success
53 Success
54 Success
55 Success
56 Success
57 Success
58 Success
59 Success
60 Success
61 Success
62 Success
63 Success
64 Success
65 Success
66 Success
67 Success
68 Success
69 Success
70 Success
71 Success
72 Success
73 Success
74 Success
75 Success
76 Success
77 Success
78 Success
79 Success
80 Success
81 Success
82 Success
83 Success
84 Success
85 Success
86 Success
87 Success
88 Success
89 Success
90 Success
91 Success
92 Success
93 Success
94 Success
95 Success
96 ERROR ERROR ERROR
97 Success
98 Success
99 Success
100 Success
101 Success
102 Success
103 Success
104 Success
105 Success
106 Success
107 Success
108 Success
109 Success
110 Success
111 Success
112 Success
113 Success
114 Success
115 Success
116 Success
117 Success
118 Success
119 ERROR ERROR ERROR
120 Success
121 Success
122 Success
123 Success
124 Success
125 Success
126 Success
127 Success
128 Success
129 Success
130 Success
131 Success
132 Success
133 ERROR ERROR ERROR
134 Success
135 Success
136 Success
137 Success
138 Success
139 Success
140 Success
141 Success
142 Success
143 Success
144 Success
145 Success
146 Success
147 Success
148 Success
149 Success
150 Success
151 Success
152 Success
153 Success
154 Success
155 Success
156 ERROR ERROR ERROR
157 Success
158 Success
159 Success
160 Success
161 Success
162 Success
163 Success
164 Success
165 Success
166 Success
167 Success
168 Success
169 Success
170 Success
171 Success
172 Success
173 Success
174 Success
175 Success
176 Success
177 Success
178 Success
179 Success
180 Success
181 Success
182 Success
183 Success
184 Success
185 Success
186 Success
187 Success
188 Success
189 Success
190 Success
191 Success
192 Success
193 Success
194 Success
195 Success
196 Success
197 Success
198 Success
199 Success
200 Success
201 Success
202 Success
203 Success
204 Success
205 Success
206 Success
207 Success
208 Success
209 Success
210 Success
211 Success
212 Success
213 Success
214 Success
215 Success
216 Success
217 Success
218 Success
219 Success
220 Success
221 Success
222 Success
223 Success
224 Success
225 Success
226 Success
227 Success
228 Success
229 Success
230 Success
231 Success
232 Success
233 Success
234 Success
235 Success
236 Success
237 Success
238 Success
239 Success
240 Success
241 Success
242 Success
243 Success
244 Success
245 Success
246 Success
247 Success
248 ERROR ERROR ERROR
249 Success
250 Success
251 Success
252 Success
253 Success
254 Success
255 Success
256 Success
257 Success
258 Success
259 Success
260 Success
261 ERROR ERROR ERROR
262 Success
263 Success
264 Success
265 Success
266 Success
267 Success
268 Success
269 Success
270 Success
271 Success
272 Success
273 Success
274 Success
275 Success
276 Success
277 Success
278 Success
279 Success
280 Success
281 Success
282 Success
283 Success
284 Success
285 Success
286 Success
287 Success
288 Success
289 Success
290 Success
291 Success
292 Success
293 Success
294 Success
295 Success
296 Success
297 Success
298 Success
299 ERROR ERROR ERROR
300 Success
301 Success
302 Success
303 Success
304 Success
305 Success
306 Success
307 Success
308 Success
309 Success
310 Success
311 Success
312 Success
313 Success
314 Success
315 Success
316 Success
317 Success
318 Success
319 Success
320 Success
321 Success
322 Success
323 Success
324 Success
325 Success
326 Success
327 Success
328 Success
329 Success
330 Success
331 Success
332 Success
333 Success
334 Success
335 Success
336 Success
337 Success
338 Success
339 Success
340 Success
341 Success
342 Success
343 Success
344 Success
345 Success
346 Success
347 Success
348 Success
349 Success
350 Success
351 Success
352 Success
353 Success
354 Success
355 Success
356 Success
357 Success
358 Success
359 Success
360 Success
361 Success
362 Success
363 Success
364 Success
365 Success
366 Success
367 Success
368 Success
369 Success
370 Success
371 Success
372 Success
373 Success
374 Success
375 Success
376 Success
377 Success
378 Success
379 Success
380 Success
381 Success
382 Success
383 ERROR ERROR ERROR
384 Success
385 Success
386 Success
387 Success
388 Success
389 Success
390 Success
391 Success
392 Success
393 Success
394 Success
395 Success
396 Success
397 Success
398 Success
399 Success
400 Success
401 Success
402 Success
403 Success
404 Success
405 Success
406 Success
407 Success
408 Success
409 Success
410 Success
411 Success
412 Success
413 Success
414 Success
415 Success
416 Success
417 Success
418 Success
419 Success
420 Success
421 Success
422 Success
423 Success
424 Success
425 Success
426 Success
427 Success
428 Success
429 Success
430 Success
431 Success
432 Success
433 Success
434 Success
435 Success
436 Success
437 Success
438 Success
439 Success
440 Success
441 Success
442 Success
443 Success
444 Success
445 Success
446 Success
447 Success
448 Success
449 Success
450 Success
451 Success
452 Success
453 Success
454 Success
455 Success
456 Success
457 Success
458 Success
459 Success
460 Success
461 Success
462 Success
463 Success
464 Success
465 Success
466 Success
467 Success
468 Success
469 Success
470 Success
471 Success
472 Success
473 Success
474 Success
475 Success
476 Success
477 Success
478 Success
479 Success
480 Success
481 Success
482 Success
483 Success
484 Success
485 Success
486 Success
487 Success
488 Success
489 Success
490 Success
491 Success
492 Success
493 Success
494 Success
495 Success
496 Success
497 Success
498 Success
499 Success
500 Success
501 Success
502 Success
503 Success
504 Success
505 Success
506 Success
507 Success
508 Success
509 Success
510 Success
511 Success
512 Success
513 Success
514 Success
515 Success
516 Success
517 Success
518 Success
519 Success
520 Success
521 Success
522 Success
523 Success
524 Success
525 Success
526 Success
527 Success
528 Success
529 Success
530 Success
531 Success
532 Success
533 Success
534 Success
535 Success
536 Success
537 Success
538 Success
539 Success
540 Success
541 Success
542 Success
543 Success
544 Success
545 Success
546 Success
547 Success
548 Success
549 Success
550 Success
551 Success
552 Success
553 Success
554 Success
555 Success
556 Success
557 Success
558 Success
559 Success
560 Success
561 Success
562 Success
563 Success
564 Success
565 Success
566 Success
567 ERROR ERROR ERROR
568 Success
569 Success
570 Success
571 Success
572 Success
573 Success
574 Success
575 Success
576 Success
577 Success
578 Success
579 Success
580 Success
581 Success
582 Success
583 Success
584 Success
585 Success
586 Success
587 Success
588 Success
589 Success
590 Success
591 Success
592 Success
593 Success
594 Success
595 Success
596 Success
597 Success
598 Success
599 Success
600 Success
601 Success
602 Success
603 Success
604 Success
605 Success
606 Success
607 Success
608 Success
609 Success
610 Success
611 Success
612 Success
613 Success
614 Success
615 Success
616 Success
617 Success
618 Success
619 Success
620 Success
621 Success
622 Success
623 Success
624 Success
625 Success
626 Success
627 Success
628 Success
629 Success
630 Success
631 Success
632 Success
633 Success
634 Success
635 Success
636 Success
637 Success
638 Success
639 Success
640 Success
641 Success
642 Success
643 Success
644 Success
645 Success
646 Success
647 Success
648 Success
649 Success
650 Success
651 Success
652 Success
653 Success
654 Success
655 Success
656 Success
657 Success
658 Success
659 Success
660 Success
661 Success
662 Success
663 Success
664 Success
665 Success
666 Success
667 Success
668 Success
669 Success
670 Success
671 Success
672 Success
673 Success
674 Success
675 Success
676 Success
677 Success
678 Success
679 Success
680 Success
681 Success
682 Success
683 Success
684 Success
685 Success
686 Success
687 Success
688 Success
689 Success
690 Success
691 Success
692 Success
693 Success
694 Success
695 Success
696 Success
697 Success
698 Success
699 Success
700 Success
701 Success
702 Success
703 Success
704 Success
705 Success
706 Success
707 Success
708 Success
709 Success
710 Success
711 Success
712 Success
713 Success
714 Success
715 Success
716 Success
717 Success
718 Success
719 Success
720 Success
721 Success
722 Success
723 Success
724 Success
725 Success
726 Success
727 Success
728 Success
729 Success
730 Success
731 Success
732 Success
733 Success
734 Success
735 Success
736 Success
737 Success
738 Success
739 Success
740 Success
741 Success
742 Success
743 Success
744 Success
745 Success
746 Success
747 Success
748 Success
749 Success
750 Success
751 Success
752 Success
753 Success
754 Success
755 Success
756 Success
757 Success
758 Success
759 Success
760 Success
761 Success
762 Success
763 Success
764 Success
765 Success
766 Success
767 Success
768 Success
769 Success
770 Success
771 Success
772 Success
773 Success
774 Success
775 Success
776 Success
777 Success
778 Success
779 Success
780 Success
781 Success
782 Success
783 Success
784 Success
785 ERROR ERROR ERROR
786 Success
787 Success
788 Success
789 Success
790 Success
791 Success
792 Success
793 Success
794 Success
795 Success
796 Success
797 Success
798 Success
799 Success
800 Success
801 Success
802 Success
803 Success
804 Success
805 Success
806 Success
807 Success
808 Success
809 Success
810 Success
811 Success
812 Success
813 Success
814 Success
815 Success
816 ERROR ERROR ERROR
817 Success
818 Success
819 ERROR ERROR ERROR
820 Success
821 Success
822 Success
823 Success
824 Success
825 Success
826 Success
827 Success
828 Success
829 Success
830 Success
831 Success
832 Success
833 Success
834 Success
835 Success
836 Success
837 Success
838 Success
839 Success
840 Success
841 Success
842 Success
843 Success
844 Success
845 Success
846 Success
847 Success
848 Success
849 Success
850 Success
851 Success
852 Success
853 Success
854 Success
855 Success
856 Success
857 Success
858 Success
859 Success
860 Success
861 Success
862 Success
863 Success
864 Success
865 Success
866 Success
867 Success
868 Success
869 Success
870 Success
871 Success
872 Success
873 Success
874 Success
875 Success
876 Success
877 Success
878 Success
879 Success
880 Success
881 Success
882 Success
883 Success
884 Success
885 Success
886 Success
887 Success
888 Success
889 Success
890 Success
891 Success
892 Success
893 Success
894 Success
895 Success
896 Success
897 Success
898 Success
899 Success
900 Success
901 Success
902 Success
903 Success
904 Success
905 Success
906 Success
907 Success
908 Success
909 Success
910 Success
911 Success
912 Success
913 Success
914 Success
915 Success
916 Success
917 Success
918 Success
919 Success
920 Success
921 Success
922 Success
923 Success
924 Success
925 Success
926 Success
927 Success
928 Success
929 Success
930 Success
931 Success
932 Success
933 ERROR ERROR ERROR
934 Success
935 Success
936 Success
Rate limit reached. Sleeping for: 58
937 Success
938 Success
939 Success
940 Success
941 Success
942 Success
943 Success
944 Success
945 Success
946 Success
947 Success
948 Success
949 Success
950 Success
951 Success
952 Success
953 Success
954 Success
955 Success
956 Success
957 Success
958 Success
959 Success
960 Success
961 Success
962 Success
963 Success
964 Success
965 Success
966 Success
967 Success
968 Success
969 Success
970 Success
971 Success
972 Success
973 Success
974 Success
975 Success
976 Success
977 Success
978 Success
979 Success
980 Success
981 Success
982 Success
983 Success
984 Success
985 Success
986 Success
987 Success
988 Success
989 Success
990 Success
991 Success
992 Success
993 Success
994 Success
995 Success
996 Success
997 Success
998 Success
999 Success
1000 Success
1001 Success
1002 Success
1003 Success
1004 Success
1005 Success
1006 Success
1007 Success
1008 Success
1009 Success
1010 Success
1011 Success
1012 Success
1013 Success
1014 Success
1015 Success
1016 Success
1017 Success
1018 Success
1019 Success
1020 Success
1021 Success
1022 Success
1023 Success
1024 Success
1025 Success
1026 Success
1027 Success
1028 Success
1029 Success
1030 Success
1031 Success
1032 Success
1033 Success
1034 Success
1035 Success
1036 Success
1037 Success
1038 Success
1039 Success
1040 Success
1041 Success
1042 Success
1043 Success
1044 Success
1045 Success
1046 Success
1047 Success
1048 Success
1049 Success
1050 Success
1051 Success
1052 Success
1053 Success
1054 Success
1055 Success
1056 Success
1057 Success
1058 Success
1059 Success
1060 Success
1061 Success
1062 Success
1063 Success
1064 Success
1065 Success
1066 Success
1067 Success
1068 Success
1069 Success
1070 Success
1071 Success
1072 Success
1073 Success
1074 Success
1075 Success
1076 Success
1077 Success
1078 Success
1079 Success
1080 Success
1081 Success
1082 Success
1083 Success
1084 Success
1085 Success
1086 Success
1087 Success
1088 Success
1089 Success
1090 Success
1091 Success
1092 Success
1093 Success
1094 Success
1095 Success
1096 Success
1097 Success
1098 Success
1099 Success
1100 Success
1101 Success
1102 Success
1103 Success
1104 Success
1105 Success
1106 Success
1107 Success
1108 Success
1109 Success
1110 Success
1111 Success
1112 Success
1113 Success
1114 Success
1115 Success
1116 Success
1117 Success
1118 Success
1119 Success
1120 Success
1121 Success
1122 Success
1123 Success
1124 Success
1125 Success
1126 Success
1127 Success
1128 Success
1129 Success
1130 Success
1131 Success
1132 Success
1133 Success
1134 Success
1135 Success
1136 Success
1137 Success
1138 Success
1139 Success
1140 Success
1141 Success
1142 Success
1143 Success
1144 Success
1145 Success
1146 Success
1147 Success
1148 Success
1149 Success
1150 Success
1151 Success
1152 Success
1153 Success
1154 Success
1155 Success
1156 Success
1157 Success
1158 Success
1159 Success
1160 Success
1161 Success
1162 Success
1163 Success
1164 Success
1165 Success
1166 Success
1167 Success
1168 Success
1169 Success
1170 Success
1171 Success
1172 Success
1173 Success
1174 Success
1175 Success
1176 Success
1177 Success
1178 Success
1179 Success
1180 Success
1181 Success
1182 Success
1183 Success
1184 Success
1185 Success
1186 Success
1187 Success
1188 Success
1189 Success
1190 Success
1191 Success
1192 Success
1193 Success
1194 Success
1195 Success
1196 Success
1197 Success
1198 Success
1199 Success
1200 Success
1201 Success
1202 Success
1203 Success
1204 Success
1205 Success
1206 Success
1207 Success
1208 Success
1209 Success
1210 Success
1211 Success
1212 Success
1213 Success
1214 Success
1215 Success
1216 Success
1217 Success
1218 Success
1219 Success
1220 Success
1221 Success
1222 Success
1223 Success
1224 Success
1225 Success
1226 Success
1227 Success
1228 Success
1229 Success
1230 Success
1231 Success
1232 Success
1233 Success
1234 Success
1235 Success
1236 Success
1237 Success
1238 Success
1239 Success
1240 Success
1241 Success
1242 Success
1243 Success
1244 Success
1245 Success
1246 Success
1247 Success
1248 Success
1249 Success
1250 Success
1251 Success
1252 Success
1253 Success
1254 Success
1255 Success
1256 Success
1257 Success
1258 Success
1259 Success
1260 Success
1261 Success
1262 Success
1263 Success
1264 Success
1265 Success
1266 Success
1267 Success
1268 Success
1269 Success
1270 Success
1271 Success
1272 Success
1273 Success
1274 Success
1275 Success
1276 Success
1277 Success
1278 Success
1279 Success
1280 Success
1281 Success
1282 Success
1283 Success
1284 Success
1285 Success
1286 Success
1287 Success
1288 Success
1289 Success
1290 Success
1291 Success
1292 Success
1293 Success
1294 Success
1295 Success
1296 Success
1297 Success
1298 Success
1299 Success
1300 Success
1301 Success
1302 Success
1303 Success
1304 Success
1305 Success
1306 Success
1307 Success
1308 Success
1309 Success
1310 Success
1311 Success
1312 Success
1313 Success
1314 Success
1315 Success
1316 Success
1317 Success
1318 Success
1319 Success
1320 Success
1321 Success
1322 Success
1323 Success
1324 Success
1325 Success
1326 Success
1327 Success
1328 Success
1329 Success
1330 Success
1331 Success
1332 Success
1333 Success
1334 Success
1335 Success
1336 Success
1337 Success
1338 Success
1339 Success
1340 Success
1341 Success
1342 Success
1343 Success
1344 Success
1345 Success
1346 Success
1347 Success
1348 Success
1349 Success
1350 Success
1351 Success
1352 Success
1353 Success
1354 Success
1355 Success
1356 Success
1357 Success
1358 Success
1359 Success
1360 Success
1361 Success
1362 Success
1363 Success
1364 Success
1365 Success
1366 Success
1367 Success
1368 Success
1369 Success
1370 Success
1371 Success
1372 Success
1373 Success
1374 Success
1375 Success
1376 Success
1377 Success
1378 Success
1379 Success
1380 Success
1381 Success
1382 Success
1383 Success
1384 Success
1385 Success
1386 Success
1387 Success
1388 Success
1389 Success
1390 Success
1391 Success
1392 Success
1393 Success
1394 Success
1395 Success
1396 Success
1397 Success
1398 Success
1399 Success
1400 Success
1401 Success
1402 Success
1403 Success
1404 Success
1405 Success
1406 Success
1407 Success
1408 Success
1409 Success
1410 Success
1411 Success
1412 Success
1413 Success
1414 Success
1415 Success
1416 Success
1417 Success
1418 Success
1419 Success
1420 Success
1421 Success
1422 Success
1423 Success
1424 Success
1425 Success
1426 Success
1427 Success
1428 Success
1429 Success
1430 Success
1431 Success
1432 Success
1433 Success
1434 Success
1435 Success
1436 Success
1437 Success
1438 Success
1439 Success
1440 Success
1441 Success
1442 Success
1443 Success
1444 Success
1445 Success
1446 Success
1447 Success
1448 Success
1449 Success
1450 Success
1451 Success
1452 Success
1453 Success
1454 Success
1455 Success
1456 Success
1457 Success
1458 Success
1459 Success
1460 Success
1461 Success
1462 Success
1463 Success
1464 Success
1465 Success
1466 Success
1467 Success
1468 Success
1469 Success
1470 Success
1471 Success
1472 Success
1473 Success
1474 Success
1475 Success
1476 Success
1477 Success
1478 Success
1479 Success
1480 Success
1481 Success
1482 Success
1483 Success
1484 Success
1485 Success
1486 Success
1487 Success
1488 Success
1489 Success
1490 Success
1491 Success
1492 Success
1493 Success
1494 Success
1495 Success
1496 Success
1497 Success
1498 Success
1499 Success
1500 Success
1501 Success
1502 Success
1503 Success
1504 Success
1505 Success
1506 Success
1507 Success
1508 Success
1509 Success
1510 Success
1511 Success
1512 Success
1513 Success
1514 Success
1515 Success
1516 Success
1517 Success
1518 Success
1519 Success
1520 Success
1521 Success
1522 Success
1523 Success
1524 Success
1525 Success
1526 Success
1527 Success
1528 Success
1529 Success
1530 Success
1531 Success
1532 Success
1533 Success
1534 Success
1535 Success
1536 Success
1537 Success
1538 Success
1539 Success
1540 Success
1541 Success
1542 Success
1543 Success
1544 Success
1545 Success
1546 Success
1547 Success
1548 Success
1549 Success
1550 Success
1551 Success
1552 Success
1553 Success
1554 Success
1555 Success
1556 Success
1557 Success
1558 Success
1559 Success
1560 Success
1561 Success
1562 Success
1563 Success
1564 Success
1565 Success
1566 Success
1567 Success
1568 Success
1569 Success
1570 Success
1571 Success
1572 Success
1573 Success
1574 Success
1575 Success
1576 Success
1577 Success
1578 Success
1579 Success
1580 Success
1581 Success
1582 Success
1583 Success
1584 Success
1585 Success
1586 Success
1587 Success
1588 Success
1589 Success
1590 Success
1591 Success
1592 Success
1593 Success
1594 Success
1595 Success
1596 Success
1597 Success
1598 Success
1599 Success
1600 Success
1601 Success
1602 Success
1603 Success
1604 Success
1605 Success
1606 Success
1607 Success
1608 Success
1609 Success
1610 Success
1611 Success
1612 Success
1613 Success
1614 Success
1615 Success
1616 Success
1617 Success
1618 Success
1619 Success
1620 Success
1621 Success
1622 Success
1623 Success
1624 Success
1625 Success
1626 Success
1627 Success
1628 Success
1629 Success
1630 Success
1631 Success
1632 Success
1633 Success
1634 Success
1635 Success
1636 Success
1637 Success
1638 Success
1639 Success
1640 Success
1641 Success
1642 Success
1643 Success
1644 Success
1645 Success
1646 Success
1647 Success
1648 Success
1649 Success
1650 Success
1651 Success
1652 Success
1653 Success
1654 Success
1655 Success
1656 Success
1657 Success
1658 Success
1659 Success
1660 Success
1661 Success
1662 Success
1663 Success
1664 Success
1665 Success
1666 Success
1667 Success
1668 Success
1669 Success
1670 Success
1671 Success
1672 Success
1673 Success
1674 Success
1675 Success
1676 Success
1677 Success
1678 Success
1679 Success
1680 Success
1681 Success
1682 Success
1683 Success
1684 Success
1685 Success
1686 Success
1687 Success
1688 Success
1689 Success
1690 Success
1691 Success
1692 Success
1693 Success
1694 Success
1695 Success
1696 Success
1697 Success
1698 Success
1699 Success
1700 Success
1701 Success
1702 Success
1703 Success
1704 Success
1705 Success
1706 Success
1707 Success
1708 Success
1709 Success
1710 Success
1711 Success
1712 Success
1713 Success
1714 Success
1715 Success
1716 Success
1717 Success
1718 Success
1719 Success
1720 Success
1721 Success
1722 Success
1723 Success
1724 Success
1725 Success
1726 Success
1727 Success
1728 Success
1729 Success
1730 Success
1731 Success
1732 Success
1733 Success
1734 Success
1735 Success
1736 Success
1737 Success
1738 Success
1739 Success
1740 Success
1741 Success
1742 Success
1743 Success
1744 Success
1745 Success
1746 Success
1747 Success
1748 Success
1749 Success
1750 Success
1751 Success
1752 Success
1753 Success
1754 Success
1755 Success
1756 Success
1757 Success
1758 Success
1759 Success
1760 Success
1761 Success
1762 Success
1763 Success
1764 Success
1765 Success
1766 Success
1767 Success
1768 Success
1769 Success
1770 Success
1771 Success
1772 Success
1773 Success
1774 Success
1775 Success
1776 Success
1777 Success
1778 Success
1779 Success
1780 Success
1781 Success
1782 Success
1783 Success
1784 Success
1785 Success
1786 Success
1787 Success
1788 Success
1789 Success
1790 Success
1791 Success
1792 Success
1793 Success
1794 Success
1795 Success
1796 Success
1797 Success
1798 Success
1799 Success
1800 Success
1801 Success
1802 Success
1803 Success
1804 Success
1805 Success
1806 Success
1807 Success
1808 Success
1809 Success
1810 Success
1811 Success
1812 Success
1813 Success
1814 Success
1815 Success
1816 Success
1817 Success
1818 Success
1819 Success
1820 Success
1821 Success
1822 Success
1823 Success
1824 Success
1825 Success
1826 Success
1827 Success
1828 Success
1829 Success
1830 Success
1831 Success
1832 Success
1833 Success
1834 Success
1835 Success
1836 Success
Rate limit reached. Sleeping for: 66
1837 Success
1838 Success
1839 Success
1840 Success
1841 Success
1842 Success
1843 Success
1844 Success
1845 Success
1846 Success
1847 Success
1848 Success
1849 Success
1850 Success
1851 Success
1852 Success
1853 Success
1854 Success
1855 Success
1856 Success
1857 Success
1858 Success
1859 Success
1860 Success
1861 Success
1862 Success
1863 Success
1864 Success
1865 Success
1866 Success
1867 Success
1868 Success
1869 Success
1870 Success
1871 Success
1872 Success
1873 Success
1874 Success
1875 Success
1876 Success
1877 Success
1878 Success
1879 Success
1880 Success
1881 Success
1882 Success
1883 Success
1884 Success
1885 Success
1886 Success
1887 Success
1888 Success
1889 Success
1890 Success
1891 Success
1892 Success
1893 Success
1894 Success
1895 Success
1896 Success
1897 Success
1898 Success
1899 Success
1900 Success
1901 Success
1902 Success
1903 Success
1904 Success
1905 Success
1906 Success
1907 Success
1908 Success
1909 Success
1910 Success
1911 Success
1912 Success
1913 Success
1914 Success
1915 Success
1916 Success
1917 Success
1918 Success
1919 Success
1920 Success
1921 Success
1922 Success
1923 Success
1924 Success
1925 Success
1926 Success
1927 Success
1928 Success
1929 Success
1930 Success
1931 Success
1932 Success
1933 Success
1934 Success
1935 Success
1936 Success
1937 Success
1938 Success
1939 Success
1940 Success
1941 Success
1942 Success
1943 Success
1944 Success
1945 Success
1946 Success
1947 Success
1948 Success
1949 Success
1950 Success
1951 Success
1952 Success
1953 Success
1954 Success
1955 Success
1956 Success
1957 Success
1958 Success
1959 Success
1960 Success
1961 Success
1962 Success
1963 Success
1964 Success
1965 Success
1966 Success
1967 Success
1968 Success
1969 Success
1970 Success
1971 Success
1972 Success
1973 Success
1974 Success
1975 Success
1976 Success
1977 Success
1978 Success
1979 Success
1980 Success
1981 Success
1982 Success
1983 Success
1984 Success
1985 Success
1986 Success
1987 Success
1988 Success
1989 Success
1990 Success
1991 Success
1992 Success
1993 Success
1994 Success
1995 Success
1996 Success
1997 Success
1998 Success
1999 Success
2000 Success
2001 Success
2002 Success
2003 Success
2004 Success
2005 Success
2006 Success
2007 Success
2008 Success
2009 Success
2010 Success
2011 Success
2012 Success
2013 Success
2014 Success
2015 Success
2016 Success
2017 Success
2018 Success
2019 Success
2020 Success
2021 Success
2022 Success
2023 Success
2024 Success
2025 Success
2026 Success
2027 Success
2028 Success
2029 Success
2030 Success
2031 Success
2032 Success
2033 Success
2034 Success
2035 Success
2036 Success
2037 Success
2038 Success
2039 Success
2040 Success
2041 Success
2042 Success
2043 Success
2044 Success
2045 Success
2046 Success
2047 Success
2048 Success
2049 Success
2050 Success
2051 Success
2052 Success
2053 Success
2054 Success
2055 Success
2056 Success
2057 Success
2058 Success
2059 Success
2060 Success
2061 Success
2062 Success
2063 Success
2064 Success
2065 Success
2066 Success
2067 Success
2068 Success
2069 Success
2070 Success
2071 Success
2072 Success
2073 Success
2074 Success
2075 Success
2076 Success
2077 Success
2078 Success
2079 Success
2080 Success
2081 Success
2082 Success
2083 Success
2084 Success
2085 Success
2086 Success
2087 Success
2088 Success
2089 Success
2090 Success
2091 Success
2092 Success
2093 Success
2094 Success
2095 Success
2096 Success
2097 Success
2098 Success
2099 Success
2100 Success
2101 Success
2102 Success
2103 Success
2104 Success
2105 Success
2106 Success
2107 Success
2108 Success
2109 Success
2110 Success
2111 Success
2112 Success
2113 Success
2114 Success
2115 Success
2116 Success
2117 Success
2118 Success
2119 Success
2120 Success
2121 Success
2122 Success
2123 Success
2124 Success
2125 Success
2126 Success
2127 Success
2128 Success
2129 Success
2130 Success
2131 Success
2132 Success
2133 Success
2134 Success
2135 Success
2136 Success
2137 Success
2138 Success
2139 Success
2140 Success
2141 Success
2142 Success
2143 Success
2144 Success
2145 Success
2146 Success
2147 Success
2148 Success
2149 Success
2150 Success
2151 Success
2152 Success
2153 Success
2154 Success
2155 Success
2156 Success
2157 Success
2158 Success
2159 Success
2160 Success
2161 Success
2162 Success
2163 Success
2164 Success
2165 Success
2166 Success
2167 Success
2168 Success
2169 Success
2170 Success
2171 Success
2172 Success
2173 Success
2174 Success
2175 Success
2176 Success
2177 Success
2178 Success
2179 Success
2180 Success
2181 Success
2182 Success
2183 Success
2184 Success
2185 Success
2186 Success
2187 Success
2188 Success
2189 Success
2190 Success
2191 Success
2192 Success
2193 Success
2194 Success
2195 Success
2196 Success
2197 Success
2198 Success
2199 Success
2200 Success
2201 Success
2202 Success
2203 Success
2204 Success
2205 Success
2206 Success
2207 Success
2208 Success
2209 Success
2210 Success
2211 Success
2212 Success
2213 Success
2214 Success
2215 Success
2216 Success
2217 Success
2218 Success
2219 Success
2220 Success
2221 Success
2222 Success
2223 Success
2224 Success
2225 Success
2226 Success
2227 Success
2228 Success
2229 Success
2230 Success
2231 Success
2232 Success
2233 Success
2234 Success
2235 Success
2236 Success
2237 Success
2238 Success
2239 Success
2240 Success
2241 Success
2242 Success
2243 Success
2244 Success
2245 Success
2246 Success
2247 Success
2248 Success
2249 Success
2250 Success
2251 Success
2252 Success
2253 Success
2254 Success
2255 Success
2256 Success
2257 Success
2258 Success
2259 Success
2260 Success
2261 Success
2262 Success
2263 Success
2264 Success
2265 Success
2266 Success
2267 Success
2268 Success
2269 Success
2270 Success
2271 Success
2272 Success
2273 Success
2274 Success
2275 Success
2276 Success
2277 Success
2278 Success
2279 Success
2280 Success
2281 Success
2282 Success
2283 Success
2284 Success
2285 Success
2286 Success
2287 Success
2288 Success
2289 Success
2290 Success
2291 Success
2292 Success
2293 Success
2294 Success
2295 Success
2296 Success
2297 Success
2298 Success
2299 Success
2300 Success
2301 Success
2302 Success
2303 Success
2304 Success
2305 Success
2306 Success
2307 Success
2308 Success
2309 Success
2310 Success
2311 Success
2312 Success
2313 Success
2314 Success
2315 Success
2316 Success
2317 Success
2318 Success
2319 Success
2320 Success
2321 Success
2322 Success
2323 Success
2324 Success
2325 Success
2326 Success
2327 Success
2328 Success
2329 Success
2330 Success
2331 Success
2332 Success
2333 Success
2334 Success
2335 Success
2336 Success
2337 Success
2338 Success
2339 Success
2340 Success
2341 Success
2342 Success
2343 Success
2344 Success
2345 Success
2346 Success
2347 Success
2348 Success
2349 Success
2350 Success
2351 Success
2352 Success
2353 Success
2354 Success
2355 Success
2356 Success
In [12]:
#Errors for ID's, They have been not added to the DataFrame.
errors = [2056,1993,1945,1865,1836,1616,1310]
api_error
Out[12]:
[20, 96, 119, 133, 156, 248, 261, 299, 383, 567, 785, 816, 819, 933]
In [15]:
#Loading in JSON File 
with open('data/tweet_json.txt') as f:
    data = json.loads(f.read())
In [16]:
data[0]
#Using this Dumped Data (first Instance Only) to parse the JSON using the following tool : http://json.parser.online.fr/ \
#and understand the structure of the JSON. 
Out[16]:
{'contributors': None,
 'coordinates': None,
 'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'sizes': {'large': {'h': 528, 'resize': 'fit', 'w': 540},
     'medium': {'h': 528, 'resize': 'fit', 'w': 540},
     'small': {'h': 528, 'resize': 'fit', 'w': 540},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/MgUWQ76dJU'}],
  'symbols': [],
  'urls': [],
  'user_mentions': []},
 'extended_entities': {'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'sizes': {'large': {'h': 528, 'resize': 'fit', 'w': 540},
     'medium': {'h': 528, 'resize': 'fit', 'w': 540},
     'small': {'h': 528, 'resize': 'fit', 'w': 540},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/MgUWQ76dJU'}]},
 'favorite_count': 38625,
 'favorited': False,
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'geo': None,
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'possibly_sensitive_appealable': False,
 'retweet_count': 8541,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'user': {'contributors_enabled': False,
  'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
  'default_profile': False,
  'default_profile_image': False,
  'description': 'Your Only Source for Pawfessional Dog Ratings STORE: @ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE APP: @GoodDogsGame Business: dogratingtwitter@gmail.com',
  'entities': {'description': {'urls': []},
   'url': {'urls': [{'display_url': 'weratedogs.com',
      'expanded_url': 'http://weratedogs.com',
      'indices': [0, 23],
      'url': 'https://t.co/N7sNNHSfPq'}]}},
  'favourites_count': 135413,
  'follow_request_sent': False,
  'followers_count': 7087411,
  'following': False,
  'friends_count': 9,
  'geo_enabled': True,
  'has_extended_profile': True,
  'id': 4196983835,
  'id_str': '4196983835',
  'is_translation_enabled': False,
  'is_translator': False,
  'lang': 'en',
  'listed_count': 4667,
  'location': '𝓶𝓮𝓻𝓬𝓱 ↴      DM YOUR DOGS',
  'name': 'WeRateDogs™',
  'notifications': False,
  'profile_background_color': '000000',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1525830435',
  'profile_image_url': 'http://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg',
  'profile_link_color': 'F5ABB5',
  'profile_sidebar_border_color': '000000',
  'profile_sidebar_fill_color': '000000',
  'profile_text_color': '000000',
  'profile_use_background_image': False,
  'protected': False,
  'screen_name': 'dog_rates',
  'statuses_count': 7341,
  'time_zone': None,
  'translator_type': 'none',
  'url': 'https://t.co/N7sNNHSfPq',
  'utc_offset': None,
  'verified': True}}
In [18]:
#TEST CODE - Used to only check first Data

#Checking Queries on first Data Value printing them to the console.
Tid = data[0]['id_str']
full_text = data[0]['full_text']
retweet_count = data[0]['retweet_count']
fav_count = data[0]['favorite_count']
url = data[0]["extended_entities"]["media"][0]["url"]
index = data[0]['full_text'].index('/')
numerator = int(data[0]['full_text'][index-2:index])
denominator = int(data[0]['full_text'][index+1:index+3])
name = (data[0]['full_text'].split('.')[0].split(" ")[-1])
val = data[0]['full_text']
if 'doggo' in val:
    dog = 'doggo'
elif 'pupper' in val:
    dog = 'pupper'
elif 'puppo' in val:
    dog = 'puppo'
elif 'floofer' in val:
    dog = 'floofer'
else:
    dog = None
print(Tid)
print(full_text)
print(retweet_count)
print(fav_count)
print(url)
print(index)
print(numerator)
print(denominator)
print(name)
892420643555336193
This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
8541
38625
https://t.co/MgUWQ76dJU
82
13
10
Phineas
In [19]:
#Another Check for parsing the JSON (it has quite a complicated Schema.)
data[0]["extended_entities"]["media"][0]["url"]
Out[19]:
'https://t.co/MgUWQ76dJU'

Here we are extracting data which will be used to solve various data Quality issues.
These are Dog Name, Numerator, Denominator and the type of dog (doggo, fluffer, etc)

Creating the Final Dataframe : api

In [20]:
#Extracting Required feilds from JSON and making a new data frame df
df_list = []
for val in data:
    Tid = val['id_str']
    full_text = val['full_text']
    retweet_count = val['retweet_count']
    fav_count = val['favorite_count']
    index = val['full_text'].index('/')
    numerator = full_text[index-2:index]
    denominator = full_text[index+1:index+3]
    
    name = full_text.split('.')[0].split(" ")[-1]
    
    if 'doggo' in full_text:
        dog = 'doggo'
    elif 'pupper' in full_text:
        dog = 'pupper'
    elif 'puppo' in full_text:
        dog = 'puppo'
    elif 'floofer' in full_text:
        dog = 'floofer'
    else:
        dog = None
    
    df_list.append({'tweet_id': int(Tid),
                    'full_text': full_text,
                    'retweet_count': int(retweet_count),
                    'fav_count' : int(fav_count),
                    'numerator' : numerator, #[Q#1] 
                    'denominator': denominator, #[Q#3]
                    'pet_name' : name, #[Q#2]
                    'dog' : dog #[T#1]
                   })
    


api = pd.DataFrame(data=df_list)
In [25]:
#Checking our newly created DataFrame.
api.head()
Out[25]:
tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas None 13 10
1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly None 13 10
2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie None 12 10
3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla None 13 10
4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin None 12 10
In [22]:
#Rearanging the DF Columns to make more sense when read in.
api = api[['tweet_id', 'full_text', 'fav_count','retweet_count', 'pet_name', 'dog', 'numerator', 'denominator']]
api.head()
Out[22]:
tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas None 13 10
1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly None 13 10
2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie None 12 10
3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla None 13 10
4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin None 12 10
In [ ]:
#Saving Twitter Data extracted from API as CSV
#api.to_csv('data/twitter_archive_api.csv')

Testing Code is Below Please Check for Understanding how I reached the above solutions

In [24]:
#Sample Test Block for the loop used above.
df_api = []
for val in api['full_text']:
    #Code for finding numberator and denominator
    Tid = api["tweet_id"]
    index = val.index('/')
    rating_numerator = val[index-2:index]
    rating_denominator = val[index+1:index+3]
    name = (val.split('.')[0].split(" ")[-1])
    if 'doggo' in val:
        dog = 'doggo'
    elif 'pupper' in val:
        dog = 'pupper'
    elif 'puppo' in val:
        dog = 'puppo'
    elif 'floofer' in val:
        dog = 'floofer'
    else:
        dog = None

    
    df_api.append({
        'tweet_id' : Tid,
        'name' : name,
        'rating_numerator' : rating_numerator,
        'rating_denominator' : rating_denominator,
        'dog' : dog
    })
    
df_api_pd = pd.DataFrame(data=df_api)
df_api_pd.head()
Out[24]:
dog name rating_denominator rating_numerator tweet_id
0 None Phineas 10 13 0 892420643555336193 1 89217742130...
1 None Tilly 10 13 0 892420643555336193 1 89217742130...
2 None Archie 10 12 0 892420643555336193 1 89217742130...
3 None Darla 10 13 0 892420643555336193 1 89217742130...
4 None Franklin 10 12 0 892420643555336193 1 89217742130...
In [26]:
#Sample code used for Calculating Numerator, Denominator and Pet Name
index = val.index('/')
print(val[index-2:index])
print(val[index+1:index+3])
print(val.split('.')[0].split(" ")[-1])
 8
10
Setter
In [27]:
#Checking the value of variable val which contains full_text from the twitter api
val
Out[27]:
'Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj'
In [28]:
#Re initing Val with a diffrent Data value (without a dog Name)
val = api["full_text"][12]
val
Out[28]:
"Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm"
In [29]:
str(val)
if 'doggo' in val:
    dog = ('doggo')
elif 'pupper' in val:
    dog = ('pupper')
elif 'puppo' in val:
    dog = ('puppo')
elif 'floofer' or 'floof' in val:
    dog =  ('floofer')
else:
    dog = ("None")
    

Data Wrangling(Assessing)

In [30]:
#Importing basic packages needed to get Data 
import pandas as pd
import requests
import os
import tweepy
import json
In [31]:
# Assigning agreed upon variable names
archive = pd.read_csv('data/twitter-archive-enhanced.csv')
img = pd.read_csv('data/image-predictions.tsv', sep='\t')
api = pd.read_csv('data/twitter_archive_api.csv')
In [32]:
archive.head()
Out[32]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
In [33]:
val = archive.text[12]
val.split('.')[0].split(" ")[-1]
# print(val.split('/')[1])
val
Out[33]:
"Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm"
In [34]:
index = val.index('/')
print(val[index-2:index])
print(val[index+1:index+3])
13
10
In [35]:
#Code for finding numberator and denominator
index = val.index('/')
rating_numerator = val[index-2:index]
rating_denominator = val[index+1:index+3]
In [ ]:
# index = val.index('This is')
# index
In [36]:
list(archive.columns.values)
Out[36]:
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']
In [37]:
archive.count()
Out[37]:
tweet_id                      2356
in_reply_to_status_id           78
in_reply_to_user_id             78
timestamp                     2356
source                        2356
text                          2356
retweeted_status_id            181
retweeted_status_user_id       181
retweeted_status_timestamp     181
expanded_urls                 2297
rating_numerator              2356
rating_denominator            2356
name                          2356
doggo                         2356
floofer                       2356
pupper                        2356
puppo                         2356
dtype: int64
In [38]:
img.head(20)
Out[38]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg 1 Bernese_mountain_dog 0.651137 True English_springer 0.263788 True Greater_Swiss_Mountain_dog 0.016199 True
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg 1 chow 0.692517 True Tibetan_mastiff 0.058279 True fur_coat 0.054449 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg 1 miniature_poodle 0.201493 True komondor 0.192305 True soft-coated_wheaten_terrier 0.082086 True
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg 1 golden_retriever 0.775930 True Tibetan_mastiff 0.093718 True Labrador_retriever 0.072427 True
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg 1 Gordon_setter 0.503672 True Yorkshire_terrier 0.174201 True Pekinese 0.109454 True
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 0.175382 True Ibizan_hound 0.097471 True
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg 1 pug 0.489814 True bull_mastiff 0.404722 True French_bulldog 0.048960 True
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg 1 bloodhound 0.195217 True German_shepherd 0.078260 True malinois 0.075628 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg 1 English_setter 0.298617 True Newfoundland 0.149842 True borzoi 0.133649 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg 1 Italian_greyhound 0.176053 True toy_terrier 0.111884 True basenji 0.111152 True
In [39]:
img.sample(50)
Out[39]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
113 667915453470232577 https://pbs.twimg.com/media/CUTpj-GWcAATc6A.jpg 1 leatherback_turtle 0.452517 False boxer 1.966550e-01 True terrapin 1.609830e-01 False
972 706644897839910912 https://pbs.twimg.com/ext_tw_video_thumb/70664... 1 space_heater 0.137871 False Chihuahua 1.329280e-01 True cougar 1.138660e-01 False
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 1.753820e-01 True Ibizan_hound 9.747050e-02 True
133 668480044826800133 https://pbs.twimg.com/media/CUbrDWOWcAEyMdM.jpg 1 Arctic_fox 0.119243 False Labrador_retriever 9.996480e-02 True pug 8.671650e-02 True
1018 710117014656950272 https://pbs.twimg.com/media/CdrXp9dWoAAcRfn.jpg 2 toy_poodle 0.802092 True miniature_poodle 1.116470e-01 True cocker_spaniel 6.286620e-02 True
1619 802624713319034886 https://pbs.twimg.com/media/CsrjryzWgAAZY00.jpg 1 cocker_spaniel 0.253442 True golden_retriever 1.628500e-01 True otterhound 1.109210e-01 True
347 672475084225949696 https://pbs.twimg.com/media/CVUchRHXAAE4rtp.jpg 1 terrapin 0.879286 False cockroach 4.525240e-02 False box_turtle 1.640380e-02 False
2004 877316821321428993 https://pbs.twimg.com/media/DCza_vtXkAQXGpC.jpg 1 Saluki 0.509967 True Italian_greyhound 9.049730e-02 True golden_retriever 7.940580e-02 True
948 704819833553219584 https://pbs.twimg.com/media/CcgF5ovW8AACrEU.jpg 1 guinea_pig 0.994776 False hamster 4.068790e-03 False wood_rabbit 2.058690e-04 False
1133 728409960103686147 https://pbs.twimg.com/media/ChvU_DwWMAArx5L.jpg 1 Siamese_cat 0.478278 False Saint_Bernard 9.424560e-02 True king_penguin 8.215670e-02 False
1131 728046963732717569 https://pbs.twimg.com/media/ChqK2cVWMAAE5Zj.jpg 1 Newfoundland 0.255971 True groenendael 1.755830e-01 True German_shepherd 1.641350e-01 True
399 673686845050527744 https://pbs.twimg.com/media/CVlqi_AXIAASlcD.jpg 1 Pekinese 0.185903 True guinea_pig 1.729510e-01 False pug 1.661830e-01 True
106 667866724293877760 https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg 1 jigsaw_puzzle 1.000000 False prayer_rug 1.011300e-08 False doormat 1.740170e-10 False
1521 788039637453406209 https://pbs.twimg.com/media/Cu-t20yWEAAFHXi.jpg 1 beach_wagon 0.362925 False minivan 3.047590e-01 False limousine 1.017020e-01 False
751 688064179421470721 https://pbs.twimg.com/media/CYx-tGaUoAAEXV8.jpg 1 Eskimo_dog 0.240602 True Norwegian_elkhound 1.803690e-01 True Siberian_husky 9.073880e-02 True
370 672975131468300288 https://pbs.twimg.com/media/CVbjRSIWsAElw2s.jpg 1 pug 0.836421 True Brabancon_griffon 4.466780e-02 True French_bulldog 3.657050e-02 True
170 668992363537309700 https://pbs.twimg.com/media/CUi9ARGWUAEyWqo.jpg 1 lynx 0.287506 False tabby 2.060480e-01 False koala 8.141930e-02 False
1437 773985732834758656 https://pbs.twimg.com/media/Cr2_6R8WAAAUMtc.jpg 4 giant_panda 0.451149 False fur_coat 1.480010e-01 False pug 1.095700e-01 True
706 684959798585110529 https://pbs.twimg.com/media/CYF3TSlWMAAaoG5.jpg 1 llama 0.379624 False triceratops 1.627610e-01 False hog 8.425150e-02 False
1965 867421006826221569 https://pbs.twimg.com/media/DAmyy8FXYAIH8Ty.jpg 1 Eskimo_dog 0.616457 True Siberian_husky 3.813300e-01 True malamute 1.670220e-03 True
1273 750026558547456000 https://pbs.twimg.com/media/CmieRQRXgAA8MV3.jpg 1 standard_poodle 0.258732 True teddy 1.307600e-01 False toy_poodle 7.172630e-02 True
602 679828447187857408 https://pbs.twimg.com/media/CW88XN4WsAAlo8r.jpg 3 Chihuahua 0.346545 True dalmatian 1.662460e-01 True toy_terrier 1.175020e-01 True
1097 720340705894408192 https://pbs.twimg.com/media/Cf8qDFbWwAEf8M3.jpg 1 alp 0.320126 False lawn_mower 8.080770e-02 False viaduct 6.532100e-02 False
1032 711652651650457602 https://pbs.twimg.com/media/CeBMT6-WIAA7Qqf.jpg 1 llama 0.856789 False Arabian_camel 9.872700e-02 False neck_brace 1.637720e-02 False
1130 728035342121635841 https://pbs.twimg.com/media/ChqARqmWsAEI6fB.jpg 1 handkerchief 0.302961 False Pomeranian 2.486640e-01 True Shih-Tzu 1.110150e-01 True
1975 870063196459192321 https://pbs.twimg.com/media/DBMV3NnXUAAm0Pp.jpg 1 comic_book 0.534409 False envelope 2.807220e-01 False book_jacket 4.378550e-02 False
1221 744234799360020481 https://pbs.twimg.com/ext_tw_video_thumb/74423... 1 Labrador_retriever 0.825333 True ice_bear 4.468080e-02 False whippet 1.844220e-02 True
1581 796484825502875648 https://pbs.twimg.com/media/Cw2uty8VQAAB0pL.jpg 1 cocker_spaniel 0.116924 True seat_belt 1.075110e-01 False Australian_terrier 9.984340e-02 True
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg 3 Siberian_husky 0.700377 True Eskimo_dog 1.665110e-01 True malamute 1.114110e-01 True
445 674646392044941312 https://pbs.twimg.com/media/CVzTUGrW4AAirJH.jpg 1 flat-coated_retriever 0.837448 True groenendael 8.616650e-02 True Labrador_retriever 1.605220e-02 True
217 670069087419133954 https://pbs.twimg.com/media/CUyQRzHWoAAhF1D.jpg 1 boathouse 0.313829 False birdhouse 1.383310e-01 False ashcan 4.567320e-02 False
1350 759793422261743616 https://pbs.twimg.com/media/CotUFZEWcAA2Pku.jpg 2 golden_retriever 0.985876 True Labrador_retriever 1.947770e-03 True kuvasz 1.751740e-03 True
1283 750429297815552001 https://pbs.twimg.com/media/CmoPdmHW8AAi8BI.jpg 1 golden_retriever 0.964929 True Labrador_retriever 1.158370e-02 True refrigerator 7.498620e-03 False
714 685532292383666176 https://pbs.twimg.com/media/CYN_-6iW8AQhPu2.jpg 1 white_wolf 0.318524 False dingo 2.154360e-01 False collie 9.580520e-02 True
1307 753420520834629632 https://pbs.twimg.com/ext_tw_video_thumb/75342... 1 balloon 0.267961 False lakeside 8.576370e-02 False rapeseed 4.080890e-02 False
693 684225744407494656 https://pbs.twimg.com/media/CX7br3HWsAAQ9L1.jpg 2 golden_retriever 0.203249 True Samoyed 6.795810e-02 True Great_Pyrenees 6.532750e-02 True
758 688789766343622656 https://pbs.twimg.com/media/CY8SocAWsAARuyh.jpg 1 American_Staffordshire_terrier 0.599660 True Staffordshire_bullterrier 3.809760e-01 True bull_mastiff 3.889020e-03 True
744 687480748861947905 https://pbs.twimg.com/media/CYpsFmIWAAAYh9C.jpg 1 English_springer 0.472273 True English_setter 1.668620e-01 True Brittany_spaniel 1.634110e-01 True
1535 790337589677002753 https://pbs.twimg.com/media/CvfX2AnWYAAQTay.jpg 1 Pembroke 0.658808 True Cardigan 1.530960e-01 True toy_terrier 1.022990e-01 True
495 675740360753160193 https://pbs.twimg.com/ext_tw_video_thumb/67574... 1 golden_retriever 0.800495 True kuvasz 9.775640e-02 True Saluki 6.841460e-02 True
969 706516534877929472 https://pbs.twimg.com/media/Cc4NCQiXEAEx2eJ.jpg 1 golden_retriever 0.772685 True Labrador_retriever 7.166530e-02 True golfcart 2.099310e-02 False
1348 759557299618865152 https://pbs.twimg.com/media/Cop9VVUXgAAhX9u.jpg 2 golden_retriever 0.763333 True Chesapeake_Bay_retriever 1.942510e-01 True Labrador_retriever 1.222540e-02 True
1334 757741869644341248 https://pbs.twimg.com/media/CoQKNY7XYAE_cuX.jpg 1 skunk 0.609715 False Old_English_sheepdog 1.288990e-01 True Siberian_husky 1.907610e-02 True
1746 823269594223824897 https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg 1 Samoyed 0.585441 True Pomeranian 1.936540e-01 True Arctic_fox 7.164760e-02 False
163 668960084974809088 https://pbs.twimg.com/media/CUifpn4WUAAS5X3.jpg 1 shower_curtain 0.226309 False Chesapeake_Bay_retriever 1.658780e-01 True bathtub 5.672610e-02 False
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 5.495000e-02 True beagle 3.891480e-02 True
883 698907974262222848 https://pbs.twimg.com/media/CbMFFssWIAAyuOd.jpg 3 German_short-haired_pointer 0.983131 True bluetick 5.557720e-03 True curly-coated_retriever 3.322210e-03 True
527 676617503762681856 https://pbs.twimg.com/media/CWPUB9TWwAALPPx.jpg 1 Chihuahua 0.841084 True Pomeranian 1.205300e-01 True Pekinese 6.600340e-03 True
1280 750132105863102464 https://pbs.twimg.com/media/CmkBKuwWgAAamOI.jpg 1 toy_poodle 0.478018 True miniature_poodle 2.074580e-01 True croquet_ball 8.587890e-02 False
203 669749430875258880 https://pbs.twimg.com/media/CUttjYtWcAAdPgI.jpg 1 washbasin 0.245794 False toilet_seat 1.094200e-01 False paper_towel 1.056640e-01 False
In [40]:
list(img.columns.values)
Out[40]:
['tweet_id',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog']
In [41]:
api.head(30)
Out[41]:
Unnamed: 0 tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10
5 5 891087950875897856 Here we have a majestic great white breaching ... 20140 3118 coast NaN 13 10
6 6 890971913173991426 Meet Jax. He enjoys ice cream so much he gets ... 11807 2076 Jax NaN 13 10
7 7 890729181411237888 When you watch your owner call another dog a g... 65260 18929 boy NaN 13 10
8 8 890609185150312448 This is Zoey. She doesn't want to be one of th... 27682 4275 Zoey NaN 13 10
9 9 890240255349198849 This is Cassie. She is a college pup. Studying... 31823 7434 Cassie doggo 14 10
10 10 890006608113172480 This is Koda. He is a South Australian decksha... 30563 7353 Koda NaN 13 10
11 11 889880896479866881 This is Bruno. He is a service shark. Only get... 27679 4980 Bruno NaN 13 10
12 12 889665388333682689 Here's a puppo that seems to be on the fence a... 47935 10083 her puppo 13 10
13 13 889638837579907072 This is Ted. He does his best. Sometimes that'... 27070 4551 Ted NaN 12 10
14 14 889531135344209921 This is Stuart. He's sporting his favorite fan... 15040 2241 Stuart puppo 13 10
15 15 889278841981685760 This is Oliver. You're witnessing one of his m... 25190 5430 Oliver NaN 13 10
16 16 888917238123831296 This is Jim. He found a fren. Taught him how t... 28972 4508 Jim NaN 12 10
17 17 888804989199671297 This is Zeke. He has a new stick. Very proud o... 25482 4353 Zeke NaN 13 10
18 18 888554962724278272 This is Ralphus. He's powering up. Attempting ... 19824 3588 Ralphus NaN 13 10
19 19 888078434458587136 This is Gerald. He was just told he didn't get... 21677 3500 Gerald NaN 12 10
20 20 887705289381826560 This is Jeffrey. He has a monopoly on the pool... 30067 5405 Jeffrey NaN 13 10
21 21 887517139158093824 I've yet to rate a Venezuelan Hover Wiener. Th... 46065 11693 Wiener NaN 14 10
22 22 887473957103951883 This is Canela. She attempted some fancy porch... 68851 18259 Canela NaN 13 10
23 23 887343217045368832 You may not have known you needed to see this ... 33548 10422 today NaN 13 10
24 24 887101392804085760 This... is a Jubilant Antarctic House Bear. We... 30425 5975 This NaN 12 10
25 25 886983233522544640 This is Maya. She's very shy. Rarely leaves he... 35026 7791 Maya NaN 13 10
26 26 886736880519319552 This is Mingus. He's a wonderful father to his... 12015 3302 Mingus NaN 13 10
27 27 886680336477933568 This is Derek. He's late for a dog meeting. 13... 22325 4477 Derek NaN 13 10
28 28 886366144734445568 This is Roscoe. Another pupper fallen victim t... 21112 3203 Roscoe pupper 12 10
29 29 886267009285017600 @NonWhiteHat @MayhewMayhem omg hello tanner yo... 116 4 caution NaN 12 10
In [42]:
#Listing All Colums Before dropping
list(api.columns.values)
Out[42]:
['Unnamed: 0',
 'tweet_id',
 'full_text',
 'fav_count',
 'retweet_count',
 'pet_name',
 'dog',
 'numerator',
 'denominator']
In [44]:
#Dropping First Column which has Just Numbers
api.drop(columns=['Unnamed: 0'], inplace=True)
#Listing All Colums Before dropping
list(api.columns.values)
Out[44]:
['tweet_id',
 'full_text',
 'fav_count',
 'retweet_count',
 'pet_name',
 'dog',
 'numerator',
 'denominator']
In [43]:
api.head(20)
Out[43]:
Unnamed: 0 tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10
5 5 891087950875897856 Here we have a majestic great white breaching ... 20140 3118 coast NaN 13 10
6 6 890971913173991426 Meet Jax. He enjoys ice cream so much he gets ... 11807 2076 Jax NaN 13 10
7 7 890729181411237888 When you watch your owner call another dog a g... 65260 18929 boy NaN 13 10
8 8 890609185150312448 This is Zoey. She doesn't want to be one of th... 27682 4275 Zoey NaN 13 10
9 9 890240255349198849 This is Cassie. She is a college pup. Studying... 31823 7434 Cassie doggo 14 10
10 10 890006608113172480 This is Koda. He is a South Australian decksha... 30563 7353 Koda NaN 13 10
11 11 889880896479866881 This is Bruno. He is a service shark. Only get... 27679 4980 Bruno NaN 13 10
12 12 889665388333682689 Here's a puppo that seems to be on the fence a... 47935 10083 her puppo 13 10
13 13 889638837579907072 This is Ted. He does his best. Sometimes that'... 27070 4551 Ted NaN 12 10
14 14 889531135344209921 This is Stuart. He's sporting his favorite fan... 15040 2241 Stuart puppo 13 10
15 15 889278841981685760 This is Oliver. You're witnessing one of his m... 25190 5430 Oliver NaN 13 10
16 16 888917238123831296 This is Jim. He found a fren. Taught him how t... 28972 4508 Jim NaN 12 10
17 17 888804989199671297 This is Zeke. He has a new stick. Very proud o... 25482 4353 Zeke NaN 13 10
18 18 888554962724278272 This is Ralphus. He's powering up. Attempting ... 19824 3588 Ralphus NaN 13 10
19 19 888078434458587136 This is Gerald. He was just told he didn't get... 21677 3500 Gerald NaN 12 10

Programatic Assessment

In [45]:
archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [46]:
archive.describe()
Out[46]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id rating_numerator rating_denominator
count 2.356000e+03 7.800000e+01 7.800000e+01 1.810000e+02 1.810000e+02 2356.000000 2356.000000
mean 7.427716e+17 7.455079e+17 2.014171e+16 7.720400e+17 1.241698e+16 13.126486 10.455433
std 6.856705e+16 7.582492e+16 1.252797e+17 6.236928e+16 9.599254e+16 45.876648 6.745237
min 6.660209e+17 6.658147e+17 1.185634e+07 6.661041e+17 7.832140e+05 0.000000 0.000000
25% 6.783989e+17 6.757419e+17 3.086374e+08 7.186315e+17 4.196984e+09 10.000000 10.000000
50% 7.196279e+17 7.038708e+17 4.196984e+09 7.804657e+17 4.196984e+09 11.000000 10.000000
75% 7.993373e+17 8.257804e+17 4.196984e+09 8.203146e+17 4.196984e+09 12.000000 10.000000
max 8.924206e+17 8.862664e+17 8.405479e+17 8.874740e+17 7.874618e+17 1776.000000 170.000000
In [47]:
api.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2344 entries, 0 to 2343
Data columns (total 8 columns):
tweet_id         2344 non-null int64
full_text        2344 non-null object
fav_count        2344 non-null int64
retweet_count    2344 non-null int64
pet_name         2334 non-null object
dog              398 non-null object
numerator        2344 non-null object
denominator      2344 non-null object
dtypes: int64(3), object(5)
memory usage: 146.6+ KB
In [48]:
img.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [49]:
api.head()
Out[49]:
tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10
In [51]:
# archive.subset(retweeted_status_id != 'NaN')
#archive(archive['retweeted_status_id'] != None)
#patients(patients['city'] == 'New York')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-51-42ca6437dacb> in <module>()
      1 # archive.subset(retweeted_status_id != 'NaN')
      2 #archive(archive['retweeted_status_id'] != None)
----> 3 patients(patients['city'] == 'New York')

NameError: name 'patients' is not defined
In [52]:
# CHecking for Duplicates throughout the data sets


all_colums = pd.Series(list(archive) + list(api) + list(img))
all_colums[all_colums.duplicated()]
Out[52]:
17    tweet_id
25    tweet_id
dtype: object

Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook.

In [53]:
archive.duplicated(['tweet_id']).sum()
Out[53]:
0
In [54]:
img.duplicated(['tweet_id']).sum()
Out[54]:
0

Data Wrangling (Cleaning)

In [55]:
#Importing Basic Packages
import pandas as pd
import numpy as np
import math
In [56]:
#Assigning agreed upon variable names (Original Data)
img = pd.read_csv('data/image-predictions.tsv', sep='\t')
api = pd.read_csv('data/twitter_archive_api.csv')
archive = pd.read_csv('data/twitter-archive-enhanced.csv')
In [57]:
#Creating Backups and Working on the *_clean Data.

img_clean = img.copy()
api_clean = api.copy()
archive_clean = archive.copy()

Quality Issues

  1. archive The numerator needs to be Recalculated as told to us - can be taken from api column (needs to be Cleaned)
  2. archive Dog names are incorrect, need to re-extract
  3. archive Dog Ratings are incorrect, need to re-extract
  4. archive Remove retwetted Data
  5. img The columns p1,p2,p3 have underscores separating their names. We should add white spaces
  6. img if P1_dog ,P2_dog and P3_dog are all false. The tweet is Invalid and cannot be processed by the neural network and hence must be removed.
  7. api remove stray count column (unnamed 0)

Tidyness Issues

  1. achive and api : Combine doggo, pupper etc into one column
  2. Merge into two Tables (get rid of api merge into archive dataset)

Solving Quality Issues

#1. archive The numerator Column needs to be recaluclated as mentioned.

Define

  • This step was noted here, although it was easier to fix it in the gathering stage while the api data was being formed.
  • I will leave a tooltip that Quality issue has been solved there. [Q#1]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#2. archive Dog names are incorret, need to re-extract

Define

  • Solved in the Data Gathering Stage and columns will be Dropped Later [Q#2]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#3. archive Dog Ratings are incorrect, need to re-extract

Define

  • Solved in the Data Gathering Stage and columns will be Dropped Later [Q#3]
  • The Extra Columns will be Dropped Later in the Other Misc Operations.

Code

Test

#4. archive Remove retwetted Data

Define

Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

  • Remove Any rows containting any other Values other than NaN in the retweeted_status_id columnn
  • Looking at the Column names 'retweeted_status_id' stands out as the defacto proof that the tweet has been retweeted. We hence Check to see this value should be null throughout the entire column and just keep the values which have 'NAN'

Code

In [58]:
archive_clean = archive[archive.retweeted_status_id.isnull()]
archive_clean.head()
Out[58]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

Test

  • Should be blank - This means we have removed all the retweeted Data.
In [59]:
archive_clean[archive_clean.retweeted_status_id.notnull()]
Out[59]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

#5. img The columns p1,p2,p3 have underscores seperating their names

Define

  • Remove underscores with whitespaces to increase readability
In [60]:
img.head(1)
Out[60]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True

Code

In [61]:
#Remvoving Whitespace using str.replace()
img_clean.p1 = img_clean.p1.str.replace('_',' ')
img_clean.p2 = img_clean.p2.str.replace('_',' ')
img_clean.p3 = img_clean.p3.str.replace('_',' ')

Test

In [62]:
#Checking if the above solution worked.
img_clean.head()
Out[62]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh springer spaniel 0.465074 True collie 0.156665 True Shetland sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature pinscher 0.074192 True Rhodesian ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian ridgeback 0.408143 True redbone 0.360687 True miniature pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

p1, p2 & p3 Columns have whiteSpaces. - Success

#6. img if P1_dog ,P2_dog and P3_dog are all false.

Define

  • The tweet is Invalid if p1_dog, p2_dog & p3_dog are all FALSE and cannot be processed by the neural network and hence must be removed.

Code

In [63]:
#looking at the original DF
img.head()
Out[63]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [64]:
#Keeping only the True Values
#img_clean = img_clean[~(img_clean.p1_dog & img_clean.p2_dog & img_clean.p3_dog)]
img.count()
Out[64]:
tweet_id    2075
jpg_url     2075
img_num     2075
p1          2075
p1_conf     2075
p1_dog      2075
p2          2075
p2_conf     2075
p2_dog      2075
p3          2075
p3_conf     2075
p3_dog      2075
dtype: int64
In [65]:
#Keeping only the True Values
img_clean = img_clean[~((img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False))]
img_clean.count()
Out[65]:
tweet_id    1751
jpg_url     1751
img_num     1751
p1          1751
p1_conf     1751
p1_dog      1751
p2          1751
p2_conf     1751
p2_dog      1751
p3          1751
p3_conf     1751
p3_dog      1751
dtype: int64

Test

In [66]:
img_clean[(img_clean.p1_dog == False) & (img_clean.p2_dog == False) & (img_clean.p3_dog == False)]
Out[66]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog

#7 api remove stray count column ['unnamed 0']

Define

-When the API csv was imported there was a stray column, we must drop it as it will cause issues later while merging

In [67]:
api.head(0)
Out[67]:
Unnamed: 0 tweet_id full_text fav_count retweet_count pet_name dog numerator denominator

Code

In [68]:
#Dropping Stray Column
api_clean.drop(columns=['Unnamed: 0'], inplace=True)

Test

In [69]:
api_clean.head()
Out[69]:
tweet_id full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 892177421306343426 This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 891689557279858688 This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 891327558926688256 This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10

Solving Tidiness Issues

#2. Merge into two Tables (get rid of api merge into archive dataset)

Define

As per the final project specification we need to merge these two tables to eliminate redudant data
These Tables are

  • archive
  • api

Code

In [70]:
#Merge into two Tables (get rid of `api` merge into `archive` dataset)
#Saving new DF as var name 'final'

final = pd.merge(archive_clean,api_clean,how='right',on='tweet_id')

Test

In [71]:
print(final.columns)
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'full_text', 'fav_count', 'retweet_count', 'pet_name', 'dog',
       'numerator', 'denominator'],
      dtype='object')

#1. archive and api : Combine doggo, pupper etc into one column

Define

  • We Extracted this column directly in the gathering stage while working with the Twitter JSON
  • This value is named dog in the api table, we can replace the columns floofer,pupper,puppo,doggo with the single column dog
  • The Remaining Columns mayebe Dropped.
  • referenced for Operations done during gathering [T#1]

Code

In [72]:
#Solving Tidiness Issue #1
#1. `achive` and `api` : Combine doggo, pupper etc into one column
final.drop(columns=['floofer','pupper','puppo','doggo'], inplace=True)

Test

In [73]:
final.head()
Out[73]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13.0 10.0 Phineas This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13 10
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13.0 10.0 Tilly This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13 10
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12.0 10.0 Archie This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12 10
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13.0 10.0 Darla This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13 10
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12.0 10.0 Franklin This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12 10

Other Misc Operations

Define

  • Dropping Extra Columns which were recalculated Before names [Q#2]
  • Dropping rating_numerator and rating_denominator [Q#1] [Q#3]
  • Dropping extra repeated column text

Code

In [74]:
final.drop(columns=['name','rating_numerator','rating_denominator','text'], inplace=True) 
#[Q#2] Solved.
#[Q#1], [Q#3] Solved,

We need to clean the numerator column

Define

The code we wrote turned out some errors, we shall manual clean this.

In [75]:
#We Need to check what values should be not there
final.numerator.value_counts()
Out[75]:
12       553
11       463
10       461
13       346
 9       152
 8       100
 7        54
14        52
 5        33
 6        32
 3        18
 4        17
 1         9
 2         8
.9         3
20         3
.5         2
44         2
 0         2
15         2
75         2
\r\n9      2
60         2
ry         1
21         1
82         1
 w         1
-5         1
80         1
26         1
88         1
84         1
27         1
17         1
.8         1
45         1
;2         1
24         1
99         1
07         1
43         1
04         1
(8         1
st         1
66         1
50         1
\r\n5      1
65         1
76         1
Name: numerator, dtype: int64

Code

In [76]:
#Making a list of which rows have errors - using value_counts() for refrence.
numerator_error = []
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == '.5']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n9']['tweet_id'])
numerator_error.append(final[final.numerator == 'ry']['tweet_id'])
numerator_error.append(final[final.numerator == '.8']['tweet_id'])
numerator_error.append(final[final.numerator == '.9']['tweet_id'])
numerator_error.append(final[final.numerator == ';2']['tweet_id'])
numerator_error.append(final[final.numerator == '(8']['tweet_id'])
numerator_error.append(final[final.numerator == 'st']['tweet_id'])
numerator_error.append(final[final.numerator == '\r\n5']['tweet_id'])
numerator_error.append(final[final.numerator == '-5']['tweet_id'])
numerator_error.append(final[final.numerator == ' w']['tweet_id'])
numerator_error
Out[76]:
[847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 42      883482846933004288
 1509    681340665377193984
 Name: tweet_id, dtype: int64, 1492    682389078323662849
 2082    667538891197542400
 Name: tweet_id, dtype: int64, 2329    760153949710192640
 Name: tweet_id, dtype: int64, 1255    697259378236399616
 Name: tweet_id, dtype: int64, 847     746369468511756288
 1192    702217446468493312
 1430    685532292383666176
 Name: tweet_id, dtype: int64, 2066    667878741721415682
 Name: tweet_id, dtype: int64, 1473    683462770029932544
 Name: tweet_id, dtype: int64, 1635    676613908052996102
 Name: tweet_id, dtype: int64, 1912    670782429121134593
 Name: tweet_id, dtype: int64, 2343    667550882905632768
 Name: tweet_id, dtype: int64, 1069    711306686208872448
 Name: tweet_id, dtype: int64]

Here we start to Manually Clean the data.

Steps

  • First we check which rows have the error prone data, look at each dataframes full_text
  • Using our judgement we manually change the values
  • This is although time consuming, since there are just a few errors it is doable.
In [77]:
final[final.numerator == '.9'].full_text
final.loc[[847,1192,1430],'numerator'] = 9
In [78]:
final[final.numerator == '.5']
Out[78]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
42 883482846933004288 NaN NaN 2017-07-08 00:28:19 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/883482846... This is Bella. She hopes her smile made you sm... 45778 9964 Bella NaN .5 10
1509 681340665377193984 6.813394e+17 4.196984e+09 2015-12-28 05:07:27 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN NaN I've been told there's a slight possibility he... 1748 303 mirror NaN .5 10
In [79]:
final.loc[42].full_text
final.loc[[42],'numerator'] = 13.5
In [80]:
final.loc[1509].full_text
final.loc[[1509],'numerator'] = 9.5
In [81]:
final[final.numerator == '\r\n9']
Out[81]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1492 682389078323662849 NaN NaN 2015-12-31 02:33:29 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/682389078... Meet Brody. He's a Downton Abbey Falsetto. Add... 1780 509 Brody NaN \r\n9 10
2082 667538891197542400 NaN NaN 2015-11-20 03:04:08 +0000 <a href="http://twitter.com" rel="nofollow">Tw... NaN NaN NaN https://twitter.com/dog_rates/status/667538891... This is a southwest Coriander named Klint. Hat... 208 69 Klint NaN \r\n9 10
In [82]:
final.loc[1492].full_text
final.loc[[1492,2082],'numerator'] = 9
In [83]:
final[final.numerator == 'ry']
Out[83]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2329 760153949710192640 NaN NaN NaN NaN NaN NaN NaN NaN RT @hownottodraw: The story/person behind @dog... 0 36 af NaN ry pe
In [84]:
final.loc[2329].full_text
final.loc[[2329],'numerator'] = 11
In [85]:
final[final.numerator == '.8']
Out[85]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1255 697259378236399616 NaN NaN 2016-02-10 03:22:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/697259378... Please stop sending in saber-toothed tigers. T... 3500 1086 tigers NaN .8 10
In [86]:
final.loc[1255].full_text
final.loc[[1255],'numerator'] = 8
In [87]:
final[final.numerator == '.9']
Out[87]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
In [88]:
final[final.numerator == ';2']
Out[88]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2066 667878741721415682 NaN NaN 2015-11-21 01:34:35 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/667878741... This is Tedrick. He lives on the edge. Needs s... 403 123 Tedrick NaN ;2 10
In [89]:
final.loc[2066].full_text
final.loc[[2066],'numerator'] = 2
In [90]:
final[final.numerator == '(8']
Out[90]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1473 683462770029932544 NaN NaN 2016-01-03 01:39:57 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/683462770... "Hello forest pupper I am house pupper welcome... 2598 729 https://t pupper (8 10
In [91]:
final.loc[1473].full_text
final.loc[[1473],'numerator'] = 8
In [92]:
final[final.numerator == 'st']
Out[92]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1635 676613908052996102 NaN NaN 2015-12-15 04:05:01 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/676613908... This is the saddest/sweetest/best picture I've... 1147 209 sent NaN st sw
In [93]:
final.loc[1635].full_text
final.loc[[1635],'numerator'] = 12
In [94]:
final[final.numerator == '\r\n5']
Out[94]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1912 670782429121134593 NaN NaN 2015-11-29 01:52:48 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/670782429... This dude slaps your girl's ass what do you do... 1627 820 https://t NaN \r\n5 10
In [95]:
final.loc[1912].full_text
final.loc[[1912],'numerator'] = 5
In [96]:
final[final.numerator == '-5']
Out[96]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2343 667550882905632768 NaN NaN NaN NaN NaN NaN NaN NaN RT @dogratingrating: Unoriginal idea. Blatant ... 0 33 idea NaN -5 10
In [97]:
final[final.numerator == ' w']
Out[97]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1069 711306686208872448 NaN NaN 2016-03-19 21:41:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/711306686... What hooligan sent in pictures w/out a dog in ... 3516 795 af NaN w ou
In [98]:
final.loc[1069].full_text
final.loc[[1069],'numerator'] = 3

Test

  • All Stray values eliminated (especially strings)
In [99]:
print(final.numerator.value_counts())
12      553
11      463
10      461
13      346
 9      152
 8      100
 7       54
14       52
 5       33
 6       32
 3       18
 4       17
 1        9
 2        8
9         5
20        3
44        2
75        2
 0        2
8         2
60        2
15        2
82        1
26        1
3         1
13.5      1
-5        1
5         1
17        1
11        1
76        1
21        1
12        1
80        1
88        1
84        1
27        1
45        1
65        1
24        1
99        1
50        1
66        1
9.5       1
04        1
43        1
07        1
2         1
Name: numerator, dtype: int64

We need to clean the denominator column as well

Define

The code we wrote turned out some errors, we shall manual clean this.
We need to eliminate string values

Code

In [100]:
final.denominator.value_counts()
Out[100]:
10    2319
11       3
50       3
15       2
80       2
20       2
sw       1
2        1
17       1
16       1
90       1
7        1
12       1
40       1
ou       1
pe       1
13       1
00       1
70       1
Name: denominator, dtype: int64
In [101]:
final[final.denominator == 'pe']
Out[101]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
2329 760153949710192640 NaN NaN NaN NaN NaN NaN NaN NaN RT @hownottodraw: The story/person behind @dog... 0 36 af NaN 11 pe
In [102]:
final.loc[2329].full_text
final.loc[[2329],'denominator'] = 10
In [103]:
final[final.denominator == 'ou']
Out[103]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1069 711306686208872448 NaN NaN 2016-03-19 21:41:44 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/711306686... What hooligan sent in pictures w/out a dog in ... 3516 795 af NaN 3 ou
In [104]:
final.loc[1069].full_text
final.loc[[1069],'denominator'] = 10
In [105]:
final[final.denominator == 'sw']
Out[105]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
1635 676613908052996102 NaN NaN 2015-12-15 04:05:01 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/676613908... This is the saddest/sweetest/best picture I've... 1147 209 sent NaN 12 sw
In [106]:
final.loc[1635].full_text
final.loc[[1635],'denominator'] = 10

Test

  • All String Values eliminated
In [107]:
final.denominator.value_counts()
Out[107]:
10    2319
50       3
11       3
10       3
20       2
15       2
80       2
13       1
12       1
40       1
00       1
7        1
90       1
16       1
17       1
2        1
70       1
Name: denominator, dtype: int64

Test

In [ ]:
final.columns

Final Code to Save

RUN LAST AND UNCOMMENT

In [ ]:
#Save Files to CSV

#final.to_csv('data/final/twitter_archive_master.csv')
#img_clean.to_csv('data/final/image_predictions.csv')
#print("Saved Successfully")

Data Wrangling (Visualizations)

In [108]:
#Import Statements
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
In [109]:
#importing Final Data found in Data/final Folder

archive = pd.read_csv('data/final/twitter_archive_master.csv',index_col=0)
img = pd.read_csv('data/final/image_predictions.csv', index_col=0)

print("Import Successful")
Import Successful

1. Analysing fav_count from archive

In [110]:
fav_mean = archive.fav_count.mean()
fav_median = archive.fav_count.median()
fav_max = archive.fav_count.max()
fav_sum = archive.fav_count.sum()
In [111]:
archive.fav_count.count()
Out[111]:
2344
In [112]:
print("Mean Favourite Value is : {}".format(fav_mean))
print("Median Favourite Value is : {}".format(fav_median))
print("Max Favourite Value for an tweet is : {}".format(fav_max))
print("Total Favourite secured for All Tweets : {}".format(fav_sum))
Mean Favourite Value is : 8031.052901023891
Median Favourite Value is : 3520.5
Max Favourite Value for an tweet is : 142720
Total Favourite secured for All Tweets : 18824788
In [113]:
fav_plt = archive.fav_count.hist(alpha=0.8,figsize=(8,8))
plt.xlabel("Fav Counts");
plt.ylabel("Count of Tweets");
plt.title("Favorited Tweets");
plt.savefig('Docs/Viz/1.png');

2. Analysing retweet_count from archive

In [114]:
archive
Out[114]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls full_text fav_count retweet_count pet_name dog numerator denominator
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... This is Phineas. He's a mystical boy. Only eve... 38625 8541 Phineas NaN 13.0 10
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... This is Tilly. She's just checking pup on you.... 33105 6282 Tilly NaN 13.0 10
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... This is Archie. He is a rare Norwegian Pouncin... 24922 4161 Archie NaN 12.0 10
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... This is Darla. She commenced a snooze mid meal... 42022 8670 Darla NaN 13.0 10
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... This is Franklin. He would like you to stop ca... 40172 9422 Franklin NaN 12.0 10
5 891087950875897856 NaN NaN 2017-07-29 00:08:17 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/891087950... Here we have a majestic great white breaching ... 20140 3118 coast NaN 13.0 10
6 890971913173991426 NaN NaN 2017-07-28 16:27:12 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,ht... Meet Jax. He enjoys ice cream so much he gets ... 11807 2076 Jax NaN 13.0 10
7 890729181411237888 NaN NaN 2017-07-28 00:22:40 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/890729181... When you watch your owner call another dog a g... 65260 18929 boy NaN 13.0 10
8 890609185150312448 NaN NaN 2017-07-27 16:25:51 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/890609185... This is Zoey. She doesn't want to be one of th... 27682 4275 Zoey NaN 13.0 10
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/890240255... This is Cassie. She is a college pup. Studying... 31823 7434 Cassie doggo 14.0 10
10 890006608113172480 NaN NaN 2017-07-26 00:31:25 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/890006608... This is Koda. He is a South Australian decksha... 30563 7353 Koda NaN 13.0 10
11 889880896479866881 NaN NaN 2017-07-25 16:11:53 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/889880896... This is Bruno. He is a service shark. Only get... 27679 4980 Bruno NaN 13.0 10
12 889665388333682689 NaN NaN 2017-07-25 01:55:32 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/889665388... Here's a puppo that seems to be on the fence a... 47935 10083 her puppo 13.0 10
13 889638837579907072 NaN NaN 2017-07-25 00:10:02 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/889638837... This is Ted. He does his best. Sometimes that'... 27070 4551 Ted NaN 12.0 10
14 889531135344209921 NaN NaN 2017-07-24 17:02:04 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/889531135... This is Stuart. He's sporting his favorite fan... 15040 2241 Stuart puppo 13.0 10
15 889278841981685760 NaN NaN 2017-07-24 00:19:32 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/889278841... This is Oliver. You're witnessing one of his m... 25190 5430 Oliver NaN 13.0 10
16 888917238123831296 NaN NaN 2017-07-23 00:22:39 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/888917238... This is Jim. He found a fren. Taught him how t... 28972 4508 Jim NaN 12.0 10
17 888804989199671297 NaN NaN 2017-07-22 16:56:37 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/888804989... This is Zeke. He has a new stick. Very proud o... 25482 4353 Zeke NaN 13.0 10
18 888554962724278272 NaN NaN 2017-07-22 00:23:06 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/888554962... This is Ralphus. He's powering up. Attempting ... 19824 3588 Ralphus NaN 13.0 10
19 888078434458587136 NaN NaN 2017-07-20 16:49:33 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/888078434... This is Gerald. He was just told he didn't get... 21677 3500 Gerald NaN 12.0 10
20 887705289381826560 NaN NaN 2017-07-19 16:06:48 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/887705289... This is Jeffrey. He has a monopoly on the pool... 30067 5405 Jeffrey NaN 13.0 10
21 887517139158093824 NaN NaN 2017-07-19 03:39:09 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/887517139... I've yet to rate a Venezuelan Hover Wiener. Th... 46065 11693 Wiener NaN 14.0 10
22 887473957103951883 NaN NaN 2017-07-19 00:47:34 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/887473957... This is Canela. She attempted some fancy porch... 68851 18259 Canela NaN 13.0 10
23 887343217045368832 NaN NaN 2017-07-18 16:08:03 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/887343217... You may not have known you needed to see this ... 33548 10422 today NaN 13.0 10
24 887101392804085760 NaN NaN 2017-07-18 00:07:08 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/887101392... This... is a Jubilant Antarctic House Bear. We... 30425 5975 This NaN 12.0 10
25 886983233522544640 NaN NaN 2017-07-17 16:17:36 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/886983233... This is Maya. She's very shy. Rarely leaves he... 35026 7791 Maya NaN 13.0 10
26 886736880519319552 NaN NaN 2017-07-16 23:58:41 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://www.gofundme.com/mingusneedsus,https:/... This is Mingus. He's a wonderful father to his... 12015 3302 Mingus NaN 13.0 10
27 886680336477933568 NaN NaN 2017-07-16 20:14:00 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/886680336... This is Derek. He's late for a dog meeting. 13... 22325 4477 Derek NaN 13.0 10
28 886366144734445568 NaN NaN 2017-07-15 23:25:31 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN https://twitter.com/dog_rates/status/886366144... This is Roscoe. Another pupper fallen victim t... 21112 3203 Roscoe pupper 12.0 10
29 886267009285017600 8.862664e+17 2.281182e+09 2017-07-15 16:51:35 +0000 <a href="http://twitter.com/download/iphone" r... NaN NaN NaN NaN @NonWhiteHat @MayhewMayhem omg hello tanner yo... 116 4 caution NaN 12.0 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2314 775898661951791106 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Like father (doggo), like son (... 0 17240 (pupper) doggo 12.0 10
2315 773336787167145985 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Meet Fizz. She thinks love is a... 0 5675 Fizz NaN 11.0 10
2316 772615324260794368 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Gromit. He's pupset bec... 0 3743 Gromit NaN 10.0 10
2317 771171053431250945 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Frankie. He's wearing b... 0 8399 Frankie NaN 11.0 10
2318 771004394259247104 NaN NaN NaN NaN NaN NaN NaN NaN RT @katieornah: @dog_rates learning a lot at c... 0 245 https://t pupper 12.0 10
2319 770743923962707968 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Here's a doggo blowing bubbles.... 0 50572 bubbles doggo 13.0 10
2320 770093767776997377 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is just downright precious... 0 3374 af doggo 12.0 10
2321 769335591808995329 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Ever seen a dog pet another dog... 0 8523 scene NaN 13.0 10
2322 768909767477751808 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: When it's Janet from accounting... 0 3004 chocolate pupper 10.0 10
2323 768554158521745409 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Nollie. She's waving at... 0 6462 Nollie NaN 12.0 10
2324 766864461642756096 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: We only rate dogs... this is a ... 0 6284 dogs NaN 10.0 10
2325 766078092750233600 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Colby. He's currently r... 0 2873 Colby NaN 12.0 10
2326 763167063695355904 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Meet Eve. She's a raging alcoho... 0 3348 Eve pupper 8.0 10
2327 761750502866649088 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: "Tristan do not speak to me wit... 0 4373 Xbox NaN 10.0 10
2328 761371037149827077 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Oh. My. God. 13/10 magical af h... 0 19821 Oh NaN 13.0 10
2329 760153949710192640 NaN NaN NaN NaN NaN NaN NaN NaN RT @hownottodraw: The story/person behind @dog... 0 36 af NaN 11.0 10
2330 759566828574212096 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This... is a Tyrannosaurus rex.... 0 23398 This NaN 10.0 10
2331 759159934323924993 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: AT DAWN...\r\r\nWE RIDE\r\r\n\r... 0 1297 DAWN NaN 11.0 10
2332 757729163776290825 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Chompsky. He lives up t... 0 8952 Chompsky NaN 11.0 10
2333 757597904299253760 NaN NaN NaN NaN NaN NaN NaN NaN RT @jon_hill987: @dog_rates There is a cunning... 0 322 least pupper 11.0 10
2334 754874841593970688 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Rubio. He has too much ... 0 8840 Rubio NaN 11.0 10
2335 753298634498793472 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Carly. She's actually 2... 0 6368 Carly NaN 12.0 10
2336 752701944171524096 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: HEY PUP WHAT'S THE PART OF THE ... 0 3178 https://t NaN 11.0 10
2337 752309394570878976 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Everyone needs to watch this. 1... 0 18361 this NaN 13.0 10
2338 747242308580548608 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This pupper killed this great w... 0 3158 battle pupper 13.0 10
2339 746521445350707200 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: This is Shaggy. He knows exactl... 0 1076 Shaggy NaN 10.0 10
2340 743835915802583040 NaN NaN NaN NaN NaN NaN NaN NaN RT @dog_rates: Extremely intelligent dog here.... 0 2289 here NaN 10.0 10
2341 711998809858043904 NaN NaN NaN NaN NaN NaN NaN NaN RT @twitter: @dog_rates Awesome Tweet! 12/10. ... 0 136 12/10 NaN 12.0 10
2342 667550904950915073 NaN NaN NaN NaN NaN NaN NaN NaN RT @dogratingrating: Exceptional talent. Origi... 0 35 talent NaN 12.0 10
2343 667550882905632768 NaN NaN NaN NaN NaN NaN NaN NaN RT @dogratingrating: Unoriginal idea. Blatant ... 0 33 idea NaN -5.0 10

2344 rows × 16 columns

In [115]:
ass = np.sort(archive.retweet_count)[::-1]
In [116]:
ass
Out[116]:
array([76934, 60743, 50572, ...,     2,     2,     0], dtype=int64)
In [117]:
retweet_mean = archive.retweet_count.mean()
retweet_median = archive.retweet_count.median()
retweet_max = archive.retweet_count.max()
retweet_sum = archive.retweet_count.sum()
In [118]:
archive.retweet_count.hist(alpha=0.8,figsize=(8,8),color = "green")
plt.xlabel("Retweet Counts");
plt.ylabel("Count of Tweets");
plt.title("Re-Tweeted Tweets");
plt.savefig('Docs/Viz/2.png');
In [119]:
print("Mean Retweets Value is : {}".format(retweet_mean))
print("Median Retweets Value is : {}".format(retweet_median))
print("Max Retweets Value for an tweet is : {}".format(retweet_max))
print("Total Retweets secured for All Tweets : {}".format(retweet_sum))
Mean Retweets Value is : 3006.8877986348125
Median Retweets Value is : 1400.5
Max Retweets Value for an tweet is : 76934
Total Retweets secured for All Tweets : 7048145

3. Analysing Dog Names from archive

In [120]:
pie = archive.dog.value_counts()
pie.plot(kind="pie");
plt.savefig('Docs/Viz/3.png');
In [121]:
dog_val = archive.dog.value_counts()


name_sum = dog_val[0]+dog_val[1]+dog_val[2]+dog_val[3]
pupper_per = (dog_val[0]/name_sum)*100
doggo_per = (dog_val[1]/name_sum)*100
puppo_per = (dog_val[2]/name_sum)*100
floofer_per = (dog_val[3]/name_sum)*100

print("The Percentile Value of Pupper to all dogs is {}%".format(pupper_per))
print("The Percentile Value of Doggo to all dogs is {}%".format(doggo_per))
print("The Percentile Value of Puppo to all dogs is {}%".format(puppo_per))
print("The Percentile Value of Floofer to all dogs is {}%".format(floofer_per))
The Percentile Value of Pupper to all dogs is 65.57788944723619%
The Percentile Value of Doggo to all dogs is 24.623115577889447%
The Percentile Value of Puppo to all dogs is 8.793969849246231%
The Percentile Value of Floofer to all dogs is 1.0050251256281406%
In [122]:
dog_val.sum()
Out[122]:
398

4. Analysing numerators & denominators from archive

In [123]:
num = archive.numerator
dom = archive.denominator

num_mean = num.mean()
num_median = num.median()
num_max = archive.numerator.max()

dom_mean = dom.mean()
dom_median = dom.median()
In [124]:
print("The mean value of all the numerator of the ratings given is : {}".format(num_mean))
print("The median value of all the numerator of the ratings given is : {}".format(num_median))
print("The mean value of all the denominators of the ratings given is : {}".format(dom_mean))
print("The median value of all the denominators of the ratings given is : {}".format(dom_median))
print("".format())
The mean value of all the numerator of the ratings given is : 11.125853242320819
The median value of all the numerator of the ratings given is : 11.0
The mean value of all the denominators of the ratings given is : 10.196245733788396
The median value of all the denominators of the ratings given is : 10.0

In [125]:
num_plt = num.plot(figsize=(10,10), kind='hist', color="#ff9960");
plt.ylabel("Tweets")
plt.xlabel("Numerator Values")
plt.title("Numerator Histogram");
# num_plt.axes.get_yaxis().set_visible(False)

plt.savefig('Docs/Viz/4.png');
In [126]:
print("The Maximum Rating Numerator given is {}".format(num_max))
The Maximum Rating Numerator given is 99.0

#5 Finding out about the posting habits

In [127]:
time_dt = pd.to_datetime(archive.timestamp).dt.date
time_hr = pd.to_datetime(archive.timestamp).dt.hour
time_yr = pd.to_datetime(archive.timestamp).dt.year
In [128]:
time_hr.value_counts()
time_hr.plot(figsize=(8,6), kind='hist');
plt.title("Hourly Posting Graph");
plt.savefig('Docs/Viz/5.png');
In [129]:
time_dt = time_dt.value_counts()
time_dt.plot(figsize=(15,10));
plt.title("Daily Posting Graph");
plt.savefig('Docs/Viz/6.png');
In [130]:
time_dt.value_counts()
Out[130]:
2     174
1     138
3     119
4      53
5      24
7      22
6      20
8       6
10      6
21      4
18      4
17      4
9       4
16      3
14      3
13      3
26      2
11      2
15      2
20      2
25      1
23      1
24      1
12      1
Name: timestamp, dtype: int64
In [131]:
time_yr.value_counts().plot(kind='bar');
plt.title("Yearly Graph Figure");
plt.savefig('Docs/Viz/7.png');

Analyzing the img DataSet

In [132]:
#Viewing The DataSet
img.head()
Out[132]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh springer spaniel 0.465074 True collie 0.156665 True Shetland sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature pinscher 0.074192 True Rhodesian ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian ridgeback 0.408143 True redbone 0.360687 True miniature pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

#6 Calculating the mean values for the images uploaded per post from img

In [133]:
#Calulating the mean values for the images uploaded per post
img_uploaded_mean = img.img_num.mean()
img_uploaded_median = img.img_num.median()
In [134]:
print("The Average amount of pictures uploaded per tweet is : {}".format(img_uploaded_mean))
print("The Median amount of the uploaded photos is : {}".format(img_uploaded_median))
The Average amount of pictures uploaded per tweet is : 1.214734437464306
The Median amount of the uploaded photos is : 1.0

Neural Network Analysis

  • We can gauge the efficiency of the algorithm by seeing it's prediction strength.
  • This is a value from 0.00 to 0.99(or 1)
  • Each phase of the neural analysis (p1_conf, p2_conf, p3_conf) has a hit ratio and we can gauge a lot about the network works.

#7 Finding the efficiency of the Prediction inp1_conf,p2_conf & p3_conf

In [135]:
# Calulating the mean values for the  confidence prediction varibles 
Cp1_mean = img.p1_conf.mean()
Cp2_mean = img.p2_conf.mean()
Cp3_mean = img.p3_conf.mean()
In [136]:
print("Calculating the Efficiency  of the Neural Network on the diffrent stages p1,p2,p3")
print("The Average Efficiency  of Stage one P1 {} ".format(Cp1_mean))
print("The Average Efficiency  of Stage one P2 {} ".format(Cp2_mean))
print("The Average Efficiency  of Stage one P3 {} ".format(Cp3_mean))
Calculating the Efficiency  of the Neural Network on the diffrent stages p1,p2,p3
The Average Efficiency  of Stage one P1 0.6042066042261568 
The Average Efficiency  of Stage one P2 0.1377151198694462 
The Average Efficiency  of Stage one P3 0.06161188353832655 

#8 Finding Hit rate of the neural network through the different stages

In [137]:
#using counts to get true false valuses of the data
prediction_p1 = img.p1_dog.value_counts()
prediction_p2 = img.p2_dog.value_counts()
prediction_p3 = img.p3_dog.value_counts()

#Finding percentile values of each
p1_per = prediction_p1[1]/ (prediction_p1[0]+prediction_p1[1])*100
p2_per = prediction_p2[1]/ (prediction_p2[0]+prediction_p2[1])*100
p3_per = prediction_p3[1]/ (prediction_p3[0]+prediction_p3[1])*100
In [138]:
#Printing Above found percentiles.
print("P1 Stage Success Hit Rate is {} %".format(p1_per))
print("P2 Stage Success Hit Rate is {} %".format(p2_per))
print("P3 Stage Success Hit Rate is {} %".format(p3_per))
P1 Stage Success Hit Rate is 87.49286122215877 %
P2 Stage Success Hit Rate is 88.692175899486 %
P3 Stage Success Hit Rate is 85.6082238720731 %
In [139]:
#Anaylzing which dogs are the most popular through diffrent stages of the neural network.
d_p1 = img.p1.value_counts()
d_p2 = img.p2.value_counts()
d_p3 = img.p3.value_counts()
In [140]:
#Dumping the Data to Read and Anaylyze
print("Finding out the most popular Dogs for Each Stage \n")
#print("P1")
print("The Top Popular Dogs for Stage P1 Are :\n{}  \n".format(d_p1.head()))
#print("P2")
print("The Top Popular Dogs for Stage P2 Are :\n{}  \n".format(d_p2.head()))
#print("P3")
print("The Top Popular Dogs for Stage P3 Are :\n{}  \n".format(d_p3.head()))
Finding out the most popular Dogs for Each Stage 

The Top Popular Dogs for Stage P1 Are :
golden retriever      150
Labrador retriever    100
Pembroke               89
Chihuahua              83
pug                    57
Name: p1, dtype: int64  

The Top Popular Dogs for Stage P2 Are :
Labrador retriever    104
golden retriever       92
Cardigan               73
Chihuahua              44
Pomeranian             42
Name: p2, dtype: int64  

The Top Popular Dogs for Stage P3 Are :
Labrador retriever    79
Chihuahua             58
golden retriever      48
Eskimo dog            38
kelpie                35
Name: p3, dtype: int64  

In [141]:
# Merging all the Data into one Series so i can get a better picutre for joint anaylysis
all_dogs = pd.concat([d_p1, d_p2, d_p3])
d_all = all_dogs.groupby(all_dogs.index).aggregate(sum)
d_all = d_all.sort_values(ascending=False)
print("The Top 10 Dogs overall through all the stages in our DataSet are \n\n{}\n".format(d_all.head(10)))
The Top 10 Dogs overall through all the stages in our DataSet are 

golden retriever      290
Labrador retriever    283
Chihuahua             185
Pembroke              143
Cardigan              115
Pomeranian            109
toy poodle            105
pug                    97
chow                   96
cocker spaniel         95
dtype: int64

Thank You for Reading :)