Introduction to API's with Python
rpi.analyticsdojo.com
11. Introduction to APIs¶
This is adopted from Mining the Social Web, 2nd Edition Copyright (c) 2013, Matthew A. Russell All rights reserved.
This work is licensed under the Simplified BSD License.
11.1. Before you Begin #1¶
If you are working locally or on colab, this exercise requires the twitter package and the ruamel.yaml package. Yaml files are structured files useful for storing configuration.
!pip install twitter ruamel.yaml
#see if it worked by importing the twitter package & some other things we will use.
from twitter import *
import datetime, traceback
import json
import time
import sys
11.2. Before you Begin #2¶
Download the sample configuration.
!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/04-viz-api-scraper/screen_names.csv && wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/04-viz-api-scraper/twitlab.py
11.3. Download Authorization file (look at Webex Teams for file)¶
11.4. Step1. Loading Authorization Data¶
Here we are going to store the authorization data in a .YAML file rather than directly in the notebook.
We have also added
config.yaml
to the.gitignore
file so we won’t accidentally commit our sensitive data to the repository.You should generally keep sensitive data out of all git repositories (public or private) but definitely Public.
If you ever accidentally commit data to a public repository you must consider it compromised.
A .yaml file is a common way to store configuration data, but it is not really secure.
#This will import some required libraries.
import sys
import ruamel.yaml #A .yaml file
#This is your configuration file.
twitter_yaml='config.yaml'
with open(twitter_yaml, 'r') as yaml_t:
cf_t=ruamel.yaml.round_trip_load(yaml_t, preserve_quotes=True)
#You can check your config was loaded by printing, but you should not commit this.
cf_t
11.5. cat
command to look at files¶
We can use the cat
command to look at the structure of our files.
!cat config.yaml
11.6. Create Some Relevant Functions¶
We first will create a Twitter object we can used to authorize data.
Then we will get profiles.
Finally we will get some tweets.
Don’t worry about not understanding all the code. Here we are pushing you us more complex functions.
#@title
def create_twitter_auth(cf_t):
"""Function to create a twitter object
Args: cf_t is configuration dictionary.
Returns: Twitter object.
"""
# When using twitter stream you must authorize.
# these tokens are necessary for user authentication
# create twitter API object
auth = OAuth(cf_t['access_token'], cf_t['access_token_secret'], cf_t['consumer_key'], cf_t['consumer_secret'])
try:
# create twitter API object
twitter = Twitter(auth = auth)
except TwitterHTTPError:
traceback.print_exc()
time.sleep(cf_t['sleep_interval'])
return twitter
def get_profiles(twitter, names, cf_t):
"""Function write profiles to a file with the form *data-user-profiles.json*
Args: names is a list of names
cf_t is a list of twitter config
Returns: Nothing
"""
# file name for daily tracking
dt = datetime.datetime.now()
fn = cf_t['data']+'/profiles/'+dt.strftime('%Y-%m-%d-user-profiles.json')
with open(fn, 'w') as f:
for name in names:
print("Searching twitter for User profile: ", name)
try:
# create a subquery, looking up information about these users
# twitter API docs: https://dev.twitter.com/docs/api/1/get/users/lookup
profiles = twitter.users.lookup(screen_name = name)
sub_start_time = time.time()
for profile in profiles:
print("User found. Total tweets:", profile['statuses_count'])
# now save user info
f.write(json.dumps(profile))
f.write("\n")
sub_elapsed_time = time.time() - sub_start_time;
if sub_elapsed_time < cf_t['sleep_interval']:
time.sleep(cf_t['sleep_interval'] + 1 - sub_elapsed_time)
except TwitterHTTPError:
traceback.print_exc()
time.sleep(cf_t['sleep_interval'])
continue
f.close()
return fn
11.7. Load Twitter Handle From CSV¶
This is a .csv that has individuals we want to collect data on.
Go ahead and follow AnalyticsDojo.
import pandas as pd
df=pd.read_csv(cf_t['file'])
df
11.8. Create Twitter Object¶
import twitlab
#Create Twitter Object
twitter= twitlab.create_twitter_auth(cf_t)
The outcoming of running the above API is to generate a twitter object.
11.9. Step 2. Getting Help¶
# We can get some help on how to use the twitter api with the following.
help(twitter)
Go ahead and take a look at the twitter docs.
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.
world_trends = twitter.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter.trends.place(_id=US_WOE_ID)
print (world_trends)
print (us_trends)
11.10. Step 3. Displaying API responses as pretty-printed JSON¶
import json
print (json.dumps(world_trends, indent=1))
print (json.dumps(us_trends, indent=1))
Take a look at the api docs for the /trends/place call made above.
11.11. Step 4. Collecting search results for a targeted hashtag.¶
# Import unquote to prevent url encoding errors in next_results
#from urllib3 import unquote
#This can be any trending topic, but let's focus on a hashtag that is relevant to the class.
q = '#analytics'
count = 100
# See https://dev.twitter.com/rest/reference/get/search/tweets
search_results = twitter.search.tweets(q=q, count=count)
#This selects out
statuses = search_results['statuses']
# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
print ("Length of statuses", len(statuses))
try:
next_results = search_results['search_metadata']['next_results']
print ("next_results", next_results)
except: # No more results when next_results doesn't exist
break
# Create a dictionary from next_results, which has the following form:
# ?max_id=313519052523986943&q=NCAA&include_entities=1
kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
print (kwargs)
search_results = twitter.search.tweets(**kwargs)
statuses += search_results['statuses']
# Show one sample search result by slicing the list...
print (json.dumps(statuses[0], indent=1))
#Print several
print (json.dumps(statuses[0:5], indent=1))
11.12. Step 5. Extracting text, screen names, and hashtags from tweets¶
#We can access an individual tweet like so:
statuses[1]['text']
statuses[1]['entities']
#notice the nested relationships. We have to take notice of this to further access the data.
statuses[1]['entities']['hashtags']
status_texts = [ status['text']
for status in statuses ]
screen_names = [ user_mention['screen_name']
for status in statuses
for user_mention in status['entities']['user_mentions'] ]
hashtags = [ hashtag['text']
for status in statuses
for hashtag in status['entities']['hashtags'] ]
urls = [ url['url']
for status in statuses
for url in status['entities']['urls'] ]
# Compute a collection of all words from all tweets
words = [ w
for t in status_texts
for w in t.split() ]
# Explore the first 5 items for each...
print (json.dumps(status_texts[0:5], indent=1))
print (json.dumps(screen_names[0:5], indent=1))
print (json.dumps(hashtags[0:5], indent=1))
print (json.dumps(words[0:5], indent=1))
11.13. Step 6. Creating a basic frequency distribution from the words in tweets¶
from collections import Counter
for item in [words, screen_names, hashtags]:
c = Counter(item)
print (c.most_common()[:10]) # top 10, "\n")