Skip to content
Notes: Analyzing Social Media Data
Analyzing Twitter Data
Collecting data through the Twitter API
Streaming API
Real-time tweets
The connection stays open till you close it Two Endpoints Filter endpoint searches:
- Keywords
- User IDs
- location Sample endpoint:
- Random sample
Twitter will return a 1% sample of all of Twitter
tweepy package
- collects data from Twitter
- requires an object called SListener which tells it how to handle incoming data.
SListener
- SListener object inherits from a general Stream class incl with tweepy
- opens a new timestamped file to store tweets
- takes an optional API argument
from tweepy import Stream
import time
class SListener(Stream):
def __init__(self, api = None):
self.output = open('tweets_%s.json' %
time.strftime('%Y%m%d-%H%M%S'), 'w')
self.api = api or API()
tweepy authentication
the Twitter API uses OAuthentication
- requires four tokens
- tokens obtained from Twitter developer site Tokens
- consumer key
- consumer secret
- access token
- access token secret Authenticating
- pass the OAuthHandler consumer key and consumer secret
- set the access token and access token secret
- lastly pass the auth object to the tweepy API object
from tweepy import OAuthHandler
from tweepy import API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = API(auth)Collecting data with tweepy
Sample endpoint - To take a random sample of all of Twitter
- instantiate an SListener object
- instantiate a stream object
- call the sample method to begin collecting data
from tweepy import Stream
listen = SListener(api)
stream = Stream(auth, listen)
stream.sample()Exercise 1
Setting up tweepy authentication
- Import OAuthHandler and API from the tweepy module.
- Pass consumer_key and consumer_secret to OAuthHandler.
- Set the access tokens with access_token and access_token_secret.
- Pass the auth object to the API.
from tweepy import OAuthHandler
from tweepy import API
# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)
# Access key authentication
auth.set_access_token(access_token, access_token_secret)
# Set up the API with the authentication handler
api = API(auth)Hidden output
Exercise 2
Collecting data on keywords
- Import
Streamfromtweepy. - Set
keywords_to_trackto a list containing#rstatsand#python. - Pass the
authandlistenobjects toStream. - Set the keyword argument
trackequals tokeywords_to_track.
from tweepy import Stream
# Set up words to track
keywords_to_track = ('#rstats', '#python')
# Instantiate the SListener object
listen = SListener(api)
# Instantiate the Stream object
stream = Stream(auth, listen, access_token, access_token_secret)
# Begin collecting data
stream.filter(track = keywords_to_track)Understanding Twitter JSON
JSON is a combination of dictionaries and lists
Contents of Twitter JSON
- text
- creation datetime
- unique tweet ID
- how many retweets or favorites
- language
- if it's a reply - which tweet it's replying to and which user
Child JSON objects
important ones
user- name, handle, bio, location, verification statusplaceextended_tweetretweeted_statusquoted_status
Places, retweets/quoted tweets, and 140+ tweets
place and coordinate
place and coordinate- contain geolocation
extended_tweet
extended_tweet- tweets over 140 characters - full text of tweet over 140
Accessing JSON
open()andread()methods load the JSON file into a JSON objectjsonpackage and theloadsmethod to convert the JSON into a Python dictionary- access via keys
import json
tweet_json = open('tweet-example.json', 'r').read()
tweet = json.loads(tweet_json)
tweet['text']Child tweet JSON
Accessed as nested dictionary
# user object
tweet['user']['screen_name'] # user handle
tweet['user']['name'] # user display name
tweet['user']['created_at'] # when account was createdExercise 3
Loading and accessing tweets
- Import the
jsonmodule. - Convert the tweet JSON stored in
tweet_jsonfrom JSON to Python object usingjson's .loads()method. - Print the tweet text and id using the appropriate keys.