Skip to content

Analyzing Twitter Data

Collecting data through the Twitter API

Streaming API

Real-time tweets

The connection stays open till you close it Two Endpoints Filter endpoint searches:

  • Keywords
  • User IDs
  • location Sample endpoint:
  • Random sample

Twitter will return a 1% sample of all of Twitter

tweepy package

  • collects data from Twitter
  • requires an object called SListener which tells it how to handle incoming data.

SListener

  • SListener object inherits from a general Stream class incl with tweepy
  • opens a new timestamped file to store tweets
  • takes an optional API argument
from tweepy import Stream
import time

class SListener(Stream):
    def __init__(self, api = None):
        self.output = open('tweets_%s.json' %
                          time.strftime('%Y%m%d-%H%M%S'), 'w')
        self.api = api or API()

tweepy authentication

the Twitter API uses OAuthentication

  • requires four tokens
  • tokens obtained from Twitter developer site Tokens
  • consumer key
  • consumer secret
  • access token
  • access token secret Authenticating
  • pass the OAuthHandler consumer key and consumer secret
  • set the access token and access token secret
  • lastly pass the auth object to the tweepy API object
from tweepy import OAuthHandler
from tweepy import API

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = API(auth)

Collecting data with tweepy

Sample endpoint - To take a random sample of all of Twitter

  • instantiate an SListener object
  • instantiate a stream object
  • call the sample method to begin collecting data
from tweepy import Stream

listen = SListener(api)
stream = Stream(auth, listen)
stream.sample()

Exercise 1

Setting up tweepy authentication

  • Import OAuthHandler and API from the tweepy module.
  • Pass consumer_key and consumer_secret to OAuthHandler.
  • Set the access tokens with access_token and access_token_secret.
  • Pass the auth object to the API.
from tweepy import OAuthHandler
from tweepy import API

# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)
Hidden output

Exercise 2

Collecting data on keywords

  • Import Stream from tweepy.
  • Set keywords_to_track to a list containing #rstats and #python.
  • Pass the auth and listen objects to Stream.
  • Set the keyword argument track equals to keywords_to_track.
from tweepy import Stream

# Set up words to track
keywords_to_track = ('#rstats', '#python')

# Instantiate the SListener object 
listen = SListener(api)

# Instantiate the Stream object
stream = Stream(auth, listen, access_token, access_token_secret)

# Begin collecting data
stream.filter(track = keywords_to_track)

Understanding Twitter JSON

JSON is a combination of dictionaries and lists

Contents of Twitter JSON

  • text
  • creation datetime
  • unique tweet ID
  • how many retweets or favorites
  • language
  • if it's a reply - which tweet it's replying to and which user

Child JSON objects

important ones

  • user - name, handle, bio, location, verification status
  • place
  • extended_tweet
  • retweeted_status
  • quoted_status

Places, retweets/quoted tweets, and 140+ tweets

place and coordinate
  • contain geolocation
extended_tweet
  • tweets over 140 characters - full text of tweet over 140

Accessing JSON

  • open() and read() methods load the JSON file into a JSON object
  • json package and the loads method to convert the JSON into a Python dictionary
  • access via keys
import json

tweet_json = open('tweet-example.json', 'r').read()
tweet = json.loads(tweet_json)
tweet['text']

Child tweet JSON

Accessed as nested dictionary

# user object
tweet['user']['screen_name'] # user handle
tweet['user']['name'] # user display name
tweet['user']['created_at'] # when account was created

Exercise 3

Loading and accessing tweets

  • Import the json module.
  • Convert the tweet JSON stored in tweet_json from JSON to Python object using json's .loads() method.
  • Print the tweet text and id using the appropriate keys.