Notes: Analyzing Social Media Data

Analyzing Twitter Data

Collecting data through the Twitter API

Streaming API

Real-time tweets

The connection stays open till you close it Two Endpoints Filter endpoint searches:

Keywords
User IDs
location Sample endpoint:
Random sample

Twitter will return a 1% sample of all of Twitter

tweepy package

collects data from Twitter
requires an object called SListener which tells it how to handle incoming data.

SListener

SListener object inherits from a general Stream class incl with tweepy
opens a new timestamped file to store tweets
takes an optional API argument

from tweepy import Stream
import time

class SListener(Stream):
    def __init__(self, api = None):
        self.output = open('tweets_%s.json' %
                          time.strftime('%Y%m%d-%H%M%S'), 'w')
        self.api = api or API()

tweepy authentication

the Twitter API uses OAuthentication

requires four tokens
tokens obtained from Twitter developer site Tokens
consumer key
consumer secret
access token
access token secret Authenticating
pass the OAuthHandler consumer key and consumer secret
set the access token and access token secret
lastly pass the auth object to the tweepy API object

from tweepy import OAuthHandler
from tweepy import API

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = API(auth)

Collecting data with tweepy

Sample endpoint - To take a random sample of all of Twitter

instantiate an SListener object
instantiate a stream object
call the sample method to begin collecting data

from tweepy import Stream

listen = SListener(api)
stream = Stream(auth, listen)
stream.sample()

Exercise 1

Setting up tweepy authentication

Import OAuthHandler and API from the tweepy module.
Pass consumer_key and consumer_secret to OAuthHandler.
Set the access tokens with access_token and access_token_secret.
Pass the auth object to the API.

from tweepy import OAuthHandler
from tweepy import API

# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

Hidden output

Exercise 2

Collecting data on keywords

Import Stream from tweepy.
Set keywords_to_track to a list containing #rstats and #python.
Pass the auth and listen objects to Stream.
Set the keyword argument track equals to keywords_to_track.

from tweepy import Stream

# Set up words to track
keywords_to_track = ('#rstats', '#python')

# Instantiate the SListener object 
listen = SListener(api)

# Instantiate the Stream object
stream = Stream(auth, listen, access_token, access_token_secret)

# Begin collecting data
stream.filter(track = keywords_to_track)

Understanding Twitter JSON

JSON is a combination of dictionaries and lists

Contents of Twitter JSON

text
creation datetime
unique tweet ID
how many retweets or favorites
language
if it's a reply - which tweet it's replying to and which user

Child JSON objects

important ones

user - name, handle, bio, location, verification status
place
extended_tweet
retweeted_status
quoted_status

Places, retweets/quoted tweets, and 140+ tweets

`place` and `coordinate`

contain geolocation

`extended_tweet`

tweets over 140 characters - full text of tweet over 140

Accessing JSON

open() and read() methods load the JSON file into a JSON object
json package and the loads method to convert the JSON into a Python dictionary
access via keys

import json

tweet_json = open('tweet-example.json', 'r').read()
tweet = json.loads(tweet_json)
tweet['text']

Child tweet JSON

Accessed as nested dictionary

# user object
tweet['user']['screen_name'] # user handle
tweet['user']['name'] # user display name
tweet['user']['created_at'] # when account was created

Exercise 3

Loading and accessing tweets

Import the json module.
Convert the tweet JSON stored in tweet_json from JSON to Python object using json's .loads() method.
Print the tweet text and id using the appropriate keys.

‌
‌
‌