(Very) simple Twitter user similarity

Posted: February 24th, 2010 | Author: Alex | Filed under: Twitter, python | 1 Comment »

In this post I am using basic web data extraction combined with ideas and python code from Toby Segaran‘s Programming Collective Intelligence to show a (very) simple Twitter user similarity mechanism.

Generating a list of users

There are lots of ways of putting together a list of Twitter users. If you’re on Twitter, you could use the list of your followers or the list of those you are following. You could extract user names from a list of search results, the public timeline or a twitter directory. There are lots of options. The following code uses a regular expression to extract the user names from a wefollow page.

import re
import urllib
 
def getWefollowTwitterUsers(category = "tech"):
        users = []
        url = "http://wefollow.com/twitter/"
        url += category
        html = urllib.urlopen(url).read()
        users = re.findall("""nofollow">(.*?)</a></strong>""", html)
        return users
 
if __name__ == "__main__":
        print getWefollowTwitterUsers()
 
Output:
['kevinrose', 'google', 'LeoLaporte', 'mashable', 'TechCrunch', 'Veronica', 
'alexalbrecht', 'ev', 'patricknorton', 'Scobleizer', 'woot', 'ijustine', 'timoreilly', 
'guykawasaki', 'engadget', 'CaliLewis', 'chrispirillo', 'wired', 'ryan', 'sarahlane', 
'ambermac', 'ginatrapani', 'tferriss', 'fforward', 'mollywood']

Retrieving a list of messages for each user

Each user’s messages are available in an RSS feed of the format http://twitter.com/statuses/user_timeline/.rss?count=[1..200]. The count parameter is optional and controls the maximum number of messages contained in the feed. The following code uses Universal Feed Parser to extract the entries from the data feed.

import feedparser
 
# [...]
 
def getUserMessages(user):
        url = "http://twitter.com/statuses/user_timeline/" + user + ".rss?count=200"
        feed_data = feedparser.parse(url)
        return feed_data.get("entries", [])

Generating keyword scores

The following code goes through a user’s messages, breaks them into fragments and counts the number of instances for each encountered word.

def getKeywordScores(user, messages):
        keywords = {}
        blacklist = ["a", "an", "by", "on", "that", "the", "these", "this", "those", "to"]
        # and many more words
        blacklist.append(user)
        for message in messages:
                tweet = message["summary"]
                words = re.split(" ", tweet)
                for word in words:
                        word = re.sub("^\W*", "", word)
                        word = re.sub("\W*$", "", word)
                        if word.startswith("http://"):
                                continue
                        word = word.lower()
                        if word in blacklist:
                                continue
                        if not word:
                                continue
                        count = keywords.get(word, 0)
                        keywords[word] = count + 1
        final_keywords = {}
        for k in keywords:
                if keywords[k] > 1:
                        final_keywords[k] = keywords[k]
        return final_keywords

Computing similarities

The code to compute similarity scores and the ideas behind that are presented in Programming Collective Intelligence. The source code for the book is available online. The relevant pieces are in chapter2/recommendations.py – sim_distance() (Euclidian Distance), sim_pearson() (Pearson Coefficient) and topMatches(). The latter compares one user to all others and returns the list of n most similar users along with their respective similarity scores.

Similar users

The following code brings it all together and demonstrates how we can show users that are similar to a specific one, given the computed dictionary of keyword scores.

from recommendations import sim_pearson, sim_distance, topMatches
# [...]
if __name__ == "__main__":
        users = getWefollowTwitterUsers()
        # add my own
        users.append("abendig")
        print users
        user_keywords = {}
        for user in users:
                print "processing data for:", user
                messages = getUserMessages(user = user)
                user_keywords[user] = getKeywordScores(user = user, messages = messages)
 
        # Similarity between the first user and three others
        print sim_pearson(user_keywords, users[0], users[1])
        print sim_pearson(user_keywords, users[0], users[2])
        print sim_pearson(user_keywords, users[0], users[3])
 
        # My top three matches
        print topMatches(user_keywords, "abendig", n = 3, similarity = sim_pearson)

Here is the output that this produces (at the time of this writing):

['kevinrose', 'google', 'LeoLaporte', 'mashable', 'TechCrunch', 'Veronica', 'alexalbrecht', 
'ev', 'patricknorton', 'Scobleizer', 'woot', 'ijustine', 'timoreilly', 'guykawasaki', 
'engadget', 'CaliLewis', 'chrispirillo', 'sarahlane', 'ryan', 'wired', 'ambermac', 
'ginatrapani', 'tferriss', 'fforward', 'mollywood', 'abendig']
processing data for: kevinrose
processing data for: google
processing data for: LeoLaporte
processing data for: mashable
processing data for: TechCrunch
processing data for: Veronica
processing data for: alexalbrecht
processing data for: ev
processing data for: patricknorton
processing data for: Scobleizer
processing data for: woot
processing data for: ijustine
processing data for: timoreilly
processing data for: guykawasaki
processing data for: engadget
processing data for: CaliLewis
processing data for: chrispirillo
processing data for: sarahlane
processing data for: ryan
processing data for: wired
processing data for: ambermac
processing data for: ginatrapani
processing data for: tferriss
processing data for: fforward
processing data for: mollywood
processing data for: abendig
0.693852667302
0.57137732992
0.350957713398
[(0.85762813072101673, 'ginatrapani'), 
(0.81973579573386002, 'CaliLewis'), 
(0.81455896587667598, 'timoreilly')]

The results suggest the users ginatrapani, CaliLewis and timoreilly as related to abendig based on the available data and thus maybe worth following.

Next

This showed an example of directly applying code and ideas from the book Programming Collective Intelligence to Twitter users and their message streams. This is of course also pretty simplified. User similarity is an interesting problem though.

There are lots of ways to make this more useful. The realtime nature of the message streams should be taken into account. Users’ posting frequency may matter. Also, people’s interests certainly change. Overall similarity is useful, but similarity based on time ranges could also be interesting.

URLs that are included in the messages are currently mostly ignored. It would of course make a lot of sense to include them (don’t forget to deduplicate the various URL shortener versions of the same URL) to be able to take into account that several people may be talking about the same articles.

Simple keyword counts are pretty crude. Semantic analysis of the messages would be useful to get an indicator of whether two people are talking about similar things even though they are using different words, if their opinions are similar, and so forth.

Oh, and scale it up to include millions of users.


One Comment on “(Very) simple Twitter user similarity”

  1. 1 Twitter User Similarity and Collective Intelligence « Andre's Tech Blog said at 11:42 pm on May 12th, 2010:

    [...] (Very) simple Twitter user similarity [...]


Leave a Reply