(Very) simple Twitter user similarity

Posted: February 24th, 2010 | Author: | Filed under: python, Twitter | 1 Comment »

In this post I am using basic web data extraction combined with ideas and python code from Toby Segaran‘s Programming Collective Intelligence to show a (very) simple Twitter user similarity mechanism.

Generating a list of users

There are lots of ways of putting together a list of Twitter users. If you’re on Twitter, you could use the list of your followers or the list of those you are following. You could extract user names from a list of search results, the public timeline or a twitter directory. There are lots of options. The following code uses a regular expression to extract the user names from a wefollow page.

import re
import urllib
 
def getWefollowTwitterUsers(category = "tech"):
        users = []
        url = "http://wefollow.com/twitter/"
        url += category
        html = urllib.urlopen(url).read()
        users = re.findall("""nofollow">(.*?)</a></strong>""", html)
        return users
 
if __name__ == "__main__":
        print getWefollowTwitterUsers()
 
Output:
['kevinrose', 'google', 'LeoLaporte', 'mashable', 'TechCrunch', 'Veronica', 
'alexalbrecht', 'ev', 'patricknorton', 'Scobleizer', 'woot', 'ijustine', 'timoreilly', 
'guykawasaki', 'engadget', 'CaliLewis', 'chrispirillo', 'wired', 'ryan', 'sarahlane', 
'ambermac', 'ginatrapani', 'tferriss', 'fforward', 'mollywood']

Retrieving a list of messages for each user

Each user’s messages are available in an RSS feed of the format http://twitter.com/statuses/user_timeline/.rss?count=[1..200]. The count parameter is optional and controls the maximum number of messages contained in the feed. The following code uses Universal Feed Parser to extract the entries from the data feed.

import feedparser
 
# [...]
 
def getUserMessages(user):
        url = "http://twitter.com/statuses/user_timeline/" + user + ".rss?count=200"
        feed_data = feedparser.parse(url)
        return feed_data.get("entries", [])

Generating keyword scores

The following code goes through a user’s messages, breaks them into fragments and counts the number of instances for each encountered word.

def getKeywordScores(user, messages):
        keywords = {}
        blacklist = ["a", "an", "by", "on", "that", "the", "these", "this", "those", "to"]
        # and many more words
        blacklist.append(user)
        for message in messages:
                tweet = message["summary"]
                words = re.split(" ", tweet)
                for word in words:
                        word = re.sub("^\W*", "", word)
                        word = re.sub("\W*$", "", word)
                        if word.startswith("http://"):
                                continue
                        word = word.lower()
                        if word in blacklist:
                                continue
                        if not word:
                                continue
                        count = keywords.get(word, 0)
                        keywords[word] = count + 1
        final_keywords = {}
        for k in keywords:
                if keywords[k] > 1:
                        final_keywords[k] = keywords[k]
        return final_keywords

Computing similarities

The code to compute similarity scores and the ideas behind that are presented in Programming Collective Intelligence. The source code for the book is available online. The relevant pieces are in chapter2/recommendations.py – sim_distance() (Euclidian Distance), sim_pearson() (Pearson Coefficient) and topMatches(). The latter compares one user to all others and returns the list of n most similar users along with their respective similarity scores.

Similar users

The following code brings it all together and demonstrates how we can show users that are similar to a specific one, given the computed dictionary of keyword scores.

from recommendations import sim_pearson, sim_distance, topMatches
# [...]
if __name__ == "__main__":
        users = getWefollowTwitterUsers()
        # add my own
        users.append("abendig")
        print users
        user_keywords = {}
        for user in users:
                print "processing data for:", user
                messages = getUserMessages(user = user)
                user_keywords[user] = getKeywordScores(user = user, messages = messages)
 
        # Similarity between the first user and three others
        print sim_pearson(user_keywords, users[0], users[1])
        print sim_pearson(user_keywords, users[0], users[2])
        print sim_pearson(user_keywords, users[0], users[3])
 
        # My top three matches
        print topMatches(user_keywords, "abendig", n = 3, similarity = sim_pearson)

Here is the output that this produces (at the time of this writing):

['kevinrose', 'google', 'LeoLaporte', 'mashable', 'TechCrunch', 'Veronica', 'alexalbrecht', 
'ev', 'patricknorton', 'Scobleizer', 'woot', 'ijustine', 'timoreilly', 'guykawasaki', 
'engadget', 'CaliLewis', 'chrispirillo', 'sarahlane', 'ryan', 'wired', 'ambermac', 
'ginatrapani', 'tferriss', 'fforward', 'mollywood', 'abendig']
processing data for: kevinrose
processing data for: google
processing data for: LeoLaporte
processing data for: mashable
processing data for: TechCrunch
processing data for: Veronica
processing data for: alexalbrecht
processing data for: ev
processing data for: patricknorton
processing data for: Scobleizer
processing data for: woot
processing data for: ijustine
processing data for: timoreilly
processing data for: guykawasaki
processing data for: engadget
processing data for: CaliLewis
processing data for: chrispirillo
processing data for: sarahlane
processing data for: ryan
processing data for: wired
processing data for: ambermac
processing data for: ginatrapani
processing data for: tferriss
processing data for: fforward
processing data for: mollywood
processing data for: abendig
0.693852667302
0.57137732992
0.350957713398
[(0.85762813072101673, 'ginatrapani'), 
(0.81973579573386002, 'CaliLewis'), 
(0.81455896587667598, 'timoreilly')]

The results suggest the users ginatrapani, CaliLewis and timoreilly as related to abendig based on the available data and thus maybe worth following.

Next

This showed an example of directly applying code and ideas from the book Programming Collective Intelligence to Twitter users and their message streams. This is of course also pretty simplified. User similarity is an interesting problem though.

There are lots of ways to make this more useful. The realtime nature of the message streams should be taken into account. Users’ posting frequency may matter. Also, people’s interests certainly change. Overall similarity is useful, but similarity based on time ranges could also be interesting.

URLs that are included in the messages are currently mostly ignored. It would of course make a lot of sense to include them (don’t forget to deduplicate the various URL shortener versions of the same URL) to be able to take into account that several people may be talking about the same articles.

Simple keyword counts are pretty crude. Semantic analysis of the messages would be useful to get an indicator of whether two people are talking about similar things even though they are using different words, if their opinions are similar, and so forth.

Oh, and scale it up to include millions of users.


Reaching the right people

Posted: February 17th, 2010 | Author: | Filed under: Artificial Intelligence, email, Search, Twitter | No Comments »

Imagine this situation: A company has hundreds (maybe thousands) of employees. All of them have their own skills and areas of expertise. There is probably lots of overlap, however any one person will not know everyone in the larger group who has particular skill sets. It someone is working on a project and needs assistance to overcome some technical hurdle, it could be very helpful, if they could communicate with those people who also have experience in that area. Those people might be located in entirely different parts of the company.

Semantic email addressing [PDF] aims to solve this problem:

Email addresses are a means to an end. The goal is usually not to send an email to a particular address, but to a particular person. You want to say hello to your friend Steve or send a message to the VP of marketing at Microsoft or to the head caterer for your wedding. Ideally, you could send a message to a person just by entering his or her name, position, or some other descriptive attribute. If a person’s email address changes, the email system should send to the new address automatically. If the person matching a description differs over time, the email system should send to the person currently matching that description.

In the given example, the user would be able to get answers to his or her questions by reaching out to the people with the fitting skill sets without previously having known those people: The email system can decide, who the most appropriate receivers of the messages are.

I cannot help thinking that Aardvark was at least a little inspired by the ideas behind semantic email addressing. Their process is simple: Users send in questions (using email, twitter, IM, etc.), Aarvark routes the question to another user is (hopefully) qualified to answer it and the user will eventually receive a response, often just a few minutes later. In this social search approach, Aardvark accomplishes the job of finding information by finding the right people who can provide it. The service has received very good press and was recently acquired by google.

Twitter seems like it might be a good platform for this problem area. If someone has a public twitter feed, they are essentially broadcasting their updates to the open stream and anyone can see them. It is probably safe to assume, they are at least open to the idea of talking to strangers/responding to messages from people they do not already know.

How could one go about finding the best people to message though? One method is certainly to search the message stream for specific keywords and basically manually look for people who might be active in areas of interest. You can also search in and add yourself to one of the many directories that are being developed.

But, if I simply need to talk to someone and ask them “May I ask you a question about XYZ?” then clearly, a) broadcasting my question hoping that someone will answer could be very inefficient and b) first researching who the best person might be for my question(s) puts all the burden on me.

What if the user could simply send out the question and the system would ensure that the most appropriate people see it?

The basic idea here is this: The user submits the question (along with a set of keywords) to his or her software. The software has analyzed other users’ message streams, extracted keywords, etc. and generated a knowledge base. If the query can be confidently matched to another user, a message is generated and send to that user. The message will be visible to that user as a regular name mention and they can choose whether to engage in that conversation.

Some of the obvious challenges:

  • Generating of meaningful keywords/subject areas based on a person’s message stream.
  • Successful matching of queries with users.
  • Establishing an effective communication protocol that does not easily lend itself to abuse, i.e. spam.

A lot of web-based social networks are great at helping you connect with people you already know. Twitter makes it easy to connect with new people. The outlined approach (or a variation thereof) might be a good way of further supporting creation of those new connections, based on areas of interest.


Letting them understand us better

Posted: February 10th, 2010 | Author: | Filed under: Affective Computing, Artificial Intelligence, Human Computer Interaction | No Comments »

As you first start reading Can your computer make you happy?, scenes from Space Odyssey 2001 or the more recent (and well done) Moon may readily come to mind. The author appears to foresee that reaction.

In sci-fi films, when anyone gives a computer emotions, it all goes horribly wrong. The computer becomes vain, doubtful and irrational and Armageddon by wayward technology is only narrowly avoided.

This is not surprising – science fiction has been informing us and becoming part of our culture for a while. It is increasingly really all around us: We Are Living in a Sci-Fi World.

Affective computing is an intriguing concept though:

Affective computing is a branch of the study and development of artificial intelligence that deals with the design of systems and devices that can recognize, interpret, and process human emotions. It is an interdisciplinary field spanning computer sciences, psychology, and cognitive science.

Imagine educational software that modifies its teaching style depending on the user’s mood. Cars that communicate with other drivers, if its driver is angry, intoxicated or talking on the phone. Music players could adjust their playlist based on the listener frowning, smiling or similarly expressing themselves. Email clients could disable the send button, if the user is clearly upset and about to send out an email he or she may regret later.

A lot of different uses are conceivable here and this could contribute to much more personalized computing experiences.

Modern laptops and desktop computers are typically already equipped with microphones and cameras. Future operating systems may well feature a mood evaluation component and search engines may take information from that component as part of the search query. Similar scenarios are conceivable for other types of web-enabled applications.

Imagine logging in to Facebook some evening and finding a notification “John has been having a bad day. Check in with him to make sure he’s okay.” Intriguing.

And at least a little bit eerie.


talking, questions and learning

Posted: February 3rd, 2010 | Author: | Filed under: Artificial Intelligence | No Comments »

In How Pair Programming Really Works [PDF], Stuart Wray discusses four mechanisms that contribute to successful pair programming practice. The author uses findings from cognitive psychology and neuroscience to provide evidence for his conclusions. There are some followup discussions at computingnow, reddit and hacker news.

I found particularly interesting the discussion around talking to develop understanding:

Around 1980, as computer science undergraduate students at the University of Cambridge, my friends and I noticed a strange phenomenon that we called expert programmer theory. When one of us had trouble getting our programs to work, we’d describe the nonfunctioning state of our code to each other over coffee. Quite often, we’d realize in a flash what was wrong and how to solve it. These epiphanies were quite independent of the other person having any real understanding of our problems—the listener often seemed little wiser about the subject.

I have experienced similar scenarios and this can be both relieving (finally solved the problem!) and frustrating (why didn’t I think of this a few minutes ago?).

Explaining something to another person or even an object can help the person’s own understanding. Wray points out that it is helpful, if we can talk to an expert, even if that expertise is large based on perception. The main reason seems to be that that person would be more likely to ask us deep questions that we can ponder or that may influence our thinking.

The ability to ask questions that are most appropriate for the given situation seems most valuable: Questions that don’t require too large a leap, but rather motivate the person to advance just a little further – questions that stimulate thinking.

What if software that we use daily asked us questions?

Lots of scenarios are conceivable, but here is one example. Imagine a news website that attaches to each article a module that contains at least one interesting question, such as “Do you think this policy change will effectively solve problem XYZ?”, “What do you think of senator X’s position on Y?”, “What if the economic situation in Y would change in Z way?” and so forth. These would be meaningful questions, based on the content of the article and meant to stimulate intelligent discourse (readers could leave responses and discuss amongst themselves). These questions would also ideally be automatically generated.

If we can accept that good questions at the right time can help our understanding and that deeper understanding is generally a good thing, then I think we will benefit from giving software more of an ability to ask questions – for our own benefit.