Pivot from MS Live Labs

Posted: March 3rd, 2010 | Author: Alex | Filed under: Data Mining | No Comments »

Often connections and patterns only become clear, if the larger set is observed and not just individual pieces of data. Microsoft Live labs Pivot shows an interesting approach to visualizing online data. Gary Flake presents the project in the following video (from Ted).

The software is available for download and experimentation. I would love to try it out but the requirements make it sound like that will have to wait until I can come up with a Windows machine.

This looks very interesting though.


Finding frequent items in a data stream

Posted: November 13th, 2009 | Author: Alex | Filed under: Algorithms, Data Mining, python | No Comments »

In Finding the Frequent Items in Streams of Data [PDF], Graham Cormode and Marios Hadjieleftheriou discuss the frequent items problem and some of the algorithms that are used to solve it:

The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties.

Some of the interesting bits here are that the data stream will easily contain millions (or billions) of items and the algorithm will typically only get to take one look at each item as it comes up in the stream.

Space-Saving

In this post I focus on the Space-Saving algorithm and provide an implementation in Python. Read the rest of this entry »