Often connections and patterns only become clear, if the larger set is observed and not just individual pieces of data. Microsoft Live labsPivot shows an interesting approach to visualizing online data. Gary Flake presents the project in the following video (from Ted).
The software is available for download and experimentation. I would love to try it out but the requirements make it sound like that will have to wait until I can come up with a Windows machine.
The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties.
Some of the interesting bits here are that the data stream will easily contain millions (or billions) of items and the algorithm will typically only get to take one look at each item as it comes up in the stream.
Space-Saving
In this post I focus on the Space-Saving algorithm and provide an implementation in Python. Read the rest of this entry »