<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>not just random &#187; Data Mining</title>
	<atom:link href="http://www.notjustrandom.com/category/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.notjustrandom.com</link>
	<description></description>
	<lastBuildDate>Tue, 08 Nov 2011 00:38:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Pivot from MS Live Labs</title>
		<link>http://www.notjustrandom.com/2010/03/03/pivot-from-ms-live-labs/</link>
		<comments>http://www.notjustrandom.com/2010/03/03/pivot-from-ms-live-labs/#comments</comments>
		<pubDate>Wed, 03 Mar 2010 20:23:55 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://www.notjustrandom.com/?p=1472</guid>
		<description><![CDATA[Often connections and patterns only become clear, if the larger set is observed and not just individual pieces of data. Microsoft Live labs Pivot shows an interesting approach to visualizing online data. Gary Flake presents the project in the following video (from Ted). The software is available for download and experimentation. I would love to [...]]]></description>
			<content:encoded><![CDATA[<p>Often connections and patterns only become clear, if the larger set is observed and not just individual pieces of data. <a href="http://livelabs.com/">Microsoft Live labs</a> <a href="http://www.getpivot.com">Pivot</a> shows an interesting approach to visualizing online data. <a href="http://flakenstein.net/">Gary Flake</a> presents the project in the following video (from <a href="http://www.ted.com/talks/lang/eng/gary_flake_is_pivot_a_turning_point_for_web_exploration.html">Ted</a>).</p>
<p><!--copy and paste--><object width="446" height="326"><param name="movie" value="http://video.ted.com/assets/player/swf/EmbedPlayer.swf"></param><param name="allowFullScreen" value="true" /><param name="wmode" value="transparent"></param><param name="bgColor" value="#ffffff"></param><param name="flashvars" value="vu=http://video.ted.com/talks/dynamic/GaryFlake_2010-medium.flv&#038;su=http://images.ted.com/images/ted/tedindex/embed-posters/GaryFlake-2010.embed_thumbnail.jpg&#038;vw=432&#038;vh=240&#038;ap=0&#038;ti=783&#038;introDuration=16500&#038;adDuration=4000&#038;postAdDuration=2000&#038;adKeys=talk=gary_flake_is_pivot_a_turning_point_for_web_exploration;year=2010;theme=what_s_next_in_tech;theme=a_taste_of_ted2010;theme=new_on_ted_com;event=TED2010;&#038;preAdTag=tconf.ted/embed;tile=1;sz=512x288;" /><embed src="http://video.ted.com/assets/player/swf/EmbedPlayer.swf" pluginspace="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" wmode="transparent" bgColor="#ffffff" width="446" height="326" allowFullScreen="true" flashvars="vu=http://video.ted.com/talks/dynamic/GaryFlake_2010-medium.flv&#038;su=http://images.ted.com/images/ted/tedindex/embed-posters/GaryFlake-2010.embed_thumbnail.jpg&#038;vw=432&#038;vh=240&#038;ap=0&#038;ti=783&#038;introDuration=16500&#038;adDuration=4000&#038;postAdDuration=2000&#038;adKeys=talk=gary_flake_is_pivot_a_turning_point_for_web_exploration;year=2010;theme=what_s_next_in_tech;theme=a_taste_of_ted2010;theme=new_on_ted_com;event=TED2010;"></embed></object></p>
<p>The software is available for <a href="http://www.getpivot.com">download and experimentation</a>. I would love to try it out but the <a href="http://www.getpivot.com/download/">requirements</a> make it sound like that will have to wait until I can come up with a Windows machine.</p>
<p>This looks very interesting though.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.notjustrandom.com/2010/03/03/pivot-from-ms-live-labs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finding frequent items in a data stream</title>
		<link>http://www.notjustrandom.com/2009/11/13/finding-frequent-items-in-a-data-stream/</link>
		<comments>http://www.notjustrandom.com/2009/11/13/finding-frequent-items-in-a-data-stream/#comments</comments>
		<pubDate>Fri, 13 Nov 2009 16:44:59 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.notjustrandom.com/?p=1016</guid>
		<description><![CDATA[In Finding the Frequent Items in Streams of Data [PDF], Graham Cormode and Marios Hadjieleftheriou discuss the frequent items problem and some of the algorithms that are used to solve it: The frequent items problem is to process a stream of items and ﬁnd all those which occur more than a given fraction of the [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://portal.acm.org/citation.cfm?id=1562789&#038;dl=GUIDE&#038;coll=GUIDE&#038;CFID=61620557&#038;CFTOKEN=15416114">Finding the Frequent Items in Streams of Data</a> [<a href="http://dimacs.rutgers.edu/~graham/pubs/papers/freqcacm.pdf">PDF</a>], <a href="http://dimacs.rutgers.edu/~graham/">Graham Cormode</a> and <a href="http://www2.research.att.com/~marioh/">Marios Hadjieleftheriou</a> discuss the frequent items problem and some of the algorithms that are used to solve it:</p>
<blockquote><p>
The frequent items problem is to process a stream of items and ﬁnd all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on ﬁnding the frequent items, and implementations are in use in large scale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties.
</p></blockquote>
<p>Some of the interesting bits here are that the data stream will easily contain millions (or billions) of items and the algorithm will typically only get to take one look at each item as it comes up in the stream.</p>
<p><strong>Space-Saving</strong></p>
<p>In this post I focus on the Space-Saving algorithm and provide an implementation in Python. <span id="more-1016"></span>The algorithm itself is originally described in <strong>Efficient Computation of Frequent and Top-k Elements in Data Streams</strong> [<a href="http://www.cs.ucsb.edu/~dsl/publications/2005/ICDT2005-metwally.pdf">PDF</a>] by <a href="http://www.cs.ucsb.edu/~metwally/">Ahmed Metwally</a>, <a href="http://www.cs.ucsb.edu/~agrawal/">Divyakant Agrawal</a>, and <a href="http://www.cs.ucsb.edu/~amr/">Amr El Abbadi</a>:</p>
<blockquote><p>
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-<em>k</em> and frequent elements with tight guarantees on errors. For general data distributions, our top-<em>k</em> algorithm can return a set of <em>k&#8217;</em> elements, where <em>k&#8217;</em> &asymp; <em>k</em>, which are guaranteed to be the top-<em>k&#8217;</em> elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-<em>k</em> queries, we ensure that only the top-<em>k</em> elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.
</p></blockquote>
<p>The algorithm basically works like this: The stream is processed one item at a time. A collection of <em>k</em> distinct items and their associated counters is maintained. If a new item is encountered and fewer than <em>k</em> items are in the collection, then the item is added and its counter is set to 1. If the item is already in the collection, its counter is increased by 1. If the item is not in the collection and the collection already has a size of <em>k</em>, then the item with lowest counter is removed and the new item is added, with its counter set to one larger than the previous minimum counter.</p>
<p>Here is some pseudo code to make this clearer:</p>

<div class="wp_syntax"><div class="code"><pre class="pseudo" style="font-family:monospace;">SpaceSaving(k, stream):
collection = empty collection
for each element in stream:
    if element in collection:
    then collection[element] += 1
    else if length of collection &lt; k:
        then add element to collection, collection[element] = 1
    else:
        current_minimum_element = element with lowest count value in collection
        current_minimum = collection[current_minimum_element]
        remove current_minimum_element from collection
        collection[element] = current_minimum + 1</pre></div></div>

<p><strong>The straightforward approach</strong></p>
<p>A first, easy implementation would use a simple hashtable, such as in the following piece of code:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> space_saving_frequent_k1<span style="color: black;">&#40;</span>k, stream, debug=<span style="color: #008000;">False</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">def</span> get_smallest_key<span style="color: black;">&#40;</span>d<span style="color: black;">&#41;</span>:
                <span style="color: #483d8b;">&quot;&quot;&quot;
                Given dictionary d, returns the key associated with
                the lowest value in the dictionary.
                &quot;&quot;&quot;</span>
                min_key = <span style="color: #008000;">None</span>
                <span style="color: #ff7700;font-weight:bold;">for</span> key <span style="color: #ff7700;font-weight:bold;">in</span> d:
                        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> min_key <span style="color: #ff7700;font-weight:bold;">or</span> d<span style="color: black;">&#91;</span>key<span style="color: black;">&#93;</span> <span style="color: #66cc66;">&lt;</span> d<span style="color: black;">&#91;</span>min_key<span style="color: black;">&#93;</span>:
                                min_key = key
                <span style="color: #ff7700;font-weight:bold;">return</span> min_key
&nbsp;
        counters = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> stream:
                <span style="color: #ff7700;font-weight:bold;">if</span> counters.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>:
                        counters<span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = counters<span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>
                <span style="color: #ff7700;font-weight:bold;">elif</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>counters<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span> k:
                        counters<span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = <span style="color: #ff4500;">1</span>
                <span style="color: #ff7700;font-weight:bold;">else</span>:
                        current_minimum_key = get_smallest_key<span style="color: black;">&#40;</span>counters<span style="color: black;">&#41;</span>
                        <span style="color: #ff7700;font-weight:bold;">if</span> current_minimum_key:
                                counters<span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = counters<span style="color: black;">&#91;</span>current_minimum_key<span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>
                                <span style="color: #ff7700;font-weight:bold;">del</span> counters<span style="color: black;">&#91;</span>current_minimum_key<span style="color: black;">&#93;</span>
                        <span style="color: #ff7700;font-weight:bold;">else</span>:
                                counters<span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = <span style="color: #ff4500;">1</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> counters</pre></div></div>

<p>This works for smaller data sets and particularly, when there is never a need to find that smallest element. Otherwise however, (repeatedly) retrieving the element with the minimum count remains a comparatively costly challenge.</p>
<p><strong>Stream-Summary</strong></p>
<p>When describing the Space-Saving algorithm, the authors also introduced the Stream-Summary data structure (inspired by work in <a href="http://portal.acm.org/citation.cfm?id=740658">Frequency Estimation of Internet Packet Streams with Limited Space</a> [<a href="http://erikdemaine.org/papers/NetworkStats_ESA2002/paper.pdf">PDF</a>]), which groups elements with equal values together (in buckets) and allows quick retrieval of the element with the lowest count.</p>
<p>Here is a diagram of this structure, using three buckets and a total of six elements (E1-E6).</p>
<p><img src="http://www.notjustrandom.com/wp-content/uploads/2009/11/frequent_items.jpg" alt="frequent_items" title="Stream-Summary" style="border: 1px solid black;" width="431" height="171" class="alignnone size-full wp-image-1116" /></p>
<p>Buckets are stored in a list sorted by the buckets&#8217; respective values. Each bucket maintains knowledge of associated elements. Each element in turn maintains a pointer to its bucket. The latter is implemented using a simple hashtable. If an element&#8217;s count needs to be increased, the element is removed from its current bucket and added to the neighboring bucket with value one greater than the previous one. If no such bucket exists, it is inserted in the bucket list. Empty buckets are removed.</p>
<p>The Python implementation using the Stream-Summary data structure may then look like this:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> Bucket<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value = <span style="color: #ff4500;">1</span>, elements = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>:
                <span style="color: #008000;">self</span>.<span style="color: black;">value</span> = value
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span> = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__str__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;%s: %s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">value</span>, <span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> append<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, element<span style="color: black;">&#41;</span>:
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> first_element<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">elements</span>:
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                <span style="color: #ff7700;font-weight:bold;">else</span>:
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">None</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> has_elements<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #ff4500;">0</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> remove<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, element<span style="color: black;">&#41;</span>:
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span>.<span style="color: black;">remove</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> StreamSummary<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
        <span style="color: #483d8b;">&quot;&quot;&quot;
        Maintains a dictionary of elements and a list of buckets. Each element
        points to a (parent) bucket.
        The bucket list is sorted based on the buckets' values. Each bucket also
        maintains a list of elments.
        This has the effect of grouping elements with equal values in buckets.
        &quot;&quot;&quot;</span>
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span> = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span> = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__len__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">elements</span>.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__str__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                result = <span style="color: #483d8b;">&quot;&quot;</span>
                <span style="color: #ff7700;font-weight:bold;">for</span> b <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>:
                        result += <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>b<span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot; &quot;</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> result
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> add_element<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, element<span style="color: black;">&#41;</span>:
                <span style="color: #483d8b;">&quot;&quot;&quot;
                Adds an element and ensures it's assigned to the correct bucket.
                &quot;&quot;&quot;</span>
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>.<span style="color: black;">elements</span>.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>:
                        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">value</span> <span style="color: #66cc66;">!</span>= <span style="color: #ff4500;">1</span>:
                                <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>.<span style="color: black;">insert</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, Bucket<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                        <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                        <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">elements</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> increase_element<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, element<span style="color: black;">&#41;</span>:
                <span style="color: #483d8b;">&quot;&quot;&quot;
                Increasing an element's value also means assigning it to the
                correct bucket. That can result in creating a new bucket and/or
                removing an empty one.
                &quot;&quot;&quot;</span>
                current_bucket = <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span>
                bucket_index = <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>.<span style="color: black;">index</span><span style="color: black;">&#40;</span>current_bucket<span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#41;</span> == bucket_index + <span style="color: #ff4500;">1</span>:
                        <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>Bucket<span style="color: black;">&#40;</span>value = current_bucket.<span style="color: black;">value</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">elif</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span>bucket_index + <span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">value</span> <span style="color: #66cc66;">&gt;</span> current_bucket.<span style="color: black;">value</span> + <span style="color: #ff4500;">1</span>:
                        <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>.<span style="color: black;">insert</span><span style="color: black;">&#40;</span>bucket_index + <span style="color: #ff4500;">1</span>,
                                            Bucket<span style="color: black;">&#40;</span>value = current_bucket.<span style="color: black;">value</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                current_bucket.<span style="color: black;">remove</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span> = <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span>bucket_index + <span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> current_bucket.<span style="color: black;">has_elements</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
                        <span style="color: #ff7700;font-weight:bold;">del</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span>bucket_index<span style="color: black;">&#93;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>element<span style="color: black;">&#93;</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> has_element<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, element<span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">elements</span>.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> get_minimum<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span>:
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">buckets</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">first_element</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">else</span>:
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">None</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> replace_element<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, old_element, new_element<span style="color: black;">&#41;</span>:
                <span style="color: #483d8b;">&quot;&quot;&quot;
                Replaces an existing element with an entirely new element in
                the old element's bucket.
                &quot;&quot;&quot;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>new_element<span style="color: black;">&#93;</span> = <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>old_element<span style="color: black;">&#93;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>new_element<span style="color: black;">&#93;</span>.<span style="color: black;">remove</span><span style="color: black;">&#40;</span>old_element<span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>new_element<span style="color: black;">&#93;</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>new_element<span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">del</span> <span style="color: #008000;">self</span>.<span style="color: black;">elements</span><span style="color: black;">&#91;</span>old_element<span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> space_saving_frequent_k<span style="color: black;">&#40;</span>k, stream<span style="color: black;">&#41;</span>:
        summary = StreamSummary<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> stream:
                <span style="color: #ff7700;font-weight:bold;">if</span> summary.<span style="color: black;">has_element</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>:
                        summary.<span style="color: black;">increase_element</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">elif</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>summary<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span> k:
                        summary.<span style="color: black;">add_element</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">else</span>:
                        current_minimum_key = summary.<span style="color: black;">get_minimum</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                        <span style="color: #ff7700;font-weight:bold;">if</span> current_minimum_key:
                                summary.<span style="color: black;">replace_element</span><span style="color: black;">&#40;</span>current_minimum_key, element<span style="color: black;">&#41;</span>
                                summary.<span style="color: black;">increase_element</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
                        <span style="color: #ff7700;font-weight:bold;">else</span>:
                                summary.<span style="color: black;">add_element</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> summary</pre></div></div>

<p>For larger data sets, where <em>k</em> is noticeably smaller than the number of distinct elements in the set, the Stream-Summary data structure proves advantageous.</p>
<p><strong>Onward</strong></p>
<p>There is a lot of ongoing research in this problem area. This article is clearly just barely offering a small (and simplified) glimpse. Explore the research. Find out what real-world applications use some version of this as part of their problem solving approach. Applications can be found in web access log processing, search applications, mining of real-time message streams, and so forth. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.notjustrandom.com/2009/11/13/finding-frequent-items-in-a-data-stream/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

