Background: This is a challenge in our Algorithm Analysis class by Dr. YANG Tong.
Problem
In case the reader is not familiar with data stream algorithms, a
simple description of the standard Top-
- You are given a (huge) data stream of
items. The set of distinct items is denoted by . The elements are given online, meaning that your algorithm needs to input the stream one by one and process it on the fly. The stream is too large to be stored. - You need to write an algorithm with a low memory usage (
units) to process the input. - After the stream is completed, you need to output the top-
elements of the stream by frequency.- In some settings, you need to output them in the correct order.
Obviously,
when the input is an arbitrary sequence,
However, real-world data streams are not arbitrary. We can estimate real-world data stream distributions with Zipfian:
- The
-th most frequent item in the stream has a frequency ; i.e. the -th most frequent item appeared in the stream for times for a given . (Throughout the post, we ignore rounding for simplicity.) However, the order of the items can still be arbitrary.
Furthermore, we assume the algorithm to be deterministic. This
completes the definition of deterministic Zipfian Top-
Previous works
The Space
Saving algorithm by Ahmed Metwally, Divyakant Agrawal and Amr El
Abbadi can be used in the Top-
The original Space Saving Algorithm is deterministic. On a Zipfian
data, it has a space complexity of
Below, we will prove an
Assumptions
Here we will show the assumptions the proof needs.
- We assume that the algorithm keeps track of
items of . The collection of tracked items is denoted as . This means that the algorithm has a space complexity of . We can further assume that the set of tracked items always has elements. (The algorithm will not perform worse if more items are tracked.) - We assume that the algorithm only uses equality comparison on items.
This means that the algorithm cannot compute hash values of items.
- Space Saving (and other algorithms based on it) uses a hash table to map items to its internal data structure. It can be rewritten to iterating through the data structure and compare the new item with each one, sacrificing time.
- As a corollary of assumption 2, the algorithm can’t distinguish
between two items if neither of them are tracked. This means that, if
the algorithm inputs an untracked item
and does something, and if we replace the item with a different untracked item , the algorithm should do the same thing. - We assume that the algorithm can’t create an untracked item from
nowhere. That is to say, if the algorithm tracks the items
before the item is inputted, then the set the algorithm tracks afterwards should be a subset of . - We assume that
. More precisely, for a positive constant .
Adversary construction
In order to make a top-
What will happen if the algorithm inputs
It can do two things when an untracked item
Let’s first consider the case when none of
As an example, if
, , and has elements.
Let’s say
We can replace every other item (
What if the algorithm rejects some
Note that during the process above, each item
The problem remains to construct a large
Construction of the untracked sequence
Set
We first put each of
Conclusion
The above adversary proved the space complexity lower bound of