Background: This is a challenge in our Algorithm Analysis class by Dr. YANG Tong.
Problem
In case the reader is not familiar with data stream algorithms, a simple description of the standard Top-\(k\) Problem is given below. \(\def\eps{\varepsilon}\DeclareMathOperator{\plog}{ProductLog}\)
- You are given a (huge) data stream of \(n\) items. The set of distinct items is denoted by \(A\). The elements are given online, meaning that your algorithm needs to input the stream one by one and process it on the fly. The stream is too large to be stored.
- You need to write an algorithm with a low memory usage (\(m\) units) to process the input.
- After the stream is completed, you need to output the top-\(k\) elements of the stream by frequency.
- In some settings, you need to output them in the correct order.
Obviously, when the input is an arbitrary sequence, \(\Omega(n)\) space is needed even if \(k=1\).
However, real-world data streams are not arbitrary. We can estimate real-world data stream distributions with Zipfian:
- The \(i\)-th most frequent item in the stream has a frequency \(\propto \frac{1}{i}\); i.e. the \(i\)-th most frequent item appeared in the stream for \(\frac{X}{i}\) times for a given \(X\). (Throughout the post, we ignore rounding for simplicity.) However, the order of the items can still be arbitrary.
Furthermore, we assume the algorithm to be deterministic. This completes the definition of deterministic Zipfian Top-\(k\) Problem.
Previous works
The Space Saving algorithm by Ahmed Metwally, Divyakant Agrawal and Amr El Abbadi can be used in the Top-\(k\) problems. Some other works optimized the algorithm and achieved better practical performance, which are not listed here due to the laziness of the author.
The original Space Saving Algorithm is deterministic. On a Zipfian data, it has a space complexity of \(O(k^2\log{|A|})\).
Below, we will prove an \(O(k^2\log{|A|})\) lower bound of deterministic Zipfian Top-\(k\) Problem (under several assumptions on the algorithm) with an adversary construction, showing that Space Saving is optimal.
Assumptions
Here we will show the assumptions the proof needs.
- We assume that the algorithm keeps track of \(m\) items of \(A\). The collection of tracked items is denoted as \(B\). This means that the algorithm has a space complexity of \(\Theta(m)\). We can further assume that the set of tracked items \(B\subseteq A\) always has \(m\) elements. (The algorithm will not perform worse if more items are tracked.)
- We assume that the algorithm only uses equality comparison on items.
This means that the algorithm cannot compute hash values of items.
- Space Saving (and other algorithms based on it) uses a hash table to map items to its internal data structure. It can be rewritten to iterating through the data structure and compare the new item with each one, sacrificing time.
- As a corollary of assumption 2, the algorithm can’t distinguish between two items if neither of them are tracked. This means that, if the algorithm inputs an untracked item \(x_1\) and does something, and if we replace the item with a different untracked item \(x_2\), the algorithm should do the same thing.
- We assume that the algorithm can’t create an untracked item from nowhere. That is to say, if the algorithm tracks the items \(B_0\subseteq A\) before the item \(x\) is inputted, then the set the algorithm tracks afterwards \(B_1\) should be a subset of \(B_0\cup\{x\}\).
- We assume that \(k^2\ll |A|\). More precisely, \(k^2<|A|^{1-\eps}\) for a positive constant \(\eps\).
Adversary construction
In order to make a top-\(k\) algorithm make mistake, the easiest way is to confuse it between the \(k\)-th and the \((k+1)\)-th most frequent item. For simplicity, we let \(A=\{1,2,\cdots,|A|\}\) and \(i\in A\) is the \(i\)-th most frequent item. The following construction of input inputs \(k\) into the algorithm, and then makes it forget about \(k\). If this can be done \(\frac{X}{k(k+1)}\) times, and \(k\) is forgotten afterwards, the algorithm will be unable to distinguish between \(k\) and \(k+1\), thus making a mistake.
What will happen if the algorithm inputs \(w\) consecutive items \(s_1, \cdots, s_w\), and each item is untracked at the moment it is inputted?
It can do two things when an untracked item \(x=s_i\) arrived: either accept and start tracking \(x\) (and kicks a previously tracking item), or reject and drop \(x\).
Let’s first consider the case when none of \(s_i\) is dropped. By assumption 1 and 4, the insertion of \(s_i\) into \(B\) will kick another item out. Denote this item by \(f(i)\), and let \(g_i=\max_{j<i,\,s_j=f(i)}{j}\), or \(g_i=0\) if such \(j\) does not exist. Define \[ h_i=\begin{cases}i, &\mbox{if }g_i=0\\h_{g_i},&\mbox{if }g_i>0 \end{cases} \] All the items with a same \(h_i\) will form a directed path on the digraph with edges \((i, g_i)\) because each item can only be kicked once after each insertion. With pigeonhole principle, we can show that there must exist such a path with length \(\ge \frac{w}{m}\).
As an example, if \(m=3,\{s_1, \cdots, s_w\}=\{1,2,3,4,5,6,7\}\) and \(4\) kicks \(2\), \(5\) kicks \(4\), \(6\) kicks \(3\), \(7\) kicks \(1\), we have
- \(\{g_1,\cdots,g_w\}=\{0,0,0,2,4,3,1\}\),
- \(\{h_1,\cdots,h_w\}=\{1,2,3,2,2,3,1\}\), and
- \(\{i\mid h_i=2\}={2,4,5}\) has \(3\ge\frac 73\) elements.
Let’s say \(I=\{i\mid h_i=i_0\}=\{i_0,\cdots,i_t\}\) has \(|I|\ge \frac{w}{m}\) items.
We can replace every other item (\(s_{i_0},s_{i_2},s_{i_4},\cdots\)) with \(k\). We are sure that each \(s_{i_{2j}}\) is inputted when a previous item \(s_{i_{2j-1}}\) has kicked \(s_{i_{2j-2}}\). When each \(s_{i_{2j}}\) is replaced with \(k\), we are sure that each \(s'_{i_{2j}}=k\) is inputted when a previous item \(s'_{i_{2j-1}}\neq k\) has kicked \(s'_{i_{2j-2}}=k\), so the property “each item is untracked at the moment it is inputted” remains. With assumption 3, the algorithm should not change in behavior, so with sequence \(s'\), we inputted \(k\) into the algorithm for \(\frac{w}{2m}\) times and made it forget about \(k\) afterwards.
What if the algorithm rejects some \(s_i\)? It’s easier: we just replace every rejected item with \(k\). If a \(r\) percentage of items are rejected, we can input \(\left((1-r)\frac{1}{2m}+r\right)w\ge\frac{w}{2m}\) occurrences of \(k\).
Note that during the process above, each item \(i\) appears \(c_i\leq\frac Xi\) times. To make the whole sequence Zipfian, we just append all \(\frac Xi-c_i\) occurrences of all \(i\) in arbitrary order; if the algorithm happens to pick \(k\) rather than \(k+1\), we just swap all the occurrences of those two.
The problem remains to construct a large \(w\).
Construction of the untracked sequence
Set \(z=\left\lfloor\frac{|A|}{2m}\right\rfloor-1\), and let \[ \begin{aligned}S_1=&\{2m+1,\cdots,2m+2m\},\\S_2=&\{4m+1,\cdots,4m+2m\},\\\cdots&,\\S_j=&\{2jm+1,\cdots,2jm+2m\}\end{aligned} \] By Zipfian distribution, each element in \(S_i\) appears at least \(\frac{X}{2(j+1)m}\) times.
We first put each of \(S_1\) into the stream, and observe the tracked set \(B\). There must be at least \(m\) elements in \(S_1\setminus B\), and we can append them in the stream. We can repeat the process for \(\frac{X}{2(j+1)m}\) times, and this can be done for each \(j\). Concatenating the sequence for each \(j\), we have our final \(s_1,\cdots,s_w\) with a total length of \[ \begin{aligned}w\ge&\sum_{1\le j\le z}\frac{X}{2(j+1)m}\times m\\=&\frac{X}{2}\left(H_{z+1}-1\right)\\\approx&\frac X2 \left(\ln{|A|}-\ln m -\ln 2-1\right)\end{aligned} \] As long as \(\frac{w}{2m}>\frac{X}{k(k+1)}\approx\frac{X}{k^2}\), we can perform our adversary. A little bit of simplification leads us to \(4m\ge k^2(\ln{|A|}-\ln m+\Theta(1))\). Solving the inequality, we have our lower bound: if the algorithm makes no mistake, it must use space \(m\ge\Theta\left(k^2\plog \left(\frac{|A|}{k^2}\right)\right)\). When \(\mu:=\frac{|A|}{k^2}\gg 1\), \(\plog(\mu)\approx\ln\mu>\eps\ln{|A|}\), hence \(m\ge\Theta\left(\eps k^2\log \left(|A|\right)\right)\).
Conclusion
The above adversary proved the space complexity lower bound of \(\Omega\left(k^2\plog \left(\frac{|A|}{k^2}\right)\right)\) for deterministic algorithm for the Zipfian top-\(k\) algorithm, or simply \(\Omega\left(k^2\log\left(|A|\right)\right)\) when \(|A|\gg k^2\).