Small-to-Large

The Vanilla Method

Let's take the given problem:

We have some sets as U = {S1,S2,S3....Sn} where U is the "container" set. Initially each set contains a single element. After each operation each set will contain some numbers. We define the cost of a set as the number of unique elements in the set. For each query, we merge two unmerged set and then report the sum of cost of all the sets. 

The most brute force approach is for each query, add all numbers in one of the sets and put them in the other set. This will take approximately O(nlogn) time per merge operation and since there can be about O(N) mergings (because after atmost N mergings there will be only 1 set left) this will take time proportional to O(n^2logn) time. For the report, part, we just return the sum of the sizes of the set. This part is O(1) as we only have to check the size of the new merged set as the sizes of the other sets haven't changed.

This is obviously a poor algorithm for a seemingly easy problem. But improving this doesn't seem obvious.. does it?

Concept

The only thing that we do differently when doing a merge is that we iterate over the elements of the smaller set and add them to the larger set. Done! This is what small to large merging is at the crux. 

This seems to be of the same complexity right? At every step we can merge potentially O(n) elements albeit with a smaller constant factor. But here we analyse the time complexity in a different manner. We see how many times an element is accessed, or in this case, added to a different set. Since each addition is what causes the O(logn) factor, thus if we can count such additions then we can get a bound for the complexity. 

The main observation here is that each element can be added to a new set at most logn times. This is because, each time an element is added to a new set, the size of its containing set increases by at least 2. This is because originally it belonged to the smaller set. The set it was added to was thus at least as large as it's original set. After merging the new set is thus at least twice the size of it's original set. Thus it's new parent set's size is at least twice the size of it's original size. However, this "doubling" of size can be done only logn times since 2^logn = n and each merging increases the size of set by a factor of 2. Thus we conclude that each element can be added at most logn times. Hence the total number of operations is bounded by n*logn. Thus total complexity of the original problem is n*logn*logn which is significantly better than the bound which we had earlier. 

More Variety

This same concept -- merging smaller sets to larger sets as each element can thus be added atmost logn time -- can be used in other circumstances. 

DSU without Path Compression


When we are using DSU without path compression, in scenarios like DSU with rollbacks, we can need to keep the DSU time complexity to a minimum. 
This is also acheived by a technique similar to small to large. Here, when merging two components, we set the parent of the smaller component to the larger component as then, when we have a findParent() operation, each time we traverse an edge to a parent, we ensure that the overall subtree size at that parent is doubled. (Think why!). 

Heavy-Light Decomposition

Given a tree, first preprocess the subtree size of each node. Now, for each node, a heavy child is that child whose subtree size is maximum. The other children are all called light children. The advantage? well, each time we move from a light child to its parent, the subtree size is at least doubled. This is because the subtree size of the heavy child was by definition more than the subtree size of the light child. Hence when we move up, since that node's subtree size also includes the subtree of the heavy child, the subtree size doubles. 

This has two prominent applications
  • Heavy Light Decomposition of the tree
  • DSU on tree
We won't be discussing the first topic in this post.

DSU on Tree

Problem: Given a tree, each node has a color. For each node, find the number of distinct colors in it's subtree. 

First mark the heavy child and the light childs. Now, if at each node we had the set of all the colors in its subtree, the answer for that node would just be the size of the set as by definition the set will consist of only distinct colors. So suppose we have these sets for all the children for a node, how will we get this set for my current node? Simple. Just merge these sets. Now, just as we had been doing all along, merge the smaller sets to the larger sets. And since we have already marked the heavy and light childs, we know which is the larger set and which are the smaller sets that we have to merge to the larger set. Here again, thus, the size of the new set that the element goes to is atleast doubled. 

Note that each of the nodes initially have a set of size 1. When we merge, we dont add elements that are not the color of the nodes that we have processed. What I mean is, the size of the subtree will be the upper bound on the number of elements always. This is because when we add the color of the node, we can just think of it as adding the node itself. Thus, since the node can be added atmost logn times to a new set, the overall complexity is nlogn*(time of each addition).

What is usually done is that in an initial DFS, we reorder the adjacency list to have the heavy child at the head of the list. In this way, during the actual DFS which computes the answers, we always merge the sets of the current child to the set of the first child. After we are done merging, we return the pointer to the first child's set to our parent. This is because if this was not done, then the complecity(both time and space) would have been quadratic since if we had copied the first child's set, then we are basically processing the elements of the heavy child's set again. Which would go against the whole point of the small to large merging. 



























Comments

  1. Thanks for this blog! I finally understood dsu on trees.

    ReplyDelete
  2. I was looking for such a article on small to large merging. Thanks!

    ReplyDelete

Post a Comment

Popular posts from this blog

Segment Tree Beats - An introduction

Getting prepared for RMO