Tuesday 3 May 2011

top k from n (k << n)

Bhai was asked this problem in one of the interviews he faced.

Given n objects and associated scores, find top k objects s.t. k is much smaller compared to n.

The most trivial approach is of course to sort n objects based on their scores and pick top k. The best known comparison based sort will do this in O(n log n) with additional pass to pick top k. And of course, this is not what is expected! We need complexity to be lower than O(n log n).

Let's focus on last part of the problem statement. What additional benefit does it give us to know that k << n? One obvious thing that comes to mind is, we are interested in correct ordering of just top k objects, remaining n-k could be unordered at the benefit of having lower complexity. Thinking in the same direction, one can try to pick k such objects such that remaining n-k are virtually ignored! But, there has to be some mechanism to pick top k. The same thing considered in opposite way will give us better insights - i.e. we get lower complexity at the expense of neglecting ordering of non-top n-k objects; meaning we don't care about them, specifically their position in ordered list.

So, is there any way where we can selectively pick elements s.t. the highest score is picked first then the second highest scored and so on? well.. this suggests none other than a max-heap! a max-heap allows you to get maximum element one at a time with delete operation. Let's see the algorithm.

  1. Build heap of n objects given.
  2. for i=1 to k do
    1. get max element (delete from heap)
Let's see the complexity if we are doing any better than O(n log n). Step 1 above takes O(n). Refer Complexity of building heap for analysis. Step 2.1 takes O(log n) and thus step 2 takes O(k log n). In first read, one might think the complexity is O(n) + O(k log n). Let's analyze more with examples.
  • n = 1024, k = 10
    • n+k log n ~ 1024 + 10 * 10 ~ 1024 -> O(n)
  • n = 1024, k = 200
    • n+k log n ~ 1024 + 200 * 10 ~ 2000 -> O(k log n) which will approach O(n log n) as k -> n
Of course, if k -> n, this algorithm will degenerate to O(n log n)  but in given setting where k << n, case 1 above will result in linear time algorithm.
:)

1 comment: