Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.
Design a data structure that supports the following two operations:
addNum(num: int) -> None
Add a integer number from the data stream to the data structure.
findMedian() -> double
Return the median of all elements so far.
Example 1
Input: [2,3,4]
Output: 3
Example 2
Input: [2,3]
Output: (2 + 3) / 2 = 2.5
Example 3
For large scale systeam that stream a lot of data, we’d often want to keep track of the median observed. There are a lot of use cases for this! A simple one could be anomaly detection. So we could be monitoring a specific data type coming into the system and looking for large spikes or drops.
Also, fun story — this question was on one of my finals durng my undergrad degree. You can imagine our reactions…
In the example we can see a distinction in which the input is even or odd length.
Odd length - When the input is odd length, finding the median is easy, because our list has a middle.
Even length - When the input is even length, finding the median is simply the average of the “middle-most” items.
What would be the most naive way to do this?
We can store all items in a list, and simply sort it when getMedian
is called. If the list is odd we return the middle elements, otherwise we return the average of the “two middle elements”.
Time Complexity
addNum -> O(1)
findMedian -> O(n* log(n))
Space Complexity
O(n)
Sorted the list every time we call getMedian
, seems sub-optimal to say the least. What can we observe from the naive implementation above? Well, we only need to really keep track of the middle elements!
Keeping track of middle elements… hmm.. sorta sounds similar to keeping track of min and max elements. This should remind us about using a heap!
But heaps are for min
and max
elements. So how can manipulate heaps to get the middle elements.
We have heaps for max and min. And we want the middle element.. So let’s keep two heaps!
1) lowerHalf
will be a maxHeap to store the smaller numbers
2) higherHalf
will be a minHeap to store the larger nuumbers
NOTE
Python only supports min heaps natively
To achieve a maxheap we’ll need to multiply each number by -1
on its way in and out! This is a bit hacky, but hey, life is short
Our general apporach will be to always add an incoming number to the lowerHalf
heap and then to check our validity conditions:
lowerHalf
< all elements in higherHalf
When getMedian
is invoked we can simply return the average between the largest element in the maxHeap
and the smallest element in the minHeap
! If the amount of items observed until now is odd, than we can return the top
element off the heap with the bigger size!
Quick Links
Legal Stuff