295. Find Median from Data Stream

Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.

For example,

[2,3,4], the median is3

[2,3], the median is(2 + 3) / 2 = 2.5

Design a data structure that supports the following two operations:

  • void addNum(int num) - Add a integer number from the data stream to the data structure.
  • double findMedian() - Return the median of all elements so far.

Example:

addNum(1)
addNum(2)
findMedian() -> 1.5
addNum(3) 
findMedian() -> 2

Follow up:

  1. If all integer numbers from the stream are between 0 and 100, how would you optimize it?
  2. If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it? (Segment Trees)
  3. how to improve it and make sure the method is thread-safe if the two heaps are shared - Use read and write locks.

Thoughts:

  1. Having two queues, one (right part) puts smallest number on the top, and one (left part) keeps largest element on the top.

  2. Using Binary Search Tree: As the input numbers are random, so the height of the binary search tree is O(logN) We maintain every single node's children's size and it's easy to implement because it just has add operation.

  3. Further Thoughts (Analysis written by @babhishek21)

    There are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:

    • Buckets!If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving.@mitbbs8080 has an interesting implementation here.

    • Reservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a"good"size for your reservoir?_Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.

    • Segment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always.The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.

    • Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the k^{th} ​​order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice.5

  4. Code

class MedianFinder {

    /** initialize your data structure here. */
    private Queue<Long> left;
    private Queue<Long> right;

    public MedianFinder() {
            left = new PriorityQueue();
            right = new PriorityQueue();
    }

    public void addNum(int num) {
        right.add((long) num);
        left.add(-right.poll());
        if(right.size() < left.size())
            right.add(-left.poll());
    }

    public double findMedian() {
        return right.size() > left.size()
            ?right.peek()
            :(right.peek() - left.peek())/2.0; 
    }
}

/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder obj = new MedianFinder();
 * obj.addNum(num);
 * double param_2 = obj.findMedian();
 */

Java: reverse ordering

class MedianFinder {

    /** initialize your data structure here. */
    private Queue<Integer> l;
    private Queue<Integer> r;
    boolean even;
    public MedianFinder() {
            l = new PriorityQueue();
            r = new PriorityQueue(Collections.reverseOrder());
            even = true;
    }

    public void addNum(int num) {
             if(even){
                l.add(num);
                r.add(l.poll());
            }
            else{
                r.add(num);
                l.add(r.poll());
            }
        even=!even;
    }

    public double findMedian() {
         return even?(r.peek() + l.peek()) / 2.0:r.peek();
    }
}

/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder obj = new MedianFinder();
 * obj.addNum(num);
 * double param_2 = obj.findMedian();
 */

C++: max_heap

class MedianFinder {
    priority_queue<long> left, right;// max_heap
public:
    /** initialize your data structure here. */

    MedianFinder() {

    }

    void addNum(int num) {
        left.push(num);
        right.push(-left.top()); left.pop();
        if(left.size() < right.size()){
            left.push(-right.top()); right.pop();
        }
    }

    double findMedian() {
        return left.size() > right.size()
            ? left.top()
            : (left.top() - right.top())/2.0;
    }
};

/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder obj = new MedianFinder();
 * obj.addNum(num);
 * double param_2 = obj.findMedian();
 */

Python

from heapq import *
class MedianFinder(object):

    def __init__(self):
        """
        initialize your data structure here.
        """
        self.heaps = [], []


    def addNum(self, num):
        """
        :type num: int
        :rtype: void
        """
        left, right = self.heaps # min heap
        heappush(left, -heappushpop(right, num))
        if len(right) < len(left):
            heappush(right, -heappop(left))

    def findMedian(self):
        """
        :rtype: float
        """
        left, right = self.heaps
        if len(right) > len(left):
            return float(right[0])
        return (right[0] - left[0])/2.0        


# Your MedianFinder object will be instantiated and called as such:
# obj = MedianFinder()
# obj.addNum(num)
# param_2 = obj.findMedian()

Code: BST (Even if the incoming data ins't random, we can still use red-black tree so that finding median value is super fast.)

struct BST {
    struct node {
        int val;
        int size;
        node* left, *right;
        node(int v) : size(1), val(v) {};
    } *Null, *root;

    BST() {
        Null = new node(0);
        Null -> size = 0;
        root = Null;
    }

    void add(int val, node*& R) {
        if(R == Null) {
            R = new node(val);
            R -> left = R -> right = Null;
            return;
        }
        if(R->val <= val) add(val, R->left);
        else add(val, R->right);
        R->size = R->left->size + R->right->size + 1;

    }

    int rank(int k) {
        node* t = root;
        while(true) {
            int leftSize =  t -> left -> size;
            if(leftSize == k) return t -> val;
            if(leftSize > k) {
                t = t -> left;
            } else {
                k = k - leftSize - 1;
                t = t -> right;
            }
        }
        return -1;
    }
};




class MedianFinder {
public:
    BST* bst;
    MedianFinder() {
        bst = new BST();
    }
    // Adds a number into the data structure.
    void addNum(int num) {
        bst->add(num, bst->root);
    }

    // Returns the median of current data stream
    double findMedian() {
        int sz = bst -> root -> size;
        if(sz % 2 == 0) {
            return 1.0 * (bst -> rank(sz / 2) + bst -> rank(sz / 2 - 1)) / 2;
        } else return bst->rank(sz / 2);

    }

Binary Search: O(n) (Add function may cost o(n).Insert function costs o(n))

private List<double> m_listNum = new List<double>();

public void AddNum(double num) {

    int lintPos = m_listNum.BinarySearch( num ) ;
    if (lintPos >= 0) {
        m_listNum.Insert(lintPos, num);
    } else {
        lintPos = ~lintPos;
        if (lintPos == m_listNum.Count) {
            m_listNum.Add(num);
        } else {
            m_listNum.Insert(lintPos, num);
        }
    }
}

// return the median of current data stream

public double FindMedian() {

    int lintCount = m_listNum.Count ;

    if (lintCount == 0 ) throw new Exception("array is empty");

    if (lintCount % 2 == 0)
        return (m_listNum[lintCount / 2 - 1] + m_listNum[lintCount / 2]) / 2;
    else 
        return m_listNum[lintCount / 2];
}

Reference: StefanPochmann's post and lgn's post

results matching ""

    No results matching ""