Sunday, May 8, 2011

Distorting a Binary Tree

We know that there are 3 different ways to traverse a Binary Tree i.e. InOrder, PreOrder, and PostOrder. Let us try to ignore other traversals like LevelOrder etc for now. Each traversal has its own significance and you can never think of one traversal technique as a replacement for the other. I am going to talk here about which traversal suits better when the tree is getting distorted. I will be taking 2 examples here and proceed.

The links to left and right subtrees are as important to a tree as the next pointer for a linked list. These are all the pointers/references that keep the structure intact. Once they are lost, there is no way to recover them once again. So when it comes to distorting a tree, I would always prefer PostOrder traversal where you don't touch the current node till you are done with the sub-trees.

Deleting a Binary tree
Consider for example the problem of how to deallocating an entire Binary tree? You cant start with the root node of the tree, as if you do so, you have no way to go to the left and right subtrees. So if you deallocate the root first, you cant proceed anymore. Similarly in case of InOrder traversal, you cant deallocate the right subtree. So the most suitable procedure involves deallocating the left and right subtrees first and coming to the root at last.

Binary tree to Doubly Linked List
Now consider the problem of converting a Binary Tree into a doubly linked list where the left pointer of the tree should be made to point the InOrder predecessor node and right pointer should be altered to point the InOrder successor node.
Though the problem talks about InOrder arrangement, you should not start tackling the problem with InOrder traversal. Because as we discussed in case of delete operation, any operation that distorts the tree is best done by the PostOrder traversal.
So, you need to write a recursive procedure to convert a tree to a Doubly Linked List. You can recursively apply the procedure to the left sub-tree and get a leftDll and separately to the right sub-tree to get a rightDll.
Now, you need to form a single DLL from the leftDll, the current root node, and the rightDll. So you need to attach the last node of the leftDll to the root, and then attach the root to the rightDll's head. However, finding the last node of leftDll is O(n) operation. In order to minimize that, instead of returning a Dll, always return a circular DLL so that once you have access to its head, it is just a O(1) operation to get its tail.
Now, we have discussed that circular DLL is the best data-structure to be returned. But what exact node would you return? Remember that you expect the left subtree to return a circular DLL which is arranged InOrder. Now to preserve the InOrder property, you would want to attach the current root to the end of the Circular DLL. Similarly you would link the circular DLL that you have obtained from the right subtree and form a single circular DLL. Now you should return the head node in other words, the node returned by the left subtree's iteration if present or else the current node to maintain the invariant.

Friday, May 6, 2011

Batch calls

Batch Processing :

Consider the two services mentioned below:
UserService - As mentioned in previous post, it is used to get a social network's user details.
FeedService - Used to get feeds information for the given user.It just queries "feeds" table.

For each feed, there should be a feed owner(the guy who posted that feed),
So in order to form a complete feed information, the FeedService has to call UserService to feed owner information(such as name,profile url,photo etc) of each feed.

Problem:
Since FeedService is pinging UserService for each and every feed, we ourself will be putting so much load on UserService.

Solution:
For each call to UserService, send a batch(say 100) of feeds and get the user information of those feeds at once. (like batch processing).

Pros:
1. Load on UserService is reduced.
2. Network latency is reduced by batch size times (here 100x times).

Cons:
1. We won't get 100 feeds always. Sometimes it might be just 10 feeds. So, there is a possibility that we might not be efficiently making call to UserService.
2. Extra care to be taken during service shutdown, and ensure that all remaining feeds in cache get processed before shutdown.

The above problem can be solved as follows -
Maintain a local cache of feeds in FeedService. Whenever the cache becomes full , then make a call to UserService and get the user information. Here cache size is nothing but batch size.

Sometimes, it is possible that few feeds can be stay in cache for long time, because FeedService is waiting for cache to get filled. To avoid such scenario, we can have some threshold time within which the feeds can reside in the cache. If cache has not filled even after threshold time, then make a call to UserService irrespective of no.of feeds available in the cache.

Though, this problem looks simple, in real world , care should be taken to identify and use batch processing  wherever possible.

Thursday, May 5, 2011

Push and Pull patterns

After exploiting the SOA environment in Amazon, I like to bring few patterns (you can say, principal's) observed in SOA , which everybody should be aware of, before designing the actual web services in your architecture.

Pull pattern
 Consider typical environment in Facebook,which has the below two web services.
 1.  UserService - Gets the Facebook user details like id, name, public profile url etc. The job of this service is to fetch user details from the underlying tables. Remember that the detail need not be present in a single table, but could be spread across multiple tables. This service aggregates all the details and provides a single API call to get all details of a particular user.
 2.  BotService - This service takes responsibility of updating the search engines indexes periodically whenever a user changes his profile information. So it contacts the UserService periodically to find the users who have updated their profile in the recent past, and based on their profile information the search indexes are updated.
Current Approach
BotService is a following Pull pattern in the above discussed model. It is the responsibility of the BotService to continuously get the user updates from the UserService. UserService takes no pains to send updates to BotService.

Problem with Pull Pattern
The general problem with Pull model is that there might be unnecessary fetches from the puller service, particularly when the the number updates the needs to come from the source is very minimum. In the above said service, assuming that a user will very rarely change his profile details, BotService will add unnecessary traffic on the UserService by continuously pulling details as at most of the times, UserService might have no updates to send at all.

Going with the Push Pattern - Don't call me, I will call you".  
Instead of UserService being called by BotService, let's make BotService being called by UserService. Since UserService knows whenever a user updated public profile URL, it will call the BotService informing that the particular user has updated his/her public profile URL. UserService can be made intelligent to batch the calls to the BotService to push the updates. In case latency is an issue, timeouts can be configured in the source by the end of which it is supposed to send updates.
The Rule of thumb is that the rate at which you get the data and the timeliness at which the data should be processed should guide you on which one to choose - the Push or the Pull.

Many more interesting SOA patterns to come.
Cheers,
Inba

Saturday, April 30, 2011

Iterative processes

There is a fundamental difference between a recursive process and a
recursive procedure.

A recursive procedure is one where the procedure calls itself (directly or indirectly) at some point.
In contrast a recursive process is one which has to remember its history of execution.

For eg.
; clojure code
(defn sum [a b]
(if (> a b)
0
(+ a (sum (inc a) b))))

This is small piece of code, written in Clojure, a lisp dialect, which sums up all
integers from a to b. The logic is quite straight forward.

If a > b then ans = 0 because there are no elements between them.
Else Find the sum of all numbers from (a+1) till b and then add 'a' to it.

Typically a function call gets its own stack frame which will be alive till the function
executes. After the function returns a value (potentially) the stack is popped.

So what would (sum 1 5) be?

Assuming stack grows from left to right, this is how to process would execute.

(sum 1 5)
(+ 1 (sum 2 5))
(+ 1 (+ 2 (sum 3 5)))
(+ 1 (+ 2 (+ 3 (sum 4 5))))
(+ 1 (+ 2 (+ 3 (+ 4 (sum 5 5)))))
(+ 1 (+ 2 (+ 3 (+ 4 (+ 5 (sum 6 5))))))
(+ 1 (+ 2 (+ 3 (+ 4 (+ 5 0)))))
(+ 1 (+ 2 (+ 3 (+ 4 5))))
(+ 1 (+ 2 (+ 3 9)))
(+ 1 (+ 2 12))
(+ 1 14)
15

Take your time to understand why the process looks like this.
[Substitute the parameters passed to the function sum at every point into the
definition and you will be able to figure out.
This model of finding out answers is called the substitution model]

In the above example what was the role that the stack played? Looking at the
process's structure we understand that the stack keeps track of the
pending operation at any point.
i.e. if there were no stack then the process would have looked like?

(sum 1 5)
(sum 2 5)
(sum 3 5)
(sum 4 5)
(sum 5 5)
(sum 6 5)
0 -> result

We got '0' because nobody remembers that there is an addition operation is to be done after the call returns.

But is it absolutely necessary to always have a stack?
If you say
"Yes, how else would I remember what operations are left to be done after the function call remembers?" then you have correctly identified the problem.

Now look at this code
(defn sum [a b acc] ;; acc should be 0 initially
(if (> a b)
acc
(sum (inc a) b (+ acc a)))

[Clojure's comments start with ';']

In the above procedure we have an extra parameter, accumulator, which will collect
the answer until that point. Here, using the substitution model, we arrive at the
process structure that looks like the following.

Assuming 'no-op' is used to signify nothing is left to be done.

(sum 1 5 0)
(no-op (sum 2 5 1))
(no-op (no-op (sum 3 5 3)))
(no-op (no-op (no-op (sum 4 5 6))))
(no-op (no-op (no-op (no-op (sum 5 5 10)))))
(no-op (no-op (no-op (no-op (no-op (sum 6 5 15))))))
(no-op (no-op (no-op (no-op (no-op 15)))))
(no-op (no-op (no-op (no-op 15))))
(no-op (no-op (no-op 15)))
(no-op (no-op 15))
(no-op 15)
15

We see that after the stage (sum 6 5 15) arrives, we essentially have found the answer.
All that is to be done now is that the result '15' is to be returned by every stack frame.
In this case, since the result is carried forward there is no real need for maintaining
the stack because there is no operation pending.

Note that as stack depth increases memory consumption of the process increases.
So stacks can't be ignore.
Languages like lisp, haskell etc do a bit of 'magic' called Tail Call Optimization
where when recursion is the last step of the procedure then they
reuse the current stack frame.

So (sum 1 5 0) would look like

(sum 1 5 0)
(sum 2 5 1)
(sum 3 5 3)
(sum 4 5 6)
(sum 5 5 10)
(sum 6 5 15)
15

All the operation would happen in the same stack frame, so memory consumption
is nearly constant.

Other languages like C++, Java etc make use of the 'for' , 'while' keywords and the
like to arrive at similar characteristics.
This type of process is called an iterative process where there is constant stack usage and all the state information is remembered through parameters alone.

[Note that if we had to do any operation with the result(15 in this case) of a function call at any point then that information can be maintained only in a stack]

Wednesday, April 27, 2011

Architecture of Recommendation Systems

About?
I recently read a paper which talks about the high level architecture of recommendation systems. The topic was discussed in a very generic way and not related to any specific domain. Details on how the database interaction happens, or how statistics collected are represented in the Database were not discussed. However, the paper was good enough to discuss the components of any recommendation system. As expected, the architecture was very clean and each component had a purpose and did not mess up with other component and that's the reason I am discussing the architecture here.

What is it?
Recommendation system is one which identifies patterns in user behavior and makes some predictions based on them. For example, how does Facebook know who are likely to be your friends? How does Amazon know which product you are likely to buy? And how does Google know which Add to show to you? All of them are recommendation systems which try to understand your behavior. Sometimes, the rules of the recommendation systems are hard-coded and hence are static, and sometimes they are very intelligent and dynamically change the rules themselves.

Architecture of a Recommendation System
The top most layer of recommendation system is always the User interface along with privacy/security layer, and the bottom layer is the database to persist your findings. What matters is the layer in between. If you observe well, you can easily find out 3 different components working together.

1. Watcher: This component takes care of collecting patterns from the user behavior. It might either do it in a visible way by posting surveys and getting feedbacks, etc, or it might choose to hide its identity and silently do its job by tracking the page hit count or the saving links the user clicks etc. This watcher component should come up with an optimal way to represent the data in the database. It need not hit the DB all the time, as the loss of some data is tolerable in such systems. So it can cache the data in its low local machine and update the central database periodically.

2. Learner: This guy always polls the Database for updates from the Watcher and interprets the data. Remember that though the user behavior is captured by the Watcher, Learner is the one that makes some sense out of it. Learners can be very complex by having multiple heuristics and can also rank/weigh each of them differently. It can also give more weight to most recent data. It can also predict the expertise of the user from the user behavior and provide different weight to different level of expertise.

3. Advisor: This component is visible to the user, more likely a GUI shown by HTML or Swing and it shows all recommendations that learner has predicted. The Advisor need not be always a dumb system, but can again collect details on the quality of recommendation like the number of times the user has accepted a recommendation and the number of times the user has said the recommendation was useless (Again it can be either visible or invisible to the user). It can provide the details of the this input to the learner system which in turn can adjust its weight on each heuristic and hence can keep learning.

Sunday, April 24, 2011

Algorithms based on majority element

What is majority element in an array?
In an array of size n, if a specific element repeats itself more than half the times (> n/2 times), then it is called the majority element. If no such element exists, then the array is said to have no majority element in it.
Examples:
[3,2,2,3,3] -> 3 is the majority element
[4,4,5,7,3,3,3,4,4] -> 4 is the majority element
[1,2,3,4,5] -> no majority element at all.
[1,1,1,2,2,2] -> no majority element as the majority element has to repeat more than half the times (i.e. 4 times at least in this array)

How to check whether a number is majority element or not?
It is very simple and obvious. Given a number X and an array A, run through the array and make the count of X's in the array. If the count is more than A.size()/2, then X is the majority element.

Logic to find majority element?
Suppose X is the majority element of an array. If you find an element Y in the same array which is not equal to X, you can cancel one occurrence of Y and one occurrence of X from the array and still the resultant array will have X as the majority element. This cancellation will reduce the size of both the parts of the array, i.e. the part (may not be contiguous) having majority element, and the one having rest of the elements.
Example: [3,1,2,3,3] -> In this array, 3 is the majority element. If you are able to find one non-majority element (say 1), you can cancel one occurrence of both 1 and 3 and still 3 will be the majority element. i.e. [2,3,3] obtained after cancelling one 3 and one 1 also has 3 as the majority element.

Similarly, take 2 elements which are not majority elements, say Y and Z. If you cancel both of them, the majority element will still be the same. i.e. in the above example, if you cancel two non majority elements 1 and 2, the resultant array will have [3,3,3] in which 3 is majority element once again.

In other words, you can pick any 2 elements from the array, and you can discard them if both of them are not the majority elements.


Solution:
Given an array A[0..n-1], your assignment is to find the majority element in it.
Let counter be the variable which records the number of occurrences of the majority element. Initially it will be 0.
Whenever the counter is less than or equal to zero, take the next element in A as majority element. Increment the counter when you see the same element and decrement it when you see a different element. Do it again if the counter becomes zero once again.

counter = 0
majority_index = -1
// Find majority element
for i in 0..Arr.size()-1
  if counter == 0
    majority_index = i
  end
  counter += Arr[i] == Arr[majority_index]
end


// Check the correctness
majority_count = 0
for j in 0..Arr.size()-1
  if Arr[j] == Arr[majority_index]
    majority_count ++
  end
end
if (majority_count > Arr.size()/2)
  return Arr[majority_index]
else
  throw "No majority element found"
end

E.g. 1,2,4,2,5,6,3,4,4,2,4,4,4,4,4
counter = 0
Start with 1 as majority element.
Next element is 2 which is not 1. So discard both.
Now 4 becomes the majority element and counter will be 1
when 2 comes again, counter will become 0, and so till now no majority element exists.
Next 5 and 6 gets discarded similarly as both are different.
Similarly, 3 and 4 will get discarded, and again 2 and 4 will get discarded.
Now the only array having majority element is [4,4,4,4,4] and the counter keeps incrementing to 5.
Now check whether 4 is the real majority element in the array. If so, you can return 4.

Problem 2 "Find element which occurs at least half the times"
Good that you now know how to find the majority element. Now, given that you need to find the element which occurs at least half the times in the array, how will you proceed? Remember, this is a different problem. As per the majority element, it should occur more than half the times, but now the problem has been modified to find that element which repeats at least half the times.
Solution:
We can use the majority algorithm that we discussed above in all the cases except when the element that needs to be found occurs exactly half the times in the array. We can change the solution a bit to catch that case also. In the array, check whether the first element is the majority element; if so, return it. Else, discard the first element in the array. Now after discarding, we can apply the above described normal majority element algorithm itself as now the majority element is guaranteed to be more than half the number of times in the array.

Monday, April 18, 2011

Thread safety


Thread safety
Definition: A class is thread safe if it behaves correctly when accessed from multiple threads, regardless of the scheduling or interleaving of the execution of those threads by the runtime environment, with no additional synchronization or other coordination on the part of the calling code.
An object in general represents a state of something (forget about behavior for now). Any thread working on that object tries to change its state. Always it is desirable to change the object from one consistent state to another. However, when two or more threads operate, the object might go to an inconsistent state. Informally, any program that doesn’t let the object to go to such an inconsistent state is considered as thread safe. As a simplest case, any stateless object is always thread-safe. Stateless objects are not difficult to find. Design patterns like Strategy, Singleton, Adaptor, Flyweight, etc. uses stateless objects only and so they are almost always thread-safe.

Atomicity
Any operation that happens as a single unit is said to be atomic. There won’t be any intermediate state while doing an atomic operation; i.e. there are only two states when doing such an atomic operation, the operation will happen fully or won’t happen at all. In reality many operations which look atomic are not really atomic. For e.g., ++j though it looks like a single atomic operation, it is not. A processor executes a series of instructions to achieve the result, and a thread switch can happen at any intermediate stage which makes the statement non-atomic. Race condition occurs when the correctness of a computation depends on the relative timing or interleaving of multiple threads at runtime. In general, race conditions can happen anytime two threads simultaneously access a shared variable. Compound actions are those actions performed in multiple steps like check-then-act (E.g. lazy initialization) and read-modify-write (increment operation), and they are problematic areas when it comes to multithreading.

Lock
To preserve state consistency, update related state variables in a single atomic operation. Use synchronized block to achieve atomicity. Every java object can act as a lock for purpose of synchronization; these built-in locks are called intrinsic locks or monitor locks. In java, locks are obtained per-thread basis rather than per-invocation basis; i.e. any thread holding a lock can make any number of invocations to obtain the same lock again and they will all succeed immediately (This is called reentrancy). This is necessary as say for example, a synchronized method of an object calls another synchronized method (lock is obtained on the same object), would result in a deadlock if there is no reentrancy.

Guarding state with locks
It is a common mistake to assume that synchronization needs to be used only when writing to shared variables. For each mutable state variable that may be accessed by more than one thread, all accesses to that variable must be performed with the same lock held. In this case, we call that variable is guarded by that lock. Every shared, mutable variable should be guarded by exactly one lock.
A common locking convention is to encapsulate all mutable states within an object and to protect it from concurrent access by synchronizing any code path that accesses mutable state using the object’s intrinsic lock. For every invariant that involves more than one variable, all the variables in that invariant must be guarded by the same lock. Use @GuardedBy annotation for easy debugging.

Performance
Make the synchronized block as small as possible without breaking the required atomicity. Otherwise it will become a performance bottleneck. Also note that acquiring and releasing a lock is costly. Avoid holding locks during lengthy computations or operations at risk of not completing quickly such as network or console I/O.

I have taken these points from the book "Java Concurrency In Practice" which has very high user rating in Amazon.