Chewbacca

CHord Enhanced With Basic Algorithm Corrections and Concurrent Activation

Proposed Chord Enhancements

There are a few enhancements and optimizations we can make to our system, once the basic thing is working.

LAST UPDATED Wed Dec 3 11:35:31 EST 2003


Basic Optimizations

These optimizations are fairly basic, and some are discussed in the paper.

Identifier Lookup Caching

In a filesystem, some files are frequently accessed. If a file does not belong to a client's local node, then that file will be repeatedly searched for by the local node.

Caching should be done by the Chord Client, where a hashmap should be maintained locally with fileid and node used for the lookup. At such point that a new node returns the file, the cache should be updated. This will save many messages in the system.

Benefit: Eliminating chord lookups on already-fetched files.

Impact: Straightforward implementation in the chord client - does not affect algorithms.

Finger Table Copies

A newly joining node copies the finger table of the predecessor (or successor), as discussed in the paper. This will reduce some number of messages also discussed in the paper.

Benefit: Actually, none, really. Doing this in place of init_finger_table may result in incorrect fingers, because a new node might have an incorrect finger[m] entry. Someone check me on this?

Impact: Modifications to join. Still need to update your farthest fingers as they may be different than your predecessors. [What were they thinking...]


Protocol Optimizations

These changes affect the core protocol, but have a larger payoff and are a bit more involved. They are not listed in any order; refer to the todo list for correct ordering of when they should get implemented.

Balanced Join

As pointed out in group, the hash of the node has nothing to do with the hash of the files it maintains. As such, each node could explicitly maintain a map of the other nodes in the system, and use this to explicitly position newly joining nodes so as to maintain near-perfect spacing between nodes.

Changes: in init_finger_table, when a new node gets the successor and predecessor, it changes its id to the predecessor plus half the difference to the successor. In other words, node N being inserted between nodes with id=2 and id=6 will set its own id=4.

Benefit: A fair attempt at balancing as nodes are inserted. Won't be perfect, but will be pretty good. Note that the SHA hash is supposed to do this probabilistically, but when running on small numbers of nodes (N=256), we have observed "bunching". This is because any slot is equally likely to be the hash point for a given node, resulting in an imbalanced cloud.

Impact: Changes in init_finger_table only. No impact on file server code. Fairly easy change.

Dynamic Load Balancing via "Backsliding"

A problem with the protocol is that load balancing is not addressed. This would be fairly easy to do and beneficial with some assumptions: Backsliding is a technique that whenever you have too many files, reset your ID to a vacant slot behind you that is halfway between your successor and your predecessor, and give half of your files to your successor.

It is important that you slide back halfway between your succ and pred, and not halfway between yourself and your pred, or else the algorithm may never terminate in a busy state.

Leave-triggered backsliding, which doesn't change the algorithm: A leaving node triggers backsliding explicitly in its predecessor, and only if not backsliding itself. The predecessor checks if the distance between itself and its (new) predecessor exceeds some threshold (say N/2 for instance); if so, it leaves (NOT propagating the backslide message), reassigns its slot number to the slot halfway between its new predecessor and successor, and rejoins. Files will be transferred automatically upon rejoin.

The test case would be a cloud (N=8) with nodes 0, 2, 5 initially, then 5 leaves - we want 0 to backslide to position 6 (halfway between succ=2 and pred=2).

It is "safe" because only one node leaves at a time, and only its successor tries a backslide with a temporary increase in its successor's load.

Backsliding could also be done with a new method, "rejoin" to trigger the fileserver code to activate the feature as well as handlers for correctly transferring files to the new node.

It makes sense to recheck your successor's position when you are notified of a change in its position - ultimately via a join/leave or a rejoin operation.

Benefit: Perfect load balancing. Nodes will eventually settle into an even distribution. Leave-triggered backsliding is near-perfect, locally optimal and eventually leaves a near-optimal cloud.

Impact: New algorithm method rejoin(), triggers in fileserver, handlers for correctly transferring files to successor. Note that Leave-triggered backsliding doesn't change the algorithm at all.

Deferred Finger Table Updates - "Lazy Updates"

The idea is to only update the finger tables of a newly joining node and its predecessor when joining the cloud. Other nodes will slowly "discover" the new node by propagating vector-clock timestamped finger tables around the cloud whenever a find operation is performed on files handled by the new node. The benefit is eliminating messages, and also improved responsiveness by eliminating the message spike or flurry of messages around the cloud that occurs when a new node is added.

     Lazy Update News - Wed Nov 26 08:42:54 EST 2003

     This might not have an easy implementation for node deletes.

     For freshen_finger_table(np) - Two cases: np has one that I
     don't have, I have one that np doesn't have.
     
     If np's finger is new to me, and its timestamp is greater than
     my finger's timestamp - new node (on delete, the removed finger's
     slot gets a greater timestamp).
     
     If I have a node that np should have but doesn't, and my finger's
     timestamp is less than np's finger's timestamp - deleted node
     
     PROBLEM: during a find, what if I think a node is there, I have
     an old finger table, and the node is in fact deleted?  I need to
     be able to find someone with more recent knowledge, and I don't
     want to "n-walk."

     Larger issue.  In a partitioning, any node might not be there.
     Need to be able to detect this and conclude node death.  In the event
     of a rejoin, lazy adds will rejoin the node properly... unless it
     is your finger 1 node.  Then, finger[1] consistency is violated.

     If your finger[1] node is gone, need to periodically probe it and
     send it your finger table and a request for new files.  Know that
     your finger 1 and the missing node's finger 1 will both appear
     missing to respective owners.  Rejoin is when finger 1's "come back",
     and it should retrieve any new files and merge them.

     During a find, need to probe the finger; if fails, update finger
     table to next finger (or self), update timestamps, fall back to
     previous finger and probe it, etc.

     Finger table updates should be bidirectional in the lazy case.
     That is the only way a rejoin after a partitioning is going to work.

     Solution is likely we need explicit delete like we have.  If a node
     is detected as missing, the detecting node initiates a cloud delete
     of that node.  If it was legitimate delete, at some point the
     successor or predecessor of that node will be notified, and can
     return timestamped finger tables reflecting the proper delete.  If
     they concur that it was a surprise, their finger tables are left
     intact, and the predecessor begins a probe loop for the lost node.

Some changes to the algorithm are needed.

First change: When joining, only update the finger tables of the new node and its predecessor. Maintain a vector timestamp on finger tables: a timestamp is generated when the new node joins, and the new node and predecessor timestamp their finger tables. This would mean that only some nodes in the cloud have correct finger tables.

Inductively, it can be shown that there are always at least 2 nodes with correct finger tables. Consider the first two nodes in the cloud, then add a third, and so on. Further, finger tables are finger[1] consistent: a predecessor's finger[1] entry always points to its correct successor. (If it didn't, then that means a new node has been inserted in between with no update of the predecessor, and that contradicts the proposed fact that a predecessor is always updated.)

Second change: find_predecessor and closest_preceeding_finger return timestamped finger tables with every call.

Third change: find_successor calls find_predecessor. If the returned node contains a newer finger table than its own, scan the new finger table for any new nodes, and update the existing finger table with new data. While the returned node is the finger[1] entry of the calling node, and the local finger table was just updated, repeat the call to find_predecessor. When updating the existing finger table, note that the new finger table is relative to the returned node, so offsets will need to be adjusted.

Fourth change: update_others (log N messages) is no longer needed. Instead, join calls a new method update_predecessor which needs exactly 1 message.

Why is this better? It still is log^2(N), but only in the worst case. In the worst case, it is log(N) messages during initial find_successor, plus additional message for each of log(N) imtermediate nodes assuming they all have bad finger tables. In the best case, the intermediate finger tables are all good, and it's log(N) messages for a join. In practice, the actual time will be somewhere in between.

The result is that the cloud is mostly correct, by virtue that join() involves calls to find_successor that will update any intermediate out of date nodes.

Benefit: Reduction in messages needed for join, and finger tables are corrected on a "need to know" basis.

Impact: Modifications to algorithm, update_predecessor, timestamping of finger tables. No impact on file system code.

Concurrent Join/Leave via Distributed Mutex

A big problem with the paper is that they suggest periodic polling to overcome the problem of systemwide state disparity that may occur if concurrent joins/leaves are taking place. The problem with polling is twofold: A better alternative is to use a distributed mutual exclusion protocol to lock the cloud while a join/leave is taking place, resulting in concurrent join/leave attempts becoming serialized over the entire system.

A suggestion is to use Raymond's tree-based token algorithm using the chord fingers as an approximation of the binary tree. Token passing will work fine given the main project assumptions:

A more resilient method would be to use voting protocols, possibly El-Abaddi quorum-based, where the quorum could be derived from the tree-like arrangement of nodes in the finger table of the locking node (further thought is needed on what the quorum should be). The benefit of a voting mutex would be correct state during partitions, but would require use of the "lazy" finger table update technique to tolerate partition rejoins. This could work.

Distributed mutex will enable us to have the benefits of explicit state maintenance (the section 4 algorithms in the paper) and yet permit simultaneous joins/leaves.

Benefit: Concurrent join/leave handling without polling.

Impact: Implementation of Raymond Tree distributed mutex algorithm in the join/leave message manager components, no impact to file system semantics. Hard to prototype using current prototype framework, since it is single-threaded, but could simulate the token passing part of it.