LAST UPDATED Wed Dec 3 11:35:31 EST 2003
Caching should be done by the Chord Client, where a hashmap should be maintained locally with fileid and node used for the lookup. At such point that a new node returns the file, the cache should be updated. This will save many messages in the system.
Benefit: Eliminating chord lookups on already-fetched files.
Impact: Straightforward implementation in the chord client - does not affect algorithms.
Benefit: Actually, none, really. Doing this in place of init_finger_table may result in incorrect fingers, because a new node might have an incorrect finger[m] entry. Someone check me on this?
Impact: Modifications to join. Still need to update your farthest fingers as they may be different than your predecessors. [What were they thinking...]
Changes: in init_finger_table, when a new node gets the successor and predecessor, it changes its id to the predecessor plus half the difference to the successor. In other words, node N being inserted between nodes with id=2 and id=6 will set its own id=4.
Benefit: A fair attempt at balancing as nodes are inserted. Won't be perfect, but will be pretty good. Note that the SHA hash is supposed to do this probabilistically, but when running on small numbers of nodes (N=256), we have observed "bunching". This is because any slot is equally likely to be the hash point for a given node, resulting in an imbalanced cloud.
Impact: Changes in init_finger_table only. No impact on file server code. Fairly easy change.
It is important that you slide back halfway between your succ and pred, and not halfway between yourself and your pred, or else the algorithm may never terminate in a busy state.
Leave-triggered backsliding, which doesn't change the algorithm: A leaving node triggers backsliding explicitly in its predecessor, and only if not backsliding itself. The predecessor checks if the distance between itself and its (new) predecessor exceeds some threshold (say N/2 for instance); if so, it leaves (NOT propagating the backslide message), reassigns its slot number to the slot halfway between its new predecessor and successor, and rejoins. Files will be transferred automatically upon rejoin.
The test case would be a cloud (N=8) with nodes 0, 2, 5 initially, then 5 leaves - we want 0 to backslide to position 6 (halfway between succ=2 and pred=2).
It is "safe" because only one node leaves at a time, and only its successor tries a backslide with a temporary increase in its successor's load.
Backsliding could also be done with a new method, "rejoin" to trigger the fileserver code to activate the feature as well as handlers for correctly transferring files to the new node.
It makes sense to recheck your successor's position when you are notified of a change in its position - ultimately via a join/leave or a rejoin operation.
Benefit: Perfect load balancing. Nodes will eventually settle into an even distribution. Leave-triggered backsliding is near-perfect, locally optimal and eventually leaves a near-optimal cloud.
Impact: New algorithm method rejoin(), triggers in fileserver, handlers for correctly transferring files to successor. Note that Leave-triggered backsliding doesn't change the algorithm at all.
Lazy Update News - Wed Nov 26 08:42:54 EST 2003
This might not have an easy implementation for node deletes.
For freshen_finger_table(np) - Two cases: np has one that I
don't have, I have one that np doesn't have.
If np's finger is new to me, and its timestamp is greater than
my finger's timestamp - new node (on delete, the removed finger's
slot gets a greater timestamp).
If I have a node that np should have but doesn't, and my finger's
timestamp is less than np's finger's timestamp - deleted node
PROBLEM: during a find, what if I think a node is there, I have
an old finger table, and the node is in fact deleted? I need to
be able to find someone with more recent knowledge, and I don't
want to "n-walk."
Larger issue. In a partitioning, any node might not be there.
Need to be able to detect this and conclude node death. In the event
of a rejoin, lazy adds will rejoin the node properly... unless it
is your finger 1 node. Then, finger[1] consistency is violated.
If your finger[1] node is gone, need to periodically probe it and
send it your finger table and a request for new files. Know that
your finger 1 and the missing node's finger 1 will both appear
missing to respective owners. Rejoin is when finger 1's "come back",
and it should retrieve any new files and merge them.
During a find, need to probe the finger; if fails, update finger
table to next finger (or self), update timestamps, fall back to
previous finger and probe it, etc.
Finger table updates should be bidirectional in the lazy case.
That is the only way a rejoin after a partitioning is going to work.
Solution is likely we need explicit delete like we have. If a node
is detected as missing, the detecting node initiates a cloud delete
of that node. If it was legitimate delete, at some point the
successor or predecessor of that node will be notified, and can
return timestamped finger tables reflecting the proper delete. If
they concur that it was a surprise, their finger tables are left
intact, and the predecessor begins a probe loop for the lost node.
Some changes to the algorithm are needed.First change: When joining, only update the finger tables of the new node and its predecessor. Maintain a vector timestamp on finger tables: a timestamp is generated when the new node joins, and the new node and predecessor timestamp their finger tables. This would mean that only some nodes in the cloud have correct finger tables.
Inductively, it can be shown that there are always at least 2 nodes with correct finger tables. Consider the first two nodes in the cloud, then add a third, and so on. Further, finger tables are finger[1] consistent: a predecessor's finger[1] entry always points to its correct successor. (If it didn't, then that means a new node has been inserted in between with no update of the predecessor, and that contradicts the proposed fact that a predecessor is always updated.)
Second change: find_predecessor and closest_preceeding_finger return timestamped finger tables with every call.
Third change: find_successor calls find_predecessor. If the returned node contains a newer finger table than its own, scan the new finger table for any new nodes, and update the existing finger table with new data. While the returned node is the finger[1] entry of the calling node, and the local finger table was just updated, repeat the call to find_predecessor. When updating the existing finger table, note that the new finger table is relative to the returned node, so offsets will need to be adjusted.
Fourth change: update_others (log N messages) is no longer needed. Instead, join calls a new method update_predecessor which needs exactly 1 message.
Why is this better? It still is log^2(N), but only in the worst case. In the worst case, it is log(N) messages during initial find_successor, plus additional message for each of log(N) imtermediate nodes assuming they all have bad finger tables. In the best case, the intermediate finger tables are all good, and it's log(N) messages for a join. In practice, the actual time will be somewhere in between.
The result is that the cloud is mostly correct, by virtue that join() involves calls to find_successor that will update any intermediate out of date nodes.
Benefit: Reduction in messages needed for join, and finger tables are corrected on a "need to know" basis.
Impact: Modifications to algorithm, update_predecessor, timestamping of finger tables. No impact on file system code.
A suggestion is to use Raymond's tree-based token algorithm using the chord fingers as an approximation of the binary tree. Token passing will work fine given the main project assumptions:
Distributed mutex will enable us to have the benefits of explicit state maintenance (the section 4 algorithms in the paper) and yet permit simultaneous joins/leaves.
Benefit: Concurrent join/leave handling without polling.
Impact: Implementation of Raymond Tree distributed mutex algorithm in the join/leave message manager components, no impact to file system semantics. Hard to prototype using current prototype framework, since it is single-threaded, but could simulate the token passing part of it.