Kademlia: A Design Specification
- Network Characterization
- The Node
- Implementation Suggestions
- Possible Problems with Kademlia
Kademlia is a communications protocol for peer-to-peer networks. It is one of many versions of a DHT, a Distributed Hash Table.
A Kademlia network is characterized by three constants, which we call alpha, B, and k. The first and last are standard terms. The second is introduced because some Kademlia implementations use a different key length.
- alpha is a small number representing the degree of parallelism in network calls, usually 3
- B is the size in bits of the keys used to identify nodes and store and retrieve data; in basic Kademlia this is 160, the length of an SHA1 digest (hash)
- k is the maximum number of contacts stored in a bucket; this is normally 20
It is also convenient to introduce several other constants not found in the original Kademlia papers.
- tExpire = 86400s, the time after which a key/value pair expires; this is a time-to-live (TTL) from the original publication date
- tRefresh = 3600s, after which an otherwise unaccessed bucket must be refreshed
- tReplicate = 3600s, the interval between Kademlia replication events, when a node is required to publish its entire database
- tRepublish = 86400s, the time after which the original publisher must republish a key/value pair
A Kademlia network consists of a number of cooperating nodes that communicate with one another and store information for one another. Each node has a nodeID, a quasi-unique binary number that identifies it in the network.
Within the network, a block of data, a value, can also be associated with a binary number of the same fixed length B, the value's key.
A node needing a value searches for it at the nodes it considers closest to the key. A node needing to save a value stores it at the nodes it considers closest to the key associated with the value.
NodeIDs are binary numbers of length B = 160 bits. In basic Kademlia, each node chooses its own ID by some unspecified quasi-random procedure. It is important that nodeIDs be uniformly distributed; the network design relies upon this.
While the protocol does not mandate this, there are possibleadvantages to the node's using the same nodeID whenever it joins the network, rather than generating a new, session-specific nodeID.
Data being stored in or retrieved from a Kademlia network must also have a key of length B. These keys should also be uniformly distributed. There are several ways to guarantee this; the most common is to take a hash, such as the 160 bit SHA1 digest, of the value.
Distance: the Kademlia Metric
Kademlia's operations are based upon the use of exclusive OR, XOR, as a metric. The distance between any two keys or nodeIDs x and y is defined as
distance(x, y) = x ^ y
where ^ represents the XOR operator. The result is obtained by taking the bytewise exclusive OR of each byte of the operands.
A Kademlia node organizes its contacts, other nodes known to it, in buckets which hold a maximum of k contacts. These are known as k-buckets.
The buckets are organized by the distance between the node and the contacts in the bucket. Specifically, for bucket j, where 0 <= j < k, we are guaranteed that
2^j <= distance(node, contact) < 2^(j+1)
Given the very large address space, this means that bucket zero has only one possible member, the key which differs from the nodeID only in the high order bit, and for all practical purposes is never populated, except perhaps in testing. On other hand, if nodeIDs are evenly distributed, it is very likely that half of all nodes will lie in the range of bucket B-1 = 159.
The Kademlia paper says that k is set to a value such that it is very unlikely that in a large network all contacts in any one bucket will have disappeared within an hour. Anyone attempting to calculate this probability should take into consideration policies that lead to long-lived contacts being kept in the table in preference to more recent contacts.
A contact is at least a triple:
- the bigendian nodeID for the other node
- its IP address
- its UDP port address
The IP address and port address should also be treated as bigendian numbers.
Kademlia's designers do not appear to have taken into consideration the use of IPv6 addresses or TCP/IP instead of UDP or the possibility of a Kademlia node having multiple IP addresses.
Within buckets contacts are sorted by the time of the most recent communication, with those which have most recently communicated at the end of the list and those which have least recently communicated at the front, regardless of whether the node or the contact initiated the sequence of messages.
Whenever a node receives a communication from another, it updates the corresponding bucket. If the contact already exists, it is moved to the end of the bucket. Otherwise, if the bucket is not full, the new contact is added at the end. If the bucket is full, the node pings the contact at the head of the bucket's list. If that least recently seen contact fails to respond in an (unspecified) reasonable time, it is dropped from the list, and the new contact is added at the tail. Otherwise the new contact is ignored for bucket updating purposes.
Experience has shown that nodes tend to group into two clearly distinguished categories, the transient and the long-lived. This update policy gives strong preference to the long-lived and so promotes network stability. It also provides a degree of protection from certain types of denial of service (DOS) attacks, including, possibly, Sybil attacks, discussed below.
The original Kademlia paper, maymo02, says that the Kademlia protocol consists of four remote procedure calls ("RPCs") but then goes on to specify procedures that must be followed in executing these as well as certain other protocols. It seems best to add these procedures and other protocols to what we call here the Kademlia protocol.
This RPC involves one node sending a PING message to another, which presumably replies with a PONG.
This has a two-fold effect: the recipient of the PING must update the bucket corresponding to the sender; and, if there is a reply, the sender must update the bucket appropriate to the recipient.
All RPC packets are required to carry an RPC identifier assigned by the sender and echoed in the reply. This is a quasi-random number of length B (160 bits).
If the identifiers are initialized to zero or are generated by the same random number generator with the same seed, the probability will be very high indeed.
It must be possible to piggyback PINGs onto RPC replies to force or permit the originator, the sender of the RPC, to provide additional information to its recipient. This might be a different IP address or a preferred protocol for future communications.
The sender of the STORE RPC provides a key and a block of data and requires that the recipient store the data and make it available for later retrieval by that key.
This is a primitive operation, not an iterative one.
The reply to any RPC should also contain an indication of the result of the operation. For example, in a STORE while no maximum data length has been specified, it is clearly possible that the receiver might not be able to store the data, either because of lack of space or because of an I/O error.
The FIND_NODE RPC includes a 160-bit key. The recipient of the RPC returns up to k triples (IP address, port, nodeID) for the contacts that it knows to be closest to the key.
The recipient must return k triples if at all possible. It may only return fewer than k if it is returning all of the contacts that it has knowledge of.
This is a primitive operation, not an iterative one.
The recipient of a FIND_NODE should never return a triple containing the nodeID of the requestor. If the requestor does receive such a triple, it should discard it. A node must never put its own nodeID into a bucket as a contact.
A FIND_VALUE RPC includes a B=160-bit key. If a corresponding value is present on the recipient, the associated data is returned. Otherwise the RPC is equivalent to a FIND_NODE and a set of k triples is returned.
This is a primitive operation, not an iterative one.
This section describes the algorithm that Kademlia uses for locating the k nodes nearest to a key. It must be understood that these are not necessarily closest in a strict sense. Also, the algorithm is iterative although the paper describes it as recursive.
The search begins by selecting alpha contacts from the non-empty k-bucket closest to the bucket appropriate to the key being searched on. If there are fewer than alpha contacts in that bucket, contacts are selected from other buckets. The contact closest to the target key, closestNode, is noted.
The first alpha contacts selected are used to create a shortlist for the search.
The node then sends parallel, asynchronous FIND_* RPCs to the alpha contacts in the shortlist. Each contact, if it is live, should normally return k triples. If any of the alpha contacts fails to reply, it is removed from the shortlist, at least temporarily.
The node then fills the shortlist with contacts from the replies received. These are those closest to the target. From the shortlist it selects another alpha contacts. The only condition for this selection is that they have not already been contacted. Once again a FIND_* RPC is sent to each in parallel.
Each such parallel search updates closestNode, the closest node seen so far.
The sequence of parallel searches is continued until either no node in the sets returned is closer than the closest node already seen or the initiating node has accumulated k probed and known to be active contacts.
If a cycle doesn't find a closer node, if closestNode is unchanged, then the initiating node sends a FIND_* RPC to each of the k closest nodes that it has not already queried.
At the end of this process, the node will have accumulated a set of k active contacts or (if the RPC was FIND_VALUE) may have found a data value. Either a set of triples or the value is returned to the caller.
Alpha and Parallelism
Kademlia uses a value of 3 for alpha, the degree of parallelism used. It appears that (see stutz06) this value is optimal.
There are at least three approaches to managing parallelism. The first is to launch alpha probes and wait until all have succeeded or timed out before iterating. This is termed strict parallelism. The second is to limit the number of probes in flight to alpha; whenever a probe returns a new one is launched. We might call this bounded parallelism. A third is to iterate after what seems to be a reasonable delay (duration unspecified), so that the number of probes in flight is some low multiple of alpha. This is loose parallelism and the approach used by Kademlia.
This is the Kademlia store operation. The initiating node does an iterativeFindNode, collecting a set of k closest contacts, and then sends a primitive STORE RPC to each.
iterativeStores are used for publishing or replicating data on a Kademlia network.
This is the basic Kademlia node lookup operation. As described above, the initiating node builds a list of k "closest" contacts using iterative node lookup and the FIND_NODE RPC. The list is returned to the caller.
This is the Kademlia search operation. It is conducted as a node lookup, and so builds a list of k closest contacts. However, this is done using the FIND_VALUE RPC instead of the FIND_NODE RPC. If at any time during the node lookup the value is returned instead of a set of contacts, the search is abandoned and the value is returned. Otherwise, if no value has been found, the list of k closest contacts is returned to the caller.
When an iterativeFindValue succeeds, the initiator must store the key/value pair at the closest node seen which did not return the value.
If no node lookups have been performed in any given bucket's range for tRefresh (an hour in basic Kademlia), the node selects a random number in that range and does a refresh, an iterativeFindNode using that number as key.
A node joins the network as follows:
- if it does not already have a nodeID n, it generates one
- it inserts the value of some known node c into the appropriate bucket as its first contact
- it does an iterativeFindNode for n
- it refreshes all buckets further away than its closest neighbor, which will be in the occupied bucket with the lowest index.
If the node saved a list of good contacts and used one of these as the "known node" it would be consistent with this protocol.
- Data is stored using an iterativeStore, which has the effect of replicating it over the k nodes closest to the key.
- Each node republishes each key/value pair that it contains at intervals of tReplicate seconds (every hour). The republishing node must not be seen as the original publisher of the key/value pair.
- The original publisher of a key/value pair republishes it every tRepublish seconds (every 24 hours).
- When an iterativeFindValue succeeds, the initiator must store the key/value pair at the closest node seen which did not return the value.
- All key/value pairs expire tExpire seconds (24 hours) after the original publication.
- All key/value pairs are assigned an expiration time which is "exponentially inversely proportional to the number of nodes between the current node and the node whose ID is closest to the key", where this number is "inferred from the bucket structure of the current node".
The writer would calculate the expiration time when the key/value pair is stored using something similar to the following:
- find the index j of the bucket corresponding to the key
- count the total number of contacts Ca in buckets 0..j-1
- count the number of contacts Cb in bucket j closer than the key
- if C = Ca + Cb, then the interval to the
expiration time is
- 24 hours if C > k
- 24h * exp( k / C ) otherwise
In any case, the system is required to mark the stored key/value pair with an original publication timestamp. If this is to be accurate, the timestamp must be set by the publisher, which means that clocks must be at least loosely synchronized across the network.
It would seem sensible to mark the key/value pair with a time to live (TTL) from the arrival of the data, tExpire (one day) or a fraction thereof.
It would seem useful to add to the Contact data structure at least:
- an RTT (round trip time) value or a set of such values, measured in ms
- more IP addresses, together with perhaps
- protocol used (TCP/IP, UDP)
- NAT information, if applicable
- whether the address is local and so reachable by broadcast
Adding an RTT or set of RTTs to the Contact data structure would enable better decisions to be made when selecting which to use.
The round trip time (RTT) to the contact could be as measured using a PING RPC or using a conventional Internet network ping.
Possible Convoy Effects
Implementors should take care to avoid convoy effects. These occur when a number of processes need to use a resource in turn. There is a tendency for such bursts of activity to drift towards synchronization, which can be disasterous. In Kademlia all nodes are requird to republish their contents every hour (tReplicate). A convoy effect might lead to this being synchronized across the network, which would appear to users as the network dying every hour.
Random Number Generation
Implementors should remember that random number generators are usually not re-entrant and so access from different threads needs to be synchronized.
Also, beware of clock granularity: it is possible that where the clock is used to seed the random number generator, successive calls could use the same seed.
For efficiency, the STORE RPC should be two-phase. In the first phase the initiator sends a key and possibly length and the recipient replies with either something equivalent to OK or a code signifying that it already has the value or some other status code. If the reply was OK, then the initiator may send the value.
Some consideration should also be given to the development of methods for handling hierarchical data. Some values will be small and will fit in a UDP datagram. But some messages will be very large, over say 5 GB, and will need to be chunked. The chunks themselves might be very large relative to a UDP packet, typically on the order of 128 KB, so these chunks will have to be shredded into individual UDP packets.
As noted earlier, the requirement that tExpire and tRepublish have the same value introduces a race condition: data will frequently be republished immediately after expiration. It would be sensible to make the expiration interval tExpire somewhat greater than the republication interval tRepublish. The protocol should certainly also allow the recipient of a STORE RPC to reply that it already has the data, to save on expensive network bandwidth.
Possible Problems with Kademlia
The Sybil Attack
A paper by John Douceur, douceur02, describes a network attack in which attackers select nodeIDs whose values enable them to position themselves in the network in patterns optimal for disrupting operations. For example, to remove a data item from the network, attackers might cluster around its key, accept any attempts to store the key/value pair, but never return the value when presented with the key.
A Sybil variation is the Spartacus attack, where an attacker joins the network claiming to have the same nodeID as another member. As specified, Kademlia has no defense. In particular, a long-lived node can always steal a short-lived node's nodeID.
Douceur's solution is a requirement that all nodes get their nodeIDs from a central server which is responsible at least for making sure that the distribution of nodeIDs is even.
A weaker solution would be to require that nodeIDs be derived from the node's network address or some other quasi-unique value.