Computer Science Individual Practical Handout 1 - Week Beginning 15 October 2001
Last modified: October 14 2001

Peer-to-Peer File Sharing

This is an individual project. In term 1 you are required to do some design work and exploratory programming. In term 2 you will be required to provide a prototype implementation of your system. Well in advance of each submission deadline you will be supplied with a pro-forma detailing the assessment process. This is intended to guide you in the allocation of effort to different parts of the project. The deadlines for submission are:
Term 1 deadline: Mon 18 Nov at 17:00
Term 2 deadline: Mon 17 Feb at 17:00

Keep an eye on the CSIP FAQ for updates and clarifications.

Introduction

Your job is to design and implement a peer-to-peer file sharing application. Here a "peer" means some computer system within a network of computer systems. Peer-to-peer is used to stress the idea that communication in the system takes place on an equal basis. This is in contrast with many distributed systems (e.g. a client-server system) where certain computer systems assume a special role in the overall distributed system. Your design should allow Internet users to cooperate in pooling free disk space and network bandwidth for use in serving public-access data. The first part of the project will involve some experimentation and familiarisation with the facilities of the Java platform and the design of the system, the second part will mainly be devoted to developing the implementation.

Many Internet users have useful and popular data, but do not have the server or network resources required to satisfy the demand for that data. For example, an open source developer might wish to publish a software distribution; the average demand for that distribution may be low, but the instantaneous demand after a new release may be too high for the developer's server or network connection to handle. Your design should allow such users to share disk space and bandwidth in a way that spreads the total load over all the participants. Your design should avoid the common requirement that each publisher of data own a network link with capacity equal to the peak demand for that data.

The system you are designing is intended to support large numbers of peers (at least thousands, perhaps millions). In the second part of the project you will be provided with an infrastructure upon which to develop the system. For the moment you should:

Your task

Here is a brief description of the system you are being asked to design and provide a prototype implementation for: Internet users, perhaps thousands or millions of them, will volunteer their machines for use as servers in your system. Publishers of data should be able to insert named documents into your system, causing them to be stored (in whatever manner you determine) on the servers. A publisher should be able to update a document without changing its name. A consumer of data should be able to retrieve a document from your system using its name, and verify the authenticity of the resulting data. You cannot assume that all the participating servers are trustworthy and reliable.

Your design should be oriented to providing the implementation to an interface along the following lines:

class InsertException extends Exception{}
class GetException extends Exception {}

interface Publish {
    public void insert(byte[] document, byte[] name) throws InsertException;
    public byte[] get(byte[] name) throws GetException;
    public void leave();
}

interface FileSystem{
    public Publish join();
}
The idea here is that we can join the file system which provides us with a Publish object that allows us to insert named documents and get the document corresponding to a given name. We have avoided some complication by choosing byte arrays for the document and name data structures but you might want to choose something more complex.

Your task in part one of the practical is twofold. The first task involves some small programming tasks to (re-)acquaint you with some aspects of the Java platform. Each of these considers some aspects of Java that could be useful in carrying out part 2 of the practical.

Part 1a - Using Java

In this part of the project you will review some aspects of the Java platform. Before commencing on this part of the project it is worth reviewing some material on the Java platform e.g. Chapter 4 of Java in a Nutshell, Third Edition.

Threads and Streams

Before commencing this task, please consult lecture notes 3, 4 and 5 from the Advanced Java Programming thread in CS2. This covers I/O and threads. In addition the threads tutorial provides lots of useful information on threads.

One aspect of the design you will need to consider is how to get round the potential for very slow or broken links to other peers. One approach to this is to use threads to permit simultaneous reading from many different input streams. In this small exercise you should implement the following interface:

class ReadThreadException extends Exception {}

import java.io.*;

interface Message {
    public void setMessageSize(int n);
    public void readThread(DataInputStream in, int start, int length)
	throws ReadThreadException;
    public boolean dataValid(int n);
    public byte[] getData(int start, int length);
}
Here, the idea is that we can set the length of the message we want to read and then use readThread(in,start,len) starts off a thread to read len bytes from in storing it in the message beginning at byte start. dataValid tellus us if a particular byte has been read from some file and getData just gets some part of the message starting at start of length length. Some issues you might want to consider are:
  1. We should be capable of running several threads at once - how careful should we be that they do not interfere?
  2. Suppose we can't read all the data from a particular file because it is too short - what do we do?
  3. Suppose the file just hangs up - how do we ensure we can continue?

Using Sockets

Before commencing this task, consider your first year notes on Client-Server systems and sockets in Java.

The aim in this exercise is to build a small Client-Server system where the servers keep (message,id) pairs and the clients recover a message on the basis of the id. The aim here is to give the Client a choice of servers by providing a number of different options for the server via the addServer method. If a server fails to deliver in a reasonable time or explicitly fails for any reason, the client can use an alternative server to recover the message.

class SetupServerException extends Exception {}

interface SetupServer{
    public void establish(int port) throws SetupServerException;
    public void addPair(byte[] message, byte[] id);
}

interface Client {
    public void addServer(String Addr, int port);
    public byte[] getMessage(byte[] id);
}
The issues here are:
  1. How can we detect when a server has failed?
  2. How do we decide which server to try next?

Signatures and Digests

Before commencing this task, please have a look at the generating and verifying signatures tutorial. This provides a good overview of some of the security features of the Java platform. In addition, the exchanging files tutorial includes a good demonstration of how to use the keytool to manage pairs of keys. Chapter 4 of Java in a Nutshell is particularly useful here.

The aim here is just to package some standard code in a couple of classes. You will aso need to use the keytool to create some public and private key pairs. The main objective is for you top read and understand the basis of security. In particular the use of the Secure Hash Algorithm (SHA) to generate digests.

Your goal in this exercise is to implement the interface:

class GetFailedException extends Exception {}

interface PrepareSignedMessage {
    public void getMessageFromFile(String file) throws GetFailedException;
    public byte[] getDigest();
    public byte[] signature(String keyFile, String filePass,
                            String signer, String signerPass);
}

interface CheckSignature {
    public void getMessageFromFile(String file) throws GetFailedException;
    public byte[] getDigest();
    public boolean checkSignature(byte[] signature, String filePass,
                                  String Signer);
}
Here PrepareSignedMessage generates the signature for a message given the names of a keyfile and the appropriate passwords. The other class checks that a signature is correct given the message and the signature. In both cases we include the method getDigest that will calculate an SHA digest of the message.

Part 1b - Your Design

Your task in part 1b is to supply the following deliverables:
  1. Identify the most important non-functional requirements of the system and identify the level to which your system will implement the non-functional requirements. Identify how you propose to measure the extent to which your system achieves these non-functional requirements.
  2. Provide a detailed design for the system that is sufficiently detailed that it is possible to predict the likely extent to which your system meets the non-functional requirements outlined above. Your design should try to address all the non-functional requirements you have identified.
  3. Provide detailed arguments that your system meets all the requirements. Your arguments should be as convincing as possible and may include some code that simulates the behaviour of the full system. Notice that this is significantly simpler than the full task because you do not need to consider how to handle the distribution in the system. Your simulation should illustrate how data is distributed around the nodes in the system and how this evolves as the network of peers grows.
  4. User-level documentation that explains the functionality, limitations and performance of the system

Notice that at the moment there is no provision to delete documents added to the system. You should consider approaches to deal with old content.

The challenge of this project is to design it so that the entire system is homogeneous. In particular you do not want any participant to have special powers or for it to depend on some central resource. You should also assume that the users of the system want to devote no effort to running the system so it should not require significant set-up or maintenance time. The software running on each of the participants in the system should be the same. You should think about how such a system will grow. It is likely that there will many small islands of use of the system that will slowly merge into one very big system as the islands merge together. It may also be the case that some participants are only connected intermittently and that through failures of network connections etc nodes in the system may become disconnected.

In constructing your models you will need to consider:

The document naming scheme is an important part of the system. The idea of this kind of system is that access to content will probably be via a web like interface that actually hides the name from the consumer of data (most URLs are not particularly memorable, but you never see them because they are disguised as links on pages). This means you can have a scheme that uses quite unmemorable names. You could consider using either content hashes (a cryptographic hash of the document's content) or public keys as document names. In the latter case, you could demand that publishers include a digital signature in each document; consumers could check the signature against the name/key, and your servers could reject inserts (including updates) for which the name/key did not match the signature. One place to look for naming ideas is the keys and searching section of the FreeNet paper.

Chord Interface

A distributed hash system, like Chord, provides a way of mapping that maps keys to nodes. This can be useful in ensuring even allocation of keys to nodes in the system Keys are arbitrary 160-bit values; you'll probably want to use SHA to generate them in some way based on your document names (if your names are digests then they are also 160bits long). You might use Chord to map a document name to the server that stores the document.

Chord is described on a separate page. You don't have to understand its inner workings in order to use it, so you can safely skip the description. However, you may find that understanding Chord will give you better opportunities for optimisation. The interface we give here is in Java in order to simplify your design work.

Chord assigns every participating node a 160-bit ID. Chord maps key k to the node with the smallest ID that follows k; that node is called k's successor. A Chord node does not keep a complete list of all other Chord nodes in order to perform this mapping. Instead, each node knows about a few other nodes, and sends "find successor" messages to those other nodes to help it map keys to nodes.

The API we use for our Chord-like system looks like this:

interface Location {
    public byte[] id();
    public String addr();
}

interface Chord {
    public Location lookup(byte[] key);
    public void leave();
}

interface Change{
    public void left(Location l);
    public void joined(Location l);
}

interface StartChord {
    public Chord join(Change c);
}
The Location class provides the Chord id and IP address of a Chord node. In the Change class lookup(key) Returns the Chord ID and IP address of key's successor Chord node. This typically involves sending find successor messages to other Chord nodes. The leave allows the node to leave the Chord system. The join function causes the node to join the Chord system. At this point two additional functions are included, one that is called when a node close to the current one leaves the system and the other when a node close to the current one joins the system. In both cases, the location parameter is the location of the node leaving or joining.

In your design You can change the details of this interface if you need to, as long as the change is consistent with the Chord implementation. If you change the interface, your report should include an outline of how to implement your changes.

Other peer-to-peer systems

Peer-to-peer systems are particularly interesting because they offer an interesting way of structuring services. In particular, with careful design they offer one approach to providing high availability and scalability from large collections of commodity hardware and software components. Here are some examples of peer-to-peer systems:

Napster allows people to share music files across the Internet. It involves a central server, containing a database of everybody's music files. You ask the central server to perform a lookup for you, which returns the IP addresses of a few Napster users who have the music you want. Then you directly contact one of those users' computers and ask for the file. The main drawback of Napster is the central server, which is a potential performance and reliability problem. Napster retrieves documents using keyword search, which you don't need to implement; you should identify documents using unique IDs or names.

GNUtella provides distributed search, as well as direct peer to peer transfer of the resulting files. Each GNUtella node knows about a few other neighbour nodes; this results in a mesh of nodes. When a node wants to find a file, it floods a query for the file over the whole mesh, starting with the nodes it knows about. When a node receives a query, it tries to match that query against locally-stored files. If there's a match, it returns the file to the original querier. Otherwise the node forwards the queries to its neighbours. This approach eliminates the central component of Napster. However, every node is involved in every lookup, so GNUtella doesn't scale well to large numbers of nodes.

FreeNet provides an interface similar to the one we are asking for in this assignment. Each document has an identifier, clients ask for documents using identifiers, and FreeNet figures out which server node is likely to store a document based on the identifier. You may find FreeNet a useful source of ideas about naming and authenticity checking. FreeNet provides anonymity for both consumers and publishers of data, and makes malicious deletion of data difficult. However, FreeNet's focus on anonymity causes it to sometimes fail to find documents that exist. You are not required to implement anonymity.

The Eternity Service is a distributed storage service that guarantees that documents will never disappear. Its guarantees are quite strong; it can even defend against legal attacks by governments. You are not required to provide eternal storage.

Your report

Your report should be structured according to the requested set of deliverables. You will be provided with a self assessment document well before the submission date that will help in identifying particular points we would like to see covered in the design process.

In constructing your report you should check that the reader will be able answer (and justify their answer) to the following questions about your system.

Resources

This section contains a collection of resources that you will find useful in completing this practical. The links provided point to basic introductions to a range of topic relevant to the practical.