What makes Clay codes so important, what are its benefits?
Any erasure code is useful to the society or industry if we show it is optimum within the bounds. For example Clay code is optimum in four respects. First, they have the smallest amount of storage overheads because they are examples of MDS codes; they have smallest repair bandwidth because they are regenerating codes; there is also the concept of the least possible distance – i.e. if one node fails, you can call upon other helper nodes to fix it; there is an industry which is concerned with how much you read and download; let’s say you read a lot [of data] but actually download less; still you are disturbing the machine; that’s called discrete or optimal access. Clay code allows us to be minimal and optimum in that.
What do you see in the horizon that is interesting in regenerating erasure codes? Are you working on any interesting problem domain?
Two things. One, in this regenerating code there is still one hole that needs to be filled namely the number of layers. If you want the minimum possible repair bandwidth which is the least amount of network traffic, Clay codes are good but they require you to layer a certain number. In the case of 20 [block length] it’s 1024 [layers of RS code] but the industry may say that I am only willing to do 16 layers, for that nobody has an answer. So, the question is if I limit the number of layers because I want to work in smaller chunks what is the best you can do for me. I understand that you can give me everything a Clay code can but can you make me a half-way which gets me the best of both worlds. So, there is a code worked upon by a former student of mine, Rashmi who is now with CMU, and with another student Nihar. They published a paper called Piggybacking where they took only two layers of Reed-Solomon codes and that already got attention because it reduces the bandwidth significantly. So, the next natural question is if you can do something with Piggybacking with two layers and Clay code does all of this, what is the tradeoff in between, what can you do to reduce the layers. So that’s an open problem.
We are really in the business of using erasure coding wherever it is applicable. One of my students is working on streaming code. Nowadays, there is data that is streaming and you have a stream of packets coming in and packets may occasionally get dropped. So you have to build redundancy into the stream so that it doesn’t cause unacceptable losses. And also you want to do it with very low latency, meaning that if a streaming packets are coming and you want to decode a particular packet, you are willing to delay by a few packets – maybe 3, 4 – if this is the fifth packet, by the ninth packet comes in, I want you to decode the fifth. That puts a constraint on the erasure coding because it essentially means that I can’t build a long code and wait for the entire code to arrive and then decode. So, I have to do it on the fly. There was a group in MIT which developed a class of code which was very nicely built using Galois Field arithmetic. My student, Nikhil has come up with a construction in which we can solve the general problem with the field size which is the square of the length of the block – so if you are blocking 9 [packets], its square is 81 then the Galois field must have 81 elements. It’s not a problem because in the finite field arithmetic 256 is very common. But in many cases we can reduce it to just a block which is 9 and you can’t do it with less. That’s a very recent result. He is just writing it up for a paper. These are called streaming codes.
Isn’t it expensive to run erasure codes on a channel, let us say for your streaming code which tries to optimize redundancy for fault tolerant transmission of packets.
We have been looking into this aspect. First, it depends upon what is the block length of the code. For many types of erasure packets there is a model that says that we are building this code to handle certain types of erasures. Take a window with a stream of packets. One is a burst erasure where you can have a burst of these packets dropping. Second, you can have smaller number of random erasures. If you make the block length small, the algorithm to handle a burst is very simple but for certain random erasures we have to do an equivalent of decoding a RS code but again for short block length we can improvise. So far, we have just shown that such a code exists, and field size is small. What’s ahead of us is to minimize or optimize the field size.
We are addressing a channel model just now and we need to work on a lot of things.
Another thing we are interested in is retrieving private data which involves retrieving information from a database without letting the database know what information you are retrieving. Computer scientists have shown us how to do it. But a coding theorist says that the computer scientists are good but they are taking up a lot of storage, we can do erasure coding to reduce the storage and still do what computer scientists allow you to do. These are called private information retrieval codes.
Then there is something called coding for computation which brings us to the Map Reduce operation in large datasets, say, a Hadoop database. Suppose you are doing a word count on a huge file, you break it up into a hundred fragments and do the word occurrence on each of these and add them all up to get the total. The Map phase is where you compute the number of occurrence and the Reduce phase is where you collect them together. In between Map and Reduce there is the Shuffle phase where erasure coding can drop the shuffle time by duplicating the computation and bringing down the operation from 900 seconds to 270 seconds in one experimental instance. That’s the kind of returns one can expect. There is a USC professor who has done a nice job of showing how erasure coding can be done here. We are looking at that.
After distributed storage, these are some of the novel work that is going on.
How can someone get into erasure coding?
I had my NPTEL lectures some years ago on erasure coding. For regenerating code, we wrote a 45 page survey which was published in Science Channel. It’s just been published in October of this year. It gives an overview of the field. This is the most detailed survey of this field. I believe this is accessible online.
There is another angle, the complexity angle. If you are willing to give enough computational resources and computational complexity can be minimized. If you can come up with an algorithm, someone can come up with an implementation. For that you should know about the various thresholds and bounds. You should be aware of tradeoff and find the codes that operate on it.