## Authors

### M. Nikhil Krishnan

### Myna Vajha

### Vinayak Ramkumar

### Birenjith Sasidharan

### S. B. Balaji

### P. Vijay Kumar

Indian Institute of Science, Bengaluru

## Abstract

This article deals with the reliable and efficient storage of ‘Big Data’, by which is meant the vast quantities of data that are stored in data centers worldwide. Given that storage units are prone to failure, to protect against data loss, data pertaining to a data file is stored in distributed and redundant fashion across multiple storage units. While replication was and continues to be commonly employed, the explosive growth in amount of data that is generated on a daily basis, has forced the industry to increasingly turn to erasure codes such as the Reed-Solomon code. The reason for this is that erasure codes have the potential to keep to a minimum, the storage overhead required to ensure a given level of reliability. There is also need for storing data such that the system can recover efficiently from the failure of a single storage unit. Conventional erasure-coding techniques are inefficient in this respect. To address this situation, coding theorists have come up with two new classes of erasure codes known respectively as regenerating codes and locally recoverable codes. These codes have served both to address the needs of industry as well as enrich coding theory by adding two new branches to the discipline. This article provides an overview of these exciting new developments, from the (somewhat biased) perspective of the authors.

## 1 Introduction

The setting of the work on developing erasure codes for the storage of Big Data is that of a large data center. The total amount of data stored in 2018 across data centers worldwide, is estimated to be in excess of $1400$ exabytes [1]. These centers are very expensive to build and operate. For example, the NSA data center in the US is estimated to have cost several billion dollars to build, consume about 65MW of power each year and use over a million gallons of water per day [2]. Thus while arguably, the most important consideration in data storage is that of protection against data loss, given the explosive growth in the amount of data generated and the costs involved in storing such data, minimizing storage overhead is an important second consideration. Yet another consideration, that has recently risen in importance, is that of efficiently handling the commonplace occurrence of the failure of an individual storage unit. The focus of this article is on identifying efficient means of storing data while keeping all three considerations in mind. We note as a disclaimer, that the article is not intended to be an unbiased survey of the discipline, as the article emphasizes those aspects of the discipline to which the authors have had greater contribution. A more detailed and balanced coverage of the topic can be found in the recent survey article, also by the authors [3].

### 1.1 Replication Versus Erasure Coding

The key strategy adopted to protect against data loss, given that individual storage units are prone to failure, is to store data pertaining to a single file in distributed and redundant fashion across multiple storage units [3]. The simplest means of introducing redundancy is replication of the data file, with triple replication being in common use [4], see Figure 1.

A more efficient option is to use an $[n,k]$ erasure code. Figure 2 shows the procedure for encoding data using an

erasure code. In an $[n,k]$ erasure code, the data file is first split into $k$ fragments. To this, an additional $m=(n-k)$ redundant fragments are added making for a total of $n$ fragments. Each fragment is stored on a different storage unit. Within the class of erasure codes, maximum distance separable codes (MDS) are the most efficient in terms of offering reliability for a given amount of storage overhead. An $[n,k]$ MDS code has the following defining property. The entire data file can be recovered if one has access to *any* collection of $k$ fragments. We will refer to this as the ‘any $k$ of $n$’ property. Thus, an $[n,k]$ MDS code can recover from the failure of any $\leq (n-k)$ fragments. To protect against the failure of any $\ell$ nodes, a replication code must create $(\ell+1)$ replicas, resulting in a storage overhead of $(\ell+1)$. In contrast, the storage overhead of an MDS code that is resilient against $\ell$ failures has overhead $\frac{n}{(n-\ell)}$. For example, with $\ell=2$ and $n=6$, the storage overheads of the two schemes, replication and erasure coding, are respectively given by $3$ and $1.5$.

#### Finite Fields

The best known of all MDS codes is the Reed-Solomon (RS) code [5]. The symbol alphabet of an RS code is a finite field $\mathbb{F}_q$ (e.g. Ch.3 [6]). A finite field $\mathbb{F}_q$ is a collection of $q$ elements together with two operations, addition and multiplication that obey the rules we are accustomed to such as $a(bc)=(ab)c=abc$, $a+b=b+a$, $a(b+c)=ab+ac$ etc. As an example, a finite field $\mathbb{F}_3$ of size $q=3$ is composed of the elements $\{0,1,2\}$ along with two operations: addition $\! \! \pmod{3}$ and multiplication $\! \! \pmod{3}$. The corresponding addition and multiplication tables are presented in Figure 3.

Similar addition and multiplication tables can be generated for finite fields of size $q$ where $q$ is a prime number such as $2,3,5,7,\cdots$. In general, finite fields of size $q$ exist whenever $q$ can be expressed as power of a prime number $p$, i.e., $q=p^e$ for some positive integer $e$. However, the arithmetic there is more involved. For our purposes, it suffices to imagine that we are working in some suitably large finite field. The explanation from here on is agnostic to the inner workings of operations in the finite field.

#### Reed-Solomon Code

We explain in brief, the construction of an $[n,k]$ RS code. Let the symbols $(a_0,a_1,\cdots,a_{k-1})$, each taking on values in a finite field $\mathbb{F}_q$, represent the $k$ message symbols. Let $(x_0,x_1,\cdots,x_{n-1})$ be an arbitrary collection of $n$ distinct elements from $\mathbb{F}_q$. Let the polynomial $f(x)$ be defined by:

\begin{eqnarray*}

f(x) & = & \sum_{i=0}^{k-1} a_i \ \prod^{k-1}_{\begin{array}{c} j=0 \\ j \neq i \end{array}} \frac{(x-x_j)}{(x_i-x_j)} \ \ := \ \sum_{i=0}^{k-1} b_i x^i .

\end{eqnarray*}

Then clearly, $f$ is a polynomial of degree $(k-1)$ such that

\begin{eqnarray*}

f(x_i) \ = \ a_i, & & 0 \leq i \leq (k-1).

\end{eqnarray*}

The $n$ code symbols in the RS codeword corresponding to message vector $(a_0,\cdots,a_{k-1})$ are precisely the $n$ values $(f(x_0),f(x_1),\cdots,f(x_{n-1}))$. The $k$ message symbols are the values $f(x_j)$, of $f$ when $f$ is evaluated at $(x_0,x_1,\cdots,x_{k-1})$. The $(n-k)$ redundant symbols of an RS code are the values $\{ f(x_j) \mid k \leq j \leq (n-1) \}$.

The RS code derives its ‘any $k$ of $n$’ property from the fact that the polynomial $f$ (and hence the message symbols $\{a_i =f(x_i) \mid i=0,1,\cdots,k-1\}$) can be determined from knowledge of any $k$ evaluations, simply by solving a nonsingular set of $k$ equations in the $k$ unknown coefficients $\{b_i\}_{i=0}^{k-1}$ as shown below

\begin{eqnarray*}

\left[ \begin{array}{c}

f(x_{i_1}) \\ f(x_{i_2}) \\ \vdots \\ f(x_{i_k}) \end{array} \right] & = &

\underbrace{\left[ \begin{array}{cccc}

1 & x_{i_1}& \cdots & x_{i_1}^{k-1} \\

1 & x_{i_2} & \cdots & x_{i_2}^{k-1} \\

\vdots & \vdots & \vdots & \vdots \\

1 & x_{i_k} & \cdots & x_{i_k}^{k-1}

\end{array} \right] }_{\begin{array}{c} \text{a Vandermonde matrix} \\ \text{and therefore invertible} \end{array} }

\left[ \begin{array}{c}

b_0\\ \\ \vdots \\ b_{k-1} \end{array} \right],

\end{eqnarray*}

where $i_1, \cdots, i_k$ are $k$ distinct indices in $\{0, \cdots, n-1\}$.

### 1.2 Node Failures

A fairly frequent occurrence in a data center is the failure of a single node (i.e., of a single storage unit). Figure 5 shows the number of single-node failures in a Facebook data center containing $3000$ nodes in all. RS codes are efficient in terms of providing the least possible value of storage overhead. However, the conventional means of recovering from a single-node failure in an MDS code is inefficient (see Figure 6). This is illustrated in Figure 7 which shows the $[n=14,k=10]$ RS code employed by Facebook.

In Figure 7, in order to repair the code symbol in failed node $1$ (similarly, for any other node), the replacement node for node $1$ will contact $10$ other storage units, use their contents in conjunction with the ‘any $10$ out of $14$’ property to reconstruct the RS codeword, and thereby the lost contents of node $1$. This is inefficient in $2$ respects: firstly in terms of the number of nodes contacted (which is $10$ here) and secondly in terms of the total amount of data that is downloaded to restore the contents of a single, failed node. Figure 7 shows that if each failed node stores $100$MB of data, then the total download needed to recover the node is $1$TB, which clearly, is inefficient.