### How many things are there?

Stated in a simple way, the poser is “what number when divided by 3, 5 and 7 gives the remainders 2, 3, 2?”.

In short, for constructing a RNS, we should choose few mutually prime numbers of similar size whose product is equal to the dynamic range we need.

As an illustration, for achieving 32 bit dynamic range, we can choose 2047, 2048, 2049 as the three moduli which are mutually prime. The product of these three primes is 8589932544 <233. A typical RNS based system architecture is presented in Fig.1.

#### Fig.1. A RNS based processor architecture

We need to have first a front-end converter to find the residues of the input numbers mod these three moduli. By residue, we mean the remainder obtained by dividing the given number by the modulus. As an example, 8000 mod 2047 =1859. We have thrown the quotient 3 and taken only the remainder. Thus for the three moduli system, we are considering, we represent it in RNS by the three residues (1859, 1856, 1853). Note that this representation is unique.

We therefore need to process such smaller 11 bit numbers in several processors in parallel in a Single instruction multiple data (SIMD) architecture and after our computation using several moduli processors, we need to convert it back to normal number. Thus problem given in the beginning of this section is reverse conversion problem. The number of moduli can be many so

that word length of the individual processors can be reduced. For example using eight four-bit or five-bit mutually prime moduli, we can construct a 32 bit system. Example {9, 11, 13, 17, 19, 23, 25, 29} can realize a 32 bit RNS. Thus our eight processors are of word length 4 and 5 bits instead of 32 bit.We consider next, various operations in RNS.

We consider modulo addition, modulo subtraction and modulo multiplication, scaling, comparison and sign detection in some detail.

### 1. Addition and subtraction

Consider finding (A+B) mod M = (5 + 7) mod 11. We add first 5 and 7 to get 12.Since 12 is greater than 11, we subtract 11 from 12 to get 1. Thus one addition and one conditional subtraction are needed. Similarly, consider subtracting 7 from 5. We get -2 and since the answer is negative, we need to add 11 to obtain 9. Thus, two instructions or two operations are needed. In hardware, we can speed up by computing both (A+B) and (A+B-M) using two adders in parallel and selecting the correct result using a 2:1 multiplexer by looking at thesign of the second adder computing(A+B-M).

### 2. Modulo Multiplication

Next let us consider multiplication operation (5×7) mod 11. We get 35 by simple multiplication and we divide 35 by 11 to obtain the remainder as 2. We observe that the product is double the length in bits of the given residues.Then, we need to perform division. Division in conventional arithmetic is quite complicated compared to multiplication. One line of thinking is as follows: When we do not need the quotient, why should we follow this method? Moreover, some designers do not like the data path width to become double. Thus lot of effort has gone into the problem of modulo multiplication in the past thirty years.

We will demonstrate this technique using an example given in Fig. 2. We start the computation with MSB of the multiplier. In every step we multiply the 8 previous result by 2 and add A if bi is 1.If bi is zero, no addition of A is needed. We then reduce the result mod M. Note that the term in the brackets S, the intermediate sum, can be at most 3(M-1) and we reduce it mod M by subtracting M or 2M and depending on the sign of the results, we select either S or S-M

or S-2M. Thus, in each step, we need a conditional addition, one or two subtractions and one selection. This operation needs to be done as many times as the number of bits in B. We can use

Fast adders to perform the addition/subtraction in the loop.

Consider finding A×B mod M = 1001×1011mod 1101(in decimal (9×11) mod 13 = 8).We initialize the accumulator S to zero.Denoting B in binary as b3b2b1b0, we start with MSB b3 of B. We add A since it is 1. S = 0+1001 = 9. We find 9 mod 13 = 9. Next we double 9 and since b2 = 0, we do not add A. Tus, S = 18 mod 13 = 5. We double 5 and add to A since b1 = 1 to get S = 10+9 = 19 mod 13 = 6. We double 6 and add A since b0 is 1 to get S = 21 mod 13 = 8. This is the result.

The technique can be extended by considering more bits of operand B at a time which is known as high radix modulo multiplication. Note that each time we need to multiply the previous output by 2r where r is the number of bits of B we are taking at a time.We consider the same example given above taking two bits of B at a time. We need only two steps.

Consider finding A×B mod M = 1001×1011 mod 1101 We Initialize the accumulator S to zero. Denote B in binary as b3b2b1b0. We start with Most significant digit (b3b2)2 = (2)10 of B.

(The subscript 2 indicates binary and 10 indicates decimal). We add 2A since this digit is 2. Hence, S = 0+2×1001 = 18.We find 18 mod 13 = 5. Next we multiply by 4 to get 20 and since (b1b0)2= (3)10, we add 3×1001 = 27 to get S = 20+27 = 47 mod 13 = 8.

The price paid is that size of S has increased and 0 < S < 7(M-1) and hence mod M reduction is slightly involved compared to the earlier case of taking one bit of B at a time in which case 0 < S < 3(M-1). Note that, we perform hand calculation by taking one digit at a time (known as radix 10). Today, hardware and software designs consider 4 bits (or even 32 bits) of B at a time thus needing 256 (or 32) steps to perform one modulo multiplication of 1024 bits (A×B) mod M with A, B and M of 1024 bits) for example for use in RSA algorithm.

We consider another two techniques for modulo multiplication. In one method known as Barrett reduction [9], the product A×B is first found which obviously has word length twice that of

A and B assuming both A and B are of the same word length. Here, we find the approximate quotient q first and then subtract q×M from X = A×B. The remainder is our answer with some

correction (one subtraction of M) may be required. We need to pre-compute first. Then, we effectively compute the quotient as . Note that many operations such as division by 2n+1 can be carried out simple shifts (omitting bits) in this technique. We will illustrate with

an example.

As an illustration, consider finding (121) mod 13. Evidently n = 4. We obtain and Thus, remainder r1 = 121-13×8 = 17. Since r1 > 13, we have r1 = 17 mod 13 = 4.