4. C-Class Micro-Architecture
In this section, we will describe the micro-architecture of the SHAKTI C-class processor that is open- sourced. This processor has also been taped-out on Intel’s 22nm FFL technology node [Intel Corporation]. Figure-3 shows a high-level block diagram of the C-class processor.
Fig. 3. A High-level block diagram of the SHAKTI C-Class processor
The following subsections will define the function of each stage in detail. Please note, that this a walk-through of one particular implementation of the SHAKTI C-class processor. Due to continuous development of the core, one may find some of the details to be obsolete when referring to the current online publicly-available source code.
4.1 Fetch Stage
This is the very first stage of the processor pipeline. The fetch stage is responsible for generating a new program counter (PC) and sending it to the instruction memory subsystem. The fetch unit also includes a small Return-Address-Stack (RAS) to help reduce delay penalties on function returns.
4.1.1 Branch Predictor and RAS. Within the memory subsystem, the program counter is sent to three modules simultaneously: the branch predictor, the instruction cache and the instruction TLB (Translation Look-aside Buffer). The branch predictor performs a check if the current program counter points to an instruction which is a conditional branch. If so, it checks the probability with which the branch will be taken and also provides the target address of the jump. If the probability of branching is low then the predictor simply returns the next consequent PC value.
The data structure within the branch predictor which maintains the probability of the indirection is known as the prediction table, and the data structure which holds the indirection address is known as the branch target buffer (BTB). In the RISC-V ISA, a conditional branch range is ±4KiB. This means that the BTB needs to simply hold a 12-bit offset value which can get added to the current PC to generate the branch address, thereby reducing the storage requirements of the branch predictor.
The RISC-V ISA also includes unconditional jump instructions (JAL and JALR) which are typically used for function calls and returns. Since a function call and its return follows a structured pattern, an RAS data structure is used to keep track rather than training the branch predictor for these instructions. The ISA also provides hints within the instruction encoding which allows for a quick detection of call or return operation in the fetch stage itself. A call operation will push the return address on top of the RAS stack and a return operation will pop the return address from the address and send it to the instruction memory subsystem for fetching the relevant instruction. Maintaining the RAS reduces the number of collisions within the branch predictor which otherwise would have happened if trained with call and return instructions.
4.1.2 Instruction Cache and TLB. Since the SHAKTI C-Class supports virtualization, the instruction cache is a VIPT (Virtually Indexed and Physically Tagged) structure. The BSV code of the cache has been designed to generate a parameterized cache which can range from 16-64KiB. The cache is blocking in nature, i.e. unless a previous request is served (whether hit or miss) the next request is stalled. The C-class processor also holds a 16-entry fully-associative TLB which is accessed simultaneously as the cache is accessed. The TLB stores the physical page number of the page present in memory. The TLB look-up takes a single cycle. In the next cycle, the TLB provided physical page number is compared with the tag from the cache. If they match, a hit is indicated and the respective instruction is fetched from the data array. In case they do not match, the cache initiates a new line- request from the next memory level. The line-request from the lower hierarchy memory should ensure that the critical word is retrieved first so that the fetch request can be served as soon as possible.
4.2 Decode and Operand Fetch Stage
This stage receives the instruction from the fetch unit, decodes it into crucial fields like: operand addresses, destination address, operation type, immediate field, etc. Once the operand addresses are available the registerfile (which holds 32 64-bit integer registers and 32 64-bit floating point registers) is accessed to fetch the respective operands. Since all the base ISA integer instructions require only a maximum of 2 operands and update only a single operand, the integer registerfile requires 2 read-ports and 1 write port. The floating point instructions like FMADD, FMSUB, etc. require 3 operands and thus the floating point registerfile require 3 read-ports and 1 write port. Other fields of the instruction are re-coded to a format which is more convenient for the next stage to execute.
Another important activity of this stage is to capture all interrupts and exceptions. The CSRs (Control Status Registers) are looked up in this stage to indicate if an interrupt has occurred. If so, then the trap field of the next pipeline buffer is set with the relevant information which will enable the processor to handle the interrupt. Other exceptions like illegal instructions, illegal CSR accesses are also captured here before the instruction proceeds.
4.3 Execute Stage
This stage holds the ALU which performs the necessary arithmetic operation required by the instructions. To achieve this it holds the integer arithmetic unit, the variable-cycle multiplier with early-out mechanism, a non-restoring sequential divider and a floating point unit which supports IEEE-754 single and double precision arithmetic. This is also the stage where the prediction of the branch predictor is validated. An instruction whose operands were not available in the decode stage are stalled in this stage until the operand forwarding logic provides the necessary operands. The operand forwarding logic receives values from the next pipeline buffer for arithmetic operations and from the memory stage for values from load instructions.
Unless designed carefully and meticulously, the ALU has the potential of being not only the most delay-sensitive path but also be the most area burdened unit as well. C-Class, being an in-order core, employs sequential multiplier and divider circuits which share a lot of hardware. The multiplier has an early-out mechanism, which means the results can be available anywhere between 1 to 8 cycles as soon as one of the shifted operands is made 0. The divider however implements a non-restoring algorithm which requires up to 64 cycles for an RV64 ISA. The floating point unit is also designed to be low overhead in terms of area and thus implements a sequential algorithm for fused-multiply-add operations which re-uses significant amount of hardware compared to a pipelined version. For memory operations, the address generated by the ALU is directly latched into the data memory subsystem to initiate the load/store process. If, however, there is a trap from the write-back stage, then the pipeline is flushed and the load/store address is not latched, thereby preventing corruption of memory.
The execute stage also checks if either the load/store addresses or the conditional branch addresses are misaligned, and raises a suitable exception and progresses the instruction forward.
4.4 Memory Stage
All non-load/store/atomic operations will simply bypass this stage and proceed to the write-back stage. In case of a load/store instruction, this stage captures the result of the memory access. For a load operation, the data is forwarded to the execution unit for instructions that depend on the loaded value. All exceptions related to load/stores such as access-fault, page-fault, etc. are captured here. The data-memory subsystem works in the same manner as the instruction memory subsystem as described earlier. A page table walk is initiated if there is a TLB miss.
4.5 Write Back stage
This is the stage where the instruction retires. This stage first checks if the instruction has been tagged with any traps (interrupt or exceptions) by the previous stages. If a trap is to be taken, then the entire pipeline is flushed and the first instruction of the trap handler is fetched. If no traps are present, then the instruction simply updates the register with the new value. All CSR related operations also take effect here.