Computer Architecture Lab/Winter2006/JeitMossFrühRamb/ThreeMicroDiscussion
The Nios II processor family was created by Altera and consists of 3 main cores:
- Nios II/e - "economy" core
- designed to achieve the smallest possible core size
- 32-bit RISC processor
- Nios II/s - "standard" core
- designed for small size while maintaining performance
- 32-bit RISC processor + Instructions Cache, Branch Prediction, Hardware Multiply, Hardware Divide
- Nios II/f - "fast" core
- designed for fast performance
- 32-bit RISC processor + Instructions Cache, Branch Prediction, Hardware Multiply, Hardware Divide, Barrel Shifter, Data Cache, Dynamic Branch Prediction
Each core offers further configuration options in order to fine-tune the processor for its field of application.
System Basics 
The Nios II processor is a general-purpose RISC processor core, providing:
- Full 32-bit instruction set, data path, and address space
- 32 external interrupt sources
- Single-instruction 32 x 32 multiply and divide producing a 32-bit result
- Dedicated instructions for computing 64-bit and 128-bit products of multiplication
- Floating-point instructions for single-precision floating-point operations
- Access to a variety of on-chip peripherals, and interfaces to off-chip memories and peripherals
- Instruction set architecture compatible across all Nios II processor systems
- Performance up to 250 DMIPS @ 200 MHz
Instruction Set 
The Nios II instruction set can be categorized by type of operation performed:
- Data Transfer Instructions
- 32-bit word, half-word or byte access to/from memory or peripherals
- 16 instructions
- Arithmetic & Logical Instructions
- logical instructions support and, or, xor and nor operations
- arithmetic instructions support addition, subtraction, multiplication and division operations
- 21 instructions
- Move Instructions
- copy the value of a register or an immediate value to another register
- 5 instructions
- Comparison Instructions
- instructions perform all the equality and relational operators of the C programming language
- 20 instructions
- Shift & Rotate Instructions
- 9 instructions
- Program Control Instructions
- unconditional jump and call instructions
- conditional-branch instructions
- 15 instructions
- Other Control Instructions
- 12 instructions
- Custom Instructions
- depends on the custom functionality added to the Nios II ALU
- No-Operation Instruction
- 1 instruction
- Potential Unimplemented Instruction
- some cores don´t support all instructions in hardware -> processor generates an exception
- mul, muli, mulxss, mulxsu, mulxuu, div, divu
- 7 instructions
The Nios II architecture supports a flat register file, consisting of thirty two 32-bit general-purpose integer registers, and six 32-bit control registers. The architecture supports supervisor and user modes that allow system code to protect the control registers from errant applications. Moreover the architecture allows for the future addition of floating point registers.
Among other features pipelining depends on the main core architecture. Nios II/e doesn´t support pipelining while Nios II/s and Nios II/f use a 5-stage respectivley 6-stage pipline which has an enormous effect on the instruction performance. Without pipelining the supported instructions need at least 6 cycles to complete while on the pipelined cores most of them will need only one cycle. Instructions that flush the pipeline (trap, break, flushp,...) need 4 cycles to complete. On the 6-stage pipeline jmp, ret, callr need one cycle, call even two cycles less than on the 5-stage pipeline.
Nios II/f core pipeline stages: Fetch (F), Decode (D), Execute (E), Memory (M), Align (A), Writeback (W)
Nios II/s core pipeline stages: Fetch (F), Decode (D), Execute (E), Memory (M), Writeback (W)
In contrast to the "standard" core the "fast" core has a 2-bit branch history table which is used for dynamic branch prediction.
SPARC V8 
The SPARC-Architecture is an Instruction Set Architecture (ISA) that derives from a reduced instruction set computer (RISC) lineage. Its principal data-types are 32-bit Integers and 32-, 64-, 128-bit IEEE Standard 754 floating-points. The SPARC is a big-endian architecture.
SPARC System Components 
The SPARC-Architecture allows different types of I/O's, MMU's and cache system sub-architectures. It is assumed, that all those components are are optimal defined by the requirements of the system, that has to be realised. In most cases, those different elements are invisible to the application programs.
SPARC Features 
- linear, 32-bit address space
- Few and simple instruction formats
- Few addressing modes
- Triadic register addresses
- A large "windowed" register file
- seperate floating-point registers
- optional Coprocessor
There are 3 different types of registers: general-purpose integer registers, floating-point registers and special registers.
The general-purpose integer registers are located in the Integer Unit (IU). The IU may contain from 40 to 520 general-purpose integer registers. This corresponds to 8 global registers and a circular stack of 2 to 32 sets of 16 registers, also known as register windows.
The FPU has 32 32-bit floating-point registers. As described in the Overview, the SPARC is also capable of using 64- and 128-bit floating-point values. Therefore 2 32-bit floating-point registers are used to store double-precision floating-point values (64 bit) and 4 registers are used to store quad-precision floating-point values.
The special registers are implemented in the optional Coprocessor. It is up to the designer of the Coprocessor to define the number of registers, but nominally some number of 32-bit registers.
The Instruction-set can be divided into 6 classes of Instructions:
1) Load/Store operations
2) Arithmetic/logical/shift (Integer Arithmetic) operations
3) Control transfer operations
4) Read/write control register operations
5) Floating-point operations
6) Coprocessor operations
There exist 72 basic instruction operations. All of them are encoded in 3 major 32-bit wide instruction formats. The load/store instructions address a linear, 2^32-byte address space.
The Load/Store instructions are the only ones that access memory. Those instructions use 2 integer-registers or one integer-register and a signed 13-bit immediate to calculate the 32-bit, byte-aligned memory address. 8-, 16-, 32- and 64-bit accesses are supported.
Integer Arithmetic Instructions are generally triadic-register-address instructions. Those operations use two source operands (2 registers or 1 register and a signed 13-bit immediate) and write the result either in a register or discard it (write to register 0).
The Control Transfer Instructions manipulate the next program counter (nPC). Again, those Instructions can be divided into 5 types: Conditional Branch, Call and Link, Jump and Link, Return from Trap, Trap
Read/Write Control Register Instructions are used to write or read the program-visible state and status registers.
The Floating-point Instructions are, like the Integer Arithmetic Instructions, generally triadic-register-address instructions. Only the floating-point convert and floating-point compare – instructions are 2 register-address instructions.
Coprocessor Operate Instructions are executed by the attached Coprocessor, if one is attached. Those instructions are interpreted by the Coprocessor, only 5 bits of the instruction-code are defined by specification. (Those 5 bits mark the instruction-code as a coprocessor instruction)
The Xilinx microblaze is a 32-bit (soft) processor. It implements a 32-bit Harvard RISC architecture. The basic architecture consists of 32 general-purpose registers, an Arithmetic Logic Unit (ALU), a shift unit, and two levels of interrupt. It has an optional floating poit unit. The instruction and the data cache can also be integrated in the design optionally.
MicroBlaze has an orthogonal instruction set architecture (any instruction can use data of any type via any addressing mode). It has thirty-two 32-bit general purpose registers and up to seven 32-bit special purpose registers, depending on configured options.
Special Purpose Registers 
The Program Counter is the 32-bit address of the execution instruction. It can only be read.
The Machine Status Register contains control and status bits for the processor. It includes information about (only some are mentioned): Exception In Progress, Arithmetic Carry, Division by Zero, Break in Progress, Interrupt Enable, Buslock Enable
The Exception Address Register stores the full load/store address that caused the exception.
The Exception Status Register contains status bits for the processor e.g. the Exception Cause.
The Floating Point Status Register contains status bits for the floating point unit.
All MicroBlaze instructions are 32 bits and are defined as either Type A or Type B. Type A instructions have up to two source register operands and one destination register operand. Type B instructions have one source register and a 16-bit immediate operand (which can be extended to 32 bits by preceding the Type B instruction with an IMM instruction). Immediate operands are data words that are loaded immediatly form the memory. Type B instructions have a single destination register operand. Instructions are provided in the following functional categories: arithmetic, logical, branch, load/store, and special.
Some important instructions:
|add rD, rA, rB||Arithmetic Add|
|and rD, rA, rB||Logical And|
|beq rA, rB||Branch if Equal|
|sw rD, rA, rB||Stores the contents of register rD, into the word aligned memory location that results from adding the contents of registers rA and rB.|
|lw rD, rA, rB||Loads a word (32 bits) from the word aligned memory location that results from adding the contents of registers rA and rB. The data is placed in register rD.|
MicroBlaze instruction execution is pipelined. The pipeline is divided into five stages: Fetch (IF), Decode (OF), Execute (EX), Access Memory (MEM), and Writeback (WB).
For most instructions, each stage takes one clock cycle to complete. Consequently, it takes five clock cycles for a specific instruction to complete, and one instruction is completed on every cycle. A few instructions require multiple clock cycles in the execute stage to complete. This is achieved by stalling the pipeline.
Memory architecture 
MicroBlaze is implemented with a Harvard memory architecture, i.e. instruction and data accesses are done in separate address spaces. Each address space has a 32 bit range (i.e. handles up to 4 GByte of instructions and data memory respectively). The instruction and data memory ranges can be made to overlap by mapping them both to the same physical memory. The latter is useful e.g. for software debugging. Both instruction and data interfaces of MicroBlaze are 32 bit wide and use big endian, itreversed format. MicroBlaze supports word, halfword, and byte accesses to data memory. Data accesses must be aligned (i.e. word accesses must be on word boundaries, halfword on halfword bounders), unless the processor is configured to support unaligned exceptions. All instruction accesses must be word aligned. MicroBlaze does not separate between data accesses to I/O and memory (i.e. it uses memory mapped I/O).