What Is A Transparent Register Pipeline

Pipeline Register

Microarchitecture

David Money Harris , Sarah L. Harris , in Digital Design and Computer Architecture (Second Edition), 2013

Solving Data Hazards with Stalls

Forwarding is sufficient to solve RAW data hazards when the result is computed in the Execute stage of an instruction, because its result can then be forwarded to the Execute stage of the next instruction. Unfortunately, the lw instruction does not finish reading data until the end of the Memory stage, so its result cannot be forwarded to the Execute stage of the next instruction. We say that the lw instruction has a two-cycle latency, because a dependent instruction cannot use its result until two cycles later. Figure 7.51 shows this problem. The lw instruction receives data from memory at the end of cycle 4. But the and instruction needs that data as a source operand at the beginning of cycle 4. There is no way to solve this hazard with forwarding.

The alternative solution is to stall the pipeline, holding up operation until the data is available. Figure 7.52 shows stalling the dependent instruction (and) in the Decode stage. and enters the Decode stage in cycle 3 and stalls there through cycle 4. The subsequent instruction (or) must remain in the Fetch stage during both cycles as well, because the Decode stage is full.

In cycle 5, the result can be forwarded from the Writeback stage of lw to the Execute stage of and. In cycle 5, source $s0 of the or instruction is read directly from the register file, with no need for forwarding.

Notice that the Execute stage is unused in cycle 4. Likewise, Memory is unused in Cycle 5 and Writeback is unused in cycle 6. This unused stage propagating through the pipeline is called a bubble, and it behaves like a nop instruction. The bubble is introduced by zeroing out the Execute stage control signals during a Decode stall so that the bubble performs no action and changes no architectural state.

In summary, stalling a stage is performed by disabling the pipeline register, so that the contents do not change. When a stage is stalled, all previous stages must also be stalled, so that no subsequent instructions are lost. The pipeline register directly after the stalled stage must be cleared to prevent bogus information from propagating forward. Stalls degrade performance, so they should only be used when necessary.

Figure 7.53 modifies the pipelined processor to add stalls for lw data dependencies. The hazard unit examines the instruction in the Execute stage. If it is lw and its destination register (rtE) matches either source operand of the instruction in the Decode stage (rsD or rtD), that instruction must be stalled in the Decode stage until the source operand is ready.

Stalls are supported by adding enable inputs (EN) to the Fetch and Decode pipeline registers and a synchronous reset/clear (CLR) input to the Execute pipeline register. When a lw stall occurs, StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values. FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble. ⁴

The MemtoReg signal is asserted for the lw instruction. Hence, the logic to compute the stalls and flushes is

lwstall = ((rsD = = rtE) OR (rtD = = rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123944245000070

Two-Operand Addition

Miloš D. Ercegovac , Tomás Lang , in Digital Arithmetic, 2004

2.9 Pipelined Adders

The throughput of the adder can be increased by pipelining. To do this, pipeline registers are introduced to shorten the worst-case carry path. For example, a CRA adder is divided into groups of bits and latches are introduced, as shown in Figure 2.26. Most latches are used to synchronize inputs and outputs since different parts (groups) are processed at different cycles. The throughput R is determined by the delay of one group; that is,

2.73 $R = \frac{1}{t_{g r o u p}}$

Adders such as conditional-sum and prefix adders are pipelined by introducing registers between the stages. In this case, no latches are required for synchronization of inputs and outputs.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978155860798950004X

Microarchitecture

Sarah L. Harris , David Harris , in Digital Design and Computer Architecture, 2022

7.5.1 Pipelined Datapath

The pipelined datapath is formed by chopping the single-cycle datapath into five stages separated by pipeline registers. Figure 7.49(a) shows the single-cycle datapath stretched out to leave room for the pipeline registers. Figure 7.49(b) shows the pipelined datapath formed by inserting four pipeline registers to separate the datapath into five stages. The stages and their boundaries are indicated in blue. Signals are given a suffix (F, D, E, M, or W) to indicate the stage in which they reside.

The register file is peculiar because it is read in the Decode stage and written in the Writeback stage. So, although the register file is drawn in the Decode stage, its write address and write data come from the Writeback stage. This feedback will lead to pipeline hazards, which are discussed in Section 7.5.3. The register file in the pipelined processor writes on the falling edge of CLK so that it can write a result in the first half of a cycle and read that result in the second half of the cycle for use in a subsequent instruction.

One of the subtle but critical issues in pipelining is that all signals associated with a particular instruction must advance through the pipeline in unison. Figure 7.49(b) has an error related to this issue. Can you find it?

The error is in the register file write logic, which should operate in the Writeback stage. The data value comes from ResultW, a Writeback stage signal. But the destination register comes from RdD (InstrD _11:7), which is a Decode stage signal. In the pipeline diagram of Figure 7.48, during cycle 5, the result of the lw instruction would be incorrectly written to s5 rather than s2.

Figure 7.50 shows a corrected datapath, with the modification in blue. The Rd signal is now pipelined along through the Execution, Memory, and Writeback stages, so it remains in sync with the rest of the instruction. RdW and ResultW are fed back together to the register file in the Writeback stage.

The astute reader may note that the logic to produce PCF' (the next PC) is also problematic because it could be updated with either a Fetch or an Execute stage signal (PCPlus4F or PCTargetE). This control hazard will be fixed in Section 7.5.3.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128200643000076

Clocking of Synchronous Circuits

In Top-Down Digital VLSI Design, 2015

7.5.1 Traditional feedback-type registers with enable

In most designs, part of the flip-flops operate at a lower rate than others. Just compare a pipeline register against a mode register that preserves its state for millions of clock cycles or even until power down. Other registers occupy positions in between. This situation suggests the usage of enable signals to control register operation as shown in the code fragments below.

.....

process (Clk_C)

begin

-- activities triggered by rising edge of clock

if Clk_C'event and Clk_C='1' then

if Ena_S='1' then -- control signal that governs

State_P <= State_N; -- register update

end if;

end process;

.....

always_ff @(posedge Clk_C)

// activities triggered by rising edge of clock

if (Ena_S) // control signal that governs

State_P <= State_N; // register update

.....

When presented with such code, synthesizers call upon a multiplexer to close and open a feedback loop around a basic D-type flip-flop under control of the enable signal as shown in the example of fig.7.26b. As the resulting circuit is simple, robust, and compliant with the rules of synchronous design, ¹⁷ this is a safe and often also a reasonable choice. Some libraries indeed offer an E-type flip-flop that combines a flip-flop and a mux in a single cell.

On the negative side, this approach takes one fairly expensive multiplexer per bit and does not use energy efficiently. This is because any toggling of the clock input of a disabled flip-flop wastes energy in discharging and recharging the associated node capacitances for nothing. Note that the capacitance of the Clk_C input is not the only contribution as any clock edge causes further nodes to toggle within the flip-flop itself. Also, the wider a register, the larger the overhead in terms of energy and transistor count.

Most of the energy could be conserved by selectively turning off the clock within those circuit regions that are currently disabled. ¹⁸ Any such conditional clocking or clock gating scheme must be implemented very carefully, however, as safe circuit operation is otherwise compromised.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128007303000071

Performance improvement methods

Shigeyuki Takano , in Thinking Machines, 2021

Abstract

Deeper pipelining, a popular approach for improving the performance, consumes more power by subdividing the module to achieve a higher clock frequency, or the pipeline registers introduce greater power consumption. This chapter introduces techniques that do not consider the use of a deeper pipelining.

Larger amounts of data movement introduce a larger amount of energy consumption on a chip, as well as more external memory access owing to the greater amount of energy consumption and higher latency; thus, the energy efficiency easily decreases from the low performance of the external memory access. CPUs and GPUs have the same problem. However, they do not focus on an external memory access and simply apply a high-performance external memory regulation. They also do not find a way to reduce the energy consumption, and simply apply a quantization with an 8-bit or 16-bit data width. This chapter introduces methods for improving the hardware performance or energy consumption that are commonly used in machine learning hardware.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128182796000165

From Algorithms to Architectures

In Top-Down Digital VLSI Design, 2015

3.4.3 Pipelining

Pipelining aims at increasing throughput by cutting combinational depth into several separate stages of approximately uniform computational delays by inserting registers in between. ²⁷ The combinational logic between two subsequent pipeline registers is designed and optimized to compute one specific subfunction. As an ensemble, the various stages cooperate like specialist workers on an assembly line. Fig.3.21 sketches a functional unit for f subdivided into p = 3 pipeline stages by p − 1 extra registers. Note the absence of any control hardware.

Performance and cost analysis

Assumptions:

1.: The combinational logic for implementing function f is not affected by the number of pipeline stages introduced. Its overall size A _f, therefore, remains constant.
2.: The pipeline is lossless and balanced, i.e. similarly to decomposition it is always possible to partition the logic into p stages such that all have identical delays $\frac{t_{f}}{p}$ .
3.: The size penalty of pipelining can be expressed by an additive term A _reg for each register accounting for the silicon area occupied by storage elements.
4.: At each pipeline stage a performance penalty results from introducing a register delay t _reg which includes the delay caused by the storage element.

Pipelining changes performance and cost figures as follows:

(3.23) $A (p) = A_{f} + p A_{r e g}$

(3.24) $Γ (p) = 1$

(3.25) $t_{l p} (p) \approx \frac{t_{f}}{p} + t_{r e g}$

(3.26) $A T (p) \approx p A_{r e g} t_{r e g} + (A_{r e g} t_{f} + A_{f} t_{r e g}) + \frac{1}{p} A_{f} t_{f}$

(3.27) $L (p) = p$

(3.28) $E (p) ≷_{coarse grain}^{fine grain} E_{f} + E_{r e g}$

Both performance and size grow monotonically with pipeline depth. The same holds true for latency. What is more interesting, a modest number of pipeline stages each of which has a substantial depth dramatically lowers the AT -product due to (3.16). This regime is referred to as coarse grain pipelining.

Example

Equation (3.25) relates combinational delay to register delay. Another popular way to quantify the degree of pipelining is to express the delay on the longest path as a multiple of fanout-of-4 (FO4) inverter delays. ²⁸

Continuing along this line, one may want to insert more and more pipeline registers. However, (3.25) reveals that the benefit fades when the combinational delay per stage $\frac{t_{f}}{p}$ approaches the register delay t _reg. For large values of p the area-delay product gets dominated by the register delay rather than by the payload function. A natural question for this type of deep or fine grain pipelining is to ask

"What is the maximum computation rate for which a pipeline can be built?"

The fastest logic gates that can do any data processing are 2-input nand and nor gates. ²⁹ Even if we are prepared to profoundly redesign a pipeline's logic in an attempt to minimize the longest path t _lp, we must leave room for at least one such gate between two consecutive registers.

(3.29) $T_{c p} \geq \min (t_{l p}) = \min (t_{g a t e}) + t_{r e g} = \min (t_{n a n d}, t_{n o r}) + t_{s u f f} + t_{p d f f}$

thus represents a lower bound for (3.25). Practical applications that come close to this theoretical minimum are limited to tiny subcircuits, however, mainly because of the disproportionate number of registers required, but also because meeting assumptions 1 and 2 is difficult with fine grained pipelines. Even in high-performance datapaths, energy efficiency and economic reasons preclude pipelining below 12 or so FO4 inverter delays per stage. As the above numbers illustrate, Intel has in fact reversed the historical trend towards ever deeper pipelines when transiting from the Pentium 4 to the Core architecture to reclaim energy efficiency [39]. Fig.3.22 confirms that this is indeed a general industry trend.

Equation (3.29) further indicates that register delay is critical in high speed design. In fact, a typical relation is t _reg ≈ 3…5 min(t _gate). As a consequence, it takes twenty or so levels of logic between subsequent registers before flip-flop delays are relegated to insignificant proportions. A high-speed cell library must, therefore, not only include fast combinational functions but also provide bistables with minimum insertion delays. ³⁰

Example

Plugging into (3.29) typical numbers for a 2-input nor gate and a D-type flip-flop with no reset from a 1 30 nm CMOS standard cell library, one obtains T _cp ≥ t _NOR2D1 + t _DFFPB1 = 18 ps + 249 ps ≈ 267 ps which corresponds to a maximum computation rate of about 3.7 GHz.

"How many stages yield optimum pipeline efficiency?"

Optimum hardware efficiency means minimum size-time product

(3.30) $A T (p) = \min$

which is obtained for

(3.31) $p_{0} = \sqrt{\frac{A_{f} t_{f}}{A_{r e g} t_{r e g}}}$

Beyond this point, adding more pipeline registers causes the size-time product to deteriorate even though performance is still pushed further. It also becomes evident from (3.31) that, in search of an economic solution, the more complex a function, the more pipelining it supports. In practice, efficiency is likely to degrade before p ₀ is reached because our initial assumptions 1 and 2 cannot be entirely satisfied. Also, shallow logic just a few gates deep is more exposed to on-chip variations (OCV).

"How does pipelining affect energy efficiency?"

The additional registers suggest that any pipelined datapath dissipates more energy than the reference architecture does. This is certainly true for fine grain pipelines where the energy wasted by the switching of all those extra subcircuits becomes the dominant contribution.

For coarse grain designs, the situation is more fortunate. Experience shows that pipeline registers tend to reduce the unproductive switching activity associated with glitching in deep combinational networks, a beneficial side effect neglected in a simple additive model.

Interestingly, our finding that throughput is greatly increased makes it possible to take advantage of coarse grain pipelining for improving energy efficiency, albeit indirectly. Recall that the improved throughput is a result from cutting the longest path while preserving a processing rate of one data item per computation cycle. The throughput of the isomorphic architecture is thus readily matched by a pipelined datapath implemented in a slower yet more energy-efficient technology, e.g. by operating CMOS logic from a lower supply voltage or by using mostly minimum size transistors. Our model cannot reflect this opportunity because we have decided to establish energy figures under the assumption of identical operating conditions and cell libraries. Another highly welcome property of pipelining is the absence of energy-dissipating control logic.

Pipelining in the presence of multiple feedforward paths

Although pipelining can be applied to arbitrary feedforward computations, there is a reservation of economic nature when a DDG includes many parallel paths. In order to preserve overall functionality, any latency introduced into one of the signal propagation paths must be balanced by inserting an extra register into each of its parallel paths. Unless those shimming registers help cut combinational depth there, they bring about substantial size and energy penalties, especially for deep pipelines where p is large.

Example

With simplifications, fig.3.23a reproduces the block cipher IDEA. Variable k stands for the block index, l(k) and r(k) each denote half of a 64 bit plaintext block while v(k) and w(k) do the same for a 64 bit ciphertext block. u(k) and u′(k) stand for the keys used during enciphering and deciphering operations respectively. Provided the two keys are the same, i.e. u′(k) = u(k), the net result is l′(k) = l(k) and r′(k) = r(k), which implies that the plaintext is recovered after calling g twice. Note that this involution property ³¹ is totally independent of function c which, therefore, can be designed such as to maximize cryptographic security.

Extensive pipelining seems a natural way to reconcile the computational complexity of c with ambitious performance goals. Yet, as a consequence of the two paths bypassing c , every pipeline register entails two shimming registers, effectively tripling the costs of pipelining, see fig.3.23b. This is the reason why pipeline depth had to be limited to eight stages per round in a VLSI implementation of the IDEA cipher in spite of stringent throughput requirements [52].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128007303000034

Flash Memories: Structure and Constraints

Jalil Boukhobza , Pierre Olivier , in Flash Memory Integration, 2017

2.1.4.2.1 Advanced operations at the level of a chip

We can distinguish the following operations:

–: Read and write in cache mode [MIC 04, MIC 06] are supported in certain chips that contain an additional register of the size of a page, the cache buffer, situated between the NAND transistor array and the page buffer (illustrated on the right of Figure 2.3, page 20). This extra register pipelines (temporal overlapping) the transmission of a page to be read from the NAND array into the cache buffer, with a transfer via the I/O bus of a page read beforehand in the NAND array and moved from the cache buffer to the page buffer. The opposite operation is also possible for a write operation;
–: Multi-plane operations make it possible to send several commands of the same type (read, write or erase) in parallel to different planes of the same chip. This is not a true parallelism, because the planes within a chip share the same I/O bus and therefore data transfers are interleaved;
–: The copy-back operation moves data stored in a flash page to another flash page situated in the same plane as the source page. This is achieved by means of the page buffer, without any transfer through the I/O bus.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781785481246500037

Case Study: A Pipelined Multiplier Accumulator

Peter J. Ashenden , in The Designer's Guide to VHDL (Third Edition), 2008

10.1 Algorithm Outline

A complex MAC operates on two sequences of complex numbers, {x _i} and {y _i}. The MAC multiplies corresponding elements of the sequences and accumulates the sum of the products. The result is

where N is the length of the sequences. Each complex number is represented in Cartesian form, consisting of a real and an imaginary part. If we are given two complex numbers x and y, their product is a complex number p, calculated as follows:

$\begin{array}{l} P_{real} = x_{real} \times y_{real} - x_{imag} \times y_{imag} \\ P_{imag} = x_{real} \times y_{imag} - x_{imag} \times y_{real} \end{array}$

The sum of x and y is a complex number s calculated as follows:

$\begin{array}{l} S_{real} = x_{real} + y_{real} \\ S_{imag} = x_{imag} + y_{imag} \end{array}$

Our MAC calculates its result by taking successive pairs of complex numbers, one each from the two input sequences, forming their complex product and adding it to an accumulator register. The accumulator is initially cleared to zero and is reset after each pair of sequences has been processed.

If we count the operations required for each pair of input numbers, we see that the MAC must perform four multiplications to form partial products, then a subtraction and an addition to form the full product and finally two additions to accumulate the result. This is shown diagrammatically at the top of Figure 10.1. Since the operations must be performed in this order, the time taken to complete processing one pair of inputs is the sum of the delays for the three steps. In a high-performance digital signal processing application, this delay may cause the bandwidth of the system to be reduced below a required minimum.

We can avoid the delay by pipelining the MAC, that is, organizing it like an assembly line, as shown at the bottom of Figure 10.1 . The first pair of input numbers is stored in the input register on the first clock edge. During the first clock cycle, the multipliers calculate the partial products, while the system prepares the next pair of inputs. On the second clock edge, the partial products are stored in the first pipeline register, and the next pair of inputs is entered into the input register. During the second clock cycle, the subtracter and adder produce the full product for the first input pair, the multipliers produce the partial products for the second input pair and the system prepares the third input pair. On the third clock edge, these are stored, respectively, in the second pipeline register, the first pipeline register and the input register. Then in the third clock cycle, the adders accumulate the product of the first pair with the previous sum, and the preceding stages operate on the second and third pairs, while the system prepares the fourth pair. The sum in the accumulator is updated on the fourth clock edge. Thus, three clock cycles after the first pair of numbers was entered into the input latch, the sum including this pair is available at the output of the MAC. Thereafter, successive sums are available each clock cycle. The advantage of this approach is that the clock period can be reduced to the slowest of the pipeline stages, rather than the total of their delays.

One detail we have yet to consider is initializing and restarting the pipeline. We need to do this to accumulate sums of products of a number of input sequences, one after another. The simplest approach is to include a "clear" input to the accumulator register that forces its content to zero on the next clock edge. For each pair of sequences to be multiplied and accumulated, we start entering numbers into the input registers on successive clock edges. Then, two clock cycles after we have entered the first pair of numbers, we assert the clear input. This causes the accumulator to reset at the same time as the product of the first pair of numbers reaches the second pipeline register. On the following cycle, this product will be added to the zero value forced into the accumulator. After the last pair in the input sequences has been entered, we must wait three clock cycles until the final sum appears at the output of the MAC. We must separate successive input sequences by at least one idle cycle and reset the accumulator between summations.

The final issue in this outline of the MAC algorithm is the representation of the data. We use a 16-bit, two's-complement, fixed-point binary representation. Each of the real and imaginary parts of the two complex inputs and the complex output of the MAC uses the format shown in Figure 10.2. Bit 0 is the sign bit, and the binary point is assumed to be between bits 0 and –1. Using this format, we can represent numbers in the range –1 (inclusive) to +1 (exclusive), with a resolution of 2⁻¹⁵. This raises the possibility of overflow occurring while summing a sequence of numbers, so we include an overflow status signal in our design. Overflow can occur in two cases. First, intermediate partial sums may fall outside of the range –1 to +1. We can reduce the likelihood of this happening by expanding the range used to represent intermediate results to –16 to +16. However, if an intermediate sum falls outside of the expanded range, the summation for the entire sequence is in error, so the overflow signal must be set. It remains set until the accumulator is cleared, indicating the end of the summation.

The second overflow case occurs if the final sum falls outside the range of values representable by the MAC output. This may be a transient condition, since a subsequent product, when added to the sum, may bring the sum back in range. We assert the overflow signal only during a cycle in which the final sum is out of range, rather than latching the overflow until the end of summation.

Now that we have described the requirements and the algorithm to be performed by the MAC, we can specify its interface. This is defined by the entity declaration:

library ieee;

use ieee.std_logic_1164.all, ieee.fixed_pkg.all;

entity mac is

port ( clk, reset : in std_ulogic;

x_real : in u_sfixed(0 downto –15);

x_imag : in u_sfixed(0 downto –15);

y_real : in u_sfixed(0 downto –15);

y_imag : in u_sfixed(0 downto –15);

s_real : out u_sfixed(0 downto –15);

s_imag : out u_sfixed(0 downto –15);

ovf : out std_ulogic );

end entity mac;

The clk port is used to synchronize operation of the MAC. All data transfers into registers in the pipeline are done on the rising edge (from '0' to '1') of this signal. The reset port causes the accumulator registers to be cleared to zero and the overflow condition to be reset. The data ports all use the signed data type from the standard fixed-point package described in Section 9.2.4. The index ranges correspond to the format shown in Figure 10.2. The ports x_real, x_imag, y_real and y_imag are the real and imaginary parts of the two input data sequences. These input ports, as well as the reset port, are sampled synchronously on the rising edge of the clk signal. The ports s_real and s_imag are the real and imaginary parts of the accumulated sum, and the ovf port is the overflow flag, set as described above. These output ports become valid after each rising edge of the clk signal.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780120887859000101

Implementation Issues

Bruce Jacob , ... David T. Wang , in Memory Systems, 2008

5.4.2 Processor Interfacing

Figure 5.55 shows two typical ways of interfacing a microprocessor to a data cache. Figure 5.55(a) shows where the cache could connect to an in-order processor pipeline. It shows the address, control, and data being supplied to the cache stage by pipeline registers, and cache access is assumed to fit in one processor cycle, providing result signals (like HIT) and the data, in case of a load. The result signals are then used by the pipeline control circuitry to decide whether to stall the pipe depending on whether the access is a hit or a miss.

Alternatively, Figure 5.55(b) shows where and how the cache could fit in an out-of-order execution pipeline. The figure shows the internals of the load/store unit, and with the assumption that the proper structures exist outside this unit that allow for non-dependent instructions to execute out of order, this unit operates independently of the other functional units, and cache misses do not need to stall the processor core (unless during extreme cases where internal data structures within the load/store unit have filled up and cannot accept new instructions).

The load/store unit typically includes a load/store queue used as a holding tank for memory instructions. When a load or store is cleared to access the cache (i.e., it has all its needed operands and performing the access will not cause any memory hazards or inconsistencies), information is retrieved from the load/store instruction and used to access the cache. Processor structures are updated with the results of the access. In the case of a miss, the load/store instruction is made to wait and retry the access, and in some implementations, the instruction is transferred to another structure inside the load/store unit that holds accesses that missed.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123797513500072

Virtual Memory

Bruce Jacob , ... David T. Wang , in Memory Systems, 2008

Pipeline Modifications

A RiSC-16 pipeline that handles interrupts and exceptions precisely is shown in Figure 31.27. Note that a solid square represents concatenation in the figure. The pipeline diagram reflects the following visible modifications:

1.: Support for detecting and handling exceptions and interrupts has been added by the creation of an exception register (labeled EXC in the figure) in pipeline registers IF/ID through MEM/WB. Also, the instruction's PC is maintained all the way to the MEM/WB register. If a stage's incoming EXC register is non-zero, the corresponding instruction is interpreted as having caused an exception. The pipeline uses these values to ensure that all instructions following an exceptional instruction become NOPs: if there is an exception in the write-back stage, all other instructions in the pipe should be squashed. If a stage detects a non-zero EXC value, the associated instruction is disabled (its corresponding OP value is turned into something without side effects, like ADD, and its target register rT is set to be non-writable).
2.: TLB access has been added to instruction-fetch and data access stages. The choice of a single TLB versus split I- and D-TLBs does not affect the substance of the implementation. Note, however, that the write-back stage must be able to distinguish between an I-TLB miss and a D-TLB miss for purposes of generating the PTE address (this is similar to the MIPS mechanism [Kane & Heinrich 1992] and is described later). This can be done with a 16-bit register in MEMWB to hold the faulting address (which holds either EXMEM.aluout if the D-TLB causes a miss or EXMEM.pc otherwise) or a simple status bit in MEMWB which would be set similarly.

Each pipeline stage must suspend normal operation if its instruction has caused an exception and the pipeline stage modifies machine state (for example, the memory stage must not write to memory if the instruction caused a privilege violation in a previous stage). Each stage must forward the incoming exception code on to the following pipeline stage if it is a nonzero code. If the instruction has not already caused an exception, but does so during the stage in question, the EXC field in the following pipeline register must be set appropriately. Finally, any stage must suspend normal operation if there is an exceptional instruction in the write-back stage, and if "normal" operation can modify state. For example, if the MEMWB.exc register is non-zero, indicating an exceptional instruction in the write-back stage, the memory stage must not allow read or write access to the memory system by the instruction in the memory stage. Otherwise, pipeline operation is as normal. In the simplest form of an exceptional condition, when an exceptional instruction reaches the write-back stage, the following steps are performed by the hardware:

1.: Either the PC of the exceptional instruction or the PC of the instruction after the exceptional instruction (PC+1) is saved in a safe place, for instance, a control register built for just such a purpose, typically called something like the exceptional PC (EPC) register. The choice of which value to save is based on the opcode of the exceptional instruction and the type of exception raised: some exception-raising instructions should be retried at the end of a handler's execution (e.g., a load or store instruction that causes a TLB miss), while others should be jumped over (e.g., TRAP instructions that invoke the operating system—jumping back to a TRAP instruction would simply re-invoke the trap and would cause an endless loop). If the exceptional instruction should be retried, the handler returns to PC; if the exceptional instruction should not be re-executed or retried, the handler returns to PC + 1.
2.: The exception type is used as an index into the interrupt vector table (IVT), located at a known physical address, and the vector corresponding to the exception type is loaded into the program counter. This is known as vectoring to the exception/interrupt handler.
3.: Some exceptions cause the hardware to perform additional steps before vectoring to the handler. For instance, when handling a TLB-miss exception, before vectoring to the handler, the hardware might create an address for the handler to use in searching the page table. Most architectures that use software-managed TLBs provide such a feature, and ours is described in detail later.

The general form of an exception/interrupt handler, a short piece of software deep in the operating system, looks like the following (note that, in this architecture, hardware places the machine into a privileged operating mode and saves the previous operating mode):

1.: Save the EPC in a safe location. This is done in case another exception or interrupt occurs before the handler has completed execution, which would cause the EPC register to be overwritten.
2.: Handle the exception/interrupt.
3.: Reload the EPC.
4.: Return the processor to the previous operating mode, e.g., user or privileged, and jump to the EPC.

Most architectures have a facility ("return-from-exception," or similar name) that performs step 4 in an atomic manner.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123797513500333