authors: Tiago Mück
last edited: 2023-01-11 16:05:55 +0000

CHI

The CHI ruby protocol provides a single cache controller that can be reused at multiple levels of the cache hierarchy and configured to model multiple instances of MESI and MOESI cache coherency protocols. This implementation is based of Arm’s AMBA 5 CHI specification and provides a scalable framework for the design space exploration of large SoC designs.

CHI overview and terminology

CHI (Coherent Hub Interface) provides a component architecture and transaction-level specification to model MESI and MOESI cache coherency. CHI defines three main components as shown in the figure below:

CHI components

An HNF is the point of coherency (PoC) and point of serialization (PoS) for a specific address range. The HNF is responsible for issuing any required snoop requests to RNFs or memory access requests to SNFs in order to complete a transaction. The HNF can also encapsulate a shared last-level cache and include a directory for targeted snoops.

The CHI specification also defines specific types of nodes for non-coherent requesters (RNI) and non-coherent address ranges (HNI and SNI), e.g., memory ranges belonging to IO components. In Ruby, IO accesses don’t go though the cache coherency protocol so only CHI’s fully coherent node types are implemented. In this documentation we interchangeably use the terms RN / RNF, HN / HNF, and SN/SNF. We also use the terms upstream and downstream to refer to components in the previous (i.e. towards the cpu) and next (i.e. towards memory) levels in the memory hierarchy, respectively.

Protocol overview

The CHI protocol implementation consists mainly of two controllers:

In order to allow fully flexible cache hierarchies, Cache_Controller can be configured to model any cache level (e.g. L1D, priv. L2, shared L3) within both request and home nodes. Furthermore it also supports multiple features not available in other Ruby protocols:

The implementation defines the following cache states:

The figure below gives an overview of the state transitions when the controller is configured as a L1 cache:

L1 cache state machine

Transitions are annotated with the incoming request from the cpu (or generated internally, e.g. Replacements) and the resulting outgoing request sent downstream. For simplicity, the figure omits requests that do not change states (e.g., cache hits) and invalidating snoops (final state is always I). For simplicity, it also shows only the typical state transitions in a MOESI protocol. In CHI the final state will ultimately be determined by the type of data returned by the responder (e.g., requester may receive UD or UC data in response to a ReadShared).

The figures below show the transition for a intermediate-level cache controller (e.g., priv. L2, shared L3, HNF, etc):

Intermediate cache state machine

Intermediate cache directory states

As in the previous case, cache hits are omitted for simplicity. In addition to the cache states, the following directory states are defined to track lines present in an upstream cache:

When the line is present both in the local cache and upstream caches the following combined states are possible:

The RUSC and RUSD states (omitted in the figures above) are used to keep track of lines for which the controller still has exclusive access permissions without having it in it’s local cache. This is possible in a non-inclusive cache where a local block can be deallocated without back-invalidating upstream copies.

When a cache controller is a HNF (home node), the state transactions are basically the same as a intermediate level cache, except for these differences:

For more information on DCT and DMT transactions, see Sections 1.7 and 2.3.1 in the CHI specification. DMT and DCT are CHI features that allow the data source for a request to send data directly to the original requester. On a DMT request, the SN sends data directly to the RN (instead of sending first to the HN, which would then forwards to the RN), while with DCT, the HN requests that a RN being snooped (the snoopee) to send a copy of the line directly the original requester. With DCT enabled, the HN may also request that the snoopee to send the data to both the HN and the original requester, so the HN can also cache the data. This depends on the allocation policy defined by the configuration parameters. Notice that the allocation policy also changes the cache state transitions. For simplicity, the figure above illustrates an inclusive cache.

The following is a list of the main configuration parameters of the cache controller that affect the protocol behavior (please refer to the protocol SLICC specification for details and a full list of parameters)

These parameters affect the cache controller performance:

Section Protocol implementation gives an overview of the protocol implementation while Section Supported CHI transactions describe the implemented subset of the the AMBA 5 CHI spec. The next sections refer to specific files in the protocol source code and include SLICC snippets of the protocol. Some snippets where slightly simplified compared to the actual SLICC specification.

Protocol implementation

The Figure below gives an overview of the cache controller implementation.

Cache controller architecture

In Ruby, a cache controller is implemented by defining a state machine using SLICC language. Transitions in the state machine are triggered by messages arriving at input queues. On our particular implementation, separate incoming and outgoing messages queues are defined for each CHI channel. Incoming request and snoop messages that start a new transaction go through the same Request allocation process, where we allocate a transaction buffer entry (TBE) and move the request or snoop to an internal queue of transactions that are ready to be initiated. If the transaction buffer is full, the request is rejected and a retry message is sent.

The actions to be performed for a message dequeued from the input / rdy queues depends on the state of the target cache line. The data state of the line is stored in the cache if the line is cached locally, while the directory state is stored in a directory entry if the line is present in any upstream cache. For lines with outstanding requests, the transient state is kept in the TBE and copied back to the cache and/or directory when the transaction finishes. The figure below describes the phases in the transaction lifetime and the interactions between the main components in the cache controller (input/output ports, TBETable, Cache, Directory and the SLICC state machine). The phases are described in more details in the subsequent sections.

Transaction lifetime

Transaction allocation

The code snippet below shows how an incoming request in the reqIn port is handled. The reqIn port receives incoming messages from CHI’s request channel:

in_port(reqInPort, CHIRequestMsg, reqIn) {
  if (reqInPort.isReady(clockEdge())) {
    peek(reqInPort, CHIRequestMsg) {
      if (in_msg.allowRetry) {
        trigger(Event:AllocRequest, in_msg.addr, 
              getCacheEntry(in_msg.addr), getCurrentActiveTBE(in_msg.addr));
      } else {
        trigger(Event:AllocRequestWithCredit, in_msg.addr,
              getCacheEntry(in_msg.addr), getCurrentActiveTBE(in_msg.addr));
      }
    }
  }
}

The allowRetry field indicates messages that can be retried. Requests that cannot be retried are only sent by a requester that previously received credit (see RetryAck and PCrdGrant in the CHI specification). The transition triggered by Event:AllocRequest or Event:AllocRequestWithCredit executes a single action which either reserves space in the TBE table for the request and moves it to the reqRdy queue, or sends a RetryAck message):

action(AllocateTBE_Request) {
  if (storTBEs.areNSlotsAvailable(1)) {
    // reserve a slot for this request
    storTBEs.incrementReserved();
    // Move request to rdy queue
    peek(reqInPort, CHIRequestMsg) {
      enqueue(reqRdyOutPort, CHIRequestMsg, allocation_latency) {
        out_msg := in_msg;
      }
    }
  } else {
    // we don't have resources to track this request; enqueue a retry
    peek(reqInPort, CHIRequestMsg) {
      enqueue(retryTriggerOutPort, RetryTriggerMsg, 0) {
        out_msg.addr := in_msg.addr;
        out_msg.event := Event:SendRetryAck;
        out_msg.retryDest := in_msg.requestor;
        retryQueue.emplace(in_msg.addr,in_msg.requestor);
      }
    }
  }
  reqInPort.dequeue(clockEdge());
}

Notice we don’t create and send a RetryAck message directly from this action. Instead we create a separate trigger event in the internal retryTrigger queue. This is necessary to prevent resource stalls from halting this action. Section Performance modeling below explains resource stalls in more details.

Incoming request from a Sequencer object (typically connected to a CPU when the controller is used as a L1 cache) and snoop requests arrive through the seqIn and snpIn ports and are handled similarly, except for:

Transaction initialization

Once a request has been allocated a TBE and moved to the reqRdy queue, an event is triggered to initiate the transaction. We trigger a different event for each different request type:

in_port(reqRdyPort, CHIRequestMsg, reqRdy) {
  if (reqRdyPort.isReady(clockEdge())) {
    peek(reqRdyPort, CHIRequestMsg) {
      CacheEntry cache_entry := getCacheEntry(in_msg.addr);
      TBE tbe := getCurrentActiveTBE(in_msg.addr);
      trigger(reqToEvent(in_msg.type), in_msg.addr, cache_entry, tbe);
    }
  }
}

Each request requires different initialization actions depending on the initial state of the line. To illustrate this processes, let’s use as example a ReadShared request for a line in the SC_RSC state (shared clean in local cache and shared clean in an upstream cache):

transition(SC_RSC, ReadShared, BUSY_BLKD) {
  Initiate_Request;
  Initiate_ReadShared_Hit;
  Profile_Hit;
  Pop_ReqRdyQueue;
  ProcessNextState;
}

Initiate_ReadShared_Hit is defined as follows:

action(Initiate_ReadShared_Hit) {
  tbe.actions.push(Event:TagArrayRead);
  tbe.actions.push(Event:ReadHitPipe);
  tbe.actions.push(Event:DataArrayRead);
  tbe.actions.push(Event:SendCompData);
  tbe.actions.push(Event:WaitCompAck);
  tbe.actions.pushNB(Event:TagArrayWrite);
}

tbe.actions stores the list of events that need to be triggered in order to complete an action. In this particular case, TagArrayRead, ReadHitPipe, and DataArrayRead introduces delays to model the cache controller pipeline latency and reading the cache/directory tag array and cache data array (see Section Performance modeling). SendCompData sets-up and sends the data responses for the ReadShared request and WaitCompAck sets-up the TBE to expect the completion acknowledgement from the requester. Finally, TagArrayWrite introduces the delay of updating the directory state to track the new sharer.

Transaction execution

After initialization, the line will transition to the BUSY_BLKD state as show in transition(SC_RSC, ReadShared, BUSY_BLKD). BUSY_BLKD is a transient state indicating the line has now an outstanding transaction. In this state, the transaction is driven either by incoming response messages in the rspIn and datIn ports or trigger events defined in tbe.actions.

The ProcessNextState action is responsible for checking tbe.actions and enqueuing trigger event messages into actionTriggers at the end of all transitions to the BUSY_BLKD state. ProcessNextState first checks for pending response messages. If there are no pending messages, it enqueues a message to actionTriggers in order to trigger the the event at the head of tbe.actions. If there are pending responses, then ProcessNextState does nothing as the transaction will proceed once all expected responses are received.

Pending responses are tracked by the expected_req_resp and expected_snp_resp fields in the TBE. For instance, the ExpectCompAck action, executed from the transition triggered by WaitCompAck, is defined as follows:

action(ExpectCompAck) {
  tbe.expected_req_resp.addExpectedRespType(CHIResponseType:CompAck);
  tbe.expected_req_resp.addExpectedCount(1);
}

This causes the transaction to wait until a CompAck response is received.

Some actions can be allowed to execute when the transaction has pending responses. This actions are enqueued using tbe.actions.pushNB (i.e., push / non-blocking). In the example above tbe.actions.pushNB(Event:TagArrayWrite) models a tag write being performed while the transactions waits for the CompAck response.

Transaction finalization

The transaction ends when it has no more pending responses and tbe.actions is empty. ProcessNextState checks for this condition and enqueues a “finalizer” trigger message into actionTriggers. When handling this event, the current cache line state and sharing/ownership information determines the final stable state of the line. Data and state information are updated in the cache and directory, if necessary, and the TBE is deallocated.

Hazard handling

Each controller allows only one active transaction per cache line. If a new request or snoop arrives while the cache line is in a transient state, this creates a hazard as defined in the CHI standard. We handle hazards as follows:

Request hazards: a TBE is allocated as described previously, but the new transaction initialization is delayed until the current transaction finishes and the line is back to a stable state. This is done by moving the request message from reqRdy to a separate stall buffer. All stalled messages are added back to reqRdy when the current transaction finishes and are handled in their original order of arrival.

Snoop hazards: the CHI spec does not allow snoops to be stalled by an existing request. If a transaction is waiting on a response for a request sent downstream (e.g. we sent a ReadShared and are waiting for the data response) we must accept and handle the snoop. The snoop can be stalled only if the request has already been accepted by the responder and is guaranteed to complete (e.g. a ReadShared with pending data but already acked with a RespSepData response). To distinguish between these conditions we use the BUSY_INTR transient state.

BUSY_INTR indicates the transaction can be interrupted by a snoop. When a snoop arrives for a line in this state, a snoop TBE is allocated as described previously and its state is initialized based on the currently active TBE. The snoop TBE then becomes the currently active TBE. Any cache state and sharing/ownership changes caused by snoop are copied back to the original TBE before deallocating the snoop. When a snoop arrives for a line in BUSY_BLKD state, we stall the snoop until the current transaction either finishes or transitions to BUSY_INTR.

Performance modeling

As described previously, the cache line state is known immediately when a transaction is initialized and the cache line can be read and written without any latency. This makes it easier to implement the functional aspects of the protocol. To model timing we use explicit actions to introduce latency to a transaction. For example, in the ReadShared code snippet:

action(Initiate_ReadShared_Hit) {
  tbe.actions.push(Event:TagArrayRead);
  tbe.actions.push(Event:ReadHitPipe);
  tbe.actions.push(Event:DataArrayRead);
  tbe.actions.push(Event:SendCompData);
  tbe.actions.push(Event:WaitCompAck);
  tbe.actions.pushNB(Event:TagArrayWrite);
}

TagArrayRead, ReadHitPipe, DataArrayRead, and TagArrayWrite don’t have any functional significance. They are there to introduce latencies that would exist in a real cache controller pipeline, in this case: tag read latency, hit pipeline latency, data array read latency, and tag update latency. The latency introduced by these action is defined by configuration parameters.

In addition to explicitly added latencies. SLICC has the concept of resource stalls to model resource contention. Given a set of actions executed during a transition, the SLICC compiler automatically generates code which checks if all resources needed by those actions are available. If any resource is unavailable, a resource stall is generated and the transition is not executed. A message that causes a resource stall remains in the input queue and the protocol attempts to trigger the transition again the next cycle.

Resources are detected by the SLICC compiler in different ways:

  1. Implicitly. This is the case for output ports. If an action enqueues new messages, the availability of the output port is automatically checked.
  2. Adding the check_allocate statement to an action.
  3. Annotating the transition with a resource type.

We use (2) to check availability of TBEs. See the snippet below:

action(AllocateTBE_Snoop) {
  // No retry for snoop requests; just create resource stall
  check_allocate(storSnpTBEs);
  ...
}

This signals the SLICC compiler to check if the storSnpTBEs structure has a TBE slot available before executing any transition that includes the AllocateTBE_Snoop action.

The snippet below exemplifies (3):

transition({BUSY_INTR,BUSY_BLKD}, DataArrayWrite) {DataArrayWrite} {
  ...
}

The DataArrayWrite annotation signals the SLICC compiler to check for availability of the DataArrayWrite resource type. Resource request types used in these annotations must be explicitly defined by the protocol, as well as how to check them. In our protocol we defined the following types to check for the availability of banks in the cache tag and data arrays:

enumeration(RequestType) {
  TagArrayRead;
  TagArrayWrite;
  DataArrayRead;
  DataArrayWrite;
}

void recordRequestType(RequestType request_type, Addr addr) {
  if (request_type == RequestType:DataArrayRead) {
    cache.recordRequestType(CacheRequestType:DataArrayRead, addr);
  }
  ...
}

bool checkResourceAvailable(RequestType request_type, Addr addr) {
  if (request_type == RequestType:DataArrayRead) {
    return cache.checkResourceAvailable(CacheResourceType:DataArray, addr);
  }
  ...
}

The implementation of checkResourceAvailable and recordRequestType are required by SLICC compiler when we use annotations on transactions.

Cache block allocation and replacement modeling

Consider the following transaction initialization code for a ReadShared miss:

action(Initiate_ReadShared_Miss) {
  tbe.actions.push(Event:ReadMissPipe);
  tbe.actions.push(Event:TagArrayRead);
  tbe.actions.push(Event:SendReadShared);
  tbe.actions.push(Event:SendCompData);
  tbe.actions.push(Event:WaitCompAck);
  tbe.actions.push(Event:CheckCacheFill);
  tbe.actions.push(Event:TagArrayWrite);
}

All transactions that modify a cache line or received cache line data as a result of a snoop or a request sent downstream, use the CheckCacheFill action trigger event. This event triggers a transition that perform the following actions:

When a replacement is performed, a new transaction is initialized to keep track of any WriteBack or Evict request sent downstream and/or snoops for backinvalidation (if the cache controller is configured the enforce inclusivity). Depending on the configuration parameters, the TBE for the replacement uses resources from a dedicated TBETable or reuses the same resources of the TBE that triggered the replacement. In both cases, the transaction that triggered the replacement completes without waiting for the replacement process.

Notice CheckCacheFill does not actually writes data to the cache block. If only ensures a cache block is allocated if needed, triggers replacements, and models the cache fill latencies. As described previously, TBE data is copied to the cache if needed during the transaction finalization.

Supported CHI transactions

All transactions are implemented as described in the AMBA5 CHI Issue D specification. The next sections provide a more detailed explanation of the implementation-specific choices not fixed by the public document.

Supported requests

The following incoming requests are supported:

When receiving any request the clusivity configuration parameters are evaluated during the transaction initialization and the doCacheFill and dataToBeInvalid flags are set in the transaction buffer entry allocated for the request. doCacheFill indicates we should keep any valid copy of the line in the local cache;dataToBeInvalid indicates we must invalidate the local copy when completing the transaction.

When receiving ReadShared or ReadUnique, if the data is present at the local cache in the required state (e.g. UC or UD for ReadUnique), a CompData response is send to the requester. The response type depends on the value of dataToBeInvalid.

When receiving a ReadOnce, CompData_I is always sent if the data is present at the local cache. For WriteUniquePtl handling see below.

If there is a cache miss, multiple actions may be performed depending on whether or not doCacheFill and dataToBeInvalid==false; and DCT or DMT is enabled:

Supported snoops

The cache controller issues and accepts the following snoops:

The snoop response is generated according to the current state of the line as defined in the specification. Data is returned with the snoop response depending on the data state and the value of retToSrc  set by the snooper. If retToSrc is set, the snoop response always includes data.

If the snoopee has sharers in any state, the same request is sent upstream to all sharers. For SnpSharedFwd/SnpNotSharedDirtyFwd and SnpUniqueFwd, a SnpShared/SnpNotSharedFwd or SnpUnique is sent, respectively. For a received SnpOnce, a SnpOnce is sent upstream only if the line is not present locally. In this particular implementation, there is always a directory entry for upstream caches that have the line. Snoops are never sent to caches that do not have the line.

Writeback and evictions

A writeback is triggered internally by the controller when a cache line needs to be evicted due to capacity reasons (cache maintenance operations are currently not supported). See Section Cache block allocation and replacement modeling for more information on replacements. These internal events are generated depending on the configurations parameters of the controller:

First we deallocate the local cache block (so the request that cause the eviction can allocate a new block and finish). For GlobalEviction, a SnpCleanInvalid is sent to all upstream caches. Once all snoops responses are received (possibly with dirty data), a LocalEviction is performed. The LocalEviction is done by issuing the appropriate request as follows:

For a HNF configuration the behavior changes slightly: WriteNoSnp to the SNF is used instead of WriteBackFull and no requests are issued if the line is clean.

The WriteBack* and Evict requests are handled at the downstream cache as follows:

Hazards

A request for a line that currently has an outstanding transaction is always stalled until the transaction completes. Snoops received while there is an outstanding request are handled following the requirements in the specification:

Multiple snoops may be received while there is an outstanding transaction. In this particular implementation, a SnpShared or SnpSharedFwd may be followed by a SnpUnique or SnpCleanInvalid. However, it’s not possible to have concurrent snoops coming from the downstream cache.

Both incoming requests and snoops require the allocation of a TBE. To prevent deadlocks when transaction buffers are full, a separate buffer is used to allocate snoop TBEs. Snoops do not allow retry, so if the snoop TBE table is full messages in the snpIn port are stalled, potentially causing severe congestion in the snoop channel in the interconnect.

Other implementations notes

Protocol table

Click here