GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators

Chapter 5:
Designing GALS Systems

Frank Kagan Gürkaynak

<kgf@ieee.org>

Disclaimer:
This is the www enabled version of my thesis. This has been converted from the sources of the original file by using TTH, some perl and some hand editing. There is also a PDF. This is essentially as it is, but includes formatting for A4, and some of the color pictures from the presentation.

1 Introduction
2 GALS System Design
3 Cryptographic Accelerators
4 Secure AES Implementation Using GALS
5 Designing GALS Systems
    5.1 Design Automation Issues
    5.2 Designing Asynchronous Finite State
Machines
        5.2.1 Port Controllers in Acacia
        5.2.2 Data Exchange between David and Goliath
        5.2.3 Data Exchange between Goliath and Synchronous Interface
    5.3 Testing Acacia
    5.4 Adapting Modules for GALS
    5.5 Related Research Directions
        5.5.1 Network-on-Chip Systems
        5.5.2 Dynamic Voltage and Frequency Scaling
        5.5.3 Latency-Insensitive Design
6 Conclusion
A 'Guessing' Effort for Keys
B List of Abbreviations
B Bibliography
B Footnotes

Chapter 5
Designing GALS Systems

The main advantage of the GALS design methodology over other self-timed design methods is that only a well-defined, small fraction of a GALS system contains self-timed circuits. This has two important consequences:

The majority of a GALS system can be designed using synchronous design methods.
The problems commonly related to self-timed design are limited in complexity and can be practically solved with non-optimal methods.

As a result, the design of a GALS system does not differ significantly from a standard synchronous design. There are obviously several GALS specific issues that have to be addressed while designing GALS systems. This chapter discusses these issues, and basically describes the differences between standard synchronous design flow and GALS design flow.

5.1 Design Automation Issues

Over the years many EDA companies have developed powerful tools to support a well established synchronous design flow. Self-timed design methodologies on the other hand, are not used as widely, especially for industrial designs. The EDA industry has therefore not invested in tools that support self-timed design flows⁴⁰. As a result, small research groups have been left to develop tools for self-timed design. These efforts have been further hampered by the fact that there is no unified design methodology. Every research group has developed its own approach to design self-timed circuits.

As with every newly proposed design methodology, the industrial acceptance of the GALS design methodology depends mainly on how well suited it is to design automation. Fortunately, up to 99% of a GALS system consists of standard synchronous design, and for the most part a standard design flow can be used. A design automation solution has to solve the following issues:

Develop a library with self-timed port controllers
The port controllers are the only real self-timed elements in the GALS system. They are realized as asynchronous finite state machines (AFSMs) and, depending on the approach used, can be synthesized using a variety of tools like Petrify [CKK⁺97], 3D [YD99a] or Minimalist [FNT⁺99]. The end result is a gate level netlist that typically consists of 5 to 20 standard cells of the target process. If the port controllers can be standardized ,it would also be possible to use transistor level optimized port controllers that are realized as standard cells.
These port controllers are (usually) not design specific. They can be designed, optimized and verified separately. In addition, generally only a very limited set of port controllers is used. As an example, Acacia uses only three different port controllers. A detailed description of how the ports were developed for Acacia can be found in section 5.2.
Develop the local clock generator
The local clock generator is a critical element of the GALS system. The clock generator relies on a special MutEx element that needs to be provided as part of the standard cell library (see section 2.2.2 for details). The clock period is determined by using a delay line. Depending on the resolution required, a number of different methods can be used to realize the delay line[OVG⁺02]. Similar to the self-timed port controllers, the local clock generator is not design specific and can be designed and optimized separately.
Providing a partitioning method
Functionality, separate clock domains, and the gate count can all be used as parameters that determine the partitioning. Ideally, a GALS partition needs to be able to work on its own for longer periods of time and should consist of a single clock domain. In addition, the LS island should be of reasonable size. It should be sufficiently large to justify the overhead, but should not be overly large ⁴¹.
This is probably the only part of the GALS design methodology that is not yet supported by design automation tools. It requires manual skill and experience to determine a partition that will result in an efficient GALS system.
Assembling individual GALS modules
A GALS module is created by encapsulating the LS island by a self timed wrapper that contains the local clock generator and the asynchronous port controllers. By itself, this is not a very difficult task and can be easily automated. Some elements of the self-timed wrapper require special attention from certain design tools. For example, the structure of the port controllers may not be optimized by the synthesis tool. Similarly, the standard cells making up the delay line elements within the local clock generator should not be placed randomly. These are, comparatively, minor issues and can easily be added to standard design scripts.
Verifying the timing constraints of the asynchronous connections
The asynchronous port controllers need to meet certain timing constraints to guarantee proper operation. For most circuits, it is only possible to determine the final timing after the placement and routing of the system has been completed. The timing of all ports controllers must be verified at this stage. Standard timing analyzers usually require a synchronous clock to perform the timing analysis, and can not automatically be used to verify self-timed port controllers.
However, almost all industrial timing analyzers can be programmed to measure worst case and best case propagation delays through any given path. Since the asynchronous port controllers contain few standard cells, all relevant path delays can be calculated with reasonable effort. These results can then be processed to determine whether or not any timing assumptions have been violated [GOV⁺03b].
Interestingly, most of the timing violations occur through fast connections. Resolving such conflicts, even at later stages of a design is very easy. Additional buffers are placed in the signal path. It is also possible to design the port controllers more conservatively from the beginning. In this way, the timing verification effort can be reduced significantly as well.
Assembling the GALS system
The GALS system basically consists of interconnected GALS modules. In a synchronous design methodology, the back-end design is a difficult task that requires considerable effort which increases with increasing circuit size.
In contrast, the top-level design for a GALS system is very straightforward. The modules are simply placed and the interconnections are made. The GALS design methodology is especially suited to a hierarchical design methodology, where parts of the design are placed and routed independently.

The design flow used for Shir-Khan, a fairly large GALS system consisting of 25 different GALS modules, can be seen in figure 5.1. There are five different design levels at Shir-Khan:

Figure 5.1: Design flow used for the Shir-Khan system.

The Self-Timed Library
Shir-Khan was designed to investigate different multi-point interconnect architectures for GALS [Vil05]. This required a large collection of experimental port controllers. Contrary to a more traditional GALS design, the self-timed library used in Shir-Khan contains 57 different port controllers. All of the ports were synthesized by the 3D tool. The generated boolean equations were mapped to standard cells of the target technology by the help of a custom tool called eqn2gate ⁴².
The Local Clock Generator
An important requirement of the local clock generator was a high period resolution. This required the design of an additional standard cell to be able to control the delay line with sub-gate delay accuracy. To achieve optimum performance, the local clock generator was designed as a hierarchical module. The design was synthesized, placed and routed separately and instantiated as a macro cell by the GALS modules.
The SIMD Micro-Controller
The LS island of all GALS modules consisted of a specialized 4-bit micro-controller named port processor. The port processor was designed specifically to activate any combination of 4 input and output ports simultaneously. It was used to generate different traffic patterns on the multi-point interconnects. The port processor was designed using a standard synchronous design flow.
GALS Modules
Shir-Khan has 25 GALS modules and a total of 181 port controller instantiations. A special design automation script called moduleassembler was written to automate the design process. moduleassembler used a textual description of the GALS system, and automatically generated the VHDL code for each GALS module. Furthermore, it generated tool-specific scripts to complete the design flow. All GALS modules in Shir-Khan were assembled automatically by scripts and source code generated by moduleassembler. Similar to the local clock generator each GALS module was designed as a hierarchical module.
GALS System
The remaining tasks of the design was to instantiate, place and route the GALS modules at the top-level⁴³. The source code of the top-level instance was made manually. The GALS modules were placed on the chip and a power routing was devised. The signal interconnections were made using standard routing tools.

The scripts developed for Shir-Khan could have been modified to support the design flow for Acacia as well. However, since Acacia is much smaller in comparison and uses only two unique GALS modules with only three port instantiations, it was more practical to conduct the design flow manually.

The main difference between the two designs is in the back-end design flow. In Shir-Khan, Silicon Ensemble from Cadence Design Systems was used. However, this older tool does not directly support a hierarchical design flow. Rather complex design scripts which were automatically generated by moduleassembler had to be used to emulate a hierarchical design flow. Acacia used SOC-Encounter from Cadence Design Systems that inherently supports a hierarchical design flow. This design flow is well suited for GALS and made it considerably easier to design the chip.

5.2 Designing Asynchronous Finite State
Machines

The basic design principle between synchronous and asynchronous finite state machines (AFSMs) are very similar. Both types of state machines preserve their state until certain conditions are met. The input signals and the present state of the machine is used to determine the next state. In a synchronous finite state machine, the next state change occurs after the next active clock event, whereas an AFSM moves to the next state as soon as the conditions to change the state are met. This results in extremely fast state transitions. Since decisions are sudden, all input signals that are used for these decisions have to be stable. AFSMs are very sensitive to glitches at their inputs⁴⁴. The synthesis of AFSMs is therefore more involved than their synchronous counterparts.

There are different classes of self-timed circuits [SF01]. Each class has its own set of assumptions that results in slightly different realizations. The most robust and general class of self-timed circuits are delay-insensitive (DI) circuits. These circuits will function correctly regardless of the gate and wire delays in the circuit. Unfortunately, only few practical DI circuits can be realized. A more practical class of self-timed circuits is obtained from DI circuits by assuming that two wires that split from a common wire have the same unbounded delay. The class of self-timed circuits that function correctly under this isochronic-fork assumption are known as Quasi-delay-insensitive (QDI) circuits. Even less assumptions are made for Speed-independent (SI) circuits that only assume bounded gate delays but no wire delays. At first sight this assumption seems unrealistic for modern integrated circuits. However, circuits whose wire delays are lumped into gate delays can still be considered speed-independent.

Before an AFSM can be synthesized, it needs to be described in a way that is convenient for the developer and understandable for the tools. A popular method for expressing AFSMs is using signal transition graphs (STG), that is essentially a simplified form of the Petri-Nets. An example STG can be seen in figure 5.4. In this graph, the boxes represent a signal transition. The signal name followed by a '+' represents a raising transition of the signal, and similarly a '-' represents a falling transition. Solid and dotted lines are used to represent outputs and inputs respectively. The large dot is called a "token" and represents the current state of the system. The STG can also be represented in machine readable form. The following is a textual description of the STG seen in in figure 5.4.

: .model d2g
.inputs Pen Ack Ai
.outputs Req Ri Ta
.graph
Pen+ Req+
Req+ Ack+
Ack+ Ri+
Ri+ Ai+
Ai+ Req- Ta+
Ta+ Pen-
Req- Ack-
Ack- Ri-
Ri- Ai-
Ai- Pen-
Pen- Ta-
Ta- Pen+
.marking{<Ta-,Pen+>}
.end

An AFSM synthesis tool is able to convert this description into a set of boolean equations, or even a gate-level netlist. Depending on the self-timed circuit style, the AFSMs are synthesized using certain timing assumptions. In the final circuit implementation, these assumptions must still hold true. For modern IC technologies the interconnection delays can vary significantly depending on the placement and routing of the final circuit. It is important to verify the correctness of the AFSMs after final placement and routing.

5.2.1 Port Controllers in Acacia

Although a library of port controllers developed for earlier GALS projects was available at the start of the project, a new set of port controllers were specifically developed for Acacia. The main reasons for the additional effort are the following:

Level sensitive port enable
In an effort to enable data transfers at every clock cycle, the port controllers developed by Muttersbach used an edge sensitive port enable signal. This practically doubles the number of states in the AFSM and results in slightly larger realizations. The GALS modules in Acacia were explicitly designed not to transfer data during consecutive clock cycles. Therefore, this overhead was deemed to be unnecessary.
Different transfer acknowledge
The transfer acknowledge signal (Ta) as implemented by Muttersbach uses a flip-flop that could be set during data transfer and would be cleared by the first active clock edge. The most important problem with this approach is that the Ta signal must be sampled immediately since the first clock cycle will clear the information. In addition, the clock signal used in the port controller must be balanced with respect to that used in the LS island. Acacia uses a more "relaxed" Ta signal that remains active until a new data transfer starts.
One-sided port
The interface between Goliath and the synchronous environment is special since only the local clock of Goliath can be paused for synchronization. A special port that is able to transfer data safely between the synchronous environment and a GALS module was designed for this purpose. The port is called one-sided since it can only influence one side of the data transfer.

Mainly due to the level sensitive input to enable the port controllers, earlier GALS ports designed by Muttersbach used the "extended burst mode" circuit description [YD99a] and were synthesized using the 3D tool. The port controllers in Acacia are speed-independent AFSMs that are synthesized from signal transition graphs using a tool called Petrify [CKK⁺97].

The idea to use three independently clocked GALS modules is a key part of the DPA countermeasures implemented in Acacia. The pausable clock generator is used to ensure data integrity during data transfers between GALS modules. If the clock is paused for longer durations, an attacker could potentially determine the time when two modules exchange data, and could use this information to refine the attack. To deny the attacker any such opportunity, the port controllers in Acacia were designed to reduce the synchronization time as much as possible. These ports work similar to P-type ports developed by Muttersbach. The port controller pauses the local clock only momentarily when its communication partner has signaled that it is ready for data transfer.

5.2.2 Data Exchange between David and Goliath

Figure 5.2: The interface between David and Goliath.

The block diagram in figure 5.2 shows the data communication channel between David and Goliath. There are two separate port controllers, d2g on the David side and g2d on the Goliath side. Unlike earlier GALS implementations, the port controllers in Acacia are bi-directional. Once both GALS modules are synchronized, both modules exchange data.

Figure 5.3: Timing diagram for the d2g port. Notice that this timing is substantially different from the GALS port timing shown in figure .

The STG and the gate-level circuit diagram of d2g can be seen in figure 5.4. The corresponding timing diagram is given in figure 5.3 The port, once activated by Pen (A), immediately activates the Req signal (B) and waits for Ack+ (C). The local clock is only paused after the Ack signal is received. After the clock is paused (D), the data can be safely sampled. At this point, the data transfer is effectively concluded and the Ta signal is activated (E). Afterwards, the handshake signals are returned to their idle states and the clock pause request signal Ri is deactivated (F). The Ta signal remains active (G) as long as the Pen signal remains active (H). This is an important change from older port controllers designed by Muttersbach, where the Ta signal was only available at the first active clock edge.

Figure 5.4: The state transition graph and the resulting gate-level schematic of the d2g port controller. This port is used for the communication between David and Goliath on the David side.

The g2d port controller seen in figure 5.5 is very similar to the d2g. The Ri+ transition, that will instruct the local clock generator to pause the clock, can only fire after both Pen+ (coming from d2g) and Req+ (coming from Goliath) are received. At this point, David is ready to transfer data, the local clock is paused immediately, and the Ack signal is generated. Once the Req- signal is received, data transfer has been completed and the local clock is released again. Similar to d2g, the Ta signal is set immediately after pausing the clock and remains active until Pen- is received.

Figure 5.5: The state transition graph and the resulting gate-level schematic of the g2d port controller. This port is used for the communication between David and Goliath on the Goliath side.

5.2.3 Data Exchange between Goliath and Synchronous Interface

The block diagram in figure 5.6 shows the data channel between Goliath and the synchronous interface. This is a specialized interface as the synchronization effort is only on the Goliath side of the channel. Note that, in this organization, safe data transfers are only possible under specific timing assumptions. In Acacia the external clock used in the interface is chosen to be slower than the local clock generator of Goliath. The port controller g2s ensures that Goliath can synchronize within one clock period of the external clock signal.

Figure 5.6: The interface between Goliath and the synchronous interface. The one-sided port g2s is responsible for safe data transfers between two domains.

The STG and the gate-level circuit schematic for the port controller g2s is given in figure 5.7 ⁴⁵. The Muller-C elements shown in figure 5.6 are used to change the handshake signals synchronous to the external clock.

Figure 5.7: The state transition graph and the resulting gate-level schematic of the g2s port controller. This port is used for the communication between Goliath and the synchronous interface.

The synchronous interface initiates the data transfer by activating the Enable signal, which is propagated to the Req signal that triggers the g2s port. As soon as the port controller is enabled by Goliath, Ri is activated to pause the local clock. At this moment, it is safe to transfer data between two modules and the Ack signal is activated. The Muller-C element ensures that the Ack signal generated by g2s is only propagated during the first half of the clock period. In this way, the Done signal can safely be sampled by the synchronous interface. The synchronous interface then deactivates the Enable signal. The Muller-C element allows Req to be deactivated within the first half of the clock period only. The local clock of Goliath remains paused until the arrival of the Req signal. Contrary to the data transfers between David and Goliath, there is no need to disguise the data transfers between Goliath and the synchronous interface for DPA security.

No latches are required for the data transfer between the synchronous interface and Goliath. The output registers of the synchronous register are stable before the data transfer is initiated by the Enable signal. The local clock of Goliath is paused as soon as it starts the data transfer. The clock is paused until the Enable signal is deactivated. This only happens after an active clock edge on the synchronous interface. By this time the same clock edge will safely sample the data inputs.

5.3 Testing Acacia

As soon as a design using self-timed circuit techniques is mentioned, the first question that pops up is:

"How are you going to test this circuit ?"

Self-timed circuits have always been regarded as being difficult to test. A good overview of the self-timed testing problem is given in a survey conducted by Hulgaard et al [HBB95]. There are two main properties of self-timed circuits that make traditional testing approaches infeasible:

It is not possible to hold the state of a self-timed circuit using a global signal.
In synchronous systems, once the clock is halted, the state of the system is frozen. It can be observed and manipulated with ease. There are well established methods (i.e. scan based testing) and proven tools to support this approach. It is possible to support similar functionality for self-timed circuits as well. But the required test functionality must be part of the circuit definition from the beginning⁴⁶.
Self-timed circuits are (in principle) sensitive to all transitions of their inputs.
This increases the amount of failure sources. In synchronous systems, as long as all nodes have the correct value at the time of a clock transition, the system will function correctly. Parasitic transitions of intermediate nodes have no negative effect on functionality⁴⁷ in a synchronous system. In self-timed circuits, such glitches can have terminal consequences. Not only that, but signal transitions that are faster or slower than normal may result in the circuit malfunctioning.

The problem is also exaggerated by the fact that there are many different flavors of self-timed design methodologies, each with its own special requirements.

Testing is an essential part of the IC manufacturing process. Since there are very few self-timed circuits that have been manufactured in the industry [Ron99], the quality of test solutions for self-timed circuits, when compared to their synchronous counterparts, is clearly lacking .

The self-timed test problem is tackled in two main directions:

Using a functional approach
Certain faults in self-timed circuits lead to clearly observable behavior, usually such circuits stop functioning altogether. Several approaches have been proposed that rely mainly on such 'self-checking' behavior [MH91,Wie95, GVO⁺02].
Modifying the self-timed circuit to support scan-based testing
Since scan based testing is well-known and effective, most self-timed test methodologies try to add scan capability to state holding elements [PF95, KB95, BPtB02]. Unfortunately, full-scan based self-timed test methods incur a very large area penalty, at times doubling the circuit area. In order to keep the overhead at acceptable values, partial-scan methods have been suggested as well.

The GALS methodology developed by Muttersbach used a synchronous fall back method, in which, for test purposes, all AFSMs were bypassed and synchronous state machines were used instead. The resulting system was a fully synchronous system that can be tested normally. This method was more of an emergency solution and several alternatives were considered that actually tested the AFSMs as well. The AFSMs used in the GALS system are very limited in complexity. Therefore, instead of devising a general method that is capable of testing any given AFSM, practical methods that ensure adequate test coverage for the AFSMs used in the GALS methodology were investigated

At first, a method that added scanable elements to the asynchronous connections was considered. This method, similar to the one presented in [BPtB02] introduces a very large area overhead. Such an arrangement also interrupts the asynchronous handshake signals between GALS modules, slowing the communication and reducing the throughput. Since all AFSMs are tested in isolation, this method fails to detect delay faults that occur because signal transitions between two AFSMs are either too slow or too fast. On top of that, it was shown that the stuck-at test coverage of this approach is not above 90%.

In a GALS system, only a very small portion of all stuck-at faults are in the AFSMs. As an example, in Acacia, the total number of stuck-at faults is 154,604. Only 182 (0.118%) of these faults are within the AFSMs. This was the main motivation to develop a functional approach to test the AFSMs within the GALS system. This approach [GVO⁺02] adds a Test Extension Element (TEE) to the each GALS module. The TEE is clocked by a synchronous test clock. During test mode, the TEE is able to decouple the Pen signal from the LS island and initiate data transfers on all self-timed connections. In this way, all data connections within the GALS system can be tested individually. This idea is very similar in principle to the IEEE P1500 standard proposed for testing embedded cores [MZK⁺99]: Inter-module communication is tested by initiating data transfers between modules.

Figure 5.8: Testing GALS systems with the help of a test extension element. The self-timed wrapper includes additional scan registers to insert and observe data transfer between GALS modules. The centralized test controller can be realized on-chip, or externally as a test program on automated test hardware

Individual TEEs are controlled by a centralized test controller as seen in figure 5.8. The centralized test controller could be implemented in hardware, giving the system a built-in self-test capability, or implemented as a test program on automated test equipment. The TEE also enables the test controller to access the LS island through a JTAG interface. In this methodology, all LS islands are assumed to have their own test solution.

Two GALS designs have been implemented after the test extension methodology was developed. Shir-Khan [GOV03a] was designed primarily as a test platform for various multi-point interconnection schemes for GALS [Vil05]. Shir-Khan consists of twenty five identical 4-bit micro-controllers with large buffers at their inputs and outputs. These so-called port processors were specifically designed to test the self-timed connections between the GALS modules. A special local clock generator developed by Stephan Oetiker [OVG⁺02] enabled switching between the locally generated clock and an external synchronous clock for configuration. In a way, the functionality of the TEE was integrated into the LS island and the local clock generator.

The local clock generator for Acacia uses a similar method to switch between the locally generated clock and an external synchronous test clock. During test mode, all GALS modules are run with the same synchronous test clock. This allows standard tools to be applied to generate the ATPG patterns. All LS islands are fully tested using this method. However, the stuck-at faults within the AFSMs, the local clock generators, and the data interface including the latches used for retaining data, can not be detected with these patterns. Figure 5.9 shows the configuration that has been used in Acacia. The shaded areas represent the portions of the system that contain stuck-at faults that can not directly be detected using the scan chains.

Figure 5.9: Simplified schematic of the scan chain connection used in Acacia. Once the ScanEn signal is active, the GALS system is in test mode. All scan chains of individual LS islands are connected and standard ATPG test vectors can be used to test the circuit. The shaded areas represent the portions of the circuit that is not covered by these tests. Functional test vectors are used to detect the stuck-at faults in these areas.

The final gate-level netlist of Acacia was analyzed using Synopsys Tetramax. All faults regarding the local clock generators⁴⁸ and reset signals⁴⁹ were removed from the fault library. The tool reported a test coverage of 96.2%. A further analysis yielded the following:

2,900 faults were classified as 'possibly detected'. Tetramax classifies a fault as possibly detected if the fault simulation does not result in a logic-1 or logic-0, but an unknown. By default, 50% credit is given for such faults. Although this increases the test coverage by few percentage points, these faults need to be investigated further
2,529 faults were listed as 'ATPG Undetectable'. Faults classified as ATPG undetectable are tied to a constant value during the fault simulation. Most of the faults in this class are a result of the test mode signals that enable the circuit to be run synchronously.
351 faults were listed as 'Undetectable'. There are a number of reasons why a fault can be placed in this class. Most common reasons are not connected outputs or inputs that are tied to a constant value. Most of these are plain design errors and could be eliminated during the design phase. Tetramax automatically removes these faults from the statistics.
1,897 faults were not detected.

After this analysis, all questionable faults were collected. Several of the faults were equivalent. In addition, for some nodes both the stuck-at-1 and stuck-at-0 faults could not be detected. A total of 3,089 unique nodes were determined by parsing the fault reports. A gate-level simulation was performed and all of these 3,089 nodes were observed for the duration of a simple encryption and decryption. Each node that was observed to have changed its value more than 4 times during this simulation was considered to be detectable for stuck-at-1 and stuck-at-0⁵⁰. This method reported 2,796 (90.5%) of the specially analyzed nodes detectable for stuck-at faults. Once these nodes were mapped back to the individual faults, only 175 faults remained undetected in the entire design. This results in a respectable test coverage of 99.89%.

5.4 Adapting Modules for GALS

In principle, any standard synchronous design can be converted into a GALS system. However, the advantages offered by the GALS methodology can only be exploited if the system is designed with GALS in mind from the beginning [GOK⁺05]. In a GALS system, the synchronous design will be partitioned into several LS islands that will eventually become individually clocked GALS modules. Each of these LS islands have to be adapted to, or better yet designed according to, the requirements of the GALS design. The better the synchronous system is adapted to a GALS methodology, the better the performance will be. However, this may result in a design that is noticeably different from a system that is optimized for synchronous clocking.

As an example, consider the AES algorithm presented in this thesis. An efficient synchronous design would most probably use a block diagram like the one presented in figure 3.10. However, a direct GALS implementation of this block diagram would not be able to offer the same kind of DPA countermeasures as Acacia does. Similarly, a designer would not come up with the transformation presented in figure 4.1 without thinking about a GALS implementation.

The following is a brief discussion of several key issues that must be considered when designing GALS-friendly LS islands:

Additional registers at inputs and outputs
The interface between the (locally) synchronous domain and the self-timed domain is always prone to timing violations. The self-timed port controllers ensure safe data transfers under the assumption that output data is available when the port is activated and input data can be sampled by the first active clock edge that follows the data transmission. This can be achieved by using registers at both inputs and outputs. Note that these registers are not strictly necessary for correct operation, but they eliminate the costly timing verification effort at the interface.
The obvious side effect of this approach is additional latency in the LS island. However, this additional latency is in local clock cycles and may not translate to an overall latency in the system. By isolating LS islands from the rest of the system, it may be possible to run individual LS islands at a higher rate than what it would be possible in a synchronous system. As an example in Acacia, David can be clocked much faster than Goliath as can be seen in table 4.3. As a net result, although the GALS implementation requires 30% more clock cycles, the time required to complete an AES round does not change significantly.
Limited interaction between modules
Basically each data transfer between GALS modules has a chance to slow the participating modules down. Therefore, a successful GALS partitioning aims to limit the rate of data transfers between GALS modules as much as possible. This requires fairly independent modules that can operate on a set of data over several clock cycles before requiring data exchange.
In synchronous systems, the actual state of a module can be easily communicated across the system by several status bits. This can be used to improve performance in the synchronous system. Communicating such signals between GALS modules can be costly and should be avoided. Again taking Acacia as an example: David requires several cycles to process its input. Once David is finished, Goliath needs to make a decision as to what to do next. In the synchronous reference realization of Acacia the current state of Matthias can be made available to Schoggi at all times. This enables the decision process in Schoggi to be triggered as soon as Matthias starts processing the last step. By the time Matthias is finished, Schoggi has already reached a decision. In the GALS implementation, communicating the actual state of David to Goliath would require continuous data transfer between the modules and is therefore not very practical.
The LS islands of a GALS system must be designed to operate using one-time or burst-type parallel data/control signal transfers rather than a continuous interaction. This may require re-engineering of several control structures and result in performance loss in local clock cycles.
Proper handshake signals
There are various forms of handshake protocols that are available to designers. However, a significant part of designers that are used to synchronous systems do not apply them properly. Many designs use protocols that are not complete and make a lot of (hidden) assumptions on the system. Strobe signals are a good example for this. A strobe signal basically qualifies an accompanying data signal as valid during a clock cycle. This is equivalent to a request signal used in standard handshake protocols. Such a signal assumes that the system is ready to accept data at any clock cycle. Whether or not the data has been accepted by the receiver is never really checked.
The LS islands of a GALS system needs to support proper handshaking between the LS island and the port controllers. The exact handshake protocol depends on the design of the specific port controllers used in the system.
More complex state machines
A GALS system consists of multiple GALS modules, each of which uses its own local clock. When seen from a LS island that is part of a GALS system, a data transaction to a connected GALS module may take an undetermined amount of clock cycles. LS islands that have connections to more than one GALS module are at times unable to make any assumptions on the completion order of the data transactions.
Consider the controller in Goliath that schedules 32-bit operations on two separate David units. Assume that initially both David modules, David Perels and David Tschopp, are able to accept data. Goliath will now wait for the modules to finish processing, and schedule new operations on both units. At any given clock cycle, both, any, or none of the two David modules may return a result. Again, assume for a moment that David Perels has returned a result and was assigned a third 32-bit data. Even now, it is still possible that David Perels returns a result prior to David Tschopp⁵¹.
Although it is very difficult to quantify, a controller within a GALS module needs to consider more cases than a purely synchronous design, and as a result is more complex.
Reducing clock domains
One of the main challenges in modern SoC design is to handle a multitude of clock domains. This requires considerable effort in synchronization, and more often than not results in large FIFO structures between blocks.
Most of the clock domains are introduced to the system in order to support a specific standard communication protocol. Different clock domains are also used when parts of the SoC can not match certain clock rates that are used throughout the system.
GALS is a method for synchronizing different clock domains. Although it might sound strange, the goal of a GALS design must be to reduce the number of clock domains. Each LS island will eventually be a separate clock domain, and GALS provides a method to synchronize these domains. If the synchronous system includes several clock domains, the GALS system will have to be designed in a way to encapsulate each clock domain in a separate GALS module. This may severely restrict the partitioning and result in sub-optimum designs.
The data connection between two GALS modules can, in a way, be considered as a single-stage distributed FIFO. At least theoretically, additional FIFO buffers between individual modules should be unnecessary.
For an efficient SoC implementation, most of the tricks that were added to the system in hope of supporting a synchronous design flow need to be undone first. This may require at times radical changes to the system.
Real-time processing
Since GALS modules use a pausable local clock, it is very difficult to guarantee that a certain operation can be processed in a fixed number of clock cycles. At first sight, this makes it difficult to interface external signals that have fixed clock rate.
However, most external interfaces use a decidedly slower clock rate when compared to internal clock rates. Depending on the clock rates used, it may still be possible to design interface modules that can finish processing input and output in time for a fixed interface clock. These interfaces have additional timing constraints set on them, and especially with modules that have more than one connection, the conditions to guarantee safe operation is not well defined.
Acacia uses such a port controller to communicate with a fixed-clock interface. The particular interface described in section 4.3.3 uses a handshake protocol and does not expect results in a fixed number of clock cycles. As long as the throughput requirement can be met by the GALS modules, it would have been possible to meet fixed clock cycle demands as well.

5.5 Related Research Directions

This thesis primarily presented an application of GALS to secure cryptographic hardware design. There are other interesting fields where GALS could have a potential impact. The following is a short description of three active research fields where GALS could be used.

5.5.1 Network-on-Chip Systems

As SoCs grow in size, supporting a global communication that connects all sub-modules becomes an increasingly difficult task. A solution to this problem might be adapting network solutions commonly used for computer communications within the SoC. Such systems are generally referred to as Network-on-Chip (NoC) [JT03].

A classical NoC system, consists of several resources, which are regular users of the network. Each resource is connected to the network by a switch that is able to route data packets to and from the resource. The NoC system clearly separates function from communication. The functionality is provided by the resources, and the communication between resources is handled by the switches.

Most of the topologies presented in the literature make use of the two dimensional structure of an integrated circuit. Some architectures like Nostrum [MNT⁺04] use a homogeneous mesh-based system, while others like Xpipes [BB04] use a heterogeneous approach where the geometry and size of resources is determined by the functionality. The problem is very similar to the partitioning problem presented for GALS systems in section 4.1. It is easier to resolve the timing issues for homogeneous systems since all connections between switches will have approximately the same size, on the other hand it is difficult to imagine practical systems where all resources are of identical size.

NoC solves one important problem of large system designs. Instead of using overly long interconnections between sub-modules, data transmission over longer distances is routed over switches in the network. Still the problem of distributing a global clock to the entire system, and synchronizing between different clock domains remains to be a challenge.

There are many parallels between NoC system design and GALS design. First of all, both methods separate function from communication. As mentioned earlier, the partitioning problem is very similar for both approaches. Furthermore, both methods were developed to address problems of large SoC designs. Combining both methods could potentially be mutually beneficial. In fact, there are several recent publications that talk about GALS-based NoC architectures [BCV⁺05, RVFG05].

A successful GALS-based NoC should be able to manage implementing the resource as the LS island and realizing the switch as part of the self-timed wrapper of a single GALS module. The switch should not interfere with the LS island, as long as there is no data transfer between the network and the resource. Note that, while the resource is not receiving or sending data, the switch still needs to be able to route traffic to and from connected switches. Depending on the NoC system, this might require a more complex switch. In case the network switch can not be realized easily using a self-timed design method, it might be necessary to implement the switch as a second LS island with an independent local clock generator, which would increase the overhead of the system.

GALS would definitely address the problems of distributing the clock and synchronizing between different clock domains. However, one of the most requested features from a NoC system is a way to provide Quality-of-Service (QoS). This guarantees a specific bandwidth between selected resources under all circumstances. On the positive side, the operation speed of the switches in a GALS-based NoC can be made independent from the resource, adding more throughput and thus more flexibility to the network. However, defining an upper limit for the time that is required to transfer data between two GALS modules is difficult⁵².

Overall, GALS and NoC promise to be two very compatible technologies. Combining both could potentially help overcome several serious problems that both technologies are facing at the moment.

5.5.2 Dynamic Voltage and Frequency Scaling

The more tightly circuits can be integrated, the more energy has to be dissipated over the same area. Coupled with the demands of mobile applications, where the power supply is one of the limiting design parameters, this has strongly motivated the designers to find ways to reduce the power consumption of integrated circuits over the last years.

As outlined briefly in section 3.5.2, the power consumption of an IC has a dynamic and a static part. While decreasing technology size has increased the ratio of static power consumption significantly, most of the power consumed by the integrated circuits remains to be dynamic power consumption⁵³ given by the following well-known equation:

P_dyn » a·C·f·V_dd²

(5.1)

The total switched capacitance C is determined mainly by the circuit netlist, and the activity factor a is determined by the function and input data. Even if the circuit is not changed in any way (and C and a remain constant), the dynamic power consumption can be reduced by both the operating frequency f and the supply voltage V_dd.

The throughput of the circuit is directly determined by the operating frequency, and reducing f will reduce the throughput of the circuit. There are however some synchronous designs where the operating frequency is chosen as a compromise to satisfy different requirements of the circuit. In this case, some sub-blocks of the circuit may remain idle over several clock cycles. Similarly, the throughput requirement of the circuit may change throughout the operation of the circuit. Peak throughput might only be required for short time intervals.

Reducing the supply voltage reduces the dynamic power consumption even more decidedly. However, the power supply can not be reduced indefinitely, and the circuits will slow down as the supply voltage is reduced [CSB92].

Dynamically changing the frequency and the supply voltage for sub-blocks to reduce power consumption has been successfully implemented for high-performance micro-processors[NCM⁺02]. The so-called Dynamic Voltage and Frequency Scaling (DVFS) methods are very attractive for the micro-processors, as they are well-known for their excessive power consumption, and their performance requirements strongly depend on the program they are executing. Most DVFS applications are rather coarse grained and adjustments are made after thousands or even millions of clock cycles.

There is no reason why not also large SoC systems could benefit from DVFS. There are some important differences between two design styles. Unlike micro-processors, that have a relatively fixed architecture, SoC architectures are more varied. Consequently, algorithms that are developed to control DVFS for micro-processors are not always well suited for SoC designs.

GALS offers several interesting advantages that could be exploited to realize DVFS systems. The GALS design methodology already enables modules to be clocked at different clock rates. At least theoretically, it would be possible to extend this idea to support different voltages as well. One important issue in DVFS is monitoring or predicting the activity of the module to be controlled. In a GALS system that uses D-type port controllers which pause the clock until data transfer is completed, both the Req-Ack signal pairs or the duty cycle of the local clock can be used to determine how active a module is. If the duty cycle is

50%: the module is running without being interrupted by data transfers. This means that the environment is faster than (or at least as fast as) the module itself. Potentially, the environment could interact even with a faster module. Therefore, if it is possible, increasing the operating frequency and the voltage of the module could result in improved throughput.
(50-d)%: the module is running just as fast as the environment. This is the ideal operating condition. The d value is required, as it shows that the environment is just keeping up with the module, and the module has to be paused occasionally.
25%: the module is spending half the time waiting for the environment. The module is producing results too fast. The supply voltage of the module and the operating frequency of the local clock can be reduced to save energy.

An important problem that needs to be addressed is to avoid cyclic dependencies, where in the end all modules in a GALS system end up slowing each other down until there is no activity at all. Also, instead of relying solely on the environment to speed up operation, individual GALS modules could be "warned" in advance about changes in activity⁵⁴. These problems could be addressed by designing a centralized power controller to monitor and control the DVFS effort in the GALS system.

A second problem is communication between modules with different power supplies. Special level converters, or independent power supplies for input and output registers could be used for this purpose. Unfortunately, modern technologies have dramatically reduced power supplies to allow small transistors. This reduces the available margin to adjust the power supply of the module somewhat ⁵⁵.

There is fair amount of interest and research already taking place in adapting DVFS for GALS. However, published results are far away from practical realizations. There are early theoretical studies on the performance gains of GALS-based micro-processors using DVFS [IM02]. More recently published papers on GALS, like the GALDS approach by Chattopadhyay et al. [CZ05], mention DVFS as a possible application without providing concrete solutions.

5.5.3 Latency-Insensitive Design

Latency-insensitive circuits developed by Carloni et al. [CMSV01] formally addresses the problem of designing systems whose inputs may arrive with different latencies due to interconnection delays. Rather than trying to find methods where all inputs arrive at the same time, Carloni suggests adding relay stations on long interconnections. The question is:

"Is it possible to have a functionally equivalent circuit under these circumstances ?"

For the formal description of the system, a tagged signal model is used. In this model, all signals are represented by a set of value-tag pairs. The tag is essentially a timestamp that tells when the signal has the given value. Using this notation, Carloni was able to prove that, as long as the system consists of patient processes, it is indeed possible to have a functionally equivalent system. Such systems are called latency insensitive. A patient process is described as a stallable process which can be halted until all data inputs required to generate the next output are present.

Latency insensitive design was developed with synchronous systems in mind. Most of the publications in this field try to address practical issues on how relay stations can be added to the system. Functional blocks are encapsulated by a shell that converts the system into a patient system, mostly by adequate clock gating. This approach is remarkably similar to a GALS system that consists of functional LS islands encapsulated by a self-timed wrapper, which contains a pausable local clock. The analysis methods used in latency-insensitive design may be applied to GALS systems and can be used to address formal aspects of GALS system design.

File translated from T_EX by T_TH, version 3.77.
On 20 Dec 2006, 15:44.

GALS System Design: Side Channel Attack Secure Cryptographic Accelerators

Chapter 5: Designing GALS Systems

Frank Kagan Gürkaynak <kgf@ieee.org>

Contents

Chapter 5 Designing GALS Systems