GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators

Chapter 5:
Designing GALS Systems

Frank Kagan Gürkaynak
 
<kgf@ieee.org>

 
Disclaimer:
This is the www enabled version of my thesis. This has been converted from the sources of the original file by using TTH, some perl and some hand editing. There is also a PDF. This is essentially as it is, but includes formatting for A4, and some of the color pictures from the presentation.

Contents

1  Introduction
2  GALS System Design
3  Cryptographic Accelerators
4  Secure AES Implementation Using GALS
5  Designing GALS Systems
    5.1  Design Automation Issues
    5.2  Designing Asynchronous Finite State
Machines

        5.2.1  Port Controllers in Acacia
        5.2.2  Data Exchange between David and Goliath
        5.2.3  Data Exchange between Goliath and Synchronous Interface
    5.3  Testing Acacia
    5.4  Adapting Modules for GALS
    5.5  Related Research Directions
        5.5.1  Network-on-Chip Systems
        5.5.2  Dynamic Voltage and Frequency Scaling
        5.5.3  Latency-Insensitive Design
6  Conclusion
A  'Guessing' Effort for Keys
B  List of Abbreviations
B  Bibliography
B  Footnotes

Chapter 5
Designing GALS Systems

The main advantage of the GALS design methodology over other self-timed design methods is that only a well-defined, small fraction of a GALS system contains self-timed circuits. This has two important consequences:
  1. The majority of a GALS system can be designed using synchronous design methods.
  2. The problems commonly related to self-timed design are limited in complexity and can be practically solved with non-optimal methods.
As a result, the design of a GALS system does not differ significantly from a standard synchronous design. There are obviously several GALS specific issues that have to be addressed while designing GALS systems. This chapter discusses these issues, and basically describes the differences between standard synchronous design flow and GALS design flow.

5.1   Design Automation Issues

Over the years many EDA companies have developed powerful tools to support a well established synchronous design flow. Self-timed design methodologies on the other hand, are not used as widely, especially for industrial designs. The EDA industry has therefore not invested in tools that support self-timed design flows40. As a result, small research groups have been left to develop tools for self-timed design. These efforts have been further hampered by the fact that there is no unified design methodology. Every research group has developed its own approach to design self-timed circuits.
As with every newly proposed design methodology, the industrial acceptance of the GALS design methodology depends mainly on how well suited it is to design automation. Fortunately, up to 99% of a GALS system consists of standard synchronous design, and for the most part a standard design flow can be used. A design automation solution has to solve the following issues:
The design flow used for Shir-Khan, a fairly large GALS system consisting of 25 different GALS modules, can be seen in figure 5.1. There are five different design levels at Shir-Khan:
Figure designflow2_c
Figure 5.1: Design flow used for the Shir-Khan system.
  1. The Self-Timed Library
    Shir-Khan was designed to investigate different multi-point interconnect architectures for GALS [Vil05]. This required a large collection of experimental port controllers. Contrary to a more traditional GALS design, the self-timed library used in Shir-Khan contains 57 different port controllers. All of the ports were synthesized by the 3D tool. The generated boolean equations were mapped to standard cells of the target technology by the help of a custom tool called eqn2gate 42.
  2. The Local Clock Generator
    An important requirement of the local clock generator was a high period resolution. This required the design of an additional standard cell to be able to control the delay line with sub-gate delay accuracy. To achieve optimum performance, the local clock generator was designed as a hierarchical module. The design was synthesized, placed and routed separately and instantiated as a macro cell by the GALS modules.
  3. The SIMD Micro-Controller
    The LS island of all GALS modules consisted of a specialized 4-bit micro-controller named port processor. The port processor was designed specifically to activate any combination of 4 input and output ports simultaneously. It was used to generate different traffic patterns on the multi-point interconnects. The port processor was designed using a standard synchronous design flow.
  4. GALS Modules
    Shir-Khan has 25 GALS modules and a total of 181 port controller instantiations. A special design automation script called moduleassembler was written to automate the design process. moduleassembler used a textual description of the GALS system, and automatically generated the VHDL code for each GALS module. Furthermore, it generated tool-specific scripts to complete the design flow. All GALS modules in Shir-Khan were assembled automatically by scripts and source code generated by moduleassembler. Similar to the local clock generator each GALS module was designed as a hierarchical module.
  5. GALS System
    The remaining tasks of the design was to instantiate, place and route the GALS modules at the top-level43. The source code of the top-level instance was made manually. The GALS modules were placed on the chip and a power routing was devised. The signal interconnections were made using standard routing tools.
The scripts developed for Shir-Khan could have been modified to support the design flow for Acacia as well. However, since Acacia is much smaller in comparison and uses only two unique GALS modules with only three port instantiations, it was more practical to conduct the design flow manually.
The main difference between the two designs is in the back-end design flow. In Shir-Khan, Silicon Ensemble from Cadence Design Systems was used. However, this older tool does not directly support a hierarchical design flow. Rather complex design scripts which were automatically generated by moduleassembler had to be used to emulate a hierarchical design flow. Acacia used SOC-Encounter from Cadence Design Systems that inherently supports a hierarchical design flow. This design flow is well suited for GALS and made it considerably easier to design the chip.

5.2   Designing Asynchronous Finite State
Machines

The basic design principle between synchronous and asynchronous finite state machines (AFSMs) are very similar. Both types of state machines preserve their state until certain conditions are met. The input signals and the present state of the machine is used to determine the next state. In a synchronous finite state machine, the next state change occurs after the next active clock event, whereas an AFSM moves to the next state as soon as the conditions to change the state are met. This results in extremely fast state transitions. Since decisions are sudden, all input signals that are used for these decisions have to be stable. AFSMs are very sensitive to glitches at their inputs44. The synthesis of AFSMs is therefore more involved than their synchronous counterparts.
There are different classes of self-timed circuits [SF01]. Each class has its own set of assumptions that results in slightly different realizations. The most robust and general class of self-timed circuits are delay-insensitive (DI) circuits. These circuits will function correctly regardless of the gate and wire delays in the circuit. Unfortunately, only few practical DI circuits can be realized. A more practical class of self-timed circuits is obtained from DI circuits by assuming that two wires that split from a common wire have the same unbounded delay. The class of self-timed circuits that function correctly under this isochronic-fork assumption are known as Quasi-delay-insensitive (QDI) circuits. Even less assumptions are made for Speed-independent (SI) circuits that only assume bounded gate delays but no wire delays. At first sight this assumption seems unrealistic for modern integrated circuits. However, circuits whose wire delays are lumped into gate delays can still be considered speed-independent.
Before an AFSM can be synthesized, it needs to be described in a way that is convenient for the developer and understandable for the tools. A popular method for expressing AFSMs is using signal transition graphs (STG), that is essentially a simplified form of the Petri-Nets. An example STG can be seen in figure 5.4. In this graph, the boxes represent a signal transition. The signal name followed by a '+' represents a raising transition of the signal, and similarly a '-' represents a falling transition. Solid and dotted lines are used to represent outputs and inputs respectively. The large dot is called a "token" and represents the current state of the system. The STG can also be represented in machine readable form. The following is a textual description of the STG seen in in figure 5.4.

.model d2g 
.inputs Pen Ack Ai 
.outputs Req Ri Ta 
.graph 
Pen+ Req+ 
Req+ Ack+ 
Ack+ Ri+ 
Ri+ Ai+ 
Ai+ Req- Ta+ 
Ta+ Pen-
Req- Ack- 
Ack- Ri- 
Ri- Ai- 
Ai- Pen- 
Pen- Ta- 
Ta- Pen+ 
.marking{<Ta-,Pen+>} 
.end 

An AFSM synthesis tool is able to convert this description into a set of boolean equations, or even a gate-level netlist. Depending on the self-timed circuit style, the AFSMs are synthesized using certain timing assumptions. In the final circuit implementation, these assumptions must still hold true. For modern IC technologies the interconnection delays can vary significantly depending on the placement and routing of the final circuit. It is important to verify the correctness of the AFSMs after final placement and routing.

5.2.1  Port Controllers in Acacia

Although a library of port controllers developed for earlier GALS projects was available at the start of the project, a new set of port controllers were specifically developed for Acacia. The main reasons for the additional effort are the following:
Mainly due to the level sensitive input to enable the port controllers, earlier GALS ports designed by Muttersbach used the "extended burst mode" circuit description [YD99a] and were synthesized using the 3D tool. The port controllers in Acacia are speed-independent AFSMs that are synthesized from signal transition graphs using a tool called Petrify [CKK+97].
The idea to use three independently clocked GALS modules is a key part of the DPA countermeasures implemented in Acacia. The pausable clock generator is used to ensure data integrity during data transfers between GALS modules. If the clock is paused for longer durations, an attacker could potentially determine the time when two modules exchange data, and could use this information to refine the attack. To deny the attacker any such opportunity, the port controllers in Acacia were designed to reduce the synchronization time as much as possible. These ports work similar to P-type ports developed by Muttersbach. The port controller pauses the local clock only momentarily when its communication partner has signaled that it is ready for data transfer.

5.2.2  Data Exchange between David and Goliath

Figure david_goliath_port2
Figure 5.2: The interface between David and Goliath.
The block diagram in figure 5.2 shows the data communication channel between David and Goliath. There are two separate port controllers, d2g on the David side and g2d on the Goliath side. Unlike earlier GALS implementations, the port controllers in Acacia are bi-directional. Once both GALS modules are synchronized, both modules exchange data.
Figure g2d_timing
Figure 5.3: Timing diagram for the d2g port. Notice that this timing is substantially different from the GALS port timing shown in figure .
The STG and the gate-level circuit diagram of d2g can be seen in figure 5.4. The corresponding timing diagram is given in figure 5.3 The port, once activated by Pen (A), immediately activates the Req signal (B) and waits for Ack+ (C). The local clock is only paused after the Ack signal is received. After the clock is paused (D), the data can be safely sampled. At this point, the data transfer is effectively concluded and the Ta signal is activated (E). Afterwards, the handshake signals are returned to their idle states and the clock pause request signal Ri is deactivated (F). The Ta signal remains active (G) as long as the Pen signal remains active (H). This is an important change from older port controllers designed by Muttersbach, where the Ta signal was only available at the first active clock edge.
Figure STG_d2g
Figure 5.4: The state transition graph and the resulting gate-level schematic of the d2g port controller. This port is used for the communication between David and Goliath on the David side.
The g2d port controller seen in figure 5.5 is very similar to the d2g. The Ri+ transition, that will instruct the local clock generator to pause the clock, can only fire after both Pen+ (coming from d2g) and Req+ (coming from Goliath) are received. At this point, David is ready to transfer data, the local clock is paused immediately, and the Ack signal is generated. Once the Req- signal is received, data transfer has been completed and the local clock is released again. Similar to d2g, the Ta signal is set immediately after pausing the clock and remains active until Pen- is received.
Figure STG_g2d
Figure 5.5: The state transition graph and the resulting gate-level schematic of the g2d port controller. This port is used for the communication between David and Goliath on the Goliath side.

5.2.3  Data Exchange between Goliath and Synchronous Interface

The block diagram in figure 5.6 shows the data channel between Goliath and the synchronous interface. This is a specialized interface as the synchronization effort is only on the Goliath side of the channel. Note that, in this organization, safe data transfers are only possible under specific timing assumptions. In Acacia the external clock used in the interface is chosen to be slower than the local clock generator of Goliath. The port controller g2s ensures that Goliath can synchronize within one clock period of the external clock signal.
Figure goliath_s_port
Figure 5.6: The interface between Goliath and the synchronous interface. The one-sided port g2s is responsible for safe data transfers between two domains.
The STG and the gate-level circuit schematic for the port controller g2s is given in figure 5.745. The Muller-C elements shown in figure 5.6 are used to change the handshake signals synchronous to the external clock.
Figure STG_g2s
Figure 5.7: The state transition graph and the resulting gate-level schematic of the g2s port controller. This port is used for the communication between Goliath and the synchronous interface.
The synchronous interface initiates the data transfer by activating the Enable signal, which is propagated to the Req signal that triggers the g2s port. As soon as the port controller is enabled by Goliath, Ri is activated to pause the local clock. At this moment, it is safe to transfer data between two modules and the Ack signal is activated. The Muller-C element ensures that the Ack signal generated by g2s is only propagated during the first half of the clock period. In this way, the Done signal can safely be sampled by the synchronous interface. The synchronous interface then deactivates the Enable signal. The Muller-C element allows Req to be deactivated within the first half of the clock period only. The local clock of Goliath remains paused until the arrival of the Req signal. Contrary to the data transfers between David and Goliath, there is no need to disguise the data transfers between Goliath and the synchronous interface for DPA security.
No latches are required for the data transfer between the synchronous interface and Goliath. The output registers of the synchronous register are stable before the data transfer is initiated by the Enable signal. The local clock of Goliath is paused as soon as it starts the data transfer. The clock is paused until the Enable signal is deactivated. This only happens after an active clock edge on the synchronous interface. By this time the same clock edge will safely sample the data inputs.

5.3   Testing Acacia

As soon as a design using self-timed circuit techniques is mentioned, the first question that pops up is:
"How are you going to test this circuit ?"
Self-timed circuits have always been regarded as being difficult to test. A good overview of the self-timed testing problem is given in a survey conducted by Hulgaard et al [HBB95]. There are two main properties of self-timed circuits that make traditional testing approaches infeasible:
  1. It is not possible to hold the state of a self-timed circuit using a global signal.
    In synchronous systems, once the clock is halted, the state of the system is frozen. It can be observed and manipulated with ease. There are well established methods (i.e. scan based testing) and proven tools to support this approach. It is possible to support similar functionality for self-timed circuits as well. But the required test functionality must be part of the circuit definition from the beginning46.
  2. Self-timed circuits are (in principle) sensitive to all transitions of their inputs.
    This increases the amount of failure sources. In synchronous systems, as long as all nodes have the correct value at the time of a clock transition, the system will function correctly. Parasitic transitions of intermediate nodes have no negative effect on functionality47 in a synchronous system. In self-timed circuits, such glitches can have terminal consequences. Not only that, but signal transitions that are faster or slower than normal may result in the circuit malfunctioning.
The problem is also exaggerated by the fact that there are many different flavors of self-timed design methodologies, each with its own special requirements.
Testing is an essential part of the IC manufacturing process. Since there are very few self-timed circuits that have been manufactured in the industry [Ron99], the quality of test solutions for self-timed circuits, when compared to their synchronous counterparts, is clearly lacking .
The self-timed test problem is tackled in two main directions:
  1. Using a functional approach
    Certain faults in self-timed circuits lead to clearly observable behavior, usually such circuits stop functioning altogether. Several approaches have been proposed that rely mainly on such 'self-checking' behavior [MH91,Wie95, GVO+02].
  2. Modifying the self-timed circuit to support scan-based testing
    Since scan based testing is well-known and effective, most self-timed test methodologies try to add scan capability to state holding elements [PF95, KB95, BPtB02]. Unfortunately, full-scan based self-timed test methods incur a very large area penalty, at times doubling the circuit area. In order to keep the overhead at acceptable values, partial-scan methods have been suggested as well.
The GALS methodology developed by Muttersbach used a synchronous fall back method, in which, for test purposes, all AFSMs were bypassed and synchronous state machines were used instead. The resulting system was a fully synchronous system that can be tested normally. This method was more of an emergency solution and several alternatives were considered that actually tested the AFSMs as well. The AFSMs used in the GALS system are very limited in complexity. Therefore, instead of devising a general method that is capable of testing any given AFSM, practical methods that ensure adequate test coverage for the AFSMs used in the GALS methodology were investigated
At first, a method that added scanable elements to the asynchronous connections was considered. This method, similar to the one presented in [BPtB02] introduces a very large area overhead. Such an arrangement also interrupts the asynchronous handshake signals between GALS modules, slowing the communication and reducing the throughput. Since all AFSMs are tested in isolation, this method fails to detect delay faults that occur because signal transitions between two AFSMs are either too slow or too fast. On top of that, it was shown that the stuck-at test coverage of this approach is not above 90%.
In a GALS system, only a very small portion of all stuck-at faults are in the AFSMs. As an example, in Acacia, the total number of stuck-at faults is 154,604. Only 182 (0.118%) of these faults are within the AFSMs. This was the main motivation to develop a functional approach to test the AFSMs within the GALS system. This approach [GVO+02] adds a Test Extension Element (TEE) to the each GALS module. The TEE is clocked by a synchronous test clock. During test mode, the TEE is able to decouple the Pen signal from the LS island and initiate data transfers on all self-timed connections. In this way, all data connections within the GALS system can be tested individually. This idea is very similar in principle to the IEEE P1500 standard proposed for testing embedded cores [MZK+99]: Inter-module communication is tested by initiating data transfers between modules.
Figure test_extension
Figure 5.8: Testing GALS systems with the help of a test extension element. The self-timed wrapper includes additional scan registers to insert and observe data transfer between GALS modules. The centralized test controller can be realized on-chip, or externally as a test program on automated test hardware
Individual TEEs are controlled by a centralized test controller as seen in figure 5.8. The centralized test controller could be implemented in hardware, giving the system a built-in self-test capability, or implemented as a test program on automated test equipment. The TEE also enables the test controller to access the LS island through a JTAG interface. In this methodology, all LS islands are assumed to have their own test solution.
Two GALS designs have been implemented after the test extension methodology was developed. Shir-Khan [GOV03a] was designed primarily as a test platform for various multi-point interconnection schemes for GALS [Vil05]. Shir-Khan consists of twenty five identical 4-bit micro-controllers with large buffers at their inputs and outputs. These so-called port processors were specifically designed to test the self-timed connections between the GALS modules. A special local clock generator developed by Stephan Oetiker [OVG+02] enabled switching between the locally generated clock and an external synchronous clock for configuration. In a way, the functionality of the TEE was integrated into the LS island and the local clock generator.
The local clock generator for Acacia uses a similar method to switch between the locally generated clock and an external synchronous test clock. During test mode, all GALS modules are run with the same synchronous test clock. This allows standard tools to be applied to generate the ATPG patterns. All LS islands are fully tested using this method. However, the stuck-at faults within the AFSMs, the local clock generators, and the data interface including the latches used for retaining data, can not be detected with these patterns. Figure 5.9 shows the configuration that has been used in Acacia. The shaded areas represent the portions of the system that contain stuck-at faults that can not directly be detected using the scan chains.
Figure test_method
Figure 5.9: Simplified schematic of the scan chain connection used in Acacia. Once the ScanEn signal is active, the GALS system is in test mode. All scan chains of individual LS islands are connected and standard ATPG test vectors can be used to test the circuit. The shaded areas represent the portions of the circuit that is not covered by these tests. Functional test vectors are used to detect the stuck-at faults in these areas.
The final gate-level netlist of Acacia was analyzed using Synopsys Tetramax. All faults regarding the local clock generators48 and reset signals49 were removed from the fault library. The tool reported a test coverage of 96.2%. A further analysis yielded the following:
After this analysis, all questionable faults were collected. Several of the faults were equivalent. In addition, for some nodes both the stuck-at-1 and stuck-at-0 faults could not be detected. A total of 3,089 unique nodes were determined by parsing the fault reports. A gate-level simulation was performed and all of these 3,089 nodes were observed for the duration of a simple encryption and decryption. Each node that was observed to have changed its value more than 4 times during this simulation was considered to be detectable for stuck-at-1 and stuck-at-050. This method reported 2,796 (90.5%) of the specially analyzed nodes detectable for stuck-at faults. Once these nodes were mapped back to the individual faults, only 175 faults remained undetected in the entire design. This results in a respectable test coverage of 99.89%.

5.4  Adapting Modules for GALS

In principle, any standard synchronous design can be converted into a GALS system. However, the advantages offered by the GALS methodology can only be exploited if the system is designed with GALS in mind from the beginning [GOK+05]. In a GALS system, the synchronous design will be partitioned into several LS islands that will eventually become individually clocked GALS modules. Each of these LS islands have to be adapted to, or better yet designed according to, the requirements of the GALS design. The better the synchronous system is adapted to a GALS methodology, the better the performance will be. However, this may result in a design that is noticeably different from a system that is optimized for synchronous clocking.
As an example, consider the AES algorithm presented in this thesis. An efficient synchronous design would most probably use a block diagram like the one presented in figure 3.10. However, a direct GALS implementation of this block diagram would not be able to offer the same kind of DPA countermeasures as Acacia does. Similarly, a designer would not come up with the transformation presented in figure 4.1 without thinking about a GALS implementation.
The following is a brief discussion of several key issues that must be considered when designing GALS-friendly LS islands:

5.5  Related Research Directions

This thesis primarily presented an application of GALS to secure cryptographic hardware design. There are other interesting fields where GALS could have a potential impact. The following is a short description of three active research fields where GALS could be used.

5.5.1  Network-on-Chip Systems

As SoCs grow in size, supporting a global communication that connects all sub-modules becomes an increasingly difficult task. A solution to this problem might be adapting network solutions commonly used for computer communications within the SoC. Such systems are generally referred to as Network-on-Chip (NoC) [JT03].
A classical NoC system, consists of several resources, which are regular users of the network. Each resource is connected to the network by a switch that is able to route data packets to and from the resource. The NoC system clearly separates function from communication. The functionality is provided by the resources, and the communication between resources is handled by the switches.
Most of the topologies presented in the literature make use of the two dimensional structure of an integrated circuit. Some architectures like Nostrum [MNT+04] use a homogeneous mesh-based system, while others like Xpipes [BB04] use a heterogeneous approach where the geometry and size of resources is determined by the functionality. The problem is very similar to the partitioning problem presented for GALS systems in section 4.1. It is easier to resolve the timing issues for homogeneous systems since all connections between switches will have approximately the same size, on the other hand it is difficult to imagine practical systems where all resources are of identical size.
NoC solves one important problem of large system designs. Instead of using overly long interconnections between sub-modules, data transmission over longer distances is routed over switches in the network. Still the problem of distributing a global clock to the entire system, and synchronizing between different clock domains remains to be a challenge.
There are many parallels between NoC system design and GALS design. First of all, both methods separate function from communication. As mentioned earlier, the partitioning problem is very similar for both approaches. Furthermore, both methods were developed to address problems of large SoC designs. Combining both methods could potentially be mutually beneficial. In fact, there are several recent publications that talk about GALS-based NoC architectures [BCV+05, RVFG05].
A successful GALS-based NoC should be able to manage implementing the resource as the LS island and realizing the switch as part of the self-timed wrapper of a single GALS module. The switch should not interfere with the LS island, as long as there is no data transfer between the network and the resource. Note that, while the resource is not receiving or sending data, the switch still needs to be able to route traffic to and from connected switches. Depending on the NoC system, this might require a more complex switch. In case the network switch can not be realized easily using a self-timed design method, it might be necessary to implement the switch as a second LS island with an independent local clock generator, which would increase the overhead of the system.
GALS would definitely address the problems of distributing the clock and synchronizing between different clock domains. However, one of the most requested features from a NoC system is a way to provide Quality-of-Service (QoS). This guarantees a specific bandwidth between selected resources under all circumstances. On the positive side, the operation speed of the switches in a GALS-based NoC can be made independent from the resource, adding more throughput and thus more flexibility to the network. However, defining an upper limit for the time that is required to transfer data between two GALS modules is difficult52.
Overall, GALS and NoC promise to be two very compatible technologies. Combining both could potentially help overcome several serious problems that both technologies are facing at the moment.

5.5.2  Dynamic Voltage and Frequency Scaling

The more tightly circuits can be integrated, the more energy has to be dissipated over the same area. Coupled with the demands of mobile applications, where the power supply is one of the limiting design parameters, this has strongly motivated the designers to find ways to reduce the power consumption of integrated circuits over the last years.
As outlined briefly in section 3.5.2, the power consumption of an IC has a dynamic and a static part. While decreasing technology size has increased the ratio of static power consumption significantly, most of the power consumed by the integrated circuits remains to be dynamic power consumption53 given by the following well-known equation:

Pdyn » a·C·f·Vdd2
(5.1)
The total switched capacitance C is determined mainly by the circuit netlist, and the activity factor a is determined by the function and input data. Even if the circuit is not changed in any way (and C and a remain constant), the dynamic power consumption can be reduced by both the operating frequency f and the supply voltage Vdd.
The throughput of the circuit is directly determined by the operating frequency, and reducing f will reduce the throughput of the circuit. There are however some synchronous designs where the operating frequency is chosen as a compromise to satisfy different requirements of the circuit. In this case, some sub-blocks of the circuit may remain idle over several clock cycles. Similarly, the throughput requirement of the circuit may change throughout the operation of the circuit. Peak throughput might only be required for short time intervals.
Reducing the supply voltage reduces the dynamic power consumption even more decidedly. However, the power supply can not be reduced indefinitely, and the circuits will slow down as the supply voltage is reduced [CSB92].
Dynamically changing the frequency and the supply voltage for sub-blocks to reduce power consumption has been successfully implemented for high-performance micro-processors[NCM+02]. The so-called Dynamic Voltage and Frequency Scaling (DVFS) methods are very attractive for the micro-processors, as they are well-known for their excessive power consumption, and their performance requirements strongly depend on the program they are executing. Most DVFS applications are rather coarse grained and adjustments are made after thousands or even millions of clock cycles.
There is no reason why not also large SoC systems could benefit from DVFS. There are some important differences between two design styles. Unlike micro-processors, that have a relatively fixed architecture, SoC architectures are more varied. Consequently, algorithms that are developed to control DVFS for micro-processors are not always well suited for SoC designs.
GALS offers several interesting advantages that could be exploited to realize DVFS systems. The GALS design methodology already enables modules to be clocked at different clock rates. At least theoretically, it would be possible to extend this idea to support different voltages as well. One important issue in DVFS is monitoring or predicting the activity of the module to be controlled. In a GALS system that uses D-type port controllers which pause the clock until data transfer is completed, both the Req-Ack signal pairs or the duty cycle of the local clock can be used to determine how active a module is. If the duty cycle is
50%
the module is running without being interrupted by data transfers. This means that the environment is faster than (or at least as fast as) the module itself. Potentially, the environment could interact even with a faster module. Therefore, if it is possible, increasing the operating frequency and the voltage of the module could result in improved throughput.
(50-d)%
the module is running just as fast as the environment. This is the ideal operating condition. The d value is required, as it shows that the environment is just keeping up with the module, and the module has to be paused occasionally.
25%
the module is spending half the time waiting for the environment. The module is producing results too fast. The supply voltage of the module and the operating frequency of the local clock can be reduced to save energy.
An important problem that needs to be addressed is to avoid cyclic dependencies, where in the end all modules in a GALS system end up slowing each other down until there is no activity at all. Also, instead of relying solely on the environment to speed up operation, individual GALS modules could be "warned" in advance about changes in activity54. These problems could be addressed by designing a centralized power controller to monitor and control the DVFS effort in the GALS system.
A second problem is communication between modules with different power supplies. Special level converters, or independent power supplies for input and output registers could be used for this purpose. Unfortunately, modern technologies have dramatically reduced power supplies to allow small transistors. This reduces the available margin to adjust the power supply of the module somewhat 55.
There is fair amount of interest and research already taking place in adapting DVFS for GALS. However, published results are far away from practical realizations. There are early theoretical studies on the performance gains of GALS-based micro-processors using DVFS [IM02]. More recently published papers on GALS, like the GALDS approach by Chattopadhyay et al. [CZ05], mention DVFS as a possible application without providing concrete solutions.

5.5.3  Latency-Insensitive Design

Latency-insensitive circuits developed by Carloni et al. [CMSV01] formally addresses the problem of designing systems whose inputs may arrive with different latencies due to interconnection delays. Rather than trying to find methods where all inputs arrive at the same time, Carloni suggests adding relay stations on long interconnections. The question is:
"Is it possible to have a functionally equivalent circuit under these circumstances ?"
For the formal description of the system, a tagged signal model is used. In this model, all signals are represented by a set of value-tag pairs. The tag is essentially a timestamp that tells when the signal has the given value. Using this notation, Carloni was able to prove that, as long as the system consists of patient processes, it is indeed possible to have a functionally equivalent system. Such systems are called latency insensitive. A patient process is described as a stallable process which can be halted until all data inputs required to generate the next output are present.
Latency insensitive design was developed with synchronous systems in mind. Most of the publications in this field try to address practical issues on how relay stations can be added to the system. Functional blocks are encapsulated by a shell that converts the system into a patient system, mostly by adequate clock gating. This approach is remarkably similar to a GALS system that consists of functional LS islands encapsulated by a self-timed wrapper, which contains a pausable local clock. The analysis methods used in latency-insensitive design may be applied to GALS systems and can be used to address formal aspects of GALS system design.


File translated from TEX by TTH, version 3.77.
On 20 Dec 2006, 15:44.