GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators

Chapter 6:

Frank Kagan Gürkaynak

This is the www enabled version of my thesis. This has been converted from the sources of the original file by using TTH, some perl and some hand editing. There is also a PDF. This is essentially as it is, but includes formatting for A4, and some of the color pictures from the presentation.


1  Introduction
2  GALS System Design
3  Cryptographic Accelerators
4  Secure AES Implementation Using GALS
5  Designing GALS Systems
6  Conclusion
    6.1  Cryptographic Hardware Design
    6.2  GALS Design Methodology
    6.3  Final Words
A  'Guessing' Effort for Keys
B  List of Abbreviations
B  Bibliography
B  Footnotes

Chapter 6

This thesis essentially combines two different research areas: cryptographic hardware design, and the GALS design methodology. Separate conclusions can be drawn for both areas:

6.1  Cryptographic Hardware Design

A hardware designer's view of implementing cryptographic hardware is presented in this work. As a result of the experience obtained during the design of six different ASICs with different constraints on operation speed, area and side-channel security, an extensive analysis of hardware implementations of the popular Advanced Encryption Standard is given.
As part of this work, a completely new implementation of the AES with improved countermeasures against the differential power analysis (a particularly efficient form of side-channel attacks) has been developed. This implementation, called Acacia, is based on the GALS design methodology and utilizes several levels of countermeasures.
Several of the implemented countermeasures, like adding noise generators or inserting 'dummy' operations, are well-known. Some of the countermeasures are however new, and were only made possible as a result of using the GALS design methodology. Acacia consists of three datapath units that are clocked independently. In addition, Acacia is able to change the period of individual clock cycles randomly. Combined with the aforementioned countermeasures, this results in a very unpredictable operation order and is expected to present a formidable challenge to attackers.
Almost all countermeasures incur some sort of performance penalty. Additional pseudo random number generators used for generating additional noise increase circuit area, and not surprisingly, power consumption. Dummy operations used to interrupt the regular operation flow increase the time required to complete a cryptographic operation, and reduce the throughput. In addition, the AES datapath needs to be modified in a GALS friendly way, which results in lower performance when compared to an optimized implementation.
Acacia is able to dynamically change its security effort during an operation. The so-called policy can be programmed to increase the effort for countermeasures during the initial and final rounds of an AES operation, where it is more vulnerable. During middle rounds, where it is more difficult to stage a DPA attack, the security effort is reduced. Measurement results have shown that, when run using this policy setting, the throughput of Acacia is only 15% less than that of the synchronous reference design and consumes only 20% more energy per data item.
This thesis provides a fresh idea on how to implement DPA countermeasures. The evaluation of the efficiency of the proposed countermeasures requires the help of the cryptanalysis community, and is beyond the scope of this thesis.
There are many practical problems associated with determining the quality of the countermeasures. For instance, to be able to verify whether or not a proposed countermeasure results in 10x improvement in DPA security, the design must be attacked with no countermeasures first. Then with the countermeasures activated, it must be shown that an attack with comparable certainty can only be obtained with at least 10x more effort56.
The successful DPA attack on Fastcore shown in figure 3.13 required more than three days of automated measurements. Unless the measurement setup is refined to reduce the attack time dramatically, it is impractical to demonstrate the efficiency of the countermeasures using the current measurement setup. Even if such measurements could be performed in reasonable time, they would not offer conclusive proof that the proposed countermeasures are effective against side-channel attacks of similar nature.

6.2  GALS Design Methodology

Up to now, GALS implementations have been limited to demonstrator circuits or large-scale testbeds. The Acacia design presented in this thesis is the first circuit where GALS has been applied to address a specific problem. The GALS design flow has been proven to produce results comparable to what can be expected from more established industrial design flows. The design effort was comparable to that of a synchronous design of the same complexity, and all major steps of the design flow were performed using industry-standard design tools.
Designs that employ self-timed circuits are often criticized to have poor testability. Similar concerns have been raised over GALS applications over the years. By using a combination of scan-based test and a simple functional test, a stuck-at fault coverage of more than 99.8% has been obtained for Acacia.
Since GALS-based designs have pausable local clock generators, it is considered to be hard to interface to external sources that use synchronous clocking. While a generic solution for this problem has not been formulated, it was shown that under certain timing assumptions, it is indeed possible to transfer data between a GALS module and a standard synchronous design reliably.
The main design criteria behind Acacia was to improve the resistance against differential power analysis attacks. Such countermeasures invariably add penalties to system parameters such as circuit area, operation speed, and power consumption. However, even with significant countermeasures, the GALS-based Acacia achieves throughput and power figures within 20% of those from a fully synchronous version without any countermeasures. This shows that it is indeed possible to design GALS systems that have similar or even better performance metrics than their synchronous counterparts.
As can be seen from table 4.3, Acacia is more than two times larger than the reference design. However, the overhead required for the self-timed wrapper is less than 5% of the total area of Acacia. Most of the additional area in Acacia can be attributed to the countermeasures and the pseudo random number generators.
The most critical aspect of GALS remains to be partitioning the design into GALS modules. The partitioning has more influence on the performance of the system than all other factors combined. A well-defined methodology to determine the partitioning for GALS designs has yet to be developed.
In an attempt to make GALS design attractive to a broader audience, it has been often suggested that a synchronous design can be easily converted to a GALS design in a process called GALSification. While it is possible to design working GALS systems using this method, more efficient systems can only be realized if it is designed with GALS in mind in the first place.
This will be more apparent for larger SoCs that are expected to benefit significantly from GALS-based design. Present SoC designs require several tens of clock domains. Several of these domains are introduced to enable system wide communication protocols between blocks with different operating speeds. If such SoC circuits were designed with GALS in mind, several of such clock domains would not be required. Moreover, especially for inter-module communication, inherently asynchronous communication protocols would be favored over synchronous versions that are more difficult to implement in a GALS system.

6.3  Final Words

This work shows that the GALS approach is indeed a relatively mature design methodology that can be safely applied to design digital systems. As long as the system has been designed with GALS in mind, a designer using GALS should not expect a notable performance loss or an increased design effort.
The main advantage of the GALS design methodology is that it reduces the effort required for integrating multiple blocks on a large System-on-Chip design. However, in the example design described in this thesis, GALS has been applied to address a completely different problem. By implementing a common cryptographic algorithm using GALS, it was shown that completely new countermeasures against common side-channel attacks can be developed.

File translated from TEX by TTH, version 3.77.
On 20 Dec 2006, 15:44.