GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators


1  Introduction
2  GALS System Design
3  Cryptographic Accelerators
4  Secure AES Implementation Using GALS
5  Designing GALS Systems
6  Conclusion
A  'Guessing' Effort for Keys
B  List of Abbreviations
B  Bibliography
B  Footnotes



1Some references make a clear distinction between asynchronous and self-timed circuits. Even according to these descriptions, for the circuits that are discussed in this thesis, the two terms could be used interchangeably.
2The following list is from a keynote speech at the 2002 Asynchronous Circuit Design Conference in Manchester, UK.
3Actually, a Ri request during the high phase of the local clock is not acknowledged, even if the outstanding Ri request would effectively prevent the next rising transition of the local clock. While this slows the handshaking protocol somewhat, it is very simple to implement and it avoids many timing problems.
4This case is more common than one would normally expect. Imagine the following scenario: Module A is waiting for Module B to acknowledge the data transfer. Assume that Module B is not immediately available. In this case, the next clock edge of the Module A is practically waiting at the Muller-C element shown in figure 2.3. Once Module B enables its input port, the data transfer quickly takes place. As the data transfer is completed Module A releases its local clock. The active clock edge appears immediately. On Module B, assuming the data transfer has taken place within a fraction of the nominal clock period, the next active clock edge will appear later. By this time the output of Module A might already have changed.
5To achieve this an accurate completion detection is required. This is not always trivial. In practice most self-timed circuits use a delay line that is matched to the worst-case delay of the circuit. Such circuits also have worst-case performance.
6Typically in an asynchronous data connection, fast signals cause timing violations. Such violations are easy to resolve even at later stages of the design. The fast signals are delayed by inserting additional buffers.
7Unfortunately, up to now no digital information has been preserved for more than 50 years, simply because these methods were not known earlier. Storing information to last a long period of time involves challenges that have not yet been mastered. The main problem in storing digital information over longer periods of time, is that the technology used to store the information changes very rapidly.
An interesting example is described in detail at the URL In this project, an old English book written in 1086 was digitized in an effort to preserve the book in the year 1986. In less than 15 years, the hardware required for the digital copy was simply outdated and practically could not be used. The original book, however, can still be used.
8The Alice and Bob After Dinner Speech given at the Zurich Seminar, April 1984 by John Gordon by invitation of Professor James Massey, found (among other places) at, tells a humorous account of Alice and Bob.
9Remarkably, by using a simple XOR gate one can create an unconditionally secure system, provided an endless stream of totally random cipherkey bits can be provided. The problem in the so-called 'one-time pad' lies in the generation of the random cipherkey bits.
10Every service provided by cryptographic systems has a value to its user. It is the relation between this value and the cost of breaking the cryptographic algorithm that is important. A malicious attacker is unlikely to invest 10 million dollars in breaking a cryptographic system that protects information that is worth 1 million dollars.
11In the 20 years that DES was used, the computation power has increased by a factor of more than 10,000 according to Moore's Law. This alone corresponds to 13 bits (log210,000 13.28). In other words, attacking a 56-bit cipherkey in the year 1996 is equivalent to attacking a 43-bit cipherkey of the year 1976, if the improvement in computer technology is accounted for.
12The original specification by NIST required the candidates to support data block sizes of 128, 192 and 256 bits. However, the official AES algorithm announced by NIST is only defined for 128 bits.
13For the AES algorithm, the irreducible polynomial of degree 8 is taken to be m(x)=x8+x4+x3+x+1 for multiplication in GF(28).
14The expectations are based on preliminary simulation and layout studies for a full-custom mask ROM.
15The wiring overhead for the ShiftRows permutation is not much different from a straight interconnection.
16High throughput can also be important for other applications. One example from the industry used a hashing function to authenticate the operating system of a mobile device. Such a hashing function processes the entire ROM where the operating system is located and generates a checksum. The device only works if the checksum calculated from the ROM matches a hard-coded value. This is used to prevent 'alternative' operating systems from being used in the device. As the capabilities of the mobile devices increased so did the size of the ROM required to hold the operating system and the time required to calculate the checksum. Although the device did not process encrypted data at high rates, it still required a high throughput cryptographic solution to calculate the checksum in a time that is not noticeable by the users.
17The remaining standard modes, require the output of a complete cryptographic process prior to a new input. These operation modes could actually just as well use the outputs from previous cycles. It has been proven that these so-called interleaved feedback modes to be just as secure as the original modes. Using interleaved feedback modes would enable a number of architectural transformations like pipelining and parallelization to be applied to cryptographic accelerators. Alas, these modes are not supported by the official FIPS.
18The example is based on real systems, but is not necessarily 100% accurate in how certain problems are addressed in state-of-the-art systems. An excellent discussion on this topic can be found in [And93].
19Cryptographic hardware includes dedicated hardware that is designed specifically to implement cryptographic algorithms and general purpose programmable micro-controller/-processor based systems.
20Most of the side channels, like electromagnetic radiation and temperature, can directly be related to the power consumption as well.
21Extracting the cipher key using DPA attack is a rather difficult task. However when compared to attacking an exhaustively by trying all combinations of the cipher key, it is tens of orders of magnitude easier.
22For DPA attacks it is important to understand the nature of the power consumption, and not necessarily its absolute value. The short-circuit power Psc is a result of a switching activity that will manifest itself in the Pdynas well. The static leakage power Psta will differ for different states of the output. Similarly, a change of the output state will also result in dynamic power consumption Pdyn. Therefore, these two power components can be neglected without adverse results.
23Such attacks are independent and can be repeated until all subkeys have been revealed.
24The DPA attack does not depend on this model. In most attacks, models based on hamming weights (number of logic ones in a vector) or hamming distances (number of bit-flips between consecutive clock cycles) are used. It is possible to construct more elaborate and accurate models if details of the architecture are known by the attacker. But the DPA attack does not require models of such complexity to be successful.
25Since the attack requires considerable resources, it was repeated only twice. In the results shown above 3,000 measurements gave a sufficiently confident indication to the correct subkey. The other attack required more than 5,000 measurements to come up with a similar result.
26Using a post layout analysis, the switching activity of a large LFSR circuit was determined to be 0.4410, which is close to the theoretical maximum of 0.5. The difference stems from nets like reset, that are not directly in the data propagation path, but are essential for operation.
27Adapting a datapath so that it can continuously perform uncorrelated operations requires certain modifications, most notably a source for random data bits.
28David can be configured for both encryption and decryption. During the last round of an AES operation or during roundkey generation, the MixColumns operation needs to be skipped as well.
29It is pretty difficult to talk of a clock period, since the clock signal is for all practical purposes not periodic any more. It is not only randomly interrupted for data transfers but each clock edge is separated by a random time interval.
30As an example the Fastcore chip was clocked at 2 MHz so that 512 values per clock cycle could be sampled by a 1 Gb/s sampling oscilloscope. If Fastcore were to be clocked at its nominal operating frequency of 166 MHz, only 6 values would have been sampled per clock cycle.
31While the output of an 8-bit SubBytes operation in the first round depends on only 8-bits of the plaintext and the 8-bits of the cipherkey, the output of any 8-bit SubBytes operation in the second round depends on 32-bits of the plaintext and 32-bits of the cipherkey.
32Goliath does not keep track of the target count that it has assigned to the David units, as soon as it assigns one, it forgets about it.
33The decision of which multiplicative inverse unit to use will still be random.
34In all fairness it must be noted that the reference design that includes Matthias was optimized for small area. If both designs were optimized similarly, the area overhead of David would decrease slightly to around 80%.
35This excellent resource is also available online at the URL
36It is tempting to derive inputs for seemingly unrelated datapath elements from the same random bits. After analyzing the cost of the PRNG for the additional bits, it was decided to use random bits only once. This is an overly conservative approach and it may be indeed safe to re-use several random bits.
37Late in the design phase several random bits were no longer needed. The LFSRs were kept as they were, some random bits are simply not used. In fact, David uses only 69 bits and Goliath 234 bits.
38The original David and Goliath names are directly related to their bit-widths. Since two instances of David are used, in the code they were named after two colleagues at the IIS David Perels and David Tschopp. It was intended to name the reference design Vanilla, to show that it did not include any countermeasures and was a plain implementation. The name Matthias was used for the reference David implementations, since, just like the name David, there were two members at the IIS that were named Matthias: Matthias Braendli and Matthias Streiff. At this point the name Vanilla was changed to Schoggi (Swiss German for chocolate), since Matthias Braendli does not like vanilla flavor but prefers chocolate instead.
39The MPW service of the Europractice foundation offers only 25 mm2 modules for this particular technology. Upon request, these modules were divided into four equal parts. The price is still calculated for a full 25 mm2 module. This arrangement is frequently used to manufacture student designs. The worst case in this scenario is when one more than a multiple of four designs need to be manufactured. In such cases, a reasonable effort is made to merge two smaller designs.
40This is a typical Catch-22. The micro-electronics industry claims the lack of tool support as the main reason for not using self-timed circuit techniques, and the EDA industry does not develop such tools, since the industry does not use them.
There are several companies which have developed self-timed design flows like Handshake Solutions (formerly part of Philips) and Theseus Logic. However the products of these companies have had little impact on the acceptance of self-timed circuits.
41"A reasonably sized block" is sometimes defined as a design for which the the entire design flow from source code to final layout information can be completed using a standard workstation in a day.
42The mapping is by no means optimal. For the most part, the resulting netlist could be used directly. In some cases, several optimizations were performed manually.
43Shir-Khan also contained several additional test structures that were not part of the GALS system. These were placed and routed during this stage as well.
44Not all glitches will cause an erroneous state change in an AFSM. They are not as error prone to glitches as some publications lead people to believe. However, certain unwanted input transitions will lead to errors. Because it is not easy to differentiate or identify harmful glitches from harmless glitches, it is common practice to try to avoid all of them.
45Although the STGs of both g2s and g2d look very similar, there are differences in the way internal state variables are handled. Initially, both ports were very different. Only in the later stages of the design, the handshake protocol between Goliath and the synchronous interface was changed, resulting in the present design.
46A designer working on a standard synchronous circuit, does not need to instantiate scanable flip-flops, or connect the scan chain manually. This is all done automatically by the test automation tools.
47Glitches are not desirable as they increase the switching activity and thus the dynamic power consumption of the circuit. Their presence, however, does not affect the functionality.
48The local clock generators are practically disabled during the scan test mode. A simple functional test was devised, and all local clock generator outputs were made observable.
49During fault simulation, the reset signal has to be held high (inactive). As a result the ATPG algorithm reports all reset/set inputs of the flip-flops to be untestable against stuck-at-1 faults. This can be easily verified by special test vectors, and therefore, these faults have been removed from the fault library
50A proper fault simulation would have been much more conclusive in this regard. While it is trivial to modify the simulations to perform actual fault simulations, it has not been performed due to time constraints.
51In Acacia, the situation is exaggerated even further as the number of clock cycles required for a David operation varies randomly.
52All known synchronizer circuits basically need to avoid metastability. No matter what approach is chosen, there is either a probability that the circuit will fail, or the time required to produce an output is unbounded. In the GALS methodology developed by Muttersbach the basic element that resolves metastability, the MutEx element, falls into the latter category.
53For the UMC 0.25 m technology used to implement the designs presented here, even for extreme low power designs, the static power consumption is barely measurable (much less than %1).
54As an example, as soon as the keypad of a mobile device is activated, computational modules could be "brought to life" in anticipation of activity as a result of user interaction.
55Most digital circuits designed for a 2.5V technology would be able to operate safely (albeit much slower) with a supply of 1.2V. However, running a circuit designed for a 1.2V technology with 0.6V may not be possible at all.
56A proper analysis would require more than one attack, since individual measurements may differ significantly. As an example, for the two separate attacks on Fastcore, the first attack required around 6,000 plaintexts while the second attack (of which the results are shown in figure 3.13) required around 3,500.

File translated from TEX by TTH, version 3.77.
On 20 Dec 2006, 15:44.