GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators

1 Introduction
2 GALS System Design
3 Cryptographic Accelerators
4 Secure AES Implementation Using GALS
5 Designing GALS Systems
6 Conclusion
A 'Guessing' Effort for Keys
B List of Abbreviations
B Bibliography
B Footnotes

Footnotes

Footnotes:

¹Some references make a clear distinction between asynchronous and self-timed circuits. Even according to these descriptions, for the circuits that are discussed in this thesis, the two terms could be used interchangeably.

²The following list is from a keynote speech at the 2002 Asynchronous Circuit Design Conference in Manchester, UK.

³Actually, a Ri request during the high phase of the local clock is not acknowledged, even if the outstanding Ri request would effectively prevent the next rising transition of the local clock. While this slows the handshaking protocol somewhat, it is very simple to implement and it avoids many timing problems.

⁴This case is more common than one would normally expect. Imagine the following scenario: Module A is waiting for Module B to acknowledge the data transfer. Assume that Module B is not immediately available. In this case, the next clock edge of the Module A is practically waiting at the Muller-C element shown in figure 2.3. Once Module B enables its input port, the data transfer quickly takes place. As the data transfer is completed Module A releases its local clock. The active clock edge appears immediately. On Module B, assuming the data transfer has taken place within a fraction of the nominal clock period, the next active clock edge will appear later. By this time the output of Module A might already have changed.

⁵To achieve this an accurate completion detection is required. This is not always trivial. In practice most self-timed circuits use a delay line that is matched to the worst-case delay of the circuit. Such circuits also have worst-case performance.

⁶Typically in an asynchronous data connection, fast signals cause timing violations. Such violations are easy to resolve even at later stages of the design. The fast signals are delayed by inserting additional buffers.

⁷Unfortunately, up to now no digital information has been preserved for more than 50 years, simply because these methods were not known earlier. Storing information to last a long period of time involves challenges that have not yet been mastered. The main problem in storing digital information over longer periods of time, is that the technology used to store the information changes very rapidly.

An interesting example is described in detail at the URL http://www.atsf.co.uk/dottext/domesday.html. In this project, an old English book written in 1086 was digitized in an effort to preserve the book in the year 1986. In less than 15 years, the hardware required for the digital copy was simply outdated and practically could not be used. The original book, however, can still be used.

⁸The Alice and Bob After Dinner Speech given at the Zurich Seminar, April 1984 by John Gordon by invitation of Professor James Massey, found (among other places) at http://downlode.org/Etext/alicebob.html, tells a humorous account of Alice and Bob.

⁹Remarkably, by using a simple XOR gate one can create an unconditionally secure system, provided an endless stream of totally random cipherkey bits can be provided. The problem in the so-called 'one-time pad' lies in the generation of the random cipherkey bits.

¹⁰Every service provided by cryptographic systems has a value to its user. It is the relation between this value and the cost of breaking the cryptographic algorithm that is important. A malicious attacker is unlikely to invest 10 million dollars in breaking a cryptographic system that protects information that is worth 1 million dollars.

¹¹In the 20 years that DES was used, the computation power has increased by a factor of more than 10,000 according to Moore's Law. This alone corresponds to 13 bits (log₂10,000 » 13.28). In other words, attacking a 56-bit cipherkey in the year 1996 is equivalent to attacking a 43-bit cipherkey of the year 1976, if the improvement in computer technology is accounted for.

¹²The original specification by NIST required the candidates to support data block sizes of 128, 192 and 256 bits. However, the official AES algorithm announced by NIST is only defined for 128 bits.

¹³For the AES algorithm, the irreducible polynomial of degree 8 is taken to be m(x)=x⁸+x⁴+x³+x+1 for multiplication in GF(2⁸).

¹⁴The expectations are based on preliminary simulation and layout studies for a full-custom mask ROM.

¹⁵The wiring overhead for the ShiftRows permutation is not much different from a straight interconnection.

¹⁶High throughput can also be important for other applications. One example from the industry used a hashing function to authenticate the operating system of a mobile device. Such a hashing function processes the entire ROM where the operating system is located and generates a checksum. The device only works if the checksum calculated from the ROM matches a hard-coded value. This is used to prevent 'alternative' operating systems from being used in the device. As the capabilities of the mobile devices increased so did the size of the ROM required to hold the operating system and the time required to calculate the checksum. Although the device did not process encrypted data at high rates, it still required a high throughput cryptographic solution to calculate the checksum in a time that is not noticeable by the users.

¹⁷The remaining standard modes, require the output of a complete cryptographic process prior to a new input. These operation modes could actually just as well use the outputs from previous cycles. It has been proven that these so-called interleaved feedback modes to be just as secure as the original modes. Using interleaved feedback modes would enable a number of architectural transformations like pipelining and parallelization to be applied to cryptographic accelerators. Alas, these modes are not supported by the official FIPS.

¹⁸The example is based on real systems, but is not necessarily 100% accurate in how certain problems are addressed in state-of-the-art systems. An excellent discussion on this topic can be found in [And93].

¹⁹Cryptographic hardware includes dedicated hardware that is designed specifically to implement cryptographic algorithms and general purpose programmable micro-controller/-processor based systems.

²⁰Most of the side channels, like electromagnetic radiation and temperature, can directly be related to the power consumption as well.

²¹Extracting the cipher key using DPA attack is a rather difficult task. However when compared to attacking an exhaustively by trying all combinations of the cipher key, it is tens of orders of magnitude easier.

²²For DPA attacks it is important to understand the nature of the power consumption, and not necessarily its absolute value. The short-circuit power P_sc is a result of a switching activity that will manifest itself in the P_dynas well. The static leakage power P_sta will differ for different states of the output. Similarly, a change of the output state will also result in dynamic power consumption P_dyn. Therefore, these two power components can be neglected without adverse results.

²³Such attacks are independent and can be repeated until all subkeys have been revealed.

²⁴The DPA attack does not depend on this model. In most attacks, models based on hamming weights (number of logic ones in a vector) or hamming distances (number of bit-flips between consecutive clock cycles) are used. It is possible to construct more elaborate and accurate models if details of the architecture are known by the attacker. But the DPA attack does not require models of such complexity to be successful.

²⁵Since the attack requires considerable resources, it was repeated only twice. In the results shown above 3,000 measurements gave a sufficiently confident indication to the correct subkey. The other attack required more than 5,000 measurements to come up with a similar result.

²⁶Using a post layout analysis, the switching activity of a large LFSR circuit was determined to be 0.4410, which is close to the theoretical maximum of 0.5. The difference stems from nets like reset, that are not directly in the data propagation path, but are essential for operation.

²⁷Adapting a datapath so that it can continuously perform uncorrelated operations requires certain modifications, most notably a source for random data bits.

²⁸David can be configured for both encryption and decryption. During the last round of an AES operation or during roundkey generation, the MixColumns operation needs to be skipped as well.

²⁹It is pretty difficult to talk of a clock period, since the clock signal is for all practical purposes not periodic any more. It is not only randomly interrupted for data transfers but each clock edge is separated by a random time interval.

³⁰As an example the Fastcore chip was clocked at 2 MHz so that 512 values per clock cycle could be sampled by a 1 Gb/s sampling oscilloscope. If Fastcore were to be clocked at its nominal operating frequency of 166 MHz, only 6 values would have been sampled per clock cycle.

³¹While the output of an 8-bit SubBytes operation in the first round depends on only 8-bits of the plaintext and the 8-bits of the cipherkey, the output of any 8-bit SubBytes operation in the second round depends on 32-bits of the plaintext and 32-bits of the cipherkey.

³²Goliath does not keep track of the target count that it has assigned to the David units, as soon as it assigns one, it forgets about it.

³³The decision of which multiplicative inverse unit to use will still be random.

³⁴In all fairness it must be noted that the reference design that includes Matthias was optimized for small area. If both designs were optimized similarly, the area overhead of David would decrease slightly to around 80%.

³⁵This excellent resource is also available online at the URL http://www.cacr.math.uwaterloo.ca/hac/

³⁶It is tempting to derive inputs for seemingly unrelated datapath elements from the same random bits. After analyzing the cost of the PRNG for the additional bits, it was decided to use random bits only once. This is an overly conservative approach and it may be indeed safe to re-use several random bits.

³⁷Late in the design phase several random bits were no longer needed. The LFSRs were kept as they were, some random bits are simply not used. In fact, David uses only 69 bits and Goliath 234 bits.

³⁸The original David and Goliath names are directly related to their bit-widths. Since two instances of David are used, in the code they were named after two colleagues at the IIS David Perels and David Tschopp. It was intended to name the reference design Vanilla, to show that it did not include any countermeasures and was a plain implementation. The name Matthias was used for the reference David implementations, since, just like the name David, there were two members at the IIS that were named Matthias: Matthias Braendli and Matthias Streiff. At this point the name Vanilla was changed to Schoggi (Swiss German for chocolate), since Matthias Braendli does not like vanilla flavor but prefers chocolate instead.

³⁹The MPW service of the Europractice foundation offers only 25 mm² modules for this particular technology. Upon request, these modules were divided into four equal parts. The price is still calculated for a full 25 mm² module. This arrangement is frequently used to manufacture student designs. The worst case in this scenario is when one more than a multiple of four designs need to be manufactured. In such cases, a reasonable effort is made to merge two smaller designs.

⁴⁰This is a typical Catch-22. The micro-electronics industry claims the lack of tool support as the main reason for not using self-timed circuit techniques, and the EDA industry does not develop such tools, since the industry does not use them.

There are several companies which have developed self-timed design flows like Handshake Solutions (formerly part of Philips) and Theseus Logic. However the products of these companies have had little impact on the acceptance of self-timed circuits.

⁴¹"A reasonably sized block" is sometimes defined as a design for which the the entire design flow from source code to final layout information can be completed using a standard workstation in a day.

⁴²The mapping is by no means optimal. For the most part, the resulting netlist could be used directly. In some cases, several optimizations were performed manually.

⁴³Shir-Khan also contained several additional test structures that were not part of the GALS system. These were placed and routed during this stage as well.

⁴⁴Not all glitches will cause an erroneous state change in an AFSM. They are not as error prone to glitches as some publications lead people to believe. However, certain unwanted input transitions will lead to errors. Because it is not easy to differentiate or identify harmful glitches from harmless glitches, it is common practice to try to avoid all of them.

⁴⁵Although the STGs of both g2s and g2d look very similar, there are differences in the way internal state variables are handled. Initially, both ports were very different. Only in the later stages of the design, the handshake protocol between Goliath and the synchronous interface was changed, resulting in the present design.

⁴⁶A designer working on a standard synchronous circuit, does not need to instantiate scanable flip-flops, or connect the scan chain manually. This is all done automatically by the test automation tools.

⁴⁷Glitches are not desirable as they increase the switching activity and thus the dynamic power consumption of the circuit. Their presence, however, does not affect the functionality.

⁴⁸The local clock generators are practically disabled during the scan test mode. A simple functional test was devised, and all local clock generator outputs were made observable.

⁴⁹During fault simulation, the reset signal has to be held high (inactive). As a result the ATPG algorithm reports all reset/set inputs of the flip-flops to be untestable against stuck-at-1 faults. This can be easily verified by special test vectors, and therefore, these faults have been removed from the fault library

⁵⁰A proper fault simulation would have been much more conclusive in this regard. While it is trivial to modify the simulations to perform actual fault simulations, it has not been performed due to time constraints.

⁵¹In Acacia, the situation is exaggerated even further as the number of clock cycles required for a David operation varies randomly.

⁵²All known synchronizer circuits basically need to avoid metastability. No matter what approach is chosen, there is either a probability that the circuit will fail, or the time required to produce an output is unbounded. In the GALS methodology developed by Muttersbach the basic element that resolves metastability, the MutEx element, falls into the latter category.

⁵³For the UMC 0.25 m technology used to implement the designs presented here, even for extreme low power designs, the static power consumption is barely measurable (much less than %1).

⁵⁴As an example, as soon as the keypad of a mobile device is activated, computational modules could be "brought to life" in anticipation of activity as a result of user interaction.

⁵⁵Most digital circuits designed for a 2.5V technology would be able to operate safely (albeit much slower) with a supply of 1.2V. However, running a circuit designed for a 1.2V technology with 0.6V may not be possible at all.

⁵⁶A proper analysis would require more than one attack, since individual measurements may differ significantly. As an example, for the two separate attacks on Fastcore, the first attack required around 6,000 plaintexts while the second attack (of which the results are shown in figure 3.13) required around 3,500.

File translated from T_EX by T_TH, version 3.77.
On 20 Dec 2006, 15:44.

GALS System Design: Side Channel Attack Secure Cryptographic Accelerators

Contents

Footnotes

Footnotes:

GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators