GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators
Contents
1 Introduction
2 GALS System Design
3 Cryptographic Accelerators
4 Secure AES Implementation Using GALS
5 Designing GALS Systems
6 Conclusion
A 'Guessing' Effort for Keys
B List of Abbreviations
B Bibliography
B Footnotes
Footnotes
Footnotes:
1Some references make a clear distinction between asynchronous and
self-timed circuits. Even according to these descriptions, for the
circuits that are discussed in this thesis, the two terms could be
used interchangeably.
2The following list is from a keynote speech at the 2002 Asynchronous
Circuit Design Conference in Manchester, UK.
3Actually, a Ri request during the high phase of
the local clock is not acknowledged, even if the outstanding Ri
request would effectively prevent the next rising transition of the
local clock. While this slows the handshaking protocol somewhat, it
is very simple to implement and it avoids many timing problems.
4This case is more common than one would normally expect. Imagine the
following scenario: Module A is waiting for Module B
to acknowledge the data transfer. Assume that Module B is not
immediately available. In this case, the next clock edge of the Module
A is practically waiting at the Muller-C element shown in figure
2.3. Once Module B enables its
input port, the data transfer quickly takes place. As the data transfer
is completed Module A releases its local clock. The active
clock edge appears immediately. On Module B, assuming the data
transfer has taken place within a fraction of the nominal clock period,
the next active clock edge will appear later. By this time the output
of Module A might already have changed.
5To achieve this an accurate completion detection is required. This
is not always trivial. In practice most self-timed circuits use a
delay line that is matched to the worst-case delay of the circuit.
Such circuits also have worst-case performance.
6Typically in an asynchronous data connection, fast signals cause timing
violations. Such violations are easy to resolve even at later stages
of the design. The fast signals are delayed by inserting additional
buffers.
7Unfortunately, up to now no digital information has been preserved
for more than 50 years, simply because these methods were not known
earlier. Storing information to last a long period of time involves
challenges that have not yet been mastered. The main problem in storing
digital information over longer periods of time, is that the technology
used to store the information changes very rapidly.
An interesting example is described in detail at the URL
http://www.atsf.co.uk/dottext/domesday.html.
In this project, an old English book written in 1086 was digitized
in an effort to preserve the book in the year 1986. In less than 15
years, the hardware required for the digital copy was simply outdated
and practically could not be used. The original book, however, can
still be used.
8The Alice and Bob After Dinner Speech given at the Zurich Seminar,
April 1984 by John Gordon by invitation of Professor James Massey,
found (among other places) at
http://downlode.org/Etext/alicebob.html,
tells a humorous account of Alice and Bob.
9Remarkably, by using a simple XOR gate one can create an unconditionally
secure system, provided an endless stream of totally random cipherkey
bits can be provided. The problem in the so-called 'one-time pad'
lies in the generation of the random cipherkey bits.
10Every service provided by cryptographic systems has a value to its
user. It is the relation between this value and the cost of breaking
the cryptographic algorithm that is important. A malicious attacker
is unlikely to invest 10 million dollars in breaking a cryptographic
system that protects information that is worth 1 million dollars.
11In the 20 years that DES was used, the computation power has increased
by a factor of more than 10,000 according to Moore's Law. This alone
corresponds to 13 bits (log210,000 » 13.28). In other words,
attacking a 56-bit cipherkey in the year 1996 is equivalent to attacking
a 43-bit cipherkey of the year 1976, if the improvement in computer
technology is accounted for.
12The original specification by NIST required the candidates to support
data block sizes of 128, 192 and 256 bits. However, the official AES
algorithm announced by NIST is only defined for 128 bits.
13For the AES algorithm, the irreducible polynomial of degree 8 is taken
to be m(x)=x8+x4+x3+x+1 for multiplication in GF(28).
14The expectations are based on preliminary simulation and layout studies
for a full-custom mask ROM.
15The wiring overhead for the ShiftRows permutation is not much
different from a straight interconnection.
16High throughput can also be important for other applications. One
example from the industry used a hashing function to authenticate
the operating system of a mobile device. Such a hashing function processes
the entire ROM where the operating system is located and generates
a checksum. The device only works if the checksum calculated from
the ROM matches a hard-coded value. This is used to prevent 'alternative'
operating systems from being used in the device. As the capabilities
of the mobile devices increased so did the size of the ROM required
to hold the operating system and the time required to calculate the
checksum. Although the device did not process encrypted data at high
rates, it still required a high throughput cryptographic solution
to calculate the checksum in a time that is not noticeable by the
users.
17The remaining standard modes, require the output of a complete cryptographic
process prior to a new input. These operation modes could actually
just as well use the outputs from previous cycles. It has been proven
that these so-called interleaved feedback modes to be just as secure
as the original modes. Using interleaved feedback modes would enable
a number of architectural transformations like pipelining and parallelization
to be applied to cryptographic accelerators. Alas, these modes are
not supported by the official FIPS.
18The example is based on real systems, but is not necessarily 100%
accurate in how certain problems are addressed in state-of-the-art
systems. An excellent discussion on this topic can be found in [And93].
19Cryptographic hardware includes dedicated hardware that is designed
specifically to implement cryptographic algorithms and general purpose
programmable micro-controller/-processor based systems.
20Most of the side channels, like electromagnetic radiation and temperature,
can directly be related to the power consumption as well.
21Extracting the cipher key using DPA attack is a rather difficult task.
However when compared to attacking an exhaustively by trying all combinations
of the cipher key, it is tens of orders of magnitude easier.
22For DPA attacks it is important to understand the nature of the power
consumption, and not necessarily its absolute value. The short-circuit
power Psc is a result of a switching activity that will manifest
itself in the Pdynas well. The static leakage power Psta
will differ for different states of the output. Similarly, a change
of the output state will also result in dynamic power consumption
Pdyn. Therefore, these two power components can be neglected
without adverse results.
23Such attacks are independent and can be repeated until all subkeys
have been revealed.
24The DPA attack does not depend on this model. In most attacks, models
based on hamming weights (number of logic ones in a vector) or hamming
distances (number of bit-flips between consecutive clock cycles) are
used. It is possible to construct more elaborate and accurate models
if details of the architecture are known by the attacker. But the
DPA attack does not require models of such complexity to be successful.
25Since the attack requires considerable resources, it was repeated
only twice. In the results shown above 3,000 measurements gave a sufficiently
confident indication to the correct subkey. The other attack required
more than 5,000 measurements to come up with a similar result.
26Using a post layout analysis, the switching activity of a large LFSR
circuit was determined to be 0.4410, which is close to the theoretical
maximum of 0.5. The difference stems from nets like reset, that are
not directly in the data propagation path, but are essential for operation.
27Adapting a datapath so that it can continuously perform uncorrelated
operations requires certain modifications, most notably a source for
random data bits.
28David can be configured for both encryption and decryption.
During the last round of an AES operation or during roundkey generation,
the MixColumns operation needs to be skipped as well.
29It is pretty difficult to talk of a clock period, since the clock
signal is for all practical purposes not periodic any more. It is
not only randomly interrupted for data transfers but each clock edge
is separated by a random time interval.
30As an example the Fastcore chip was clocked at 2 MHz so that
512 values per clock cycle could be sampled by a 1 Gb/s sampling oscilloscope.
If Fastcore were to be clocked at its nominal operating frequency
of 166 MHz, only 6 values would have been sampled per clock cycle.
31While the output of an 8-bit SubBytes operation in the first
round depends on only 8-bits of the plaintext and the 8-bits of the
cipherkey, the output of any 8-bit SubBytes operation in the
second round depends on 32-bits of the plaintext and 32-bits of the
cipherkey.
32Goliath does not keep track of the target count that it has
assigned to the David units, as soon as it assigns one, it
forgets about it.
33The decision of which multiplicative inverse unit to use will still
be random.
34In all fairness it must be noted that the reference design that includes
Matthias was optimized for small area. If both designs were
optimized similarly, the area overhead of David would decrease
slightly to around 80%.
35This excellent resource is also available online at the URL http://www.cacr.math.uwaterloo.ca/hac/
36It is tempting to derive inputs for seemingly unrelated datapath elements
from the same random bits. After analyzing the cost of the PRNG for
the additional bits, it was decided to use random bits only once.
This is an overly conservative approach and it may be indeed safe
to re-use several random bits.
37Late in the design phase several random bits were no longer needed.
The LFSRs were kept as they were, some random bits are simply not
used. In fact, David uses only 69 bits and Goliath 234
bits.
38The original David and Goliath names are directly related
to their bit-widths. Since two instances of David are used,
in the code they were named after two colleagues at the IIS David
Perels and David Tschopp. It was intended to name the reference design
Vanilla, to show that it did not include any countermeasures
and was a plain implementation. The name Matthias was used
for the reference David implementations, since, just like the
name David, there were two members at the IIS that were named
Matthias: Matthias Braendli and Matthias Streiff. At this point
the name Vanilla was changed to Schoggi (Swiss German
for chocolate), since Matthias Braendli does not like vanilla flavor
but prefers chocolate instead.
39The MPW service of the Europractice foundation offers only 25 mm2
modules for this particular technology. Upon request, these modules
were divided into four equal parts. The price is still calculated
for a full 25 mm2 module. This arrangement is frequently used
to manufacture student designs. The worst case in this scenario is
when one more than a multiple of four designs need to be manufactured.
In such cases, a reasonable effort is made to merge two smaller designs.
40This is a typical Catch-22. The micro-electronics industry claims
the lack of tool support as the main reason for not using self-timed
circuit techniques, and the EDA industry does not develop such tools,
since the industry does not use them.
There are several companies which have developed self-timed design
flows like Handshake Solutions (formerly part of Philips) and Theseus
Logic. However the products of these companies have had little impact
on the acceptance of self-timed circuits.
41"A reasonably sized block" is sometimes defined as a design
for which the the entire design flow from source code to final layout
information can be completed using a standard workstation in a day.
42The mapping is by no means optimal. For the most part, the resulting
netlist could be used directly. In some cases, several optimizations
were performed manually.
43Shir-Khan also contained several additional test structures that were
not part of the GALS system. These were placed and routed during this
stage as well.
44Not all glitches will cause an erroneous state change in an AFSM.
They are not as error prone to glitches as some publications lead
people to believe. However, certain unwanted input transitions will
lead to errors. Because it is not easy to differentiate or identify
harmful glitches from harmless glitches, it is common practice to
try to avoid all of them.
45Although the STGs of both g2s and g2d look very similar,
there are differences in the way internal state variables are handled.
Initially, both ports were very different. Only in the later stages
of the design, the handshake protocol between Goliath and the
synchronous interface was changed, resulting in the present design.
46A designer working on a standard synchronous circuit, does not need
to instantiate scanable flip-flops, or connect the scan chain manually.
This is all done automatically by the test automation tools.
47Glitches are not desirable as they increase the switching activity
and thus the dynamic power consumption of the circuit. Their presence,
however, does not affect the functionality.
48The local clock generators are practically disabled during the scan
test mode. A simple functional test was devised, and all local clock
generator outputs were made observable.
49During fault simulation, the reset signal has to be held high (inactive).
As a result the ATPG algorithm reports all reset/set inputs of the
flip-flops to be untestable against stuck-at-1 faults. This can be
easily verified by special test vectors, and therefore, these faults
have been removed from the fault library
50A proper fault simulation would have been much more conclusive in
this regard. While it is trivial to modify the simulations to perform
actual fault simulations, it has not been performed due to time constraints.
51In Acacia, the situation is exaggerated even further as the
number of clock cycles required for a David operation varies
randomly.
52All known synchronizer circuits basically need to avoid metastability.
No matter what approach is chosen, there is either a probability that
the circuit will fail, or the time required to produce an output is
unbounded. In the GALS methodology developed by Muttersbach the basic
element that resolves metastability, the MutEx element, falls into
the latter category.
53For the UMC 0.25 m technology used to implement the designs
presented here, even for extreme low power designs, the static power
consumption is barely measurable (much less than %1).
54As an example, as soon as the keypad of a mobile device is activated,
computational modules could be "brought to life" in anticipation
of activity as a result of user interaction.
55Most digital circuits designed for a 2.5V technology would be able
to operate safely (albeit much slower) with a supply of 1.2V. However,
running a circuit designed for a 1.2V technology with 0.6V may not
be possible at all.
56A proper analysis would require more than one attack, since individual
measurements may differ significantly. As an example, for the two
separate attacks on Fastcore, the first attack required around
6,000 plaintexts while the second attack (of which the results are
shown in figure 3.13) required around 3,500.
File translated from
TEX
by
TTH,
version 3.77.
On 20 Dec 2006, 15:44.