GALS System Design:
Side Channel Attack Secure Cryptographic Accelerators

Chapter 4:
Secure AES Implementation Using GALS

Frank Kagan GŘrkaynak
 
<kgf@ieee.org>

 
Disclaimer:
This is the www enabled version of my thesis. This has been converted from the sources of the original file by using TTH, some perl and some hand editing. There is also a PDF. This is essentially as it is, but includes formatting for A4, and some of the color pictures from the presentation.

Contents

1  Introduction
2  GALS System Design
3  Cryptographic Accelerators
4  Secure AES Implementation Using GALS
    4.1  Partitioning
    4.2  DPA Countermeasures
        4.2.1  Noise Generation
        4.2.2  Operation Re-Ordering
        4.2.3  GALS Modules
        4.2.4  Variable Clock Periods
        4.2.5  Security Effort
    4.3  Realization and Results
        4.3.1  David
        4.3.2  Goliath
        4.3.3  Interface
        4.3.4  Random Number Generation
        4.3.5  Reference Design
        4.3.6  Physical Implementation
        4.3.7  Simulation Results
        4.3.8  Measurement Results
5  Designing GALS Systems
6  Conclusion
A  'Guessing' Effort for Keys
B  List of Abbreviations
B  Bibliography
B  Footnotes

Chapter 4
Secure AES Implementation Using GALS

The GALS design methodology described in chapter 2 was developed to combine the advantages of asynchronous design with the convenience of using well supported digital design methodology. A significant portion of present day asynchronous circuit design is geared towards secure cryptographic hardware design. The power spectrum of an asynchronous circuit does not contain large peaks at multiples of a global clock frequency and therefore is believed to reveal less information about the circuit operation. With this in mind, we have asked the following question:
"How can we use the GALS methodology to improve the security of cryptographic hardware ?"
To answer this question the AES block cipher was implemented by utilizing new possibilities offered by the GALS methodology. The new architecture, called Acacia, presents additional challenges to attackers, especially to those who use DPA. The main contribution of the GALS methodology, to increase DPA security, comes from a partitioning of the AES block cipher into multiple LS islands that are clocked independently. Even though clock signals are present (and visible in the power trace), the attacker is unable to directly attribute a given clock signal to the correct LS island easily. This task is made even more difficult by varying the clock period of each LS island randomly. The remainder of this chapter discusses the Acacia architecture in detail.

4.1   Partitioning

Partitioning a design into several LS islands remains to be mostly an inexact science of the GALSification process. There is an inevitable overhead both in circuit area and in communication when a design is partitioned into multiple GALS modules. Experience has shown that small-sized GALS modules suffer noticeably from such an overhead.
Two separate approaches to partitioning are common. A functional partitioning scheme determines the LS islands according to functionality. In this way, parts of the circuit that exchange data often are located within the same LS island and the communication overhead can be reduced significantly. However, the circuit size of the resulting LS islands can not be kept at an optimum level in this way. The second alternative takes the circuit area in account. In this method, the size of an LS island is frequently taken as the size of the circuit that can go through the entire design flow (synthesis, placement and routing, verification) using contemporary tools and computers within one day. While this approach is more suited for an automatic partitioning process, it creates LS islands that suffer from excessive communication overhead.
The security concept developed for Acacia requires multiple LS islands that are clocked independently. The additional security offered by this approach is the increased effort required on part of the attacker to determine the state of the operation. This has two main consequences:
  1. There must be multiple LS islands
  2. The LS islands should not exchange data all too frequently, as during data transfers the local clocks of two modules are synchronized to each other. If two LS islands would exchange data at each cycle, the two clocks would remain synchronous to each other, this would take away any advantage gained by using independent clocks.
It can be seen from previous designs (see section 3.4.3) that even a fully parallel AES cipher can be realized using around 100,000 gates. This is not a large amount, even for a relatively mature technology like the 0.25 m technology used in this project. The second constraint listed above basically rules out any parallel implementation of AES, and calls for an iterative architecture. To achieve a suitable partitioning that satisfies the above mentioned constraints, a new architecture must be developed. After a careful analysis of the basic structure of the AES algorithm seen in figure 3.2, the base operations of AES are divided into two groups.
The first group consists of the operations ShiftRows and AddRoundKey and is designed to operate on 128-bit data. The ShiftRows operation can be realized without significant hardware overhead if it operates on all 128 bits in parallel. The AddRoundKey operation consists of a simple bitwise XOR operation, and does not add significant hardware overhead even if it is implemented for 128 bits. Since the roundkey is used only by the AddRoundKey operation, the roundkey generation is also placed into this group.
The second group consists of the remaining operations MixColumns and SubBytes. As mentioned in section 3.3.2, the SubBytes operation can be further divided into two parts: a multiplicative inverse and an affine transformation. In essence MixColumns is a 32 bit operation and SubBytes is an 8-bit operation. In reference to the bit widths used within the two groups the first group (using 128-bits) is called Goliath, and the second group (using 32-bits) is called David.
One round of AES transformations is equivalent to running the operations in Goliath once, and the operations in David four times. Similarly, to complete the operations within David, one 32-bit MixColumns operation and four 8-bit SubBytes operations are required. As described earlier in section 3.5.3 parallel running datapath units contribute to DPA security by generating additional noise. To take advantage of this, David has been designed to include two parallel SubBytes units and similarly Goliath is designed to interface to two identical David units. Figure 4.1 shows the architectural transformation used in Acacia. To simplify the drawing only encryption operators are shown. Acacia supports decryption as well.
Figure compose2
Figure 4.1: Transforming the AES encryption round into Goliath and David.
The new AES architecture is more suitable to partitioning for GALS. Figure 4.2 shows the block diagram of Acacia. The system consists of two identical instances of David and a single instance of Goliath. These three instances are realized as separate GALS modules and contain their own independent local clock generators. The system includes an additional block that is clocked synchronously and is used as an interface to the outside world.
Figure acacia_gals_3
Figure 4.2: The block diagram of Acacia showing three GALS modules and the synchronous interface

4.2   DPA Countermeasures

Acacia was specifically designed to incorporate several different layers of DPA countermeasures. As with all such designs, it is a combination of different 'DPA aware' design decisions that contribute to the overall security of the system. The GALS design methodology employed in Acacia has been instrumental in implementing a number of efficient countermeasures. However, just using GALS is by far not sufficient to provide this security.
It is assumed that, although small, a portion of the power consumption of Acacia will contain a contribution that directly depends on the cipher key. Rather than finding methods to completely eliminate this power consumption, the security effort is concentrated on making the detection of this power consumption impractical.
A typical attack on the AES algorithm would target the output of one of the 8-bit SubBytes operations. In a standard AES encryption there are a total of 160 different SubBytes operations. Consider an architecture that is able to perform a single 8-bit SubBytes operation at a time. Ideally the attacker would like to observe the circuit while it is processing just the targeted SubBytes operation and nothing else. For this, the attacker must be able to precisely determine when the targeted operation is taking place. If the attacker is not able to perform measurements at a precise time, the measurement will contain the contribution of several SubBytes operations at the same time, just like in an architecture where multiple instances of 8-bit SubBytes operations are executed parallel. In the worst case, the attacker will be able to make just a single measurement over the entire encryption operation.
In Acacia the attacker is given as little reference as possible over the state of the AES operation. The GALS methodology used in Acacia presents additional means to increase the temporal uncertainty of a specific operation. This is further augmented by well-known design practices to increase the DPA resistance. The following subsections explain individual aspects of the countermeasures in more detail.

4.2.1   Noise Generation

Adding noise and executing random operations are well-known countermeasures to improve the DPA security of a cryptographic system as explained in section 3.5.3. Acacia incorporates both of these methods in several layers. The architecture of Acacia consists of multiple parallel datapaths. As a design principle, all datapath units in Acacia are constantly fed with data, regardless of the status of the cryptographic operation. The necessary input data is generated by pseudo random number generators. Goliath uses as much as 242 bits of random data per clock cycle and each David unit uses an additional 76 bits (for details on random number generation in Acacia refer to section 4.3.4).
Power consumed by parallel operations on uncorrelated data appear as a noise component PN for DPA analysis. By keeping the datapaths constantly occupied, the exact state of the operations can not be observed externally. The power consumed by these additional dummy operations also add to DPA relevant PN. Thirdly, the generation of pseudo random numbers adds additional activity to the circuit.
Since all datapath units within Acacia are constantly active, it is not possible to distinguish which parts of the datapath are processing data that is part of the cryptographic operation. In fact, at any given time Acacia might be processing only randomly generated data. In this way, dummy operations can be inter-mixed with real cryptographic operations seamlessly without an apparent change in behavior.
As an example consider one of the David units. It is assigned a 32-bit value by Goliath. It will continue to process this data without really knowing whether or not the data is part of the cryptographic operation or a randomly generated bit stream. After data has been processed Goliath will either continue using the output of David, or will silently ignore it in case the input data was dummy.

4.2.2   Operation Re-Ordering

An AES architecture that does not processes all 128-bits in parallel, has to execute the same AES operation multiple times. In a standard implementation, more often than not, the data is processed in some order, i.e. least significant portion of the data is processed first, followed by higher order bits. Any operation targeted by a DPA attack would always occur at the same time. If there is no data dependency between the operations, the execution order of can be changed randomly to make it more difficult to predict when a certain operation is performed.
Consider the MixColumns operation in David. To be able to compute a 32-bit MixColumns operation, the result of four 8-bit SubBytes operations are needed as described in section 3.3.4. Using the same notation as in eq. 3.1 let us denote the outputs of the SubBytes operations S0, S1, S2 and S3 (neglecting the column index c for clarity). In David there are two parallel datapath units that can be used to calculate Si and Sj at the same time. Theoretically David could calculate the four outputs in two clock cycles. Since the operations are independent, they can be calculated in any order. While there are 24 possible combinations that the operations can be processed, the probability that a given SubBytes operation will be processed in a given cycle is 0.5.
David has been modified in a way to enable any combination of the two datapath units to process random data. This adds more uncertainty as to when a specific SubBytes operation will be processed. At any given cycle David can choose one of the following methods to determine what to do next:
David does not decide on a fixed method, but rather chooses between methods as appropriate at runtime. During normal operation each time David is assigned data to process, a target cycle count is determined. This target cycle count represents a loose guideline for the duration of the operation. The target count is continuously decremented, and the method decision is affected by the urgency to complete the operation. David is more likely to decide to use a permutation with two data words if the target cycle count is one and none of the four SubBytes operations have been completed, than choosing only one randomly. A simple register keeps track of the processed operations. The MixColumns operation will be executed only after all four SubBytes operations are processed.
Figure reorder_timing
Figure 4.3: Example operation flow of the three datapaths in Acacia. The figure shows one round of AES operations. Four possible re-ordering schemes are shown on the two David units.
In Goliath a very similar method is used. During one AES round four MixColumns operations have to be completed in any order before a new AddRoundKey operation can be performed. There are two instances of David that can process these MixColumns (and associated SubBytes) operations. Goliath assigns 32-bit data blocks to these David instances as soon as they are available. A similar target operation count method is used to choose between the three ordering methods described above. Goliath keeps track of which operations process cryptographic data and which ones process dummy data. However, David is not notified about the authenticity of the data it processes.
There is an important difference in the way this re-ordering is handled in Goliath. In David all sub operations are synchronous, and once an operation is assigned, the result is available within the next clock cycle. In Goliath this is not possible since the run time of David is not deterministic. This can lead to interesting situations where a David unit finishes ahead of the other David unit even though it was assigned new data later. This results in a slightly more complex control logic to accommodate for such cases.
Depending on the target operation count, Goliath can also assign several dummy AddRoundKey operations. This is also different from David, where the MixColumns operation is performed directly after all SubBytes operations are processed. The re-ordering operations are demonstrated in figure 4.3. In this simplified timing diagram, all three datapath units are shown to be performing one round of AES encryption. The diagram shows all four combinations of operation re-ordering on both David units (named David Perels and David Tschopp) . The fastest results (and the least uncertainty) is obtained by using a permutation based re-ordering with both datapaths. In normal operation Acacia does not use a single mode for re-ordering, but combines different methods.

4.2.3   GALS Modules

A successful DPA attack needs to identify and measure a specific operation within the AES algorithm. In a standard synchronous design, the power measurement traces clearly identify the clock signal. The attacker is therefore able to identify individual clock cycles easily. In most systems, the attacker is also able to reduce the clock rate (or manipulate the clock signal in different ways) to obtain better measurements.
As described in section 2.2, each GALS module contains an independent local clock generator that supplies the clock signal for its LS island. In a system that uses multiple GALS modules, there are multiple clocks that have no clear frequency phase relationship. This can be exploited to make DPA attacks more difficult.
The main problem in such an approach result through the data transfers between GALS modules. Two (or more) GALS modules are essentially synchronized while exchanging data with each other. Two GALS modules that need to exchange data at every clock cycle are practically synchronous with each other. To be resistant against DPA attacks, the GALS modules must be allowed to run without transferring data for several clock cycles. This aspect was the main criteria that determined the partitioning described in section 4.1.
As mentioned in section 2.2, there are two basic synchronization modes during data transfer between two GALS modules. When D-type port controllers are used the GALS island is suspended as soon as it attempts a data transfer. It remains suspended until the data transfer is completed. This can significantly reduce the dynamic power consumption of the module, but on the other hand presents the attacker clear information on the state of the AES operation. If on the other hand, P-type port controllers are used, the local clock is only briefly halted when all communication parties are ready to transfer data. In this mode, the local clock is only shortly, if ever, paused. From a DPA security point of view, this is a more desirable situation.
In total, three new port controllers were designed for Acacia. A detailed description of these ports and their design criteria can be found in section 5.2.

Communication between Goliath and David

During an AES operation Goliath assigns a 32-bit part of its current state for processing to David. To reduce the number of data transfers between two modules, both modules exchange data at the same time. David transfers its last output to Goliath and records the new 32-bit input data.
Goliath uses a 3-bit control signal to configure David for different modes of operation28. An additional 9-bit signal is used to determine the policy (see section 4.2.5) that will be used for calculating the next data.
During a normal AES round, each David requires at least 4 clock cycles to process data. While this is not overwhelmingly large, it offers a fair compromise between performance and security.

Communication between Goliath and Interface

Acacia provides a standard synchronous interface to communicate with the environment. Once the interface has received 128 bits of data from the environment it requests a data transfer. A specialized port controller has been developed to enable safe data transfers between Goliath and the interface. This port controller synchronizes the local clock of Goliath to that of the synchronous clock for the duration of the data transfer. Goliath and the synchronous interface exchange 128 bits of data simultaneously.
The interface provides a 7-bit control signal to Goliath to set the cipherkey length (128,192 or 256 bit), operation mode (encryption or decryption), and the security effort.

4.2.4   Variable Clock Periods

The local clock generator is an integral part of the GALS methodology developed at the IIS [OVG+02]. The clock generator is basically a ring-oscillator structure where the feedback path can be controlled by the asynchronous port controllers as seen in figure 4.4.
Figure local_clock
Figure 4.4: Block diagram of the local clock generator, with the detailed schamtic of the delay line. The Muller-C element has been realized by four 2-input NAND gates. The configuration interface is not shown in the figure.
The local clock generators designed for previous projects [MVF00, OGV+03] were designed to have a programmable delay line so that the performance of the system could be tuned for optimum performance. The delay line as seen in figure 4.4 consists of a series of identical delay slices. As long as the delay slice is enabled by setting the DelCntrlxSI signal high, the clock signal propagates through the AND gate. As soon as one delay slice is disabled by setting DelCntrlxSI to low the forward propagation is halted and the reverse path is activated by the AND-OR gate. The fastest clock frequency (the shortest delay through the delay line) is obtained by disabling all delay slices and the slowest clock frequency is obtained by enabling all slices. In this case the input signal propagates through all delay elements and returns back. In all of these designs the clock frequency was configured during startup and was not changed dynamically.
Stephan Oetiker has modified the local clock generator for Acacia in a way that the clock period can be changed every cycle by reading a new value into the configuration register that controls the switches29. Both David and Goliath use the same local clock generator type that has a delay line consisting of 15 identical delay elements. Each delay element adds roughly 550 ps of delay when activated. There is an additional fine delay element, seen on the far left side of figure 4.4, that can add 250 ps delay to the delay line . This configuration allows for 30 different delay settings and the delay line is programmable between 4 and 12.5 ns with nearly 0.25 ns accuracy.
A 6-bit random number is used by David and Goliath to determine the next clock period. Under no circumstances should the configuration result in a period that is less than the critical path of the module. The exact value for the critical path of the circuit is typically known very late in the design process. Furthermore, in Acacia, the same clock generator type is used for two datapaths that have different critical paths. An individual minimum value for the clock delay is stored in a configuration register of each module. The random value obtained from the LFSR is added to this minimum value to determine the clock period.
As described in section 4.2.5, Acacia is able to trade-off between throughput and security. The variation of the clock period is one of the parameters that is controlled by the security effort. The high order bits of the random number can be masked to reduce the random distribution of the clock period. In the most extreme case, the random number is completely masked and the clock period is equal to the minimum period as determined by the configuration register.
The local clock generator has an additional configuration interface controlled by an external synchronous clock that allows a number of parameters to be set during operation. Most notable is the ability to switch the local clock to the external synchronous clock, effectively converting the GALS system to a fully synchronous system for test purposes. This mode is intended only for scan testing and not for normal operation.
In most attacks against cryptographic hardware, the attacker is able to control the clock signal of the system. This allows the attacker to run the system at a frequency at which measurements can be made more easily30. An additional advantage of using local clock generators is that the attacker is not given any chance to influence the operation speed of the device.
Figures 4.5 to 4.12 illustrate the countermeasures step by step. One round of AES encryption is given in figure 4.5. All operations (with the exception of ShiftRows) are executed in a separate clock cycle. For illustration pusposes it is assumed that the attacker is targeting the ninth SubBytes operation. In this case, the attacker will always target the twelfth clock cycle for his attacks. Note that for a successful attack, the attacker needs to make thousands of measurements. Inserting dummy operations as shown in figure 4.6 changes the occurence of the targeted operation. However, the occurence is only delayed. In our example, the probability of observing the operation in cycle twelve has been reduced. It is clear that the operation can not be observed before the twelfth clock cycle, depending on the number of additional dummy operations until the time of execution it can be delayed arbitrarily. If dummy operations are inserted with the same probability, operations that are later in the dataflow will be delayed more. This is not really helpful, as there is no telling which the operation the attacker will target, in other words, the security depends on the security of the weakest link (in this case the first operation). Reordering operations as shown in figure 4.7 can really help in this regard, since the occurence of the targeted operation can be moved practically to any location within the round. Parallelization as seen in figures 4.8 and 4.9 help increase the activity within the circuit, but the real improvement comes with the modification shown in figure 4.10. All operational units are continuosly supplied with data. When the output of one unit is not required, random input data is used. Figure 4.11 shows teh partitioning into GALS modules while figure 4.12 illustrates the random clock period approach.
Figure timing_solo
Figure 4.5: Simplified timing diagram showing one round of the AES. To demonstrate the efficiency of the countermeasures, one SubBytes operation has been targeted.
Figure timing_dummy
Figure 4.6: Inserting dummy operations randomly will confuse the attacker, by delaying the occurence of the targeted operation. However the run time will increase
Figure timing_reorder
Figure 4.7: Independent operations can be reordered. In this example you can see that both the MixColumns operations in one round, and the SubBytes operations required for a particular MixColumns operation can be reordered
Figure timing_parallel1
Figure 4.8: It is possible to parallelize the MixColumns operations. In this case, two MixColumns operations can be executed simultaneously, as soon as their input dependencies are resolved.
Figure timing_parallel2
Figure 4.9: The parallelization can be taken one step further. In this example, for each MixColumns operation, the data is provided by two SubBytes units executing in parallel. In total the system has four SubBytes units.
Figure timing_parallel3
Figure 4.10: Instead of using the parallel datapath units only when they are needed, the system can be modified so that it executed dummy operations when the a result is not scheduled.
Figure timing_gals3
Figure 4.11: The operations as they are organized in Acacia. The MixColumns / SubBytes operations are executed in two David units that run in parallel and a single Goliath unit that executes the AddRoundKey, ShiftRows operations. Each unit is a stand alone GALS module.
Figure timing_gals4
Figure 4.12: The operation is further randomized by changing the local clock generator period from cycle to cycle.

4.2.5   Security Effort

The countermeasures designed to introduce temporal uncertainty in Acacia are able to do so at the expense of the overall throughput. The variation of the clock period in individual GALS modules can only be achieved by making the module operate slower (which reduces the throughput), since the clock period must be at all times longer than the critical path in the module. Similarly, dummy operations reduce the throughput as they are performed at the expense of cryptographic operations. There is an inherent trade-off between throughput and the amount of effort undertaken to thwart DPA attacks.
All operations in an AES process are vulnerable to a DPA attack, but it is by far much more easier to attack the first and the last round. The reasoning is simple: A successful DPA attack, requires a hypothetical power model for all permutations of the attacked subkey. Developing the model for later rounds is increasingly difficult, as these depend on the previous rounds31. The complexity of the power model increases with each round until the halfway point. After that a model can be used that is based on the outputs of the system instead of the inputs, and the complexity of the model decreases with increasing round numbers. Attacking the last round is (almost) as easy as attacking the first round.
Acacia uses a simple policy system to be able to tune its DPA effort according to the round. Using this system, the first round is executed with all DPA countermeasures at their maximum settings to foil attacks as much as possible. With the progression of rounds, the effort is reduced, less dummy operations are performed and the clock period is varied less than before. During the middle of the process, the GALS modules are clocked at their maximum frequency and no dummy operations are executed. The DPA related effort is then ramped up to protect later stages of the process. The last round effectively has the same level of countermeasures as the initial round.
ModeDescriptionRun time [ns]norm.
00As fast as possible114,4001
01Using slightly random countermeasures165,3201.44
10Using highly random countermeasures616,2405.38
11Using pre-programmed policy203,8401.78
Table 4.1: Simulation results comparing the run time of Acacia using different operation modes. The simulation uses a 50 MHz interface clock and measures the time required to compute a mix of 100 en/decryptions.
Acacia uses four basic operation modes. The first one is a high-speed mode that turns off any throughput reducing DPA countermeasures. There are two modes that use the same level of random countermeasures for all rounds and a fourth mode that effectively uses the policy system described above. The 'security profile' of Acacia can be changed using the configuration interface during initialization. Table 4.1 presents simulation results that show the runtime required for a wide range of en/decryption operations. The pre-programmed policy has the same level of random operations during its initial and last stages as the highly random configuration, and is therefore just as secure, but is more than 3 times faster.
Figure hist03
Figure 4.13: Distribution of a specific SubBytes operation of the first round of an AES encryption over 2,000 encryptions using different policies: a) as fast as possible, b) small random variation, c) large random variation, d) pre-programmed policy.
A typical DPA attack would target one of the SubBytes operations within the first round of an AES encryption. In Acacia, depending on the operational mode used, it may be executed at different times. Figure 4.13 shows the distribution of a specific SubBytes operation over 2,000 different encryptions for the different operation modes.
When configured to run as fast as possible, no dummy operations are inserted and the GALS modules are configured to run at their fastest rate. There are two distinct peaks visible in this operation mode (figure 4.13a). This is the result of operation re-ordering described in section 4.2.2, that evenly distributes the 32-bit operations among the two available David units. Since each David unit is capable of processing 32-bits at a time, there are two distinctive time-slots available for any given operation. The additional uncertainty is a result of communication overhead between independently clocked domains.
The operational mode where all encryption rounds execute slightly random countermeasures shows a much better distribution as seen in figure 4.13b. An even better distribution can be obtained if all rounds execute highly random-countermeasures (figure 4.13c). However, as shown in table 4.1, this mode increases the run-time of a single encryption significantly. The last distribution shown in figure 4.13d is a result of the pre-programmed policy that dynamically adjusts the security effort between encryption rounds. In this mode, the distribution is similar to the one obtained from executing highly random countermeasures, but the run time is comparable to that of slightly random countermeasures.

4.3   Realization and Results

4.3.1   David

David is essentially a 32-bit datapath that consists of the SubBytes and the MixColumns operation (and their inverses) of the AES cipher. SubBytes is by far the most demanding operation of AES both in terms of area and operation speed. In an architecture supporting both encryption and decryption, it is more efficient to divide the operation into an affine transformation and an multiplicative inverse in GF(28) as described in section 3.3.2. In David, operations are divided into two groups. One group consists of two instances of 8-bit look-up tables that realize the multiplicative inverse function (shown on the left in figure 4.14) and the other group combines the MixColumns operation and the affine transformation (shown on the right in figure 4.14). This organization creates a more balanced critical path through both groups.
When used as part of an encryption round, David uses at least two clock cycles to calculate four multiplicative inverse operations and then, in a single clock cycle performs the affine transformation and the MixColumns operation. In the decryption mode the same operations are executed in reverse order. David processes the InvMixColumns and the inverse affine transformation in the first clock cycle, and requires at least two clock cycles to process four multiplicative inverse operations.
The last round of an AES transformation does not include the MixColumns or its inverse operation. David is designed to skip the MixColumns and its inverse upon request. This is equivalent to performing a 32-bit SubBytes operation. This operation is required during the key expansion as well.
The DPA countermeasure effort of David is mainly determined by the Policy input. This input consists of two parts:
A four bit register is used to keep track of which multiplicative inverse operations have been completed. When the TargetCount is high, random operation modes are more likely to be executed. If the TargetCount decreases, depending on how many multiplicative inverses are still left, more deterministic decisions are taken. As an example, if TargetCount indicates that there are two cycles left, and only the third multiplicative inverse has not been calculated until this time, David will assign the third byte to be calculated with 50% probability. However, if it has not been calculated until the next cycle, the TargetCount will indicate that there is only one cycle left, and the third multiplicative inverse will be definitely scheduled in this cycle33.
David includes an 89-bit LFSR to provide random data for DPA related operations. The circuit area of David (183,007 m2) is nearly two times larger than Matthias (93,131 m2) which is built in the same way as reference and does not include any DPA countermeasures34.
Figure david
Figure 4.14: Block diagram of David. The control structure is simplified for clarity.

4.3.2   Goliath

Goliath is essentially a 128-bit parallel implementation of the AES cipher, that outsources the SubBytes and MixColumns operations to two David datapaths. The ShiftRows and the AddRoundKey operations are the least demanding operations of the AES cipher and can be realized without a significant overhead. The block diagram of Goliath shown in figure 4.15 is very similar to that of David shown in figure 4.14.
The datapath is divided into two main groups of operations. The part shown on the left hand side of figure 4.15 is used to communicate with the two David instances. A 32-bit part of the current state is selected and transferred to an output register so that it can be sent to David for further processing. The input from David is routed back to the state holding 128-bit register. The second part, seen in the middle of figure 4.15, is used to execute the ShiftRows and AddRoundKey operations. During a given clock cycle only one of these groups is active.
Figure goliath
Figure 4.15: Block diagram of Goliath. The control structure is simplified for clarity.
The main operation of Goliath is determined in a similar manner to David. For each cryptographic operation, a 2-bit value determines the overall Policy. For each round of operation, a round policy is determined using the round number and the overall policy. This value in turn determines the mask value that determines the period of the local clock and assigns policies for the two David units.
Comparable to the TargetCount in David, two separate counters are employed in Goliath. AddCount is used to add a random number of dummy cycles prior to calculating the AddRoundKey and ShiftRows operations. Unlike David, where the operation start is not easy to determine, the initial operations of Goliath are easy to detect. In order to protect the first few operations a random amount of dummy operations are required. The ColsCount counter is similar to the TargetCount of David, and determines the scheduling of data on the two David units. The ColsCount is not decremented every cycle, but after every successful transaction between David and Goliath.
The scheduling control for Goliath is a little bit more involved as well. Once a 32-bit word is scheduled on a David unit, Goliath has no reliable information on when to expect the result back. Therefore, two separate 4-bit registers are implemented. The first register keeps track of which 32-bit words were scheduled, and the second 4-bit register keeps track of the received results. One round is concluded as soon as all four 32-bit words are processed.
Goliath also includes the key generator that implements the key expansion algorithm. This block seen on the right side of figure 4.15 provides the roundkeys used by the AddRoundKey operation. Acacia supports all three standard AES cipher key lengths of 128, 192 and 256 bits. Rather than calculating the roundkeys on the fly, the roundkeys are calculated and stored in memory. In this way, additional side-channel information leakage due to continuous roundkey calculation is prevented.
A 32-bit wide SRAM memory serves as a roundkey storage due to practical limitations. Access bandwidth is not a problem during the key expansion phase where only 32-bit operations are used (see section 3.3.5). However, during normal AES operations a 128-bit roundkey is required. The roundkey generator requires four clock cycles to 'assemble' this roundkey. Once Goliath consumes a roundkey, the next roundkey is automatically requested. This is sufficient to fetch a new roundkey before the next AddRoundKey operation for all but the last round. Once the last round operation is performed, it is assumed that the AES mode will not change from encryption to decryption (or vice versa) for the next operation and the corresponding roundkey is fetched. If a new cipherkey is used, or the AES mode is changed between operations, the roundkey generator will be caught with the wrong roundkey, and a new roundkey will have to be fetched. Operation is stalled until the correct roundkey is available.
The key expansion algorithm that generates the roundkeys, and the actual AES operations are not performed at the same time. It was decided to use one David instance to realize the 32-bit SubBytes function that is required by the key expansion routine. While this organization saves circuit area by re-using existing datapath elements, it requires a more complex selection hardware to be able to route the signal to and from the David datapath as well. It was later discovered that the chosen solution was far from ideal.

4.3.3   Interface

The AES algorithm operates on 128-bits of data and therefore requires a large number of inputs and outputs. However, the I/O bandwidth can be reduced in cases where an output is not calculated every clock cycle. The interface of Acacia uses separate 16-bit data inputs and outputs to:
An 8-bit control bus is implemented to select between the above mentioned operations. The interface also acts as a buffer between the asynchronous communication of the GALS core and the synchronous environment. A specialized port controller is used to ensure safe data transfers between the interface and Goliath. As a result, the asynchronous behavior of the GALS modules is not observable directly. Depending on the specific interface clock rate and the selected operation mode, the result may appear after different number of (interface) clock cycles at the output. Therefore, all input and output words have dedicated handshake signals.

4.3.4   Random Number Generation

Most DPA countermeasures rely on the availability of random numbers to introduce unpredictable changes in the circuit behavior. Generating truly random numbers in hardware is an involved task that will not be explored in this thesis. Contrary to 'real' random number generators, pseudo random number generators (PRNG) are relatively simple circuits that can be built using standard digital design techniques. In essence, such PRNGs generate a very long sequence of numbers that show properties similar to that of a true random generator.
Figure lfsr
Figure 4.16: LFSR that realizes the polynomial P(x)=x5+x2+1. This LFSR has a period of 25-1=31.
The Linear Feedback Shift Register (LFSR) is probably the best known and most widely used PRNG as it has reasonable properties and is extremely easy to implement in hardware. An LFSR of degree m is a shift register that consists of m flip-flops. The next input value of this shift register is obtained from the present outputs by linear XOR operations as seen in figure 4.16. The feedback function is frequently expressed as a polynomial in the form

P(x)=xm+cm-1xm-1+...+c1x1+1
(4.1)
where ci {0,1}. The period of the sequence generated by the LFSR depends directly on the polynomial used. LFSRs that use 'primitive polynomials' generate maximum length sequences with a period of 2m-1. An excellent description of the LFSR and a list of primitive polynomials for different values of m can be found in the Handbook of Applied Cryptography[MvOV96]35.
In this form the LFSR generates a single output bit per clock cycle. There are several alternatives to increase the number of random bits per clock cycle. Implementing n LFSR units in parallel or clocking a single LFSR n times faster would result in n random bits. Both of these solutions are costly in practice. Since the function is iterative, it is also possible to calculate the state of the LFSR in n cycles iteratively. Table 4.2 shows the first five iterations of the LFSR shown in figure 4.16. The circuit corresponding to P(x)n is more complex than P(x) but it produces n output bits per clock cycle. Since more bits are read out in parallel the maximum length sequence decreases by a factor of n as well.
InitialD1D2D3D4D5
P(x)D2D5D1D2D3D4
P(x)2D1D4D2D5D1D2D3
P(x)3D2D3D5D1D4D2D5D1D2
P(x)4D1D2D4D2D3D5D1D4D2D5D1
P(x)5D1D2D3D5D1D2D4D2D3D5D1D4D2D5
Table 4.2: Result of the first five iterations of P(x)=x5+x2+1 in terms of the initial values of the LFSR. The bold elements signify the outputs that can be used. In the last case P(x)5 five bits can be read out during each clock cycle.
A careful examination of the iteration reveals the following: Primitive polynomials with large coefficients have less overhead during iterations. As an example consider the polynomial P(x)=x203+x8+x7+x+1, when iterated to have a parallel output of 200 bits, the synthesized circuit in a 0.25 m technology occupies an area of 313,275 m2. Less than 13% of this area is occupied by the flip-flops, the rest is occupied by the combinational circuit required for the iteration. If the polynomial P(x)=x212+x105+1 , which is of a higher degree, is iterated to generate a 200 bit output, the resulting circuit area is only 50,355 m2 where 65% of the area is occupied by flip-flops. This second implementation is more than a factor 6 smaller.
There is a number of different places where a random number is required in Acacia. Both David and Goliath consist of parallel datapath units of which only one is active at any clock cycle. In order to conceal which part of the datapath is processing original cryptographic data, all other portions of the datapath are supplied with random data. This accounts for most of the random bits required during each clock cycle. In addition, a limited number of random bits are required by the state machines to make flow control decisions.
In order to get completely uncorrelated data, the random bits are only used once in Acacia36. Goliath requires 242 bits of random information per clock cycle. Two separate LFSRs are implemented to generate these bits. The first one uses the polynomial P(x)=x124+x37+1 and is iterated for 124 bits and the second one uses the polynomial P(x)=x118+x33+1 and is iterated to produce an output of 118 bits. David requires 76 bits37 and uses a single LFSR that is based on the polynomial P(x)=x89+x38+1. All LFSRs can be initialized to a given value during startup.

4.3.5   Reference Design

A standard synchronous reference design implementing the same architecture as Acacia was designed and included on the ASIC as well. The AES architecture in Acacia is different from the standard architectures used in other designs. For a purely synchronous implementation, it is probably not the optimal architecture, as the arrangement requires several registers which would not necessarily be needed for an optimized design. Nevertheless, the same architecture has been realized in the reference design, in order to be able to compare the overhead required by the DPA countermeasures more directly. Several near-optimal implementations of the AES algorithm in the same 0.25 m technology exist for comparing the efficiency of the reference design if needed.
The reference design consists of three modules, similar to the Acacia design. The reference equivalent of David is named Matthias, and the reference implementation of Goliath is named Schoggi38. They do not contain any additional logic for DPA countermeasures, and the interface between modules is optimized to reduce latency.
Schoggi has the same connections to the interface as Goliath. The user can select the cryptographic engine by specifying the CmdxDI. The clock for Schoggi is the same one that drives the interface.
An extensive comparison of the two designs can be found in table 4.3. The first observation is that the area overhead of GALS, even for this fine grained partitioning, is rather small (around 5%). The reference design has an area that is less than half of the DPA version. However, this number is slightly misleading, as the synchronous reference design had to be optimized for a smaller area due to area constraints on the final chip. A speed-optimized synthesis run results in an area of 554,837 m2 for the reference design. The LFSR units used for DPA countermeasures account for roughly 15% of the total area (127,336 m2). The remaining area difference between the two implementations is a result of additional circuitry required to support different DPA countermeasures.
Synchronous GALS + DPA
MatthiasSchoggiDavidGoliath
Area (m2) - LS93,123207,031183,007551,194
Area (m2) - ClockGenN.A.N.A.7,5797,626
Area (m2) - Ports N.A.N.A.6,22511,412
Area (m2) - TOTAL393,277 963,855
Critical path (ns)5.435.843.985.27
Latency (cycles)3142
Clock frequency (MHz)170.96250.8189.6
Encryption round (cycles)782
Encryption round (ns)40.8842.38
Table 4.3: Comparison of the GALS and synchronous implementations of the AES partitioning used in Acacia. All numbers given are synthesis results for a 0.25 m CMOS technology. The GALS version was configured to run without delay inducing DPA countermeasures.
The second notable difference between the two implementations is that the GALSified version of the datapath elements require an additional clock cycle (David requires 4 clock cycles compared to 3 cycles of Matthias). In the best case this results in 30% more clock cycles per encryption round where one Goliath (Schoggi) and two David (Matthias) operations are required. In the synchronous system, the clock period is determined by the longest of the critical path of the two design units (in this case 5.84 ns). In contrast, it is possible to run both datapaths with a different clock in a GALS system. Theoretically (without taking the synchronization time loses) the GALS system is able to complete encryption within 104% of the time required for the synchronous solution. This also has consequences in the synthesis. Note that in the synchronous solution Matthias and Schoggi both have a critical path that is similar in length, while the critical path of David is noticeably shorter than that of Goliath. The top-down approach used in the synchronous solution finds no advantage in reducing the critical path of Matthias, since the critical path of Schoggi determines the operating frequency. In the GALS solution, a bottom-up approach is used as each locally synchronous island is optimized separately. This approach results in more optimized sub-units.

4.3.6   Physical Implementation

Acacia was manufactured in a the 0.25 m 5-Metal CMOS technology. While it is not the most recent technology, it is reasonably priced and is well supported at the IIS. As a result of a special arrangement with the MPW manufacturer, the usable die size for Acacia was fixed to 3.56 mm2. This also placed a hard limit on the available I/O pads of 64. In addition, it was decided to implement Acacia with a second unrelated student design for cost reasons39.
Figure acacia_photo
Figure 4.17: Photomicrograph of the chip. The asynchronous port controllers are placed between the LS islands, . The student design on the left is unrelated to Acacia. The glue logic to share the pins is placed at the bottom of the chip.
A traditional back-end design flow, does not have a sense of hierarchy: all components of the netlist are treated equally. In a hierarchical design flow, the design is partitioned into a hierarchy of sub-designs. A global floorplan is used for the entire system, but other than that, each sub-design is treated separately. This allows parts of the system to be redesigned without affecting the remaining part of the system. Such a hierarchical design flow was used for Acacia. The floorplan of the design is shown in figure 4.17. The design hierarchy is given in figure 4.18. At the top level of the hierarchy two designs, Acacia and the student design, are instantiated. This top-level hierarchy also contains the necessary glue logic required to share the inputs and outputs of both designs. Acacia itself contains a further level of hierarchy that instantiates the GALS modules and the synchronous part including both the reference design and the synchronous interface. Each of the GALS modules contains the self-timed wrapper and an additional hierarchy level that instantiates the stand alone LS island. In this arrangement, the self-timed circuit elements are always in a different hierarchy level than the LS island. As seen in the floorplan, a small amount of space is reserved for the self-timed wrapper, so that the asynchronous port controllers, the local clock generator and the necessary glue logic can be placed and routed. The hierarchical design style is well suited for GALS design and a commercial placement and routing tool was used without any difficulties.
Figure hierarchy
Figure 4.18: The design hierarchy.

4.3.7  Simulation Results

In this section, simulation results will be used to explain the operation of Acacia in detail. Three different simulation waveforms, each showing a different level of detail, are presented:
Figure modelsim_encryption_ann
Figure 4.19: Modelsim simulation showing handshake signals of Acacia during one round of encryption. The pre-programmed security profile can be clearly observed.
In the upper parts of the waveforms in figure 4.19, the handshake signals of the synchronous interface can be seen. In this simulation, the interface is clocked by a 50 MHz clock. The result of the last encryption is stored in the interface at the beginning. The interface supports 16-bit data transfers, and the result is clocked out in 8 steps. Directly after that, data for the next operation is loaded in 8 steps. At this point, the interface raises the g2s_Req signal and waits until Goliath is finished processing the current operation.
Once Goliath finishes processing the current operation, it acknowledges the g2s_Req signal by activating g2s_Ack. To be able to safely transfer data, the local clock of Goliath is halted briefly. This is the only time when one of the local clocks in Acacia is halted noticeably.
The simulation shows an encryption operation using 128-bit keys. This operation requires 10 encryption rounds. Although all rounds are identical in nature, by observing the Enc.Round signal, it can be seen that the first and last few rounds of the encryption require much longer time than the rounds in the middle. This is a result of the pre-programmed security-effort described in section 4.2.5. The beginning and end of an AES operation is more vulnerable to DPA attacks, and the pre-programmed security effort increases the amount of random operations during this time.
The waveform in figure 4.19 also shows all three local clocks in Acacia. The average clock period can be observed to change between 100 MHz and 200 MHz together with the security effort. However, individual local clocks are not visibly influenced by the data transfers between modules.
The total time required for the encryption shown in the simulation is 1.100 ns (in reference to the synchronous clock), and corresponds to a throughput of over 116 Mb/s.
Figure modelsim_one_round_ann
Figure 4.20: Modelsim simulation showing the main signals of Goliath and the handshake signals between David units during one round of encryption. The highlighted data items represent 'real' cryptographic operations.
The second waveform in figure 4.20 shows only one round of encryption. The 128-bit register in Goliath holding the state is shown as four separate 32-bit values for clarity. One encryption round is made up of a random number of dummy addition cycles, followed by a real addition cycle, and a number of clock cycles where 32-bit values of the current state is assigned to David units for further processing. The four-bit register colassigned is used to keep track of which parts of the 128-bit state were assigned for processing. As Goliath receives the processed 32-bit words, it updates another four-bit register named coldone.
In the simulation, both David units are assigned one value at the beginning of the round. However, Goliath receives the result of David Tschopp one cycle prior to David Perels. The remaining two 32-bit words are assigned to David Perels, while David Tschopp executes dummy operations. These dummy operations require a random number of clock cycles. At the end of these cycles, the unit signals that it is ready for further processing, but continues to process random data until the request is acknowledged.
Figure modelsim_one_david_ann
Figure 4.21: Modelsim simulation showing two consecutive David operations. The first one is a rather fast operation whereas the second one is a much slower operation. The pink shaded areas represent random operations on the SubBytes functiond
The third waveform shown in figure 4.21 shows two consecutive operations of a David unit and all signals of the associated d2g port controller in detail. The local clock of Goliath is shown for reference.
Initially, the David unit can be observed in a state where it is ready to accept data. Pen is enabled and the Req signal is active. As soon as Goliath is ready to accept data, it activates the Ack signal. At this time, the local clock needs to be paused. However, the clock pause request signal Ri arrives at the local clock generator just after the rising clock edge. Therefore, clock pause request acknowledge Ai is delayed until the falling edge. Since the handshake with Goliath is finished shortly after, the local clock is not stretched at all. The remaining two data transfers are completed without stretching the clock at all.
A David operation for a typical encryption round starts with a state where the unit awaits the completion of the data transfer. This is followed by a random number of cycles where the 32-bit state is processed by two parallel 8-bit multiplicative inverse functions. After all multiplicative inverses have been calculated, a single cycle is used to compute the affine transformation, and the MixColumns operation.
David uses a method that is similar to Goliath in order to determine the ordering of the multiplicative inverse operations. The four-bit register imuldone keeps track of what part of the 32-bit state has already been processed. The signal imulremaining shows the number of remaining inverse multiplication operations as well. The targetcount is assigned by Goliath and together with imulremaining is used to guide the decisions on how to schedule the operations. The signal selop displays the actual decision that is used during each clock cycle. In the first operation shown in figure 4.21, the inverse multiplication is finished in only three clock cycles. In the first clock cycle, two real operations are scheduled (two0). In the second clock cycle David chose two operations randomly (rnd2). Coincidentally the two operations chosen had already been completed, so the result is not stored. Since the targetcount at the third clock has already reached 0, the remaining two operations are scheduled for the third clock cycle (two2).
The second operation takes much longer, both because of more clock cycles are used and because the fact that the average period length is longer. Since the targetcount is relatively high, David occasionally assigns a dummy operation. In this case, the inputs of both multiplicative inverse units are determined randomly. Note that the targetcount is not always achieved. In the second operation the execution finishes one clock cycle earlier. This is not a bug, the targetcount is meant as a guide, and not a deterministic target that has to be met.

4.3.8  Measurement Results

A total of 20 samples were received from manufacturing. One of the samples was damaged mechanically prior to any measurements. Extensive tests showed that all of the remaining 19 samples were functional.
Figure schoggi_power
Figure 4.22: Power consumption of Schoggi, the synchronous reference design, as a function of operating frequency (Vdd=2.5V).
First tests were conducted with the synchronous reference design called Schoggi. All samples were operational up to a clock frequency of 150 MHz. The power consumption of Schoggi for different operating frequencies is given in figure 4.22. The reference design consumes 160 mW power while operating with a clock rate of 150 MHz. The throughput while running the AES-128 more at this frequency is 164 Mb/s . Encrypting a 128-bit block of data requires an energy of 125 nJ.
The local clock generators used in Acacia, contain a programmable delay line. The local clock frequency can be reduced by enabling an increasing number of delay slices as explained in section 4.2.4. Measurement results of the local clock generator used in Goliath from two different chip samples are compared to post-layout simulation in figure 4.23. As can be seen from the figure, the local clock period can be programmed between 4.5 and 12.5 ns with linear steps of 0.25 ns.
The chip number 14 is the fastest and chip number 1 is the slowest of the manufactured samples. While the fastest sample matches the simulation results closely, the slowest sample is within 5% of the simulations.
Figure clockgen_per_best
Figure 4.23: Local clock generator clock period as a function of activated delay slices. The lines compare the expected simulation results with the measurement results from the fastest chip (#14) and the slowest chip (#1) (Vdd=2.5V).
Acacia contains three different local clock generators. Figure 4.24 shows the operating frequency of all three local clock generators of chip 1 (slowest chip) as a function of the number of activated delay slices. Contrary to earlier GALS designs, the local clock generators in Acacia are not placed and routed manually. The standard cells that make up the local clock generator in Goliath are distributed over a larger area than David. This explains the frequency difference between the local clock generators of David and Goliath.
Figure clockgen_freq_jer
Figure 4.24: Local clock generator frequencies for all three GALS modules as a function of activated number of delay slices for chip #1 (Vdd=2.5V).
As part of the DPA countermeasures, the GALS modules in Acacia are able to determine a new clock period every clock cycle. However, the clock period can not be reduced below the critical path of the LS island. The minimum allowable clock period for each local clock generator in Acacia can be determined during the configuration phase. The measurement result shown in figure 4.25 is obtained by changing the setting for the minimum allowable clock period. Acacia is operated in mode 00 where the local clock generator is not slowed down by DPA countermeasures. The clock frequency given for the GALS modules is an approximate value, as individual GALS modules have slightly different clock frequencies. Furthermore, local clocks are occasionally paused during communication, which effectively slows them. The synchronous I/O clock is set to 20 MHz for this measurement. Running with this clock, the synchronous interface consumes around 35 mW of power.
Figure acacia_power_vs_freq
Figure 4.25: Power consumption of Acacia as a function of maximum local clock frequency. The I/O clock is at 20 MHz, and Acacia is configured to run in mode 00 (as fast as possible). Note that the synchronous interface contributes around 35 mW to the power consumption (Vdd=2.5V).
The four different operating modes of Acacia (see section 4.2.5) result in different power consumption characteristics as seen in figure 4.26. Increased DPA effort results in the average clock period to be reduced which in turn results in lower power consumption. The figure also shows the effect of the I/O clock. A certain amount of power is consumed by the synchronous interface. The actual power consumption of the GALS modules is unaffected by the change of the I/O clock period. Therefore, it is possible to extrapolate the contribution of the synchronous interface from the curves in figure 4.26.
Figure acacia_power_opmode
Figure 4.26: Power consumption of Acacia running different modes as a function of the I/O clock (Vdd=2.5V).
A summary of measurement results is given in table 4.4. The measurements for GALS modes show peak performance values. Average case performance is around 10% lower. It can be seen that the pre-programmed policy represents a good compromise between performance and security. The pre-programmed policy denoted as mode 11, is able to distribute operations over a wide range as shown in figure 4.13, comparable to that obtained by mode 10. Distributing operations over time increases the security against known DPA attacks. While the mode 10 reduces the throughput, and increases energy per data item significantly, the pre-programmed policy has penalties comparable to mode 01 which introduces only limited countermeasures.
Mode Clock Number ofMax. TimeThroughput Energy
[MHz]Cycles[ns][Mb/s][mJ/Mb]
Acacia - 005036720.0177.71.232
Acacia - 015044880.0145.41.362
Acacia - 10501122,440.057.12.704
Acacia - 115046920.0139.11.198
Schoggi150117779.2164.20.976
Table 4.4: Throughput measurement results for different operational modes. The numbers for Acacia modes represent peak performance. Energy figures include only the Cryptographic core, not the interface.



File translated from TEX by TTH, version 3.77.
On 20 Dec 2006, 15:44.