SHA-1 Engine: A dedicated FPGA-based machine for SHA-1 collision search
To better understand and demonstrate how the potential vulnerabilities of the SHA-1 algorithm can be exploited, we designed a dedicated hardware engine to speed up the collision search process.
The engine is basically made of a large number of cheap devices based on Field-Programmable Gate Array (FPGA) chips. Relying on a parallel, highly-optimized hardware architecture, each FPGA chip can reach a considerable computational power density, yielding a much higher speed/cost ratio compared to standard general-purpose platforms.
As shown in the figure below, the SHA-1 Engine is basically a cluster of FPGA piled up each other. FPGAs are divided into slave nodes and master nodes. Each master node contains a microprocessor-based system coordinating the operation of a subset of processing FPGAs. These, in turn, contain a controller along with a matrix of dedicated hardware cores, highly optimized for speed and low footprint, running the sub-characteristic search job.
A small-scale prototype of the cluster enabled us to reach in a short time a real collision for a 72-round version of the hash function, the most advanced result towards the break of the full 80-round SHA-1 at the time of the discover.
Currently, the cluster is still in a prototypical stage, although we count on being able to demonstrate a fully fledged large-scale implementation in a very short time.
Details of the cluster architecture and application flow
The cluster inherently relies on the advantages that modern reconfigurable technologies can have
for cryptanalytic applications. In particular, we exploited FPGAs for algorithm and architecture exploration and input-dependent system specialization, based on the automated generation of Hardware Description Language (HDL) files based on the input SHA-1 search characteristic.
The figure below summarizes the application flow and the architecture of the cluster.
Based on the software tools already used for the collision search at the BSC, we first produce a differential characteristic, that is then analyzed by a module for the automated generation of HDL code (namely, VHDL), leading to the configuration of the different components of the cluster architecture. This automated generation applies a number of architectural optimizations, mostly depending on the probabilistic behavior of the characteristic. The cluster architecture is made of three levels. A top-level Master node analyzes the first few rounds (e.g. the first thirteen), in order to produce more constrained characteristics, which are then sent to the Slave nodes. Most of the workload is concentrated below the first rounds, so that the concurrent jobs dispatched to the slaves achieve an almost complete parallelization of the search process. The jobs, moreover, involve very little communication overhead, since only the initial search state for each job and the possible colliding messages need to be exchanged over the bus. Within the slave node (a whole FPGA), a Controller component acts as a second level in the architecture. It does further enumeration of intermediate rounds (e.g. on round 14) and distributes the remaining search workload to a set of SHA-1 collision cores, which constitute the third level in the architecture. We leave only round 15 for the enumeration on the third-level cores, as long as the available degrees of freedom ensure an execution time long enough to hide the synchronization overhead. This simplification allows further speed and area optimizations, enabling high throughput levels with relatively small-footprint cores.
For the current prototypical implementation of the cluster, we used a set of 20 inexpensive commercial off-the-shelf boards, namely the Digilent Nexys-2 boards, each equipped with a Xilinx Spartan XC3S1200E FPGA. The interconnection is made through an ad-hoc inter-board bus. The Master node is implemented as a microprocessor system based on the Xilinx MicroBlaze core, while the Slave nodes are completely custom-made.
We paid much attention to the modular nature of the cluster, supporting an easy extension with additional hardware. The extension port allows the hot-plug of additional modules and the software-supported reorganization of the search partitioning as new modules are plugged in.
This feature addresses the large run-times of the collision search process, by providing the possibility of dynamically extending the computing power of the cluster.
The Controller also interact with an on-board non-volatile memory for the checkpointing of the search jobs. An I/O interface managed by the Master node allows the external user to interact with the cluster during the the search operation.
Compared to other solutions in the technical literature, either hardware- or software-based, our prototypical cluster turns out to be the solution with the highest speed/cost ratio for SHA-1 collision search currently available.