## Approach

### The main idea

The purpose of a hash collision search program is to find a pair of messages producing the same hash value. A naive approach would consist, of course, in searching the whole space of message pairs until a colliding one is found. With an n-bit hash function, this would require, roughly, 2^{(n/2)} hash evaluations (*birthday paradox*), which is of course infeasible with any current hash function. Instead of testing arbitrary pairs, existing approaches exploit the internal structure of the hash function, locating a special subset of message pairs which (1) are considerably more likely to collide than random pairs, and (2) can be efficiently enumerated.

### SHA-1

SHA-1 is a standard defined by the NIST in document FIPS 180-2. It is one of the most popular, and thus cryptanalized, existing cryptographic hash functions.

We refer to a restructured version of the SHA-1 algorithm which is especially suitable for cryptanalysis,
as described in **[1]**,**[2]**. In fact, during the computation the hash function mantains an internal state made of five variables, A, B, C, D, and E, that are combined together by means of a state-update function f:

Because of the particular "rotating" structure of f, this relationship can be replaced with a different function f’, allowing us to represent the state by means of a single variable A:

### Differential characteristics

The essential idea behind the attack to SHA is to constrain the values of the messages and registers in order to reach a collision with a certain probability. Such set of bit-level constraints on the difference of two messages is called "differential characteristic". A differential characteristic is organized into two sections. The *W* part defines the constraints imposed between the bits of the two messages in each position, while the *A* part contains the constraints forced on the internal registers during the hashing process. The constraints are expressed by a set of symbols: "1" and "0" indicate that both bits in the two messages must take on the values "1" and "0", respectively, "u" and "n" indicate a signed difference (1/0 or 0/1, respectively), "x" an unspecified difference, "-" two unspecified equal values.

The definition of suitable, high-probability characteristics is fundamental for the success of the attack.

### Two-block attacks

While the first attacks against SHA were performed using a single message block (512 bits), a common technique used today relies on a two-block approach in order to obtain better overall probabilities of finding a collision. The basic idea is that, instead of being constrained to converge to a zero difference, the messages can lead to a small difference *+d* after the first block, and an opposite *-d* after the second one. Because of the cumulative summing structure of the hashing process, this produces a final zero difference.

### The collision search

Once a characteristic has been built, the actual message search can be performed. This proceeds by trying pairs of messages whose difference complies with the constraints imposed in the fixed characteristic. The two messages in each pair are hashed until an incompatibility with the differential characteristic is found, or until a collision is reached at the last round. It is essential to enumerate all the available message pairs in the search space, to evaluate them exhaustively, possibly optimizing the search by pruning large subspaces as soon as it is possible to determine that they do not contain collisions.

To summarize, the overall collision search process can be divided into two main phases:

**Phase 1)**definition of a differential characteristic, ensuring a suitable search space and a certain value of probability for the success of the attack**Phase 2)**enumeration of message pairs compliant with the differential characteristic, until a collision is found

### Implementation

The first phase can be completely automated, but as of now, since the process is mostly heuristic, it appears that better results can be obtained if an expert manually controls the generation process to evaluate and re-focus the search on the most promising semi-finished characteristic. Moreover, compared to the second phase, the first is much less critical. So, a mere increase in computing power does not necessarily yield an improved characteristic.

The second phase, on the other hand, is highly critical in performance and can be automated by finding out efficient ways to enumerate the message pairs in the subset defined by the characteristic, and by testing them rapidly. Thus, this phase is very suitable to be implemented on high performance computing architectures. Also, because of its non-sequantial nature, this process can successfully exploit a high level of parallelism.

In particular, we found the Cell B.E. architecture to be a very good support for this phase. Using the *MariCel* cluster, located at BSC in Barcelona, we were able to obtain new promising results, improving on the state of the art.

### The Maricel Cluster

The Maricel cluster, hosted at the Barcelona Supercomputing Center, is a heterogeneous multi-core cluster based on 72 QS22 blades. Each QS22 includes 2x Cell/B.E processors at @3.2Ghz and 8 GBytes of RAM. All the QS22 are interconnected by a 4x DDR Infiniband network. The total number of cores is 1296 for a peak performance of 10 TFlops.

The Cell/B.E. is a heterogeneous processor composed of one main general purpose processor (PPU) and eight specialized cores (SPUs) with software-managed local stores. Each SPU only has direct access to its own local storage, so before any data computation may take place the involved data must be explicitly transfered from main memory to a local store. The data is transfered between local storages and main memory with asynchronous DMA operations managed by a specialized memory flow controller (MFC), thus the SPU is able to keep computing while up to 16 DMA operations are in-fly. The ability to issue asynchronous DMA operations between the local stores and the main memory enables an efficient overlapping of computation time and data transfer time, but at the cost of higher software complexity. The most widely used techniques to exploit this processor feature are the double buffering and the multi-threading schemes.

The figure shows the three basic components of the Cell processor. First, the PowerPC Processor Element (PPE), which is primarily intended to manage global OS resources. Second, the Synergistic Processing Elements (SPEs) that are specialized vector processors with a private local storage and a DMA engine, which can perform up to 16 asynchronous DMA transfers between the local storage and main memory at the same time. Finally, the communication between the PPE, the SPEs, main memory, and external devices is realized through an Element Interconnect Bus (EIB).

### New techniques

As mentioned above, the complexity of the problem relies almost entirely on performing the second phase efficiently. Therefore, several techniques have emerged during the last years, both to build better probability characteristics and to perform the search more efficiently.

In addition to the techniques already known, such as the early pruning of empty subspaces and the Amplified Boomerang Attack, we were able to introduce two new techniques (**[3]**), which significantly reduce the complexity of the problem. In particular:

**Constraint relaxation.**We discovered that in some cases the standard constraints that are forced in the characteristic are actually*too much*constraining. In fact, sometimes, a message pair could follow another - slightly different and similarly probable - path and yield equally valid results. This can also be seen as an extension of the original characteristic, that includes the original plus other "parallel" ones.**Inter-bit Constraints.**We found in our tests that past about the 70th round the usual technique for building differential characteristics starts to cause problems. Namely, it tends to not converge to an acceptable characteristic, i.e. one having good probability and also defining a space large enough to likely contain at least one solution. We found that it is convenient to integrate the characteristic-building process with new specifically aimed constraints among*groups*of bits, and not only equally positioned bits.