MD5 Collision Attack — SEED Security Labs

Swetha
12 min readAug 22, 2023

--

Hey all! It has been some months since I last posted:)
This blog is about a hash collision attack on MD5 and how to create two different programs that have the same MD5 hash value.

MD5 collision attack

In the early 1990s, the MD5 (Message Digest Algorithm 5) hash function emerged as a beacon of hope for digital security. Developed by Ronald Rivest, MD5 promised to provide a swift and reliable way to generate fixed-size hash values from arbitrary data, making it ideal for data integrity checks, digital signatures, and various authentication mechanisms. For a time, MD5 served its purpose dutifully, becoming an integral part of digital systems worldwide.

However, as technology advanced, chinks in MD5’s armor began to surface. Researchers unearthed vulnerabilities in the algorithm, ones that questioned its resilience against determined adversaries. Among these vulnerabilities lay the unsettling discovery of the MD5 Collision Attack.

The MD5 Collision Attack Lab stands as a virtual crucible, where researchers, security enthusiasts, and learners can delve into the heart of this cryptographic anomaly. Through this lab, one can understand the mechanisms that underlie the generation of colliding data pairs. With tools like “fastcoll, md5collgen” the lab simulates real-world scenarios, allowing participants to witness firsthand how vulnerabilities within the MD5 algorithm are exploited to create colliding files.

At the heart of this exploration lies a crucial lesson:

cryptographic systems are only as robust as the algorithms that underpin them.

LAB TASKS

Task 1: Generating Two Different Files with the Same MD5 Hash

The MD5 Collision Attack Lab involves creating a controlled scenario to demonstrate how an attacker can generate two different files with the same MD5 hash value, thus highlighting the vulnerability of MD5 to collision attacks. Here’s a step-by-step breakdown of the process:

  1. Create a prefix file using this command:
echo "hello" > prefix.txt

2. Run the fastcoll tool to create two bin files with the same hash. You can find the fastcoll tool here: https://github.com/brimstone/fastcoll

./fastcoll -p prefix.txt -o out1.bin out2.bin

3. Now, check the md5 checksums of the 2 files that were generated by the tool:

md5sum out1.bin out2.bin

We can see that the md5 checksums are the same. Now, let us look at what the contents of the 2 files are using this command:

xxd out1.bin > output1_hex
xxd out2.bin > output2_hex

diff output1_hex output2_hex

This attack creates two files with the same data but with different padding. The two files share the same prefix i.e., the remaining part of the two files are not the same. They are generated by the collision tool “fastcoll”. This tool generates two different files with the same prefix and same md5 checksums.

Some questions:

– Question 1. If the length of your prefix file is not multiple of 64, what is going to happen?

When working with hash functions like MD5, the input data is processed in fixed-size blocks. For MD5, each block is 512 bits (64 bytes) long. If the length of the input data, in this case, the prefix file, is not a multiple of 64 bytes, some form of padding is required to ensure that the data can be divided into these fixed-size blocks for processing. This padding is a crucial aspect of maintaining the integrity and consistency of the hashing process.

In the scenario where the length of the prefix file is not a multiple of 64 bytes, zeros will be appended to the remaining part of the data to fill the last block. This ensures that the data is extended to a length that is divisible by 64, aligning with the block-based processing of MD5.

— Question 2. Create a prefix file with exactly 64 bytes, and run the collision tool again, and see what happens.

With a 64-byte prefix file, no additional zero-bytes are appended for padding. The 64-byte file aligns perfectly with the MD5 block size of 512 bits (64 bytes), which means that there’s no need for any extra padding to ensure that the data fits into blocks for processing.

Upon running the collision tool, the behavior is as follows:

Prefix File: The 64-byte prefix file serves as the initial content for both of the generated colliding files. This prefix remains unchanged in both files, forming the basis for their similarity.

Random Data for Collision: To create the collision, the collision tool generates additional data that follows the 64-byte prefix. This generated data, which totals 128 bytes, is distinct for each of the colliding files. The inclusion of random data after the prefix is a crucial element in the collision attack, as it creates differences between the two files while still ensuring they share the same MD5 hash.

Appending and Hashing: The collision tool appends the generated random data to each of the colliding files, creating two files with different content beyond the initial 64-byte prefix. Despite these differences, the random data is structured in a way that exploits the vulnerabilities in the MD5 algorithm to ensure that the resulting hash values match.

— Question 3. Are the data (128 bytes) generated by md5collgen completely different for the two output files? Please identify all the bytes that are different.

While the differences between the two generated output files can vary based on the specifics of the collision technique, they usually involve small, strategically placed variations within the generated data. It could be just a few bits or bytes that are adjusted in a way that ensures the MD5 hash remains the same while the rest of the content remains similar.

Task 2: Understanding MD5’s Property

In this task, we will try to understand some of the properties of the MD5 algorithm.

How the MD5 algorithm works
  1. We need to create a suffix file using the command:
echo "suffix text" > suffix.txt

2. Append suffix to the binary file 1 and get the checksum:

cat out1_64.bin suffix.txt | md5sum
cat out2_64.bin suffix.txt | md5sum

The idea behind this task is to append separate suffixes to the two colliding files and subsequently observe the hash values produced by the MD5 algorithm. This experiment delves into the question of whether the carefully constructed collisions hold up even when additional data is introduced.

After appending the suffixes to the colliding files, the MD5 hash values are calculated for both modified files. The focus here is to observe whether the hash values still match despite the introduction of distinct suffixes.

Task 3: Generating Two Executable Files with the Same MD5 Hash

Nowadays, many online resources still use a MD5 checksum for integrity checking. However, due 2 to the broken collision-resistance property, an attacker can change these online resources without the awareness of the user. In this task, you will receive two PNG files and their corresponding checksums. Your goal is to change these files in such a way that a user who executes integrity checking cannot identify these changes. In other words, you should modify both files such that they have the same MD5 checksum. To this end, we will consider two scenarios for the execution of the attack. •

Scenario 1: The attacker can modify exactly one of both files. In this scenario, we assume that the attacker knows only one PNG file and can modify the picture without any restrictions beside that the (visible) content of the PNG should not change. On the other hand, the adversary does not have any chance to modify the other PNG file. The second PNG file can only be read and executed by the attacker. Thus, he knows its MD5 checksum.

Scenario 2: The attacker can modify both files. This scenario is common for binary executable files, when the attacker provides a developed program for security auditing before he gets a certificate/allowance to publish the software. Thus, the adversary can pass the audit and change the program to a malicious version afterwards. In this scenario, the attacker is allowed to modify the binary of both files to achieve his goal of having the same MD5 checksum. However, the (visible) content of both files should not be modified.

a) Which property of a secure hash function should be broken in order to be able to execute the attack? Elaborate on this question for each of the scenarios described above.

**Breaking the Preimage Resistance:**
The “preimage resistance” property stands as a fundamental pillar of secure hash functions, aiming to ensure that reversing the process of hashing to retrieve the original input is exceedingly difficult and practically infeasible. This property reinforces the concept that a hash value should not leak any information about the original input. In the context of the first scenario, the attacker already knows the MD5 hash value of one PNG file. To break the preimage resistance, the attacker embarks on a computational journey to uncover an input that, when hashed, produces the known MD5 hash. Upon discovering this elusive “preimage,” the attacker gains the power to manipulate another PNG file, utilizing the same preimage to ensure that both files yield the same hash. This manipulation, rooted in the preimage breakthrough, underscores the potency of the attack and the significance of the preimage resistance property.

  • *Breaking the Collision Resistance:**
    The “collision resistance” property addresses the challenge of finding two different inputs that result in the same hash output. A secure hash function should make it incredibly hard for an attacker to engineer such input pairs that collide, strengthening the overall integrity of the hash function. In scenario 2, the attacker’s aim is to shatter this collision resistance property. By modifying both files in a carefully orchestrated manner, the attacker seeks to generate a collision — a situation where both files, while distinct in content, share the same MD5 hash. The artistry of this attack lies in the ability to manipulate the files such that their differences are specifically designed to deceive the MD5 algorithm’s collision avoidance mechanisms. Importantly, the attacker ensures that the modifications do not alter the visible content of the files, making it even more challenging to detect the manipulation.

b) Is this property broken in case of the MD5 algorithm? Elaborate on this question for each of the scenarios described above.

The MD5 algorithm is vulnerable to preimage attacks that allow an attacker to find a preimage for a given hash.

The collision resistance property of the MD5 algorithm is considered to be broken. This means that it is possible to find two different inputs that produce the same MD5 hash output, which is known as a collision.

c) Elaborate on limitations and possibilities of executing such an attack for each of the scenarios.

Scenario 1: The limitations are that the attacker cannot change the visible content of the first PNG file, which means that the attack is limited to changing the hidden data in the file. Also, the attacker can only modify one PNG file, which limits the attack further. Another limitation is that the attacker must find a collision in the MD5 function which produces the same hash for both PNGs.

Scenario 2: The limitations of this attack are similar to Scenario 1. The attacker cannot change the visible content of the PNG files, which means that the attack is limited to changing the hidden data in the files. Additionally, the attacker must find a collision in the MD5 hash function that produces the same hash output for both PNG files.

I chose to go with scenario 2. Here is some information about the PNG file format.

File header: The PNG file format starts with an 8-byte header that identifies the file as a PNG file. The header consists of the following bytes:

137 80 78 71 13 10 26 10

The first byte is a magic number that identifies the file as a PNG file. The remaining bytes are control characters that ensure that the file is properly formatted.

Image data: The image data in a PNG file is stored as a series of pixels. Each pixel contains information about the color and transparency of the image.

Chunks: Chunks are optional sections of the PNG file that provide additional image information. There are several types of chunks, including:

● IHDR (Image Header) chunk: This chunk provides image information, such as width, height, and color depth.

● PLTE (Palette) chunk: This chunk contains a color palette that can be used to represent the image.

● IDAT (Image Data) chunk: This chunk contains the compressed image data.

● tEXt (Textual Data) chunk: This chunk contains text information about the image, such as title or author.

● tIME (Time) chunk: This chunk contains the time that the image was last modified.

Each chunk has a 4-byte type code that identifies the type of chunk, followed by a data section and a 4-byte CRC (Cyclic Redundancy Check) value to verify the integrity of the chunk. In Scenario 2, the attacker can modify certain parts of the PNG file format. For example, the attacker can modify the metadata chunks (such as tEXt or tIME) or the unused or padding areas of the file to insert malicious code. The attacker can also modify the CRC values of the modified chunks to ensure that the PNG file still has the same MD5 checksum as the original file.

This python code creates the collision between the two PNG files:

#!/usr/bin/env python3

import sys
import struct
import glob
import shutil
import os
import hashlib


def get_data(args):
fn1, fn2 = args

with open(fn1, "rb") as f:
d1 = f.read()
with open(fn2, "rb") as f:
d2 = f.read()

assert d1.startswith(b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR")
assert d2.startswith(b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR")
# make sure the header of the two files match
assert d1[:0x21] == d2[:0x21]
# make sure the first 21 bytes of the 2 PNG files are the same
return d1, d2


d1, d2 = get_data(sys.argv[1:3])

hash = hashlib.sha256(d1[:0x21]).hexdigest()[:8]

print("Header hash: %s" % hash)

if not glob.glob("png1-%s.bin" % hash):
print("Not found! Launching computation...")

# make the complete prefix
with open("prefix", "wb") as f:
f.write(b"".join([
# 00-20 - original common header
d1[:0x21],
# 21-46 - padding chunk
b"\0\0\0\x1a", b"aNGE", b"ha IS A PRO HEHEHEHEHEHEHE", b"ROFL",

# 47-C7 - collision chunk

# 47-4F
# this is the end of the collision prefix,
# => lengths of 0x75 and 0x175
b"\0\0\0\x75", b"mARC", b"!",

# the end of the collision blocks if they're not computed
# 50-BF
# " " * 0x70,
]))

os.system("../hashclash/scripts/poc_no.sh prefix")

shutil.copyfile("collision1.bin", "png1-%s.bin" % hash)
shutil.copyfile("collision2.bin", "png2-%s.bin" % hash)

with open("png1-%s.bin" % hash, "rb") as f:
block1 = f.read()
with open("png2-%s.bin" % hash, "rb") as f:
block2 = f.read()

assert len(block1) == 0xC0
assert len(block2) == 0xC0
# make sure both the block lengths are 192 bytes
ascii_art = b"""
/==============\\
|* *|
| PNG IMGS |
| with |
| identical |
| -prefixes |
| |
| |
| by |
| Swetha |
| |
| |
| |
|* *|
\\==============/
BRK!
""".replace(b"\n", b"").replace(b"\r", b"")

assert len(ascii_art) == 0xF4
# assert the length of the ascii art is 244 bytes
suffix = b"".join([
# C0-C7
b"RealHash",
# the remaining of the chunk mARC

# C8-1C3 the tricky fake chunk

# the length, the type and the data should all take 0x100
struct.pack(">I", 0x100 - 4 * 2 + len(d2[0x21:])),
b"jUMP",
# it will cover the data chunks of d2,
# and the 0x100 buffer
ascii_art,
b"\xDE\xAD\xBE\xEF",
# fake cyclic redundancy check for mARC

# 1C8 - Img2 + 4
d2[0x21:],
b"\x5E\xAF\x00\x0D",
# fake cyclic redundancy check for jUMP after d2's IEND
d1[0x21:],
])

with open("%s-1.png" % hash, "wb") as f:
f.write(b"".join([
block1,
suffix
]))

with open("%s-2.png" % hash, "wb") as f:
f.write(b"".join([
block2,
suffix
]))

A Big thanks to Wenliang Du for creating the SEED Labs!❤️
You can find the Seed Security project on GitHub here:
https://github.com/seed-labs

The Lab Tasks can be found in this PDF: https://seedsecuritylabs.org/Labs_20.04/Files/Crypto_MD5_Collision/Crypto_MD5_Collision.pdf

--

--