Bitzuma

Bitcoin from Scratch: Hash Functions

By Rich Apodaca | Updated

The security of an electronic cash system increases by linking messages such as transactions together through unique IDs. Centralized registries, such as the kind operated by various governments, offer one way to issue unique IDs. However, the potential for corruption and service outage — not to mention bureaucratic inefficiency — makes centralized registries unsuitable for electronic cash. A better solution is available through hash functions.

This post is part of the extended series Bitcoin from Scratch.

A hash function accepts an arbitrary message as input and produces a numerical hash value as output. The message can be composed of any binary data of any length, including text, images, audio, and video. The hash value is deterministically selected from a large range and appears to be random to all observers. As such, hash functions can generate a unique identifier for any piece of digital data. Hash values are most commonly expressed in hexadecimal notation.

SVG Image
Hash Function. A message of any length (left) is converted to a numerical hash value (right) via an imaginary hash function (center).

Hash functions can be understood more concretely by thinking in terms of a random oracle. A random oracle consists of a box containing a gremlin, a book, a pencil, a stack of index cards, and a metal coin. Into one side of the box is cut an input slot through which a user can slide a message written on a piece of paper. Opposite this slot is cut another slot through which the gremlin slides a result written on one of the index cards. Each page of the book is divided with a vertical line; to the left is the heading “Message” and to the right is the heading “Result.”

SVG Image
Random Oracle. A box that generates random output for new messages and consistent output for previously-seen messages. Inside the box is a gremlin that uses coins tosses to generate random output and a book of previous messages to ensure consistency.

The oracle’s purpose is to produce what appears to be random output for new messages, but the same output whenever a message is resubmitted. To this end, the gremlin uses the following procedure. When a new message is received through the input slot, the gremlin consults the book. For each page in the book, the Message column is scanned for a match to the message that was just received. If a match is found, the gremlin transcribes the corresponding entry in the “Result” column to an index card, and pushes the card through the output slot. If the message is not found, the gremlin opens the book and creates a new entry after the last. Under the “Message” column, the gremlin transcribes the message. Then the gremlin takes a new index card and begins a series of 16 coin tosses. For each heads toss, the gremlin appends a “1” to the card. For each tails toss, the gremlin appends a “0”. After completing its coin tosses, the gremlin transcribes the result from the index card into the book under the “Result” column, and pushes the card through the output slot.

Imagine that Alice wants to test the oracle with a series of challenges. First, she writes the message, “To the moon!” and inserts it through the input slot. The oracle responds with an index card on which is written a sequence of 16 ones and zeroes. Using hexadecimal notation, Alice converts the result to the 16-bit integer 0xf1c3. Every time Alice submits the “To the moon!” message, she receives the same result from the oracle: 0xf1c3. Alice then writes a second message, “We set sail on this new sea.” and inserts it through the input slot. The oracle responds with an index card on which is written a binary sequence equivalent to the 16-bit integer 0x2a06. Each new message results in a different result, but Alice can decipher no pattern to the output.

Like the random oracle, hash functions offer two properties that make them attractive for issuing unique IDs:

  • Consistency. The same combination of hash function and message will always produce the same output. This ensures that a given piece of digital data always produces the same hash value output.
  • Fixed-width output. A hash function will always produce output within a constant numerical range. As such, the storage requirements for a hash value are always known.

Taken together, these two properties allow a hash function to assign a permanent, fixed-width identifier to any piece of digital data.

Although the random oracle appears in many interesting and useful thought experiments, it exists as fantasy only. What’s needed is a way to mathematically replicate the behavior of the random oracle. Fortunately, the last few decades of cryptographic research have produced several hash functions to choose from. Even better, the principles underlying hash functions can be ignored for most intents and purposes. Those interested in the details might consult this excellent reference.

SVG Image
Hash Collision. Two different messages yield the same hash value.

When used to issue unique IDs, a hash function must avoid generating collisions. A collision occurs when two different messages yield the same hash value. Collisions break the fundamental requirement of ID uniqueness, weakening the security of the electronic cash system.

SVG Image
Range. Increased range (right) gives more possible hash values, and in principle, a lower collision frequency.

One way to minimize hash value collisions is to increase the output range. Range refers to the largest value that a hash function can produce, typically measured in bits. For example, a hash function capable of 16-bit output can produce at most 65,536 (216) hash values. Although widening the output range can decrease the collision rate, adding bits increases storage and transmission costs. The best hash function offers a good tradeoff between storage costs and collision rate.

SVG Image
Uniformity. A uniform hash function (right) spreads hash values evenly over the available range.

Another way to reduce a hash function’s collision rate is to maximize uniformity. Uniformity refers to how evenly distributed hash value are. For example, a hash function capable of 32-bit output that consistently produced a single value would have very poor uniformity despite a large range. Any two messages would produce a collision 100% of the time, even though a much lower collision rate is feasible. To take full advantage of its output range, a good hash function ensures the widest possible distribution of values.

Regardless of the quality of a hash function’s output, all are subject to limitations imposed by the birthday paradox. Specifically, an ideal hash function such as the random oracle will generate a collision between two random messages in roughly one out of every 2n/2 attempts, where n is the number of bits in the output value. In the case of a hash function returning 16-bit output, a collision can be expected once every 216/2 (256) attempts. At best, adding one bit to the output of a hash function decreases collision frequency by a factor of √2.

In addition to collision resistance, a good hash function resists preimage attacks. Here, an attacker tries to find a message whose hash value matches that of a known message. An attacker for whom a preimage attack becomes practical can rewrite messages linked together via their hash values. Whereas collision attacks benefit from the birthday paradox, preimage attacks do not. In other words, an ideal hash function producing 16-bit output will require around 65,536 attempts before succeeding.

SVG Image
Preimage Attack. A message chain can be forged by finding a new message (B’, red) with the same hash value ID as one of the chain’s members (B, white). An outside observer will consider chains A-B-C and A-B’-C to be equivalent.

Examples of hash functions generally regarded as secure include SHA-256 and RIPEMD-160. SHA-256, developed by the US National Security Agency (NSA), yields hash values 256 bits (32 bytes) wide. RIPEMD-160, developed by an academic group from Belgium, yields hash values with a width of 160 bits (20 bytes). For this reason, both SHA-256 and RIPEMD-160 are widely-used to issue unique IDs in electronic cash systems.

A bit width of only 256 may at first appear woefully too narrow to support unique identifiers. It helps to consider the magnitude of this number in relation to a familiar reference point. The value 2256 is equal to approximately 1077 when expressed in decimal format, or the number “1” followed by 77 zeroes. This number is so vast that just counting that high with an extremely efficient computer would consume the combined energy output of the sun for many centuries. In other words, it’s impossible to even enumerate the values between one and 1077, much less execute a hash function that many times.

A good way to understand how hash functions work is to experiment with them interactively. One resource for doing so is the SHA-256 Online calculator. This tool supports a wide range of hash functions, including SHA-256 and RIPEMD-160.

For simplicity, the rest of this series will use short decimal numbers when discussing hash-based identifiers. Keep in mind, however, that these identifiers actually represent the output of a secure hash function.