Notes on Error Correction and Detection

11 minutes read over 4 years ago over 4 years ago

Terminology

Forward Error Correction (FEC) - a way to transmit data reliably, up to a given error rate, over an unreliable medium by redundantly ecoding the data using an Error Correction Code (ECC).
Error Detection - the ability to detect up to a certain number of errors has occurred in the transmission of data.
- When an error is detected the data needs to be requested and retransmitted to recover.
Error Correction - the ability to detect and correct up to a certain number of errors in the transmission of data.
- The size of data transmitted is increased by a fixed amount for every piece of data.
Bit Error - the bit received does not match the bit transmitted.
Bit Erasure - the bit transmitted is not received.
Block Code - ECC code which work on a fixed size block of data.
Convolutional Code - ECC code which work on a sliding window of prior data.
Code Rate - |data bits| / |total bits| => |data bits| / (|data bits| + |redundancy bits|); Higher, closer to one, is better for a given error rate.
Maximum Chanel Capacity / Shannon Limit / Noisy-channel Coding Theorem - “the maximum rate of error-free data that can theoretically be transferred over the channel if the link is subject to random data transmission errors, [Gaussian noise], for a particular noise level” [src].
Hamming Distance - the number of substitutions needed to transform one sequence into another sequence. In this context is it used to measure bit error rates due to only bit errors, and NOT bit erasures or bit insertions. Most FEC codes only support bit errors.
Levenshtein Distance / Edit Distance - the minimum number of insertions, deletions, or substitutions needed to transform one sequence into another sequence. In this context is it used to measure bit error rates due to bit errors, bit erasures, and bit insertions.
Code distance / minimum distance / d - the minimum number of changes (Hamming Distance or Levenshtein Distance) required for one code word to be transformed into another code word. Therefore any block code with a distance of d can detect up to d - 1 errors. And any block code with a distance of d can recover up to (d - 1) / 2 errors since only one possible code word will be withing that distance.
Block Length / n - size of the encoded data
Message length / k - size of the unencoded data
Alphabet size / q - the number of symbols in the alphabet (ex: binary |[1, 0]| => 2)

Overview

Redundant bits are inserted into data so that if some of that data, up to a given number of bits, are flipped (and possibly missing or inserted) then this corruption can be detected and possibly corrected. Most error correction codes (ECC) work on fixed sized blocks of data while some work on sliding windows of streams of data. Some are more compact in the amount of redundant bits introduced while some are more efficient to encode or decode or are simpler to implement.

Most ECCs assume a consisent random error distribution over the transferred bits, known as a “memoryless” channel. This means that burst errors, errors which are higher than the channel average for a section of data, may still lead to unrecoverable data corruption with ECCs designed to handle up to the average chanel error rate. To mitigate this the blocks of data can be interleaved and transmitted so that when they are received and deinterleaved the error rate is more uniform across the blocks. This however increases the time it takes to fully receive a block and thus introduces increased latency.

Error Detection

None of these methods can correct errors however they can determine if errors may have occurred. If data retransmission is cheap, error rates are low, or data corruption is manageable then error detection may be preferred over error correction which requires a greater block size and computation overhead for each message.

Parity Check

One relatively simple approach is to use a single bit to mark the occurrence of an even or odd number of set bits, aka 1s, in the binary representation of the data. This parity bit is calculated by the sender and sent along side the data. The receiver can in turn calculate the expected parity bit for the data received and compare this against the parity bit that was sent.

If there is a mismatch between the expected and received parity bit then it is known that the data received is corrupt (d = 1) and one or an odd number of bits have been flipped. Conversely if the parity bits match then the data may not be corrupt and zero or an even number of bits have been flipped.

Pros:

Quick, especially in hardware, to determine if the number of set bits are odd by using a chain of XOR gates.
Small transmission / storage overhead of a single bit.

Cons:

Can only detect an odd number of bit flips.

Checksum / Message Digest (Hash)

The data is summed or hashed in some manner and this value is used to ensure data entegrity (ex: md5sum, shasum, etc).

Cons:

Hash collisions can occur although this is often very unlikely if a sufficiently long hash size is used.

Error Correction (ECC)

Classical block codes have error detection and correction guarantees while many modern block codes do not provide these guarantees and instead bit error rates are used to determine their performance. Many of these codes only handle bit errors so for these the Hamming Distance is used to calculate bit error rates. For error correction rates of bit errors, erasures, and insertions the Levenshtein distance is used.

Repetition Codes / Duplicate Data

The most basic forrm of error correction is to send the data multiple times, say three times, and take the majority bit value for each bit. This allows up to one bit to be flipped or up to two bits to be missing and still allow for the error to be detected and corrected.

Pros:

Simple to understand and implement.

Cons:

Not very space efficient.
If the duplicate data is repeated in the same order each time then errors which occur at a fixed interval may affect the same bit positions each transmission and leave the message unrecoverable.

Parity Bits

Similar to a single parity bit except 1) multiple parity bits are used and 2) the data is divided in half and each bit in each half is XORed against each other to produce the parity bits. This allows error detection if one (or three) out of the three corresponding bits are flipped and recovery if the corrupted bit is known (ex: an erasure when one drive fails in a RAID-3 array).

Example:

data = [A, B, C, D, E, F]
data[0:2] = [A, B, C]
data[3:5] = [D, E, F]
parity = [A^D, B^E, C^F]

This can be extended to dividing the data into n number of pieces and XORing each bit in each piece together to produce the parity bits. Still error detection can only occur if one (or an odd number of bits) out of the n bits are flipped and recovery if the corrupted bit position is known.

Commonly used in RAID where in RAID-3 mutiple drives store the data and another drive stores the parity.

Pros:

Simple and fast to implement in hardware.
Handles a 33% error rate when data is divided in two.

Cons:

Not near the channel capacity
Uncorrectable unless the currupted bit position is known.
Results in a 50% increase in data size when data is divided in two.

Hamming Code

Wikipedia

An “perfect”, as in minimal distance, code word count, and code word length, binary block code which can detect up to two bit errors, without correction, and correct up to one bit error (d = 3). While the message and block size are configurable they are only at discrete, and not continuous, sizes. (m = 2^r - 1 - r and n = 2^r - 1 where r represents the number of parity bits and r >= 2. Ex: See this table.

Also some two bit errors look like a single bit error and when error correction is used this results in an incorrect “correction”. To mitigate this another bit can be added to the d = 3 hamming code to make it a d = 4 hamming code known as Single Error Correction Double Error Detection (SECDED), however this is no longer a “perfect” code rate. This pushes the same issue out to where some three bit errors are detected as two bit errors and if error correction is used the incorrect bits are “corrected”.

Commonly used in NAND flash memory. Single bit error correction, two bit error detection.

Pros:

“Perfect” for d = 3 on a 2 letter alphabet (ex: binary).
Can be extended to handle messages of various sizes.
Fast using masks and XORs.

Cons:

Not near channel capacity.
Only supports single bit error correction for optimal hamming code (d = 3).
Low code rates at small message sizes: 73% at 11 bits, 57% at 7 bits, 33% at 1 bit.
Discrete, and not continious, message and block sizes.
Only supports a binary alphabet.

Reed-Solomon Code

Wikipedia Vivint - Introduction to Reed-Solomon

A family of maximally distant block codes with a configurable distance of d = n - k + 1.

In the original form each letter in the message is mapped onto each coefficient in a polynomial function of degree < k. Example: p(x) = A + Bx + Cx^2 + Dx^3 + ... where A = message[0], B = message[1], .... And to obtain the encoded message n predefined points are evaluated across p(x). To decode the message k of the n values sent form a system of equations which must be solved to recover the original k coefficients. Alternatively the message can be mapped onto the values of p(x) and there by reverse the steps required for encoding and decoding as described previously. This allows the message to appear in the encoded data along with seperate redundancy symbols.

In the more common form (BCH) the message is mapped in one of the two ways mentioned prior and then tranformed into another polynomial s(x) of degree < n using a polynomial g(x) shared by both the encoder and decoder.

Commonly used in CDs, DVDs, HDs, QR codes, RAID-6, …

Pros:

Maximally distant.
Configurable correction and detection limits.
Handles burst errors well.

Cons:

Not near channel capacity.
Complex.

Capacity-Approaching Codes

Turbo Codes

Wikipedia

A class of soft decision (probabilistic) codes which are near the channel capacity. Two equally sized but different sets of parity bits are produced by two encoders and sent alongside the message data. Two decoders decode the two sets of parity bits, each working together in an iterative fashion until both agree on the value and likelihood for each message bit.

Used in 3G, 4G, and satellite communications.

Pros:

Near channel capacity.
Performs better than LDPC at lower code rates.

Cons:

Complex.
Multiple cycles to decode (“typically in 15 to 18 cycles”).

Low Density Parity Check (LDPC) Codes

Wikipedia

A class of codes which are near the channel capacity with linear decoding time. The encoded data is composed of the message bits and parity bits. Finding an optimal LDPC code is not practical (NP-complete) but heuristics or other approaches are able to produce high performance ones.

Used in 802.11n, 10GBASE-T, DOCSIS 3.1, …

Pros:

Near channel capacity.
Linear time decoding WRT n.
Better “error floor and performance” than turbo codes when using higher code rates.
Decoding can be done on low performance (ex: microcontroller) hardware by using a lookup table.

Cons:

Complex.

Terminology

Overview

Error Detection

Parity Check

Checksum / Message Digest (Hash)

Error Correction (ECC)

Repetition Codes / Duplicate Data

Parity Bits

Hamming Code

Reed-Solomon Code

Capacity-Approaching Codes

Turbo Codes

Low Density Parity Check (LDPC) Codes

References