AES Rundown

Introduction to Ciphers

In today's world, there is monumental effort to secure data that we wish to remain private. Information like our social security number, credit card data, and passwords should all be hidden from the general public (and for good reason). Despite this, we seldom ask ourselves how this data is actually hidden. If the title hasn't already given it away, the answer is encryption.

If you're in the vast majority of the population, you probably don't encounter block ciphers in your day-to-day life. This shouldn't be an issue though, as this article is meant to be understood by anyone with a bit of coding experience.

The first week that I started writing ciphers for NIST, my mentor came up to me and gave a helpful analogy: a block cipher is like an old-fashioned safe. The cipher itself is the safe, while our data is some valuable. The catch is that our key is still called a 'key', but it's really just a big number of our choosing used for calculations.

A visual depiction of how a cipher is similar to a safe

Why AES?

Great question. It would take greater minds than mine to answer that question fully.

The simple answer is that AES remains very resiliant under years of scrutiny without sacrificing too much speed. This does not mean AES is necessarily efficient as a lightweight cipher (see PRINCE); nonetheless, would you rather keep your money in a national bank or in a safe under your bed?

For those interested in the security aspect, my previous analogy begins to break down when it comes to the motive of a hacker / thief. Just as a thief is interested in the money inside a safe, one would expect that the motive of a hacker is to steal passwords / sensitive information; however, it is truly the motive of a hacker to obtain our key (I'm interchanging hacker and cryptanalyst a bit loosely — in reality I think most hackers would be perfectly happy with a few hundred passwords). Therefore, the objective of a cipher is to make it as hard as possible to retrieve the key.

To give a little more detail, all computer information is stored in binary — thus, our key is just a fixed number of bits (128, 192, or 256 for AES). For every bit that we add the number of possibilities for our key doubles, so as a result our key space doubles as well. In order to break a cipher, one must find some sort of algorithm which discovers the key with greater efficiency than a brute-force solution (though I have not gone into the details of AES yet, this is the rationale behind adding extra rounds to larger key sizes).

To my knowledge, there are currently no non-side-channel attacks besides a biclique attack that only has theoretical improvement from $2^{128} $ to $ 2^{126} $ (theoretical meaning this attack is still not practical for modern computers). Though there are other ciphers that are nearly as strong as AES (i.e. Serpent and Twofish), the attention given to AES over the past few decades has allowed for speedups and security updates to patch the commonly-known side-channel attacks.

The Cipher

Now that you have all the background, it's time to discuss the actual algorithm of AES.

AES is what cryptographers call a Substitution-Permutation Network (SPN): every round / iteration of the algorithm takes our chunk of data, breaks it into smaller chunks that are fed through a substitution box, and then swaps the bits around according to some sort of permutation.

In order to truly understand what is going on in AES, the reader would first have to become fimiliar with the basics of polynomial rings and Galois fields (also referred to as finite fields) — I will leave a section at the end to cover these things for those who are interested.

As I briefly mentioned in an aside earlier, the AES algorithm is somewhat dependent on key-length (which is either 128 bits, 192 bits, or 256 bits) — however, the size of the data we wish to encrypt is fixed at 128 bytes. The reason I say 'somewhat' is because AES is essentially a collection of functions ( SubBytes(), ShiftRows(), MixColumns, and AddRoundKey() ) that are repeated; each iteration of these functions is referred to as a round. Each of AES-128, AES-192, and AES-256 execute the same code — it's merely the number of rounds that differ for each variant. In particular AES-128 executes 10 rounds, AES-192 executes 12 rounds, and AES-256 executes 14 rounds. Increasing the number of rounds makes it so that the computation time of most cryptanalysis takes just as long as a brute-force solution.

With that said, here's a breakdown of the functions that are executed in each round of the algorithm.

Key Schedule

At this point, you're likely wondering how the key acutally fits into the cipher: is there some sort of lock hidden in the code where the key mysteriously fits in, maybe some hash function that checks authenticity? Nope.

As I briefly touched upon earlier, the key is basically a large number used for arithmetic operations in $GF(2^8)$; but even that isn't entirely true, as the key itself is only used in the first round for key-whitening. For later operations, the key is actually used to generate what are called round keys in a key schedule. Many ciphers utilize the concept of a key schedule, as it greatly improves upon the security of a key. For example, if a cryptanalyst were to gain access to one of our round keys through extensive attacks, the strength of our key schedule would determine whether that is enough information to compute the original key (in the case of AES, it is not but allows the cryptanalyst to gain access to other round keys which collectively can accomplish the task).

The AES key schedule is broken down into two subroutines, SubWord() and RotWord(), which are added in $GF(2^8)$ to a round constant. The gist of the key schedule is that the previous round key is added to the current round key so that, as we continue along the key schedule, we experience a similar avalanche effect to that of SHA. Implementing the subroutines is incredibly straightforward: RotWord() cyclically rotates the bytes of the current round key as if they were a wheel, and SubWord() simply breaks the round key up into individual bytes and sends them through the SBox (which is explained in next section). Here is the code for each:

def RotWord(word):
    return [word[(i + 1) % 4] for i in range(4)]

def SubWord(word):
    return [SBox[word[i]] for i in range(4)] #SBox is provided in next section

static void RotWord(unsigned char* word) {
     // assume that word is a char array of length 4
     unsigned char temp = word[0];
     word[0] = word[1];
     word[1] = word[2];
     word[2] = word[3];
     word[3] = temp;
}
static void SubWord(unsigned char* word) {
     // SBox is provided in next section
     word[0] = SBox[word[0]];
     word[1] = SBox[word[1]];
     word[2] = SBox[word[2]];
     word[3] = SBox[word[3]];
}

Since the size of data that we wish to encrypt is fixed at 16 bytes (hence the term "block" cipher — we only encrypt one 16-byte block at a time), we choose to represent this data as a 4x4 matrix of bytes which will refer to from now on as the state: $$\begin{pmatrix} p_0 & p_4 & p_8 & p_{12} \\\\ p_1 & p_5 & p_9 & p_{13} \\\\ p_2 & p_6 & p_{10} & p_{14} \\\\ p_3 & p_7 & p_{11} & p_{15} \end{pmatrix}$$ I'm using the variable $p_i $ for $ 0 \leq i \leq 15 $ here to denote splitting of plaintext into bytes. Our goal is to add 4 round keys to the state each round (since a round key is only a word and the state is 4 words). In order to do this, we tend to think of the state as an array of 4 columns, and perform such operations column by column.

Now the first four round keys are simply the key itself; however, all subsequent round keys are added to the previous four round keys after they have passed through the SubWord() and RotWord() subroutines. Since our first four round keys are used for key-whitening, we actually have one additional round key which is used for the last round. Thus, for AES-128, there need to be 4 round keys for each of 10 + 1 rounds; for AES-192, there need to be 4 keys for each of 12 + 1 rounds; and for AES-256, there need to be 4 keys for each of 14 + 1 rounds. I am confident you can do the math.

Heres an example of the full key schedule:

def KeySchedule(key, nRounds):
    # The first round key is the key itself
    for i in range(16):
        RoundKey[i] = key[i]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
    for i in range(4, 4*(nRounds + 1)):
        tempWord = [RoundKey[4 * (i - 1) + j] for j in range(4)]
        
        # since the state is 4x the size of a RoundKey, we only apply our
        # subroutines on necessary rounds
        if (i % 4 == 0):
           tempWord = RotWord(tempWord)
           tempWord = SubWord(tempWord)
           tempWord[0] ^= Rcon[i // 4]
        
        for j in range(4):
            # current round key is our calculation above added to previous round key
            RoundKey[4*i + j] = RoundKey[4*(i-4) + j] ^ tempWord[j]

#define NUM_ROUNDS
                           
static void KeySchedule(const unsigned char * key) {
  
    unsigned char i;
    unsigned char tempWord[4];
    
    // The first round key is the key itself.
    for (i = 0; i < 4; ++i)
    {
        RoundKey[4 * i] = Key[4 * i];
        RoundKey[4 * i + 1] = Key[4 * i + 1];
        RoundKey[4 * i + 2] = Key[4 * i + 2];
        RoundKey[4 * i + 3] = Key[4 * i + 3];
    }
    // All other round keys are found from the previous round keys.
    for (; i < 4 * (NUM_ROUNDS + 1); ++i)
    {
        // copies previous four round keys into column
        tempWord[0] = RoundKey[4 * (i - 1)];
        tempWord[1] = RoundKey[4 * (i - 1) + 1];
        tempWord[2] = RoundKey[4 * (i - 1) + 2];
        tempWord[3] = RoundKey[4 * (i - 1) + 3];
        
        // since the state is 4x the size of a RoundKey, we only apply our
        // subroutines on necessary rounds            
        if (i % 4 == 0)
        {
            RotWord(tempWord);
            SubWord(tempWord);
            
            tempWord[0] ^= Rcon[i / 4]; // Since Rcon is only one byte, all all other elements of word are XORed with identity
        }

        // current round key is our calculation above added to previous round key
        RoundKey[i * 4] = RoundKey[(i - 4) * 4 ] ^ tempWord[0];
        RoundKey[i * 4 + 1] = RoundKey[(i - 4) * 4 + 1] ^ tempWord[1];
        RoundKey[i * 4 + 2] = RoundKey[(i - 4) * 4 + 2] ^ tempWord[2];
        RoundKey[i * 4 + 3] = RoundKey[(i - 4) * 4 + 3] ^ tempWord[3];
    }
}

Where the round constants denoted by Rcon are simply the first byte of $2^{i-1}$ in $GF(2^8)$. I'll provide the constants in the array below:



        0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a

`SubBytes()`

The first item on our list of round functions is that pesky "substitution" part of our whole substitution-permutation network. The reason our substitution layer is so important is that it provides a non-linear layer in $GF(2^8)$, so that cryptanalysis cannot reduce the cipher to a linear system of equations (which would allow the cipher to be easily broken using GPUs and a sufficient knowledge of finite field arithmetic).

Now let $ b $ be an arbitrary byte from our state above (i.e. $ b = p_i $ for some $ 0 \leq i \leq 15$ ) and let $ b_j $ denote the $ j^{th}$ bit for $ 0 \leq j \leq 7 $. The transformation which the Substitution layer is applying to $b$ is $$ \tilde{b_i} = (b^{-1})_i \oplus (b^{-1})_{(i + 4)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 5)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 6)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 7)\mathrm{mod} 8} + c_i $$ where $c_i$ denotes the $i^{th}$ bit of the fixed constant $ 01100011$. Note that the symbol $\oplus $ here represents addition in $GF(2^8)$, which just so happens to be the exclusive-or XOR operation (i.e. $0 \oplus 0 = 1 \oplus 1 = 0 $ and $1 \oplus 0 = 0 \oplus 1 = 1 $ ).

Now if we decided to apply that transformation bit by bit to a 128-bit state, we would be wasting a HUGE amount of computing power since inversion in $GF(2^8)$ is incredibly taxing in terms of clock cycles; instead, the National Institute of Standards and Technology (NIST) was gracious enough to provide a pre-computed lookup-table:


      0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76, 

      0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0, 

      0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15, 

      0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75, 

      0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84, 

      0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf, 

      0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8, 

      0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2, 

      0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73, 

      0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb, 

      0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79, 

      0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08, 

      0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a, 

      0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e, 

      0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf, 

      0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16

Now that we have our lookup table, creating a function to send the state through the substitution layer is pretty simple:

def SubBytes():
    global state
    state = [[SBox[state[row][col]] for col in range(4)] for row in range(4)] # SBox is array above

static void SubBytes() {
    unsigned char row, col;
    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            (*state)[row][col] = SBox[(*state)[row][col]]; // SBox is array above
}

`ShiftRows()`

Next on our agenda is the ShiftRows() method, which does just what the name suggests. Using zero-indexed origin (i.e. for $0 \leq i \leq 3$), we cyclically shift each row in the state to the left by $i$ bytes: $$ \begin{pmatrix} p_0 & p_4 & p_8 & p_{12} \\\\ p_1 & p_5 & p_9 & p_{13} \\\\ p_2 & p_6 & p_{10} & p_{14} \\\\ p_3 & p_7 & p_{11} & p_{15} \end{pmatrix} \longmapsto_{ShiftRows()} \begin{pmatrix} p_0 & p_4 & p_8 & p_{12} \\\\ p_5 & p_9 & p_{13} & p_1 \\\\ p_{10} & p_{14} & p_2 & p_6 \\\\ p_{15} & p_{3} & p_{7} & p_{11} \end{pmatrix} $$

def ShiftRows():
    global state
    for i in range(1, 4):
        state[0][i], state[1][i], state[2][i], state[3][i] = \
        state[i][i], state[(i + 1) % 4][i], state[(i + 2) % 4][i], state[i - 1][i]

static void ShiftRows() {
    unsigned char temp;

    // Rotate first row 1 columns to left
    temp           = (*state)[0][1];
    (*state)[0][1] = (*state)[1][1];
    (*state)[1][1] = (*state)[2][1];
    (*state)[2][1] = (*state)[3][1];
    (*state)[3][1] = temp;

    // Rotate second row 2 columns to left
    temp           = (*state)[0][2];
    (*state)[0][2] = (*state)[2][2];
    (*state)[2][2] = temp;
    temp           = (*state)[1][2];
    (*state)[1][2] = (*state)[3][2];
    (*state)[3][2] = temp;

    // Rotate third row 3 columns to left
    temp           = (*state)[0][3];
    (*state)[0][3] = (*state)[3][3];
    (*state)[3][3] = (*state)[2][3];
    (*state)[2][3] = (*state)[1][3];
    (*state)[1][3] = temp;
}

`MixColumns()`

Third on our list is the MixColumns() function — this is where things start to get trickier. The reason things get tricky is because this function is an affine (linear) transform over $GF(2^8)$, and is thus heavily dependent on polynomial multiplication. I briefly mentioned in the SubBytes() section that inversion (i.e. division) in a Galois field is computationally taxing in terms of clock cycles — the same applies to multiplication, except a few extra steps are taken out. The exact rationale for MixColumns() is explained in the additional content for decryption. Overlooking the details, we have the transform: $$ \begin{pmatrix} 02 & 03 & 01 & 01 \\\\ 01 & 02 & 03 & 01 \\\\ 01 & 01 & 02 & 03 \\\\ 03 & 01 & 01 & 02 \end{pmatrix} \begin{pmatrix} p_0 & p_4 & p_8 & p_{12} \\\\ p_1 & p_5 & p_9 & p_{13} \\\\ p_2 & p_6 & p_{10} & p_{14} \\\\ p_3 & p_7 & p_{11} & p_{15} \end{pmatrix} $$

We have two options here: since multiplication isn't as hard as division in $GF(2^8)$, we could find an algorithm for how to multiply by small numbers. Alternatively, we could just do what we did before with the substituion layer and precompute everything into a lookup-table. Since each has their own advantage (space vs. time tradeoff), I'll go ahead and provide both.

For the explicit multiplication approach, recall that a 'carry-over' digit was used in elementary school multiplication when one digit overflowed into another. The idea doesn't change just because we have switched fields from $\mathbb{R}$ to $GF(2^8)$, but we must be cautious to use the right binary operator. Ultimately, the gmult() function below relays this idea using the proper XOR:

def xtime(x):
    return (x << 1) ^ (((x >> 7) & 0x01) * 0x1b) # 0x1b represents polynomial x^4 + x^3 + x + 1
        
def gmult(x, y):
    result = (y & 0x01) * x
    result ^= (y>>1 & 0x01) * xtime(x)
    result ^= (y>>2 & 0x01) * xtime(xtime(x))
    result ^= (y>>3 & 0x01) * xtime(xtime(xtime(x)))
    result ^= (y>>4 & 0x01) * xtime(xtime(xtime(xtime(x))))
    return result & 0xff    #mask to cut off any overflow

static unsigned char xtime(unsigned char x)
{
    return ((x<<1) ^ (((x>>7) & 0x01) * 0x1b)); // 0x1b represents polynomial x^4 + x^3 + x + 1
}

#define gmult(x, y)                          \
(  ((y & 0x01) * x) ^                              \
((y>>1 & 0x01) * xtime(x)) ^                       \
((y>>2 & 0x01) * xtime(xtime(x))) ^                \
((y>>3 & 0x01) * xtime(xtime(xtime(x)))) ^         \
((y>>4 & 0x01) * xtime(xtime(xtime(xtime(x))))) & 0xff)   \

Alternatively, we could simply precompute each number that we anticipate multiplying by and store the results in lookup-table. As I mentioned with the space-time tradeoff, this clearly takes up more space (256 bytes for each table) but does reduce clock cycles.

Multiplication by 0x02:


                0x00,0x02,0x04,0x06,0x08,0x0a,0x0c,0x0e,0x10,0x12,0x14,0x16,0x18,0x1a,0x1c,0x1e, 

                0x20,0x22,0x24,0x26,0x28,0x2a,0x2c,0x2e,0x30,0x32,0x34,0x36,0x38,0x3a,0x3c,0x3e, 

                0x40,0x42,0x44,0x46,0x48,0x4a,0x4c,0x4e,0x50,0x52,0x54,0x56,0x58,0x5a,0x5c,0x5e, 

                0x60,0x62,0x64,0x66,0x68,0x6a,0x6c,0x6e,0x70,0x72,0x74,0x76,0x78,0x7a,0x7c,0x7e, 

                0x80,0x82,0x84,0x86,0x88,0x8a,0x8c,0x8e,0x90,0x92,0x94,0x96,0x98,0x9a,0x9c,0x9e, 

                0xa0,0xa2,0xa4,0xa6,0xa8,0xaa,0xac,0xae,0xb0,0xb2,0xb4,0xb6,0xb8,0xba,0xbc,0xbe, 

                0xc0,0xc2,0xc4,0xc6,0xc8,0xca,0xcc,0xce,0xd0,0xd2,0xd4,0xd6,0xd8,0xda,0xdc,0xde, 

                0xe0,0xe2,0xe4,0xe6,0xe8,0xea,0xec,0xee,0xf0,0xf2,0xf4,0xf6,0xf8,0xfa,0xfc,0xfe, 

                0x1b,0x19,0x1f,0x1d,0x13,0x11,0x17,0x15,0x0b,0x09,0x0f,0x0d,0x03,0x01,0x07,0x05, 

                0x3b,0x39,0x3f,0x3d,0x33,0x31,0x37,0x35,0x2b,0x29,0x2f,0x2d,0x23,0x21,0x27,0x25, 

                0x5b,0x59,0x5f,0x5d,0x53,0x51,0x57,0x55,0x4b,0x49,0x4f,0x4d,0x43,0x41,0x47,0x45, 

                0x7b,0x79,0x7f,0x7d,0x73,0x71,0x77,0x75,0x6b,0x69,0x6f,0x6d,0x63,0x61,0x67,0x65, 

                0x9b,0x99,0x9f,0x9d,0x93,0x91,0x97,0x95,0x8b,0x89,0x8f,0x8d,0x83,0x81,0x87,0x85, 

                0xbb,0xb9,0xbf,0xbd,0xb3,0xb1,0xb7,0xb5,0xab,0xa9,0xaf,0xad,0xa3,0xa1,0xa7,0xa5, 

                0xdb,0xd9,0xdf,0xdd,0xd3,0xd1,0xd7,0xd5,0xcb,0xc9,0xcf,0xcd,0xc3,0xc1,0xc7,0xc5, 

                0xfb,0xf9,0xff,0xfd,0xf3,0xf1,0xf7,0xf5,0xeb,0xe9,0xef,0xed,0xe3,0xe1,0xe7,0xe5

Multiplication by 0x03:


                0x00,0x03,0x06,0x05,0x0c,0x0f,0x0a,0x09,0x18,0x1b,0x1e,0x1d,0x14,0x17,0x12,0x11,

                0x30,0x33,0x36,0x35,0x3c,0x3f,0x3a,0x39,0x28,0x2b,0x2e,0x2d,0x24,0x27,0x22,0x21,

                0x60,0x63,0x66,0x65,0x6c,0x6f,0x6a,0x69,0x78,0x7b,0x7e,0x7d,0x74,0x77,0x72,0x71,

                0x50,0x53,0x56,0x55,0x5c,0x5f,0x5a,0x59,0x48,0x4b,0x4e,0x4d,0x44,0x47,0x42,0x41,

                0xc0,0xc3,0xc6,0xc5,0xcc,0xcf,0xca,0xc9,0xd8,0xdb,0xde,0xdd,0xd4,0xd7,0xd2,0xd1,

                0xf0,0xf3,0xf6,0xf5,0xfc,0xff,0xfa,0xf9,0xe8,0xeb,0xee,0xed,0xe4,0xe7,0xe2,0xe1,

                0xa0,0xa3,0xa6,0xa5,0xac,0xaf,0xaa,0xa9,0xb8,0xbb,0xbe,0xbd,0xb4,0xb7,0xb2,0xb1,

                0x90,0x93,0x96,0x95,0x9c,0x9f,0x9a,0x99,0x88,0x8b,0x8e,0x8d,0x84,0x87,0x82,0x81,

                0x9b,0x98,0x9d,0x9e,0x97,0x94,0x91,0x92,0x83,0x80,0x85,0x86,0x8f,0x8c,0x89,0x8a,

                0xab,0xa8,0xad,0xae,0xa7,0xa4,0xa1,0xa2,0xb3,0xb0,0xb5,0xb6,0xbf,0xbc,0xb9,0xba,

                0xfb,0xf8,0xfd,0xfe,0xf7,0xf4,0xf1,0xf2,0xe3,0xe0,0xe5,0xe6,0xef,0xec,0xe9,0xea,

                0xcb,0xc8,0xcd,0xce,0xc7,0xc4,0xc1,0xc2,0xd3,0xd0,0xd5,0xd6,0xdf,0xdc,0xd9,0xda,

                0x5b,0x58,0x5d,0x5e,0x57,0x54,0x51,0x52,0x43,0x40,0x45,0x46,0x4f,0x4c,0x49,0x4a,

                0x6b,0x68,0x6d,0x6e,0x67,0x64,0x61,0x62,0x73,0x70,0x75,0x76,0x7f,0x7c,0x79,0x7a,

                0x3b,0x38,0x3d,0x3e,0x37,0x34,0x31,0x32,0x23,0x20,0x25,0x26,0x2f,0x2c,0x29,0x2a,

                0x0b,0x08,0x0d,0x0e,0x07,0x04,0x01,0x02,0x13,0x10,0x15,0x16,0x1f,0x1c,0x19,0x1a

With Galois field multiplication for small numbers in our arsenal, the implementation of MixColumns() is now fairly straightforward. What we have is:

def MixColumns():
    global state
    for col in range(4):
        # temporary variables
        a = state[col][0]
        b = state[col][1]
        c = state[col][2]
        d = state[col][3]
        
        state[col][0] = gmult(a, 2) ^ gmult(b, 3) ^ c ^ d
        state[col][1] = a ^ gmult(b, 2) ^ gmult(c, 3) ^ d
        state[col][2] = a ^ b ^ gmult(c, 2) ^ gmult(d, 3)
        state[col][3] = gmult(a, 3) ^ b ^ c ^ gmult(d, 2)

static void MixColumns() {
    char a,b,c,d, i;
    for (i = 0; i < 4; i++)
    {
        a = (*state)[i][0];
        b = (*state)[i][1];
        c = (*state)[i][2];
        d = (*state)[i][3];
    
        (*state)[i][0] = gmult(a, 2) ^ gmult(b, 3) ^ c ^ d;
        (*state)[i][1] = a ^ gmult(b, 2) ^ gmult(c, 3) ^ d;
        (*state)[i][2] = a ^ b ^ gmult(c, 2) ^ gmult(d, 3);
        (*state)[i][3] = gmult(a, 3) ^ b ^ c ^ gmult(d, 2);
    }
}

`AddRoundKey()`

Again, it's just what it sounds like. If you didn't feel like reading over the Key Schedule section but want to actually understand why this routine is vital to the cipher, then I can't do much help — the key schdule makes the round keys which are used by AddRoundKey().

Alright, so hopefully you understand that there are 4 times as many round keys as there are rounds (plus 4 used for key-whitening). If this is still unclear to you, we reached this number because the state is 4 times the size of the round key (or conversely, each round key is 1/4 the size of the state). In reality, if we were to mesh the round keys together four at a time, we would have a full 128-bit round key for each round. If you recall from the SubBytes section, addition in $GF(2^8)$ is the same as bitwise exclusive-or XOR so we wind up with the following code:

def AddRoundKey(round):
    global state
    state = [[state[row][col] ^ RoundKey[16 * round + 4 * row + col] for col in range(4)] for row in range(4)]

static void AddRoundKey(unsigned char round)
{
    unsigned char row, col;
    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            (*state)[row][col] ^= RoundKey[16 * round  + 4 * row + col]; // Adds RoundKey byte by byte to the state
}

Encryption Code

Now that we have the main components of AES down, we're ready to put it all together in a cogent, continuous block. For all variants of AES, we perform what is what is called key-whitening on the state, such that the AddRoundKey() function applies our key to the initial data. Following that we perform exactly one less than the total number of rounds in the following format:


        

        SubBytes()

        ShiftRows()

        MixColumns()

        AddRoundKey(round)

Lastly, we perform one final round where we merely take out the MixColumns() function — and that's it! That's the AES block cipher. I hope you enjoyed reading this almost as much as I didn't enjoy writing the HTML. I will provide the additional content on Galois fields below and a much briefer section on decryption. However, if your goal was to understand the current most popular cipher, this is effectively the end. Thanks for reading! 😁

from __future__ import print_function
            
# Precomputed substitution layer
_SBox = [
0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,
0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,
0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,
0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf,
0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8,
0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,
0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,
0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,
0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,
0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,
0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,
0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,
0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16
]

__Nr = 10 # AES-128 in this example
__Nk = 4
# Test data provided from the FIPS AES documentation

_Key = [ 0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x09, 0xcf, 0x4f, 0x3c]
_state=[[0x32, 0x43, 0xf6, 0xa8], [0x88, 0x5a, 0x30, 0x8d], [0x31, 0x31, 0x98, 0xa2], [0xe0, 0x37, 0x07, 0x34]]
_RoundKey = [0 for i in range(16 * (__Nr + 1) )]

def RotWord(word):
    """ Inline function which takes a word (4 bytes) and performs a cyclical
    right shift"""
    return [word[(i + 1) % 4] for i in range(4)]

def SubWord(word):
    """ Takes each individual byte from a word and sends it through the 
    substitution layer"""
    return [_SBox[word[i]] for i in range(4)] #SBox is provided in next section

def KeySchedule():
    """ Rijndael key schedule for generating all round keys
    based off the original key"""

    # Since round constants are only used for the key schedule, we
    # keep them local to the method
    Rcon = [0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a]

    # The first round key is the key itself
    for i in range(4 * __Nk):
        _RoundKey[i] = _Key[i]

    for i in range(__Nk, 4 * (__Nr + 1)):
        tempWord = [_RoundKey[4 * (i - 1) + j] for j in range(4)]

        # since the state is 4x the size of a RoundKey, we only apply our
        # subroutines on necessary rounds
        if i % __Nk == 0:
           tempWord = RotWord(tempWord)
        if i % 4 == 0:
           tempWord = SubWord(tempWord)
        if i % __Nk == 0:
           tempWord[0] ^= Rcon[i // __Nk]

        # set current round key
        for j in range(4):
            _RoundKey[4*i + j] = _RoundKey[4*(i-__Nk) + j] ^ tempWord[j]

def SubBytes():
    """ Force all data in the state matrix through the substitution layer """
    global _state
    _state = [[_SBox[_state[row][col]] for col in range(4)] for row in range(4)]

def ShiftRows():
    """ Perform a cyclic shift on each row dependent on the depth of the row"""
    global _state
    for i in range(1, 4):
        _state[0][i], _state[1][i], _state[2][i], _state[3][i] = \
        _state[i][i], _state[(i + 1) % 4][i], _state[(i + 2) % 4][i], _state[i - 1][i]

def xtime(x):
    return (x << 1) ^ (((x >> 7) & 0x01) * 0x1b)

def gmult(x, y):
    result = (y & 0x01) * x
    result ^= (y>>1 & 1) * xtime(x)
    result ^= (y>>2 & 1) * xtime(xtime(x))
    result ^= (y>>3 & 1) * xtime(xtime(xtime(x)))
    result ^= (y>>4 & 1) * xtime(xtime(xtime(xtime(x))))
    return result & 0xff   # return only the first byte of the calculation

def MixColumns():
    """ Affine transform in the Rijndael field on the state """
    global _state
    for col in range(4):
        # temporary variables
        a = _state[col][0]
        b = _state[col][1]
        c = _state[col][2]
        d = _state[col][3]

        _state[col][0] = gmult(a, 2) ^ gmult(b, 3) ^ c ^ d
        _state[col][1] = a ^ gmult(b, 2) ^ gmult(c, 3) ^ d
        _state[col][2] = a ^ b ^ gmult(c, 2) ^ gmult(d, 3)
        _state[col][3] = gmult(a, 3) ^ b ^ c ^ gmult(d, 2)

def AddRoundKey(round):
    """Adds the round key generated for that specific round to the state""" 
    global _state
    _state = [[_state[row][col] ^ _RoundKey[16 * round + 4 * row + col] for col in range(4)] for row in range(4)]

def AESCipher():
    """ The complete AES forward encryption performed through byte-array calculations """
    
    # key-whitening
    AddRoundKey(0)
    
    for round in range(1, __Nr):
        SubBytes()
        ShiftRows()
        MixColumns()
        AddRoundKey(round)

    # final round
    SubBytes()
    ShiftRows()
    AddRoundKey(__Nr)

from __future__ import print_function
def printState():
    for i in range(4):
        for j in range(4):
            print(format(_state[j][i], '02x'), end=' ')
        print()
    print()
if __name__ == "__main__":
    printState()
    KeySchedule()
    AESCipher()
    printState()

#include <stdio.h>

/*****************************************************************************/
/* Function Declarations:                                                    */
/*****************************************************************************/
static void RotWord(unsigned char*);
static void SubWord(unsigned char*);
static void AddRoundKey(unsigned char);
static void SubBytes();
static void ShiftRows();
static unsigned char xtime(unsigned char);
static void MixColumns();
static void AESCipher();
static void KeySchedule();


/*****************************************************************************/
/* Variables:                                                                */
/*****************************************************************************/

#define N_ROUNDS 10

typedef unsigned char state_t[4][4];

static state_t* state;
static unsigned char RoundKey[16 * (N_ROUNDS + 1)];
static const unsigned char* Key;

static const unsigned char SBox[256] =   {
    //0     1    2      3     4    5     6     7      8    9     A      B    C     D     E     F
    0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
    0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
    0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,
    0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,
    0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,
    0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf,
    0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8,
    0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,
    0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,
    0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,
    0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,
    0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,
    0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
    0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,
    0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,
    0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };

/*****************************************************************************/
/* Main method:                                                              */
/*****************************************************************************/

int main() {
    
    int i;

    const unsigned char key[] = {0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x09, 0xcf, 0x4f, 0x3c};
    unsigned char in[] = {0x6b, 0xc1, 0xbe, 0xe2, 0x2e, 0x40, 0x9f, 0x96, 0xe9, 0x3d, 0x7e, 0x11, 0x73, 0x93, 0x17, 0x2a};
    state = (state_t*) in;    
    Key = key;

    KeySchedule();
    
    printf("0x");
    for (i = 15; i >= 0; --i)
        printf("%02x", in[i]);
    printf("\n");

    AESCipher();
    
    printf("0x");
    for (i = 15; i >= 0; --i)
        printf("%02x", in[i]);
    printf("\n");

    return 1;
}

/**
 * The complete AES forward encryption performed through byte-array calculations
 */
static void AESCipher() {
    
    // Key-whitening
    AddRoundKey(0);

    unsigned char round;
    for (round =  1; round < N_ROUNDS; ++round) {
        SubBytes();
        ShiftRows();
        MixColumns();
        AddRoundKey(round);
    }

    // final round
    SubBytes();
    ShiftRows();
    AddRoundKey(round);
}

/**
 * Function which takes a word (4 bytes) and performs a cyclical right
 * shift
 */
static void RotWord(unsigned char* word) {
     // assume that word is a char array of length 4
     unsigned char temp = word[0];
     word[0] = word[1];
     word[1] = word[2];
     word[2] = word[3];
     word[3] = temp;
}

/**
 * Takes each individual byte from a word and sends it through the 
 * substitution layer
 */
static void SubWord(unsigned char* word) {
     word[0] = SBox[word[0]];
     word[1] = SBox[word[1]];
     word[2] = SBox[word[2]];
     word[3] = SBox[word[3]];
}

/**
 * Rijndael key schedule for generating all round keys based off
 * the original key
 */
static void KeySchedule() {
 
    const unsigned char Rcon[] = {0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a};
    unsigned char i;
    unsigned char tempWord[4];
    
    // The first round key is the key itself.
    for (i = 0; i < 4; ++i)
    {
        RoundKey[4 * i] = Key[4 * i];
        RoundKey[4 * i + 1] = Key[4 * i + 1];
        RoundKey[4 * i + 2] = Key[4 * i + 2];
        RoundKey[4 * i + 3] = Key[4 * i + 3];
    }
    // All other round keys are found from the previous round keys.
    for (; i < 4 * (N_ROUNDS + 1); ++i)
    {
        // copies previous four round keys into column
        tempWord[0] = RoundKey[4 * (i - 1)];
        tempWord[1] = RoundKey[4 * (i - 1) + 1];
        tempWord[2] = RoundKey[4 * (i - 1) + 2];
        tempWord[3] = RoundKey[4 * (i - 1) + 3];
        
        // since the state is 4x the size of a RoundKey, we only apply our
        // subroutines on necessary rounds            
        if (i % 4 == 0)
        {
            RotWord(tempWord);
            SubWord(tempWord);
            
            tempWord[0] ^= Rcon[i / 4]; // Since Rcon is only one byte, all all other elements of word are XORed with identity
        }

        // set current round key                           
        RoundKey[i * 4] = RoundKey[(i - 4) * 4 ] ^ tempWord[0];
        RoundKey[i * 4 + 1] = RoundKey[(i - 4) * 4 + 1] ^ tempWord[1];
        RoundKey[i * 4 + 2] = RoundKey[(i - 4) * 4 + 2] ^ tempWord[2];
        RoundKey[i * 4 + 3] = RoundKey[(i - 4) * 4 + 3] ^ tempWord[3];
    }
}

/**
 * Force all data in the state matrix through the substitution layer
 */
static void SubBytes() {
    unsigned char row, col;
    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            (*state)[row][col] = SBox[(*state)[row][col]]; // SBox is array above
}

/**
 * Performs a cyclic shift on each row dependent on the depth
 * of the row
 */
static void ShiftRows() {
    unsigned char temp;

    // Rotate first row 1 columns to left
    temp           = (*state)[0][1];
    (*state)[0][1] = (*state)[1][1];
    (*state)[1][1] = (*state)[2][1];
    (*state)[2][1] = (*state)[3][1];
    (*state)[3][1] = temp;

    // Rotate second row 2 columns to left
    temp           = (*state)[0][2];
    (*state)[0][2] = (*state)[2][2];
    (*state)[2][2] = temp;
    temp       = (*state)[1][2];
    (*state)[1][2] = (*state)[3][2];
    (*state)[3][2] = temp;

    // Rotate third row 3 columns to left
    temp       = (*state)[0][3];
    (*state)[0][3] = (*state)[3][3];
    (*state)[3][3] = (*state)[2][3];
    (*state)[2][3] = (*state)[1][3];
    (*state)[1][3] = temp;
}

static unsigned char xtime(unsigned char x) {
    return ((x<<1) ^ (((x>>7) & 1) * 0x1b)); // 0x1b represents polynomial x^4 + x^3 + x + 1
}

#define gmult(x, y)                          \
(  ((y & 1) * x) ^                              \
((y>>1 & 1) * xtime(x)) ^                       \
((y>>2 & 1) * xtime(xtime(x))) ^                \
((y>>3 & 1) * xtime(xtime(xtime(x)))) ^         \
((y>>4 & 1) * xtime(xtime(xtime(xtime(x))))) & 0xff)   \

/**
 * Affine transformation in the Rijndael field on the state
 */
static void MixColumns() {
    unsigned char a,b,c,d, i;
    for (i = 0; i < 4; i++)
    {
        a = (*state)[i][0];
        b = (*state)[i][1];
        c = (*state)[i][2];
        d = (*state)[i][3];
    
        (*state)[i][0] = gmult(a, 2) ^ gmult(b, 3) ^ c ^ d;
        (*state)[i][1] = a ^ gmult(b, 2) ^ gmult(c, 3) ^ d;
        (*state)[i][2] = a ^ b ^ gmult(c, 2) ^ gmult(d, 3);
        (*state)[i][3] = gmult(a, 3) ^ b ^ c ^ gmult(d, 2);
    }
}

/**
 * Adds the round key generated for that specific round to the state
 */
static void AddRoundKey(unsigned char round) {
    unsigned char row, col;
    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            (*state)[row][col] ^= RoundKey[16 * round  + 4 * row + col]; // Adds RoundKey byte by byte to the state
}

Additional Content: $GF(2^8)$

Oh, it's you again.

Alright, you've made it this far so there's only a little bit left to understand: what is this $GF(2^8)$ symbol that we keep on seeing? Well, for starters, that is the Galois Field of order 256 — the number of values an unsigned byte (8 bits, hence the 8) can represent. But what is a field? For anyone pursuing pure mathematics, you can jump to the next paragraph. For everyone else, what I'm about to go over may be a bit of an abstract concept.

Imagine you have some set of objects and you want to know if you can perform standard mathematics on those objects in a similar fashion to the way most people use addition and subtraction on numbers. To find this out, you need to be a little more specific: what kind of mathematics do you want to do?

Addition and subtraction? You want a Group
Multiplication? A Ring
Multiplication AND Division? You'll need a Field (THIS IS US)
Integration? Measure Space
Differentiation? A ( $ C^1$ at minimum, but ideally smooth) Manifold
Fourier Analysis? A Hilbert Space

Okay back on topic — a Galois Field, which is also known as a finite field, is basically what the second name suggests. It is a finite set of objects that supports addition, subtraction, multiplication, and division. According to an important theorem, any Galois field must be of order $p^k$ for some prime $p$ and some positive integer $k$. Thus, the elements of our Galois field are essentially just numbers represented by some prime base. That said, all is well for our Rijndael field (the technical name for $GF(2^8)$ ) since 2 is prime and 8 is positive.

I'm not actually going to explain Galois fields in the context of the Rijndael field $GF(2^8) $ in this section, but abstractly for some prime $p$ and positive integer $k$.

Suppose $a \in GF(p^k) $. Then we can represent the element $a$ base $p$ as follows: $$ a = a_{k-1}p^{k-1} \oplus a_{k-2}p^{k-2} \oplus \dots \oplus a_{1}p \oplus a_0$$ with coefficients $a_i \in \{0, 1, \dots, p-1 \} $ for $0 \leq i \leq k-1 $ (where $ \oplus $ is the binary operation replacing addition in this field). Now lets take a second to see what's going on here - think about binary, specifically a byte in our Rijndael field. Well, our byte is some collection of 1's and 0's... but that makes sense, as our prime is 2 so our coefficients can only be $a_i \in \{ 0, 1\}$! What about ternary numbers in the Galois field $ GF(3^4) $? In that scenario, our coefficients can only be 0, 1, or 2. Assuming our binary operation is just normal addition, how would we translate the element 2102 in $ GF(3^4) $? Well, going back to the rationale above, this is really $$ 2\cdot 3^{4-1} + 1 \cdot 3^{4-2} + 0 \cdot 3^{4-3} + 2\cdot3^{4-4} = 65 $$ And there you have it, 2102 is the representation of 65 in $ GF(3^4) $.

But why is this any better? Why can't we just represent 65 as 65 (in decimal) and be content with it? Well there are two main points as to why not:

Computer architecture is not historically represented that way. A bit is really an electrical charge in a capacitor, which is always at one of two states (i.e. 0 for low charge, 1 for high charge).

The second (and better) answer is what I'm about to explain in the next paragraph: binary operations in the Galois Field are not the same as they are over the integers. By binary operations, I'm referring to multiplication and addition (but in their modified format).

I'll begin with the easier of the two binary operations operations: addition in $GF(p^k)$. When we add two coefficients, say $a_i$ and $b_i$, we're actually adding them in the coefficient Group of order $p$. According to Lagrange's Theorem, since this group is of prime order it must be cyclic (a straightforward proof). Thus, addition of coefficients is much like modular arithmetic. For example, consider $GF(5^3) $ with the two numbers $ 413 $ and $ 232 $:

\[ \ \ \ 413 \\ +232 \\ \_\_\_\_\_\_ \\ \ \ \ 140 \]

At this point you may be telling yourself, "That's just objectively wrong: 3 + 2 is 5 not 0." Well, you'd be right if this were the integers. However, as I mentioned earlier, the addition is like modular arithmetic with respect to our base prime (which is 5). Thus, we are actually evaluating the expression (3 + 2) mod 5, which does happen to be equal to 0. The same rationale applies to each digit.

For the sake of completeness, I mentioned earlier that addition in $ GF(2^8)$ is equivalent to the exclusive-or (XOR) operation. This is pretty easy to show — recall that XOR works in the following way:

P	Q	P ^ Q
TRUE(1)	TRUE(1)	FALSE(0)
TRUE(1)	FALSE(0)	TRUE(1)
FALSE(0)	TRUE(1)	TRUE(1)
FALSE(0)	FALSE(0)	FALSE(0)

Well whenever our base prime is 2 (i.e. $GF(2^k)$ ), we essentially have the same behavior since all coefficients are either 1 (true) or 0 (false) and $ 1 \oplus 1 = (1 + 1) \text{mod} 2 = 0 $.

An even more helpful fact (which we build upon in the decryption stage) is that addition and subtraction are the same exact operation in $ GF(2^k) $! In any group-theoretical setting, subraction is simply addition applied to the inverse element. For example, given some integer $n$, subtraction by a second integer $m$ is actually the same as addition by $-m$ (which is $m$'s inverse with respect to addition). For any group, we have the trivial identity $ g \ominus g = g \oplus g^{-1} = e $ where $g$ is the identity (which is 0 for the integers under addition). However, if you notice from the table above, every digit is its own inverse! Therefore, we also have

$$ g \ominus g = g \oplus g^{-1} = g \oplus g = e $$

In the more general setting of $GF(2^k)$, consider the numbers 12 and 34 in $ GF(5^2) $ and suppose we want to find the answer to $ y = 12 \ominus 34 $. Then by associativity, we also have that our answer must satisfy $ y \oplus 34 = 12 $. The catch here is that numbers wrap back around when they become too large, so we can reverse engineer the solution to be 33 (since $ 3 \oplus 3 = (3 + 3)\, \text{mod}\, 5 = 1 $ and $ 3 \oplus 4 = (3 + 4)\, \text{mod}\, 5 = 2$ ).

Now comes the hard part: multiplication. Conceptually it's not much different, in that we choose a number called our irreducible polynomial (the irreducible polynomial in AES is $ x^8 + x^4 + x^3 + x + 1 $), which we use for the same purpose of allowing the result to wrap back around when it becomes large. The difficulty here lies in the fact that our coefficients collectively affect each coefficient of the other term. When we perform regular multiplication, we are only really working with two coefficients; on the other hand, when we multiply two polynomials, each term of the first polynomial affects each term of the second polynomial (what most middle-schoolers learn as FOIL). Galois Field multiplications works the exact same as polynomial multiplication, and for that reason many people actually choose to represent information in polynomial format when it comes time for calculations.

For example, let's consider multiplication in $ GF(3^3) $ with primitive polynomial $1102 = x^3 + x^2 + 2 $. Take the numbers 201 and 11 — translating them into polynomials we get $ 2x^2 + 1 $ and $x + 1$. By the distributive property (i.e. FOIL), we get $$ (2x^2 + 1)(x + 1) = 2x^3 + 2x^2 + x + 1 $$ However, this answer is too large for $GF(3^3)$ so we must take the answer modulo $x^3 + x^2 + 2$ (using polynomial division) to get $$ (2x^3 + 2x^2 + x + 1) \text{mod} (x^3 + x^2 + 2) = x$$ which translates to 3. Thus, we find that in $ GF(3^3) $ with irreducible polynomial $x^3 + x^2 + 2$, the equation $ 19 \otimes 4 = 3 $ holds (I translated from ternary back to base 10). Neat stuff, huh?

As you can see, it is much more computationally intensive to do multiplication in $GF(2^8)$ than it is to do regular multiplication over the integers $\mathbb{Z}$. To see why the SBox from SubBytes() provides a level of non-linearity, recall that the SBox performs the calculation $$ \tilde{b_i} = (b^{-1})_i \oplus (b^{-1})_{(i + 4)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 5)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 6)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 7)\mathrm{mod} 8} + c_i $$ in $GF(2^8)$. After experiencing for yourself how much more complex multiplication is in a finite field, it shouldn't be hard to see that Galois Field arithmetic prevents cryptanalysts from representing the cipher as a system of even thousands of linear equations.

Additional Content: Decryption

It's likely become evident to you at this point, but for the reader who hasn't put too much thought into it: how do you plan to get back your data now that it has gone through a complex encryption? Sure, you have the key, but I never once told you that the cipher above is its own inverse (because it isn't). Building upon earlier analogies, I have given you a way to put your valuables in a safe, but have not told you anything about how to get them out of the safe. I chose to put this section off until the end due to the fact that many of the inverse operations in AES' decryption require knowledge of $ GF(2^8) $ from the previous section.

`InvSubBytes()`

Proceeding in the same order as encryption, the first routine that we wish to find an inverse for is SubBytes(). If you recall, the non-linearity of the SBox came from the transformation: $$ \tilde{b_i} = (b^{-1})_i \oplus (b^{-1})_{(i + 4)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 5)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 6)\mathrm{mod} 8} \oplus (b^{-1})_{(i + 7)\mathrm{mod} 8} + c_i $$ which can be done iteratively by first applying the inverse in $ GF(2^8)$ and then applying the transformation $$ \tilde{b_i} = b_i \oplus b_{(i + 4)\mathrm{mod} 8} \oplus b_{(i + 5)\mathrm{mod} 8} \oplus b_{(i + 6)\mathrm{mod} 8} \oplus b_{(i + 7)\mathrm{mod} 8} + c_i $$ We could apply the same affine transformation, and take the inverse of the result; however, note that $b_{(i + j)\text{mod}8}$ refer to the bits of our original byte before going through the subtitution layer (i.e. the byte we are trying to compute). To perform this mathematically we would need more information about the original bytes before encryption, thus introducing a sort of Catch-22.

Fortunately, we have a giant SBox which stores all the results of our SubBytes() mapping. Even more fortunate, our SBox has no collisions (i.e. it's injective) and maps to every possible byte value (i.e. it's surjective). For those who have a couple weeks of linear algebra under their belt, it should be clear that an inverse exists. To construct this inverse, just find which output byte is mapped to under which input byte. Luckily, someone already did the work for you:


      0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb,

      0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87, 0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb,

      0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d, 0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e,

      0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2, 0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25,

      0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92,

      0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda, 0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84,

      0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a, 0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06,

      0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02, 0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b,

      0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea, 0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73,

      0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85, 0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e,

      0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b,

      0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20, 0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4,

      0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31, 0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f,

      0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d, 0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef,

      0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0, 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61,

      0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26, 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d

`InvShiftRows()`

If this one isn't obvious, recall that our original ShiftRows() simply shifted the first row cyclically to the left by one, shifted the second row cyclically to the left by two, etc. Well, just replace left by right and you're golden.

def InvShiftRows():
    global state
    for i in range(1, 4):
        state[0][i], state[1][i], state[2][i], state[3][i] = \
        state[(4-i)%4][i], state[(5- i) % 4][i], state[(6-i) % 4][i], state[(7-i)%4][i]

static void InvShiftRows() {
    char temp;
    // Rotate first row 1 columns to right
    temp           = (*state)[3][1];
    (*state)[3][1] = (*state)[2][1];
    (*state)[2][1] = (*state)[1][1];
    (*state)[1][1] = (*state)[0][1];
    (*state)[0][1] = temp;
                                                                                                                                                                                                                                                                                                 
    // Rotate second row 2 columns to right
    temp           = (*state)[0][2];
    (*state)[0][2] = (*state)[2][2];
    (*state)[2][2] = temp;
    temp           = (*state)[1][2];
    (*state)[1][2] = (*state)[3][2];
    (*state)[3][2] = temp;
                                                                                                                                                                                                                                                                                            
    // Rotate third row 3 columns to right
    temp           = (*state)[1][3];
    (*state)[0][3] = (*state)[2][3];
    (*state)[1][3] = (*state)[3][3];
    (*state)[2][3] = (*state)[0][3];
    (*state)[3][3] = temp;
}

`InvMixColumns()`

After all that we've gone over, we're finally at what I consider to be the make-or-break point of understanding the AES cipher. That is, you either get why this works or you haven't a damn clue and you're just copying and pasting code at this point (no shame in it).

Our original MixColumns() function was simply matrix multiplication, so it makes sense that the inverse function is simply the inverse of the matrix! Easy right? Not so much. The original matrix multiplication was in $GF(2^8)$, so unfortunately I'm going to have to jump around a bit and leave the reader to some research on how the Extended Euclidean Algorithm fits in.

Before all that noise, it's time for a little more detail on our original MixColumns() since we have a basic knowledge of $GF(2^8) $. Suppose we have two polynomials $ a(x) = a_3x^3 + a_2x^2 + a_1x + a_0 $ and $ b(x) = b_3x^3 + b_2x^2 + b_1x + b_0 $ and we want to multiply them. From high school algebra, we would simply get $$ a(x) \otimes b(x) = a_3b_3 x^6 + (a_2b_3 + a_3b_2)x^5 + (a_1b_3 + a_3b_1 + a_2b_2)x^4 \\+ (a_0b_3 + a_3b_0 + a_1b_2 + a_2b_1)x^3 \\+ (a_2b_0 + a_1b_1 + a_0b_2)x^2 \\+ (a_1b_0 + a_0b_1)x + a_0b_0 $$ What the developers of the Rijndael cipher (more commonly known as AES) did was they chose the irreducible polynomial $x^4 + 1$ for the MixColumns() step, since it has the nice property that $x^i \text{mod}\,(x^4 + 1) = x^{i\, \text{mod} 4} $. Thus, our new polynomial multiplication simplifies to $$ a(x) \otimes b(x) = (a_3b_0 + a_2b_1 + a_1b_2 + a_0b_3)x^3 + (a_2b_0 + a_1b_1 + a_0b_2 + a_3b_3)x^2 \\+ (a_1b_0 + a_0b_1 + a_3b_2 + a_2b_3)x \\+ (a_0b_0 + a_3b_1 + a_2b_2 + a_1b_3)$$ which we can represent by matrix multiplication as $$ \begin{pmatrix} a_0 & a_3 & a_2 & a_1 \\ a_1 & a_0 & a_3 & a_2 \\ a_2 & a_1 & a_0 & a_3 \\ a_3 & a_2 & a_1 & a_0 \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ b_3 \end{pmatrix} $$ Therefore, we come to the new conclusion that the MixColumns() step is actually multiplying each row by the polynomial $ a(x) = 3x^3 + x^2 + x + 2 $. In this case, it's no longer neceassary to brute-force the inverse matrix in $ GF(2^8) $, but instead find the inverse polynomial.

In order to find the inverse polynomial, you can either apply the Extended Euclidean Algorithm as provided above, or use a nice little algebraic trick: we have a cyclical subgroup of $ GF(2^8) $ (polynomials under irreducible polynomial $ x^4 + 1 $) which forms an integral domain. In other words, there are 256 degrees of freedom for any three of the coefficients and 255 degrees of freedom for the last coefficient. Now for any cyclic group of order $ r $, we have that $ g^r = e$ where $ g $ is any element and $e$ is the identity. Thus, to find the inverse of $ g $ we utilize the simple fact that $g \cdot g^{r-1} = g^r = e$, so $g^{r-1}$ must be the multiplicative inverse. Therefore, (in our subgroup) the inverse of any polynomial $ g(x) $ is $ (g(x))^{4278190079} $ (since $255 \cdot 256^3 = 4278190080$). However, to use this approach, you would need to be somewhat savvy with a mathematical programming language such as Maple.

Hopefully you don't actually go and write a Maple program to compute this, because the AES document specifies that the inverse polynomial of $a(x) = 3x^3 + x^2 + x + 2 $ is $ a^{-1}(x) = 11x^3 + 13x^2 + 9x + 14 $. Really dodged a bullet there.

With that said, our InvMixColumns() step can now be represented by the transform: $$ \begin{pmatrix} 0e & 0b & 0d & 09 \\\\ 09 & 0e & 0b & 0d \\\\ 0d & 09 & 0e & 0b \\\\ 0b & 0d & 09 & 0e \end{pmatrix} \begin{pmatrix} p_0 & p_4 & p_8 & p_{12} \\\\ p_1 & p_5 & p_9 & p_{13} \\\\ p_2 & p_6 & p_{10} & p_{14} \\\\ p_3 & p_7 & p_{11} & p_{15} \end{pmatrix} $$ As before you can feel free to use a lookup table for the matrix instead of using the gmult() function (note, however, that this will use up 1 kB of ROM):

static const unsigned char Mult9[256] = {
    0x00,0x09,0x12,0x1b,0x24,0x2d,0x36,0x3f,0x48,0x41,0x5a,0x53,0x6c,0x65,0x7e,0x77,
    0x90,0x99,0x82,0x8b,0xb4,0xbd,0xa6,0xaf,0xd8,0xd1,0xca,0xc3,0xfc,0xf5,0xee,0xe7,
    0x3b,0x32,0x29,0x20,0x1f,0x16,0x0d,0x04,0x73,0x7a,0x61,0x68,0x57,0x5e,0x45,0x4c,
    0xab,0xa2,0xb9,0xb0,0x8f,0x86,0x9d,0x94,0xe3,0xea,0xf1,0xf8,0xc7,0xce,0xd5,0xdc,
    0x76,0x7f,0x64,0x6d,0x52,0x5b,0x40,0x49,0x3e,0x37,0x2c,0x25,0x1a,0x13,0x08,0x01,
    0xe6,0xef,0xf4,0xfd,0xc2,0xcb,0xd0,0xd9,0xae,0xa7,0xbc,0xb5,0x8a,0x83,0x98,0x91,
    0x4d,0x44,0x5f,0x56,0x69,0x60,0x7b,0x72,0x05,0x0c,0x17,0x1e,0x21,0x28,0x33,0x3a,
    0xdd,0xd4,0xcf,0xc6,0xf9,0xf0,0xeb,0xe2,0x95,0x9c,0x87,0x8e,0xb1,0xb8,0xa3,0xaa,
    0xec,0xe5,0xfe,0xf7,0xc8,0xc1,0xda,0xd3,0xa4,0xad,0xb6,0xbf,0x80,0x89,0x92,0x9b,
    0x7c,0x75,0x6e,0x67,0x58,0x51,0x4a,0x43,0x34,0x3d,0x26,0x2f,0x10,0x19,0x02,0x0b,
    0xd7,0xde,0xc5,0xcc,0xf3,0xfa,0xe1,0xe8,0x9f,0x96,0x8d,0x84,0xbb,0xb2,0xa9,0xa0,
    0x47,0x4e,0x55,0x5c,0x63,0x6a,0x71,0x78,0x0f,0x06,0x1d,0x14,0x2b,0x22,0x39,0x30,
    0x9a,0x93,0x88,0x81,0xbe,0xb7,0xac,0xa5,0xd2,0xdb,0xc0,0xc9,0xf6,0xff,0xe4,0xed,
    0x0a,0x03,0x18,0x11,0x2e,0x27,0x3c,0x35,0x42,0x4b,0x50,0x59,0x66,0x6f,0x74,0x7d,
    0xa1,0xa8,0xb3,0xba,0x85,0x8c,0x97,0x9e,0xe9,0xe0,0xfb,0xf2,0xcd,0xc4,0xdf,0xd6,
    0x31,0x38,0x23,0x2a,0x15,0x1c,0x07,0x0e,0x79,0x70,0x6b,0x62,0x5d,0x54,0x4f,0x46
};
static const unsigned char Mult11[256] = {
    0x00,0x0b,0x16,0x1d,0x2c,0x27,0x3a,0x31,0x58,0x53,0x4e,0x45,0x74,0x7f,0x62,0x69,
    0xb0,0xbb,0xa6,0xad,0x9c,0x97,0x8a,0x81,0xe8,0xe3,0xfe,0xf5,0xc4,0xcf,0xd2,0xd9,
    0x7b,0x70,0x6d,0x66,0x57,0x5c,0x41,0x4a,0x23,0x28,0x35,0x3e,0x0f,0x04,0x19,0x12,
    0xcb,0xc0,0xdd,0xd6,0xe7,0xec,0xf1,0xfa,0x93,0x98,0x85,0x8e,0xbf,0xb4,0xa9,0xa2,
    0xf6,0xfd,0xe0,0xeb,0xda,0xd1,0xcc,0xc7,0xae,0xa5,0xb8,0xb3,0x82,0x89,0x94,0x9f,
    0x46,0x4d,0x50,0x5b,0x6a,0x61,0x7c,0x77,0x1e,0x15,0x08,0x03,0x32,0x39,0x24,0x2f,
    0x8d,0x86,0x9b,0x90,0xa1,0xaa,0xb7,0xbc,0xd5,0xde,0xc3,0xc8,0xf9,0xf2,0xef,0xe4,
    0x3d,0x36,0x2b,0x20,0x11,0x1a,0x07,0x0c,0x65,0x6e,0x73,0x78,0x49,0x42,0x5f,0x54,
    0xf7,0xfc,0xe1,0xea,0xdb,0xd0,0xcd,0xc6,0xaf,0xa4,0xb9,0xb2,0x83,0x88,0x95,0x9e,
    0x47,0x4c,0x51,0x5a,0x6b,0x60,0x7d,0x76,0x1f,0x14,0x09,0x02,0x33,0x38,0x25,0x2e,
    0x8c,0x87,0x9a,0x91,0xa0,0xab,0xb6,0xbd,0xd4,0xdf,0xc2,0xc9,0xf8,0xf3,0xee,0xe5,
    0x3c,0x37,0x2a,0x21,0x10,0x1b,0x06,0x0d,0x64,0x6f,0x72,0x79,0x48,0x43,0x5e,0x55,
    0x01,0x0a,0x17,0x1c,0x2d,0x26,0x3b,0x30,0x59,0x52,0x4f,0x44,0x75,0x7e,0x63,0x68,
    0xb1,0xba,0xa7,0xac,0x9d,0x96,0x8b,0x80,0xe9,0xe2,0xff,0xf4,0xc5,0xce,0xd3,0xd8,
    0x7a,0x71,0x6c,0x67,0x56,0x5d,0x40,0x4b,0x22,0x29,0x34,0x3f,0x0e,0x05,0x18,0x13,
    0xca,0xc1,0xdc,0xd7,0xe6,0xed,0xf0,0xfb,0x92,0x99,0x84,0x8f,0xbe,0xb5,0xa8,0xa3
};

static const unsigned char Mult13[256] = {
    0x00,0x0d,0x1a,0x17,0x34,0x39,0x2e,0x23,0x68,0x65,0x72,0x7f,0x5c,0x51,0x46,0x4b,
    0xd0,0xdd,0xca,0xc7,0xe4,0xe9,0xfe,0xf3,0xb8,0xb5,0xa2,0xaf,0x8c,0x81,0x96,0x9b,
    0xbb,0xb6,0xa1,0xac,0x8f,0x82,0x95,0x98,0xd3,0xde,0xc9,0xc4,0xe7,0xea,0xfd,0xf0,
    0x6b,0x66,0x71,0x7c,0x5f,0x52,0x45,0x48,0x03,0x0e,0x19,0x14,0x37,0x3a,0x2d,0x20,
    0x6d,0x60,0x77,0x7a,0x59,0x54,0x43,0x4e,0x05,0x08,0x1f,0x12,0x31,0x3c,0x2b,0x26,
    0xbd,0xb0,0xa7,0xaa,0x89,0x84,0x93,0x9e,0xd5,0xd8,0xcf,0xc2,0xe1,0xec,0xfb,0xf6,
    0xd6,0xdb,0xcc,0xc1,0xe2,0xef,0xf8,0xf5,0xbe,0xb3,0xa4,0xa9,0x8a,0x87,0x90,0x9d,
    0x06,0x0b,0x1c,0x11,0x32,0x3f,0x28,0x25,0x6e,0x63,0x74,0x79,0x5a,0x57,0x40,0x4d,
    0xda,0xd7,0xc0,0xcd,0xee,0xe3,0xf4,0xf9,0xb2,0xbf,0xa8,0xa5,0x86,0x8b,0x9c,0x91,
    0x0a,0x07,0x10,0x1d,0x3e,0x33,0x24,0x29,0x62,0x6f,0x78,0x75,0x56,0x5b,0x4c,0x41,
    0x61,0x6c,0x7b,0x76,0x55,0x58,0x4f,0x42,0x09,0x04,0x13,0x1e,0x3d,0x30,0x27,0x2a,
    0xb1,0xbc,0xab,0xa6,0x85,0x88,0x9f,0x92,0xd9,0xd4,0xc3,0xce,0xed,0xe0,0xf7,0xfa,
    0xb7,0xba,0xad,0xa0,0x83,0x8e,0x99,0x94,0xdf,0xd2,0xc5,0xc8,0xeb,0xe6,0xf1,0xfc,
    0x67,0x6a,0x7d,0x70,0x53,0x5e,0x49,0x44,0x0f,0x02,0x15,0x18,0x3b,0x36,0x21,0x2c,
    0x0c,0x01,0x16,0x1b,0x38,0x35,0x22,0x2f,0x64,0x69,0x7e,0x73,0x50,0x5d,0x4a,0x47,
    0xdc,0xd1,0xc6,0xcb,0xe8,0xe5,0xf2,0xff,0xb4,0xb9,0xae,0xa3,0x80,0x8d,0x9a,0x97
};

static const unsigned char Mult14[256] = {
    0x00,0x0e,0x1c,0x12,0x38,0x36,0x24,0x2a,0x70,0x7e,0x6c,0x62,0x48,0x46,0x54,0x5a,
    0xe0,0xee,0xfc,0xf2,0xd8,0xd6,0xc4,0xca,0x90,0x9e,0x8c,0x82,0xa8,0xa6,0xb4,0xba,
    0xdb,0xd5,0xc7,0xc9,0xe3,0xed,0xff,0xf1,0xab,0xa5,0xb7,0xb9,0x93,0x9d,0x8f,0x81,
    0x3b,0x35,0x27,0x29,0x03,0x0d,0x1f,0x11,0x4b,0x45,0x57,0x59,0x73,0x7d,0x6f,0x61,
    0xad,0xa3,0xb1,0xbf,0x95,0x9b,0x89,0x87,0xdd,0xd3,0xc1,0xcf,0xe5,0xeb,0xf9,0xf7,
    0x4d,0x43,0x51,0x5f,0x75,0x7b,0x69,0x67,0x3d,0x33,0x21,0x2f,0x05,0x0b,0x19,0x17,
    0x76,0x78,0x6a,0x64,0x4e,0x40,0x52,0x5c,0x06,0x08,0x1a,0x14,0x3e,0x30,0x22,0x2c,
    0x96,0x98,0x8a,0x84,0xae,0xa0,0xb2,0xbc,0xe6,0xe8,0xfa,0xf4,0xde,0xd0,0xc2,0xcc,
    0x41,0x4f,0x5d,0x53,0x79,0x77,0x65,0x6b,0x31,0x3f,0x2d,0x23,0x09,0x07,0x15,0x1b,
    0xa1,0xaf,0xbd,0xb3,0x99,0x97,0x85,0x8b,0xd1,0xdf,0xcd,0xc3,0xe9,0xe7,0xf5,0xfb,
    0x9a,0x94,0x86,0x88,0xa2,0xac,0xbe,0xb0,0xea,0xe4,0xf6,0xf8,0xd2,0xdc,0xce,0xc0,
    0x7a,0x74,0x66,0x68,0x42,0x4c,0x5e,0x50,0x0a,0x04,0x16,0x18,0x32,0x3c,0x2e,0x20,
    0xec,0xe2,0xf0,0xfe,0xd4,0xda,0xc8,0xc6,0x9c,0x92,0x80,0x8e,0xa4,0xaa,0xb8,0xb6,
    0x0c,0x02,0x10,0x1e,0x34,0x3a,0x28,0x26,0x7c,0x72,0x60,0x6e,0x44,0x4a,0x58,0x56,
    0x37,0x39,0x2b,0x25,0x0f,0x01,0x13,0x1d,0x47,0x49,0x5b,0x55,0x7f,0x71,0x63,0x6d,
    0xd7,0xd9,0xcb,0xc5,0xef,0xe1,0xf3,0xfd,0xa7,0xa9,0xbb,0xb5,0x9f,0x91,0x83,0x8d
};

def InvMixColumns():
    global state
    for col in range(4):
        # temporary variables
        a = state[col][0]
        b = state[col][1]
        c = state[col][2]
        d = state[col][3]
        
        state[col][0] = gmult(a, 0x0e) ^ gmult(b, 0x0b) ^ gmult(c, 0x0d) ^ gmult(d, 0x09)
        state[col][1] = gmult(a, 0x09) ^ gmult(b, 0x0e) ^ gmult(c, 0x0b) ^ gmult(d, 0x0d)
        state[col][2] = gmult(a, 0x0d) ^ gmult(b, 0x09) ^ gmult(c, 0x0e) ^ gmult(d, 0x0b)
        state[col][3] = gmult(a, 0x0b) ^ gmult(b, 0x0d) ^ gmult(c, 0x09) ^ gmult(d, 0x0e)

static void InvMixColumns() {
    char a,b,c,d, i;
    for (i = 0; i < 4; i++)
    {
        a = (*state)[i][0];
        b = (*state)[i][1];
        c = (*state)[i][2];
        d = (*state)[i][3];
    
        (*state)[i][0] = gmult(a, 0x0e) ^ gmult(b, 0x0b) ^ gmult(c, 0x0d) ^ gmult(d, 0x09);
        (*state)[i][1] = gmult(a, 0x09) ^ gmult(b, 0x0e) ^ gmult(c, 0x0b) ^ gmult(d, 0x0d);
        (*state)[i][2] = gmult(a, 0x0d) ^ gmult(b, 0x09) ^ gmult(c, 0x0e) ^ gmult(d, 0x0b);
        (*state)[i][3] = gmult(a, 0x0b) ^ gmult(b, 0x0d) ^ gmult(c, 0x09) ^ gmult(d, 0x0e);
    }
}

`InvAddRoundKey()`

Just kidding, there is no InvAddRoundKey() — from the $ GF(2^8) $ section above you should understand that addition and subtraction are one in the same, so you simply add the round key right back.

Alright, it's finally time to put the pieces of our decryption together. The forward encryption for AES is described by applying the following functions every round (except the last):


        SubBytes()

        ShiftRows()

        MixColumns()

        AddRoundKey(round)

so it should be intuitive that the decryption for AES is described by applying


        AddRoundKey(round)

        InvMixColumns()

        InvShiftRows()

        InvSubBytes()

to every round in reverse order except the first (or last depending on how you think about it). In either case, here is the full decryption:

# precomputed inverse substitution layer
_InvSBox = [0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb,
    0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87, 0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb,
    0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d, 0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e,
    0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2, 0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25,
    0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92,
    0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda, 0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84,
    0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a, 0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06,
    0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02, 0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b,
    0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea, 0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73,
    0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85, 0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e,
    0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b,
    0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20, 0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4,
    0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31, 0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f,
    0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d, 0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef,
    0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0, 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61,
    0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26, 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d]

def InvSubBytes():
    """ Force all data in the state matrix through the inverse substitution layer """
    global _state
    _state = [[_InvSBox[_state[row][col]] for col in range(4)] for row in range(4)]

def InvShiftRows():
    """ Perform the reverse cyclic shift on each row dependent on the depth of the row"""
    global _state
    for i in range(1, 4):
        _state[0][i], _state[1][i], _state[2][i], _state[3][i] = \
        _state[(4-i)%4][i], _state[(5- i)%4][i], _state[(6-i)%4][i], _state[(7-i)%4][i]
        
def InvMixColumns():
    """ Inverse affine transform in the Rijndael field on the state """
    global _state
    for col in range(4):
        # temporary variables
        a = _state[col][0]
        b = _state[col][1]
        c = _state[col][2]
        d = _state[col][3]
        
        _state[col][0] = gmult(a, 0x0e) ^ gmult(b, 0x0b) ^ gmult(c, 0x0d) ^ gmult(d, 0x09)
        _state[col][1] = gmult(a, 0x09) ^ gmult(b, 0x0e) ^ gmult(c, 0x0b) ^ gmult(d, 0x0d)
        _state[col][2] = gmult(a, 0x0d) ^ gmult(b, 0x09) ^ gmult(c, 0x0e) ^ gmult(d, 0x0b)
        _state[col][3] = gmult(a, 0x0b) ^ gmult(b, 0x0d) ^ gmult(c, 0x09) ^ gmult(d, 0x0e)

def AESInvCipher():
    """ The complete AES decryption performed through byte-array calculations """
    AddRoundKey(__Nr)
    InvShiftRows()
    InvSubBytes()
    
    for round in range(__Nr - 1, 0, -1):
        AddRoundKey(round)
        InvMixColumns()
        InvShiftRows()
        InvSubBytes()
    
    # Reverse Key-whitening
    AddRoundKey(0)

// Precomputed inverse substitution layer
static const uint8_t InvSBox[256] =  {
    //0     1    2      3     4    5     6     7      8    9     A      B    C     D     E     F
    0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb,
    0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87, 0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb,
    0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d, 0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e,
    0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2, 0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25,
    0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92,
    0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda, 0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84,
    0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a, 0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06,
    0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02, 0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b,
    0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea, 0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73,
    0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85, 0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e,
    0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b,
    0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20, 0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4,
    0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31, 0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f,
    0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d, 0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef,
    0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0, 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61,
    0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26, 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d };

/**
 * The complete AES decryption performed through byte-array calculations
 */
static void AESInvCipher() {
    AddRoundKey(N_ROUNDS);
    
    unsigned char round;
    for (round = N_ROUNDS - 1; round >= 0; --round) {
        AddRoundKey(round);
        InvMixColumns();
        InvShiftRows();
        InvSubBytes();
    }
    
    AddRoundKey(0);
}

/**
 * Force all data in the state matrix through the inverse substitution layer
 */
static void InvSubBytes() {
    unsigned char row, col;
    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            (*state)[row][col] = InvSBox[(*state)[row][col]]; // SBox is array above
}

/**
 * Performs a cyclic shift on each row dependent on the depth
 * of the row in the opposite direction of ShiftRows()
 */
static void InvShiftRows() {
    char temp;
    // Rotate first row 1 columns to right
    temp           = (*state)[3][1];
    (*state)[3][1] = (*state)[2][1];
    (*state)[2][1] = (*state)[1][1];
    (*state)[1][1] = (*state)[0][1];
    (*state)[0][1] = temp;
                                                                                                                                                                                                                                                                                                 
    // Rotate second row 2 columns to right
    temp           = (*state)[0][2];
    (*state)[0][2] = (*state)[2][2];
    (*state)[2][2] = temp;
    temp           = (*state)[1][2];
    (*state)[1][2] = (*state)[3][2];
    (*state)[3][2] = temp;
                                                                                                                                                                                                                                                                                            
    // Rotate third row 3 columns to right
    temp           = (*state)[1][3];
    (*state)[0][3] = (*state)[2][3];
    (*state)[1][3] = (*state)[3][3];
    (*state)[2][3] = (*state)[0][3];
    (*state)[3][3] = temp;
}

/**
 * Inverse affine transformation in the Rijndael field on the state
 */
static void InvMixColumns() {
    char a,b,c,d, i;
    for (i = 0; i < 4; i++)
    {
        a = (*state)[i][0];
        b = (*state)[i][1];
        c = (*state)[i][2];
        d = (*state)[i][3];
    
        (*state)[i][0] = gmult(a, 0x0e) ^ gmult(b, 0x0b) ^ gmult(c, 0x0d) ^ gmult(d, 0x09);
        (*state)[i][1] = gmult(a, 0x09) ^ gmult(b, 0x0e) ^ gmult(c, 0x0b) ^ gmult(d, 0x0d);
        (*state)[i][2] = gmult(a, 0x0d) ^ gmult(b, 0x09) ^ gmult(c, 0x0e) ^ gmult(d, 0x0b);
        (*state)[i][3] = gmult(a, 0x0b) ^ gmult(b, 0x0d) ^ gmult(c, 0x09) ^ gmult(d, 0x0e);
    }
}

AES

Advanced Encryption Standard (FIPS 197)