Using LSTM and Dense Keras Weights in C++

by 21zhouv in Circuits > Microcontrollers

1439 Views, 3 Favorites, 0 Comments

Using LSTM and Dense Keras Weights in C++

For an engineering project, I ran a Keras neural net on several microcontrollers (albeit very slowly) as a proof of concept. The neural network consisted of an LSTM layer and a Dense layer. Here, I'd like to share my findings on how I implemented a Keras model in C++.

Disclaimer: Use a microcontroller with at least 16kb of SRAM. I made the mistake of using controllers with less SRAM, which caused massive headaches when the program randomly crashes.

Extract Weights From Keras Model

Screen Shot 2021-05-11 at 11.22.47 AM.png

Screen Shot 2021-05-11 at 11.23.05 AM.png

Screen Shot 2021-05-11 at 11.23.34 AM.png

Screen Shot 2021-05-11 at 11.23.18 AM.png

Screen Shot 2021-05-11 at 11.22.23 AM.png

LSTM layer

First, you must extract the weights from each layer. While the weights are accessible, the individual weights for each gate are compacted into a big matrix. Thankfully, this Stackoverflow post provided the slices required to separate the individual weights. Make sure that the index for the layer corresponds to the actual layer that you want.

https://stackoverflow.com/questions/42861460/how-t...

units = int(int(lstm.layers[0].trainable_weights[0].shape[1])/4)

# Layer 0 is my LSTM layer
W = model.layers[0].get_weights()[0]
U = model.layers[0].get_weights()[1]
b = model.layers[0].get_weights()[2]

W_i = W[:, :units]
W_f = W[:, units: units * 2]
W_c = W[:, units * 2: units * 3]
W_o = W[:, units * 3:]

U_i = U[:, :units]
U_f = U[:, units: units * 2]
U_c = U[:, units * 2: units * 3]
U_o = U[:, units * 3:]

b_i = b[:units]
b_f = b[units: units * 2]
b_c = b[units * 2: units * 3]
b_o = b[units * 3:]

Dense Layer

The dense layer is incredibly simple to extract. The first matrix is the weight and the second matrix is the bias.

# Layer 1 is my LSTM layer
W = model.layers[1].get_weights()[0]
b = model.layers[1].get_weights()[1]

As a sanity check, you can check out the attributes of the matrices to verify the correct size, dimensions, and numbers.

print(W_i.shape) # Tuple of dimensions
print(W_i.nbytes) # Size of array
print(W_i) # Preview of array contents

(Optional) Test Forward Propagation

Screen Shot 2021-01-07 at 12.13.13 PM.png

Screen Shot 2021-01-07 at 12.13.07 PM.png

To ensure that the weights work and to familiarize yourself with the formulation of an LSTM cell, you can perform forward propagation manually. This website goes into depth about the process and includes mathematical representations.

https://fairyonice.github.io/Extract-weights-from-Keras's-LSTM-and-calcualte-hidden-and-cell-states.html

Select a test data set and run it through the regular model and the manual forward propagation to make sure that both output the same values. Make sure that the dimensions of the data and states are correct and adjust acccordingly. This code only works for my layer which has 50 units and expects 3 features.

def sigmoid(x):
    return 1/(1+np.exp(-x))

def tanh(x):
    return np.tanh(x)

def iterate(x_t, prev_h, prev_c):
    # Forget gate
    f_t = np.dot(np.transpose(W_f), x_t) + np.dot(np.transpose(U_f), prev_h) + b_f
    f_t = sigmoid(f_t)
    # Input gate
    i_t = np.dot(np.transpose(W_i), x_t) + np.dot(np.transpose(U_i), prev_h) + b_i
    i_t = sigmoid(i_t)
    # Candidate cell state
    _c = np.dot(np.transpose(W_c), x_t) + np.dot(np.transpose(U_c), prev_h) + b_c
    _c = tanh(_c)
    # New cell state
    c_t = np.multiply(f_t, prev_c) + np.multiply(i_t, _c)
    # Output gate
    o_t = np.dot(np.transpose(W_o), x_t) + np.dot(np.transpose(U_o), prev_h) + b_o
    o_t = sigmoid(o_t)
    # New hidden state
    h_t = np.multiply(o_t, tanh(c_t))
    return (h_t, c_t)

def lstm_forward_pass(x):
    # Initialize hidden state and cell state
    prev_h = np.zeros((50,))
    prev_c = np.zeros((50,))
    for i in range (0, 10):
	# Slice one row of the data
        x_t = x[i,]
        (h_t, c_t) = iterate(x_t, prev_h, prev_c)

	# Update states
        prev_h = h_t
        prev_c = c_t
    return prev_h

For a Dense layer, simply dot product multiply the data by the weight and add the bias. Since my Dense layer has a sigmoid activation function, I put the result into the sigmoid function.

def dense_forward_pass(x):
    return sigmoid(np.dot(np.transpose(W), x) + b)

Since my model consists of an LSTM layer and a Dense layer, this is how I would run it.

# Initialize dummy series data
x = np.ones((10, 3))

def run(x):
    h = lstm_forward_pass(x)
    y = dense_forward_pass(h)
    return y
run(x)

Exporting the Wieghts

Screen Shot 2021-05-11 at 10.52.04 AM.png

To export the LSTM weights, I used np.savetxt() to save the numpy arrays as a plain text file and iterated over a dictionary to save the weights with their corresponding names. Here is some code that also prints out the bytes and shapes of the array as a sanity check.

weights_list = [W_i, W_f, W_c, W_o, U_i, U_f, U_c, U_o, b_i, b_f, b_c, b_o]
names_list = ["W_i", "W_f", "W_c", "W_o", "U_i", "U_f", "U_c", "U_o", "b_i", "b_f", "b_c", "b_o"]

for weight, name in zip(weights_list, names_list):
    print(f"{name}, Bytes: {weight.nbytes}, Shape: {weight.shape}")
    np.savetxt(f"{name}.txt", weight)

I also exported the Dense weights.

np.savetxt("W.txt", W)
np.savetxt("b.txt", b)

(Optional) Formatting Weights

For my case, since I wanted to use the weights with a microcontroller, I need to format them as array declarations in C++. In addition, I added the PROGMEM modifier at the end of each array to store the arrays in flash and conserve RAM. If I need to read array values out of flash memory, I would simply use the pgmspace.h library and the pgm_read_float_near() function.

To automate the format, I wrote a Python (3.7) script to read the files and print out the declaration. The script is attached below and doesn't require any external library.

One thing to note is that the functions in the script have a parameter called decimals that has a default value of 3. This parameter controls how many decimals places to round the weights. I needed to round the weights to 3 decimals places to save flash memory, but you can pass a parameter to change the number of places.

To remove the rounding, you will need to find the lines that do the rounding.

number = round(float(number), decimals)

Two of these lines exist. One in the regular conversion function and one in the separate conversion function. Simply remove the round function from the line.

number = float(number)

In addition, you will need to run the script in an IDE, like PyCharm, so you can edit the file paths and functionality. Here is an example declaration of a W and a b matrix for the forget gate:

const float W_f[3][50] = {{0.031,-0.068,0.126,0.128,-0.183,0.143,0.079,-0.038,0.011,0.027,-0.037,-0.057,0.227,-0.036,0.011,0.138,-0.001,0.119,0.042,0.189,0.101,-0.036,0.138,-0.141,0.04,0.133,-0.118,-0.027,-0.22,0.087,0.023,-0.002,0.024,-0.175,-0.097,0.06,0.179,0.166,0.018,-0.223,0.017,0.152,0.021,-0.068,0.099,0.15,-0.132,0.077,-0.141,-0.015},{-0.091,0.081,-0.034,0.023,-0.023,0.155,0.161,0.018,0.01,-0.042,-0.005,0.07,0.063,-0.104,0.143,0.061,0.048,-0.081,-0.124,0.095,-0.149,0.025,0.064,0.0,-0.104,-0.098,0.183,0.17,0.034,-0.106,0.018,0.093,0.255,-0.167,-0.113,-0.127,-0.017,-0.014,-0.022,-0.005,0.043,-0.008,0.073,0.042,-0.048,-0.061,0.092,-0.157,0.053,-0.041},{-0.05,0.151,0.082,0.127,-0.113,0.029,0.037,0.088,-0.0,-0.003,0.083,-0.106,0.17,0.018,-0.105,0.082,0.03,0.028,0.011,-0.059,-0.003,0.043,-0.008,-0.097,-0.004,-0.017,0.129,0.005,0.029,0.086,-0.004,0.081,0.043,0.002,0.023,0.001,-0.021,-0.024,0.0,-0.115,0.002,0.137,-0.002,-0.157,0.182,-0.075,-0.001,0.006,-0.157,-0.046}};
const float b_f[50][1] = {{1.095},{0.999},{1.002},{1.023},{1.071},{1.068},{1.039},{0.968},{1.107},{1.097},{1.053},{0.996},{0.94},{1.059},{1.017},{0.999},{1.015},{0.975},{1.065},{1.008},{1.04},{1.052},{1.115},{1.05},{1.027},{1.078},{0.964},{1.062},{1.044},{1.06},{1.089},{1.074},{0.966},{1.123},{1.038},{1.098},{1.136},{1.096},{1.168},{1.039},{1.037},{1.014},{1.078},{1.001},{0.963},{1.109},{1.132},{1.045},{0.992},{1.067}};

For the U matrices, I created a separate function because I needed a separate array for each row of the U matrix. Then, I set the addresses for each separate array as elements of another array. Here is an example declaration of a U matrix for the forget gate:

const float U_f_0[50] PROGMEM = {-0.039,0.037,-0.022,0.159,-0.117,0.037,-0.054,-0.145,-0.049,-0.108,-0.071,0.043,-0.012,-0.168,0.009,0.007,-0.062,-0.007,-0.076,0.056,-0.216,0.098,-0.185,-0.173,-0.079,-0.176,-0.092,-0.124,-0.049,0.044,-0.07,0.148,0.027,-0.076,0.003,0.05,-0.093,-0.003,-0.147,0.067,-0.179,0.008,-0.093,0.001,0.103,0.083,-0.102,0.122,-0.241,0.067};
const float U_f_1[50] PROGMEM = {-0.005,-0.059,0.091,0.04,0.038,0.17,0.09,-0.042,-0.201,-0.232,0.048,0.011,0.112,-0.029,0.064,0.053,-0.077,-0.07,0.121,-0.124,0.08,-0.087,-0.083,-0.068,0.003,-0.109,-0.117,0.046,0.058,-0.008,0.074,-0.014,0.007,-0.042,0.226,0.03,0.005,-0.062,0.049,-0.18,0.117,0.026,0.069,-0.032,-0.029,-0.124,0.131,0.08,0.138,-0.049};
const float U_f_2[50] PROGMEM = {-0.147,-0.016,0.033,0.01,-0.055,-0.032,-0.001,0.091,-0.148,0.066,-0.08,0.077,0.096,-0.165,0.118,0.091,-0.147,0.032,-0.067,0.082,-0.068,-0.141,-0.179,-0.098,-0.125,-0.084,0.048,-0.088,0.061,0.063,-0.01,0.086,0.138,-0.177,-0.085,-0.113,0.008,0.193,-0.144,-0.027,-0.005,0.005,-0.026,-0.016,0.054,0.114,-0.032,0.062,-0.017,-0.189};
const float U_f_3[50] PROGMEM = {-0.107,0.086,0.061,0.008,-0.063,0.096,0.113,-0.002,0.047,0.015,-0.121,0.05,0.001,0.147,0.095,0.044,-0.06,-0.165,-0.068,-0.047,-0.05,-0.069,0.155,0.245,-0.057,0.061,-0.124,-0.14,0.028,0.09,-0.027,-0.065,-0.171,0.022,-0.077,0.152,-0.033,-0.012,0.031,0.129,-0.065,-0.035,-0.022,0.066,0.054,0.032,0.054,0.153,-0.082,-0.041};
...
const float(* const U_f[]) = {U_f_0,U_f_1,U_f_2,U_f_3,U_f_4,U_f_5,U_f_6,U_f_7,U_f_8,U_f_9,U_f_10,U_f_11,U_f_12,U_f_13,U_f_14,U_f_15,U_f_16,U_f_17,U_f_18,U_f_19,U_f_20,U_f_21,U_f_22,U_f_23,U_f_24,U_f_25,U_f_26,U_f_27,U_f_28,U_f_29,U_f_30,U_f_31,U_f_32,U_f_33,U_f_34,U_f_35,U_f_36,U_f_37,U_f_38,U_f_39,U_f_40,U_f_41,U_f_42,U_f_43,U_f_44,U_f_45,U_f_46,U_f_47,U_f_48,U_f_49};

Downloads

main.py

Using the Weights and RAM Saving Measures

For matrix algebra on microcontrollers, I used the BasicLinearAlgebra library as it was a simple lightweight library. However, due to RAM constraints, I employed slower, ram-saving functions to perform the matrix algebra. Here are a couple of functions that you may find helpful.

Activation functions (Sigmoid and Tanh)

template <int n_row, int n_col>
void sigmoid(Matrix<n_row, n_col> * mat) {
  for (int i = 0; i < n_row; i++) {
    for (int j = 0; j < n_col; j++) {
      (*mat)(i, j) = 1/(1+exp(-(*mat)(i,j)));
    }
  }
}
template <int n_row, int n_col>
void tanh(Matrix<n_row, n_col> * mat) {
  for (int i = 0; i < n_row; i++) {
    for (int j = 0; j < n_col; j++) {
      (*mat)(i, j) = tanh((*mat)(i,j));
    }
  }
}

Fill matrix from progmem

template <int n_row, int n_col>
void fill_matrix_pgm(Matrix<n_row, n_col> * mat, const float arr[][n_col]) {
  for(int i = 0; i < n_row; i++) {
    for(int j = 0; j < n_col; j++) {
      (*mat)(i,j) = pgm_read_float(&arr[i][j]);
    }
  } 
}

Element wise multiplication

template <int n_row, int n_col><br>void element_multiply(Matrix<n_row, n_col> * mat1, Matrix<n_row, n_col> * mat2) {
  for(int i = 0; i < n_row; i++) {
    for(int j = 0; j < n_col; j++) {
      (*mat1)(i,j) = (*mat1)(i,j) * (*mat2)(i,j);
    }
  } 
}

transpose(W) * x_t

In my case, x_t is (3,1) and W is (3, 50).

void populate_matrix_col(Matrix<3, 1> * mat, const float pgm_matrix[][50], int col) {
  for (int i = 0; i < 3; i++) {
    (*mat)(i, 0) = pgm_read_float(&pgm_matrix[i][col]);
  }
}
Matrix<50, 1> w_x(Matrix<3,1> operand, const float pgm_matrix[][50]) {
  Matrix<50, 1> ret;
  Matrix<3, 1> rowWeight;
  for (int i = 0; i < 50; i++) {   
    populate_matrix_col(&rowWeight, pgm_matrix, i);
    Matrix<1, 1> sum = (~operand * rowWeight);
    ret(i, 0) = sum(0, 0);
  }
  return ret;
}

transpose(U) * h_(t-1)

In my case, h_(t-1) is (50,1) and W is (50, 50).

void populate_matrix_col(Matrix<50, 1> * mat, const float(* const pgm_matrix[]), int col) {
  for (int i = 0; i < 50; i++) {
    (*mat)(i, 0) = pgm_read_float(&pgm_matrix[i][col]);
  }
}
Matrix<50, 1> u_h(Matrix<50,1> operand, const float(* const pgm_matrix[])) {
  Matrix<50, 1> ret;
  Matrix<50, 1> rowWeight;
  for (int i = 0; i < 50; i++) {   
    populate_matrix_col(&rowWeight, pgm_matrix, i);
    Matrix<1, 1> sum = (~operand * rowWeight);
    ret(i, 0) = sum(0, 0);
  }
  return ret;
}

Furthermore, this Arduino forum post provides a method to show how much SRAM that the program is using at the function call. I have used it to determine whether I need to add optimizations or if I am in the safe zone in terms of RAM.

extern unsigned int __bss_end;
extern unsigned int __heap_start;
extern void *__brkval;
int freeMemory() {
  int free_memory;
  if((int)__brkval == 0)
    free_memory = ((int)&free_memory) - ((int)&__bss_end);
  else
    free_memory = ((int)&free_memory) - ((int)__brkval);
  
  return free_memory;
}

Theoretical Forward Propagation in C++

Here is what the forward propagation for an LSTM might look like. However, this is only theoretical because I couldn't run the code with the measly 2 kb of ram for an ATTiny3216. In reality, I split all the functions across several microcontrollers and transmitted the arrays through I2C.

Gate Functions

Matrix<50, 1> forget_gate(Matrix<50, 1> b, Matrix<3, 1> x_t, Matrix<50, 1> prev_h) {
  Matrix<50, 1> f_t = w_x(x_t, W_f) + u_h(prev_h, U_f) + b;
  sigmoid<50, 1>(ret);
  return f_t;
}

Matrix<50, 1> input_gate(Matrix<50, 1> b, Matrix<3, 1> x_t, Matrix<50, 1> prev_h) {
  Matrix<50, 1> i_t = w_x(x_t, W_i) + u_h(prev_h, U_i) + b; 
  sigmoid<50, 1>(&i_t);
  return i_t;
}

Matrix<50, 1> candidate_cell_state(Matrix<50, 1> b, Matrix<3, 1> x_t, Matrix<50, 1> prev_h) {
  Matrix<50, 1> _c = w_x(x_t, W_c) + u_h(prev_h, U_c) + b;
  tanh<50, 1>(&_c);
  return _c;
}

Matrix<50, 1> output_gate(Matrix<50, 1> b, Matrix<3, 1> x_t, Matrix<50, 1> prev_h) {
  Matrix<50, 1> o_t = w_x(x_t, W_o) + u_h(prev_h, U_o) + b;
  sigmoid<50, 1>(&o_t);
  return o_t;
}

Iteration Function

void iterate(Matrix<3, 1> x_t, Matrix<50, 1> * prev_h, Matrix<50, 1> * prev_c) {
  Matrix<50, 1> b;
  fill_matrix_pgm<50, 1>(&b, b_f);
  Matrix<50, 1> f_t = forget_gate(b, x_t, (*prev_h));
  fill_matrix_pgm<50, 1>(&b, b_i);
  Matrix<50, 1> i_t = input_gate(b, x_t, (*prev_h));
  fill_matrix_pgm<50, 1>(&b, b_c);
  Matrix<50, 1> _c = candidate_cell_state(b, x_t, (*prev_h));
  element_multiply(&f_t, prev_c);
  element_multiply(&i_t, &_c);
  Matrix<50, 1> c_t = f_t + i_t;
  (*prev_c) = c_t;

  fill_matrix_pgm<50, 1>(&b, b_o);
  Matrix<50, 1> o_t = candidate_cell_state(b_o, x_t, (*prev_h));
  tanh<50, 1>(&c_t);
  element_multiply(&o_t, &c_t);

  (*prev_h) = o_t;
}

Full Forward Pass

Matrix<3, 1> slice(Matrix<10, 3> x, int n) {
  Matrix<3, 1> x_t;
  x_t(0, 0) = x(n, 0);
  x_t(1, 0) = x(n, 1);
  x_t(2, 0) = x(n, 2);
  return x_t;
}
Matrix<50, 1> run(Matrix<10, 3> x) {
  Matrix<50, 1> prev_h;
  Matrix<50, 1> prev_c;

  prev_h.Fill(0);
  prev_c.Fill(0);

  for (int i = 0; i < 10; i++) {
    iterate(slice(x, i), &prev_h, &prev_c);
  }
  return prev_h;
}