This blog details the inner workings of TensorFlow Lite for Microcontrollers and the role of Flatbuffers in them.
Story of TensorFlow Lite for Microcontrollers
Part 2/2: This blog extends my TensorFlow Lite for Microcontrollers tutorial. I was selected in Google Summer of Code, under TensorFlow, to work on building demos for TinyML and when I read through the documentation, I was intrigued with some of the design decisions made in the Tensorflow Lite for Microcontrollers library. I then read through chapter 11 of the TinyML book which details the inner workings of the Tensorflow Lite Micro framework, and this blog is my interpretation of it. I have also added some of my own suggestions to improve TensorFlow Lite for Microcontrollers. I hope this blog helps you better understand what’s happening behind the hood and appreciate the tiny details in TensorFlow Lite for Microcontrollers library.
This blog of TensorFlow Lite for Microcontrollers in a nutshell
1. Introduction
2. FlatBuffers
3. Schema
4. Suggestions to improve the TF Lite Micro Framework
5. Conclusion
1. Introduction of TensorFlow Lite for Microcontrollers
The format TensorFlow Lite uses to store its models has many advantages, but simplicity is not one of them.
1.a Neural network models
Graphs of operations with inputs and outputs make up neural network models. The outcomes of previous operations, input value arrays supplied by the application layer, or big arrays of learned values, known as weights, may all be used as inputs to an operation.
These inputs could be accelerometer data, audio sample data, or image pixel data. The final operations will leave arrays of values in their outputs after running the model once, often indicating things like categorization predictions for various categories.
1.b Desktop machines vs Microcontrollers
Models are usually trained on desktop machines, so there’s a need to transfer them to other devices like phones or microcontrollers.
With TensorFlow, we do this using a converter that can take a trained model from Python and export it as a TensorFlow Lite file.
Because it’s straightforward to build a model in TensorFlow that depends on desktop environment characteristics (such as the ability to run Python code snippets or use advanced operations) that are not available on less complex platforms, this exporting stage can be riddled with issues.
Some operations that are necessary to be performed:
- Convert all the values that are variables in training, e.g. weights, into constants
- Remove operations that are needed only for gradient backpropagation,
- Perform optimizations like fusing neighboring ops
- Fold costly operations like batch normalization into less expensive forms.
2. FlatBuffers
2.a Introduction
A FlatBuffer is a binary file and in-memory format consisting mostly of scalars of various sizes, all aligned to their own size. Each scalar is also always represented in little-endian format, which corresponds to all commonly used CPUs today. FlatBuffers will also work on big-endian machines but will be slightly slower because of additional byte-swap intrinsics.
To achieve cross-platform interoperability, it is expected that the conditions are met.
- The binary IEEE-754 format is used for floating-point numbers.
- The two’s complemented representation is used for signed integers.
- The endianness is the same for floating-point numbers as for integers.
2.a The benefits of using FlatBuffers in Embedded systems
FlatBuffers is used as the serialization library in the TensorFlow Lite framework.
It was designed for applications for which performance is critical, so it’s a good fit for embedded systems.
- Its runtime in-memory representation is exactly the same as its serialized form, so models can be embedded directly into flash memory and accessed immediately, without parsing or copying.
- This implies that reading properties from the generated code classes can be a little challenging due to a few layers of indirection, but the crucial data, such as weights, is stored directly as little-endian blobs that can be accessed like raw C arrays.
- There’s also very little wasted space, so you aren’t paying a size penalty by using FlatBuffers.
3. Schema
Note: The new schema_generated.h uses struct datatype and tables have been eliminated. I have explained the schema of an earlier version of TF Lite Micro.
FlatBuffers work using a schema that defines the data structures we want to serialize, together with a compiler that turns that schema into native C++ (or C, Python, Java, etc.) code for reading and writing the information.
Schema: A representation of a plan or theory in the form of an outline or model.
For TensorFlow Lite, the schema is in:
tensorflow/lite/schema/schema.fbs
The cached C++ accessor code is in:
tensorflow/lite/schema/schema_generated.h
Rather than storing the C++ code in source control, one could generate it every time they perform a new build. However, this would demand that the flatc compiler and the rest of the toolchain be present on every platform they build on, so a decision was made to forego automatic generation in favor of portability.
3.a Root Type
At the very end of the schema, we see a line declaring that the root_type is Model:
root_type Model;
FlatBuffers need a single container object that acts as the root for the tree of other data structures held within the file. This statement tells us that the root of this format is going to be a Model.
3.b Table in FlatBuffers
To find out what root_type Model; means, we need to scroll up a few more lines to the definition of Model:
table Model {
Tables are the main way of defining objects in FlatBuffers and consist of a name and a list of fields. Each field has a name, a type, and optionally a default value
This tells us that Model is what FlatBuffers calls a table. You can think of this like a Dict in Python or a struct in C or C++. It defines what properties an object can have, along with their names and types.
Structs are similar to a table. Use this for simple objects where you are very sure no changes will ever be made. Structs use less memory than tables and are even faster to access
There’s also a less-flexible type in FlatBuffers called struct that’s more memory-efficient for arrays of objects, but we don’t currently use this in TensorFlow Lite.
In practice,
/* Map the model into a usable data structure. */
/* This doesn't involve any copying or parsing, it's a very lightweight operation. */
const tflite::Model* model = ::tflite::GetModel(g_model_data);
The g_model_data variable is a pointer to an area of memory containing a serialized TensorFlow Lite model, and the call to ::tflite::GetModel() is effectively just a cast to get a C++ object backed up by that underlying memory. It doesn’t require any memory allocation or walking of data structures, so it’s a very quick and efficient call.
To understand how we can use it, look at the next operation we perform on the data structure:
if (model->version() != TFLITE_SCHEMA_VERSION) {
error_reporter->Report( "Model provided is schema version %d not equal"
"to supported version %d.\n",
model->version(), TFLITE_SCHEMA_VERSION);
return 1;
}
3.d Version Property
The specification of the version property to which this code refers may be found at the beginning of the Model definition in the schema:
/* Version of the schema. */
int32_t version;
This informs us that the version property is a 32-bit unsigned integer, so the C++ code generated for model->version() returns that type of value. Here we’re just doing error checking to make sure the version is one that we can understand, but the same kind of accessor function is generated for all the properties that are defined in the schema.
3.e The MicroInterpreter class
It is worthwhile to follow the flow of the MicroInterpreter class as it loads a model and gets ready to run it in order to comprehend the more intricate elements of the file format. A pointer to a model stored in memory is provided to the constructor. Buffers are the first property it accesses. Like the base Model object, the FlatBuffers Vector class is simply a read-only wrapper around the underlying memory and doesn’t require any parsing or memory allocation to be created.
const flatbuffers::Vector<flatbuffers>::Offset>>* buffers = model->buffers();
3.f Schema definition
To understand more about what the buffers array represents, we need to look at the schema definition
// Table of raw data buffers (used for constant tensors). Referenced by tensors
// by index. The generous alignment accommodates mmap-friendly data structures.
table Buffer {
data:[ubyte]
(force_align: 16);
}
Each buffer is defined as a raw array of unsigned 8-bit values, with the first value 16- byte-aligned in memory. All of the weight arrays stored in the graph are contained in this container type. This array only contains the raw bytes that support the data inside the arrays; the type and shape of the tensors are held separately. Within this top-level vector, operations refer to these constant buffers by index.
3.g Subgraphs
The next property we need to access is the list of subgraphs:
auto* subgraphs = model->subgraphs();
if (subgraphs->size() != 1) {
error_reporter->Report("Only 1 subgraph is currently supported.\n");
initialization_status_ = kTfLiteError;
return;
}
subgraph_ = (*subgraphs)[0];
A subgraph is a set of operators, the connections between them, and the buffers, inputs, and outputs that they use.
3.h A more closer look inside the schema:Part 1
To get a better idea of what’s in a subgraph, we need to take a look back at the schema:
// The root type, defining a subgraph, which typically represents an entire
// model.
table SubGraph {
// A list of all tensors used in this subgraph.
tensors:[Tensor];
// Indices of the tensors that are inputs into this subgraph. Note this is
// the list of non-static tensors that feed into the subgraph for inference.
inputs:[int];
// Indices of the tensors that are outputs out of this subgraph. Note this is
// the list of output tensors that are considered the product of the
// subgraph's inference.
outputs:[int];
// All operators, in execution order.
operators:[Operator];
// Name of this subgraph (used for debugging).
name:string;
}
The first property every subgraph has is a list of tensors, and the MicroInterpreter code accesses it like this:
tensors_ = subgraph_->tensors();
As mentioned earlier, the Buffer objects just hold raw values for weights, without any metadata about their types or shapes. Tensors are the place where this extra information is stored for constant buffers. The tensors also hold the same information for temporary arrays like inputs, outputs, or activation layers.
3.i A more closer look inside the schema:Part 2
The metadata can be seen in the definition near the top of the schema file:
table Tensor {
// The tensor shape. The meaning of each entry is operator-specific but
// builtin ops use: [batch size, height, width, number of channels] (That's
// Tensorflow's NHWC).
shape:[int];
type:TensorType;
// An index that refers to the buffers table at the root of the model. Or,
// if there is no data buffer associated (i.e. intermediate results), then
// this is 0 (which refers to an always existent empty buffer).
//
// The data_buffer itself is an opaque container, with the assumption that the
// target device is little-endian. In addition, all builtin operators assume
// the memory is ordered such that if `shape` is [4, 3, 2], then index
// [i, j, k] maps to data_buffer[i*3*2 + j*2 + k].
buffer:uint;
name:string;
// For debugging and importing back into tensorflow.
quantization:QuantizationParameters;
// Optional.
is_variable:bool = false;
}
- The type is an enum mapping to the potential data types that are supported by TensorFlow Lite, whereas the shape is a list of numbers that represents the tensor’s dimensions.
- The buffer property, which is zero if the values are calculated dynamically, identifies which Buffer in the root-level list has the actual values supporting this tensor.
- The name is there to give a human-readable label for the tensor, which can help with debugging
- The quantization property defines how to map low-precision values into real numbers.
- The is_variable member exists to support future training and other advanced applications.
3.j Back to the Microinterpreter class
Going back to the MicroInterpreter code, the second major property we pull from the subgraph is a list of operators. This list holds the graph structure of the model.
operators_ = subgraph_->operators();
3.k Schema definition of the operator
To understand how the operators are encoded, we need to go back to the schema definition of Operator:
// An operator takes tensors as inputs and outputs. The type of operation being
// performed is determined by an index into the list of valid OperatorCodes,
// while the specifics of each operations is configured using builtin_options // or custom_options. table Operator {
// Index into the operator_codes array. Using an integer here avoids
// complicate map lookups. opcode_index:uint;
// Optional input and output tensors are indicated by -1.
inputs:[int];
outputs:[int];
builtin_options:BuiltinOptions;
custom_options:[ubyte];
custom_options_format:CustomOptionsFormat;
// A list of booleans indicating the input tensors which are being mutated by
// this operator.(e.g. used by RNN and LSTM).
// For example, if the "inputs" array refers to 5 tensors and the second and
// fifth are mutable variables, then this list will contain
// [false, true, false, false, true].
// // If the list is empty, no variable is mutated in this operator.
// The list either has the same length as `inputs`, or is empty.
mutating_variable_inputs:[bool];
}
The opcode_index member is an index into the root-level operator_codes vector inside Model. Because some operators, like Conv2D, may appear more than once in a single graph and some require a string to specify them, keeping all of the op definitions in a single top-level array and referencing them indirectly from subgraphs reduces the size of serialization.
The inputs and outputs arrays define the connections between an operator and its neighbors in the graph. These are lists of integers that refer to the tensor array in the parent subgraph and may refer to constant buffers read from the model, inputs fed into the network by the application, the results of running other operations, or output destination buffers that the application will read after calculations have finished.
It’s crucial to note that the list of operators contained in the subgraph is always in topological order, which means that if you execute the operations in the array from start to finish, all inputs for each operation depending on earlier operations will already have been calculated.
Because the execution loop doesn’t have to perform any prior graph operations and can simply carry out the listed operations in the order they are listed, developing interpreters is made significantly simpler.
3.l Operators
Operators also usually require parameters, like the shape and stride for the filters for a Conv2D kernel. The representation of these is unfortunately pretty complex.
TensorFlow Lite for microcontrollers supports two different families of operations:
- Builtin operations are the most common ops that are used in mobile applications. You can see a list in the schema. As of November 2019 there are about 122 ops.
- Custom operations are defined by a string name instead of a fixed enum like built-ins, so they can be added more easily without touching the schema. For built-in ops, the parameter structures are listed in the schema.
Here’s an example for Conv2D:
table Conv2DOptions {
padding:Padding;
stride_w:int;
stride_h:int;
fused_activation_function:ActivationFunctionType;
dilation_w_factor:int = 1;
dilation_h_factor:int = 1;
}
Creating interpreters is made substantially easier because the execution loop only needs to conduct the operations listed in the order they are listed and doesn’t need to perform any preceding graph operations.
We are unable to build a code object if the operator code turns out to be a custom operator because we are unaware of the parameter list’s structure beforehand. Instead, a FlexBuffer is used to hold the argument information. When you don’t know the structure ahead of time, the FlatBuffer library supports the format for encoding arbitrary data. This means that the code implementing the operator must access the generated data by declaring what the type is and using messier syntax than builtins.
Here’s an example from some object detection code. This example demonstrates how to obtain parameter data from the Operator table’s custom_options member by using a buffer pointer that ultimately comes from that member.
const flexbuffers::Map& m = flexbuffers::GetRoot(buffer_t, length).AsMap(); op_data->max_detections = m["max_detections"].AsInt32();
4. Suggestions to improve the TF Lite Micro Framework
4.aRemove the AllocateTensors() method,add itinside the interpreter’s constructing an interpreter
When constructing the interpreter, letting the constructor allocate tensors automatically would be easier compared to calling an allocateTensors method() separately.
A bit more explanation:
The current way to build an interpreter and allocate tensors is given below:
- An interpreter is constructed
- It is assigned to a pointer
- The tensors are allocated in the memory
The code uses three different statements to perform the above-specified functions.
Source: In-depth: TensorFlow Lite for Microcontrollers – Part 2