Have you been using float or double variables to perform mathematical operations on embedded systems without a Floating-Point Unit (FPU)? You are doing it wrong! Thatβs incredibly inefficient. UseΒ fixed-point representation instead.
An FPU isΒ an hardware block specially designed to carry on arithmetic operations on floating point numbers. Even though the C/C++Β code may work without an FPU, itβs always much faster to use hardware designed for a specific purpose, like this one, instead of relying on a software implementation, something that the compiler will do for you, knowing theΒ hardware restrictions you have butΒ not in an efficient manner. Essentially, it will generate a lot of assembly code, greatly increasing the size of your program and the amount of time required to complete the operation. Thus,Β if you donβt have an FPU availableΒ and you still want to performΒ those arithmetic operations efficiently youβll have toΒ convert those numbers toΒ fixed-point representation. Integers! But how? By scaling them. Letβs see how that scaling value may be determined.
The scaling value as well as the resulting scaled number, which is an integer, really much depends on the bitness of the CPUβs architecture being used. YouΒ want to use values that fit in the available registers which have the same width as the CPU buses. So, whether youΒ are working with an 8, 16 or 32-bit architecture, the range of integer valuesΒ we can store on those registers, Β being bΒ the number of bits and representing numbersΒ in twoβs complement, is given by:
Fixed-Point Representation
If one bit is used to represent the sign (and in this text weβll always consider signed numbers) the remaining ones may be used to represent the integerΒ and fractional parts of the floating-point number.We may textually represent this format as follows (denoted as Q-format):
WhereΒ m corresponds to the bits availableΒ to represent the integer part of and n corresponds to the bits available to represent the fractionalΒ part. IfΒ m is zero youΒ may use justΒ Qn
. So, when youΒ use aΒ register to save both integer and fractional parts (and the sign bit!), the value range is then given by:
(note that the expression above is a particular case of this one, for n=0 and m=b-1).
Itβs up to you deciding how many bits areΒ reserved for m and nΒ (still, youΒ should base your decision on a good criteria: the more bits, the greater the precision youΒ can achieve; more on this bellow). So, youΒ are essentiallyΒ fixing an imaginary point in your register that separates the integer and fractional parts. Thatβs why itβs called fixed-point.
For more detail: The Art of Representing Floating-Point Numbers as Integers