# The Art of Representing Floating-Point Numbers as Integers

Have you been using float or double variables to perform mathematical operations on embedded systems without a Floating-Point Unit (FPU)? You are doing it wrong! Thatâs incredibly inefficient. UseÂ fixed-point representation instead.

An FPU isÂ an hardware block specially designed to carry on arithmetic operations on floating point numbers. Even though the C/C++Â code may work without an FPU, itâs always much faster to use hardware designed for a specific purpose, like this one, instead of relying on a software implementation, something that the compiler will do for you, knowing theÂ hardware restrictions you have butÂ not in an efficient manner. Essentially, it will generate a lot of assembly code, greatly increasing the size of your program and the amount of time required to complete the operation. Thus,Â if you donât have an FPU availableÂ and you still want to performÂ those arithmetic operations efficiently youâll have toÂ convert those numbers toÂ fixed-point representation. Integers! But how? By scaling them. Letâs see how that scaling value may be determined.

The scaling value as well as the resulting scaled number, which is an integer, really much depends on the bitness of the CPUâs architecture being used. YouÂ want to use values that fit in the available registers which have the same width as the CPU buses. So, whether youÂ are working with an 8, 16 or 32-bit architecture, the range of integer valuesÂ we can store on those registers, Â being bÂ the number of bits and representing numbersÂ in twoâs complement, is given by:

â2bâ1â¤valueâ¤2bâ1â1

### Fixed-Point Representation

If one bit is used to represent the sign (and in this text weâll always consider signed numbers) the remaining ones may be used to represent the integerÂ and fractional parts of the floating-point number.We may textually represent this format as follows (denoted as Q-format):

Qm.n

WhereÂ m corresponds to the bits availableÂ to represent the integer part of and n corresponds to the bits available to represent the fractionalÂ part. IfÂ m is zero youÂ may use justÂ Qn

. So, when youÂ use aÂ register to save both integer and fractional parts (and the sign bit!), the value range is then given by:

â2mâ¤valueâ¤2mâ2ân

(note that the expression above is a particular case of this one, for n=0 and m=b-1).

Itâs up to you deciding how many bits areÂ reserved for m and nÂ (still, youÂ should base your decision on a good criteria: the more bits, the greater the precision youÂ can achieve; more on this bellow). So, youÂ are essentiallyÂ fixing an imaginary point in your register that separates the integer and fractional parts. Thatâs why itâs called fixed-point.

For more detail: The Art of Representing Floating-Point Numbers as Integers

## About The Author

### Ibrar Ayyub

I am an experienced technical writer holding a Master's degree in computer science from BZU Multan, Pakistan University. With a background spanning various industries, particularly in home automation and engineering, I have honed my skills in crafting clear and concise content. Proficient in leveraging infographics and diagrams, I strive to simplify complex concepts for readers. My strength lies in thorough research and presenting information in a structured and logical format.