# Perform Fixed-Point Arithmetic

This example shows how to perform basic fixed-point arithmetic operations.

Save the current state of all warnings before beginning.

warnstate = warning;

### Addition and Subtraction

When you add two unsigned fixed-point numbers, you may need a carry bit to correctly represent the result. For this reason, when adding two B-bit numbers with the same scaling, the resulting value has an extra bit compared to the two operands used.

a = fi(0.234375,0,4,6); c = a + a

c = 0.4688 DataTypeMode: Fixed-point: binary point scaling Signedness: Unsigned WordLength: 5 FractionLength: 6

a.bin

ans = '1111'

c.bin

ans = '11110'

With signed, two's-complement numbers, a similar scenario occurs because of the sign extension required to correctly represent the result.

a = fi(0.078125,1,4,6); b = fi(-0.125,1,4,6); c = a + b

c = -0.0469 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 5 FractionLength: 6

a.bin

ans = '0101'

b.bin

ans = '1000'

c.bin

ans = '11101'

If you add or subtract two numbers with different precision, the radix point first needs to be aligned to perform the operation. The result is that there is a difference of more than one bit between the result of the operation and the operands, depending on how far apart the radix points are.

a = fi(pi,1,16,13); b = fi(0.1,1,12,14); c = a + b

c = 3.2416 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 18 FractionLength: 14

### Further Considerations for Addition and Subtraction

The following pattern is **not** recommended. Because scalar additions are performed at each iteration in the `for`

loop, a bit is added to `temp`

during each iteration. As a result, instead of a bit growth of `ceil(log2(Nadds))`

, the bit growth is equal to `Nadds`

.

s = rng; rng('default'); b = fi(4*rand(16,1)-2,1,32,30); rng(s); % restore RNG state Nadds = length(b) - 1; temp = b(1); for n = 1:Nadds temp = temp + b(n+1); % temp has 15 more bits than b end

If the `sum`

command is used instead, the bit growth is curbed.

`c = sum(b) % c has 4 more bits than b`

c = 7.0059 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 36 FractionLength: 30

### Multiplication

In general, a full-precision product requires a word length equal to the sum of the word lengths of the operands. In this example, the word length of the product `c`

is equal to the word length of `a`

plus the word length of `b`

. The fraction length of `c`

is also equal to the fraction length of `a`

plus the fraction length of `b`

.

a = fi(pi,1,20); b = fi(exp(1),1,16); c = a*b

c = 8.5397 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 36 FractionLength: 30

### Assignment

When you assign a fixed-point value into a predefined variable, quantization might be involved. In such cases, the right-hand side of the expression is quantized by rounding to nearest and then saturating, if necessary, before assigning to the left-hand side.

N = 10; a = fi(2*rand(N,1)-1,1,16,15); b = fi(2*rand(N,1)-1,1,16,15); c = fi(zeros(N,1),1,16,14); for n = 1:N c(n) = a(n).*b(n); end

When the product `a(n).*b(n)`

is computed with full precision, an intermediate result with wordlength 32 and fraction length 30 is generated. That result is then quantized to a word length of 16 bits and a fraction length of 14 bits. The quantized value is then assigned to the element `c(n)`

.

### Quantize Results Explicitly

Often, it is not desirable to round to nearest or to saturate when quantizing a result because of the extra logic and computation required. It may also be undesirable to have to assign to a left-hand side value to perform the quantization. You can use the `quantize`

function for such purposes. A common case is a feedback-loop. If no quantization is introduced, unbounded bit growth will occur as more input data is provided.

a = fi(0.1,1,16,18); x = fi(2*rand(128,1)-1,1,16,15); y = fi(zeros(size(x)),1,16,14); for n = 1:length(x) z = y(n); y(n) = x(n) - quantize(a.*z,true,16,14,'Floor','Wrap'); end

In this example, the product `a.*z`

is computed with full precision and is then quantized to a wordlength of 16 bits and a fraction length of 14 bits. The quantization is done by rounding to floor (truncation) and allowing for wrapping if overflow occurs. Quantization still occurs at assignment because the expression `x(n) - quantize(a.*z,...)`

produces an intermediate result of 18 bits and `y`

is defined to have 16 bits.

To eliminate the quantization at assignment, you can introduce an additional explicit quantization so that no round to nearest or saturation logic is used. Because the left-hand side result has the same 16 bit word length and fraction length of 14 bits as `y(n)`

, no quantization is necessary.

a = fi(0.1,1,16,18); x = fi(2*rand(128,1)-1,1,16,15); y = fi(zeros(size(x)),1,16,14); T = numerictype(true,16,14); for n = 1:length(x) z = y(n); y(n) = quantize(x(n),T,'Floor','Wrap') - ... quantize(a.*z,T,'Floor','Wrap'); end

### Non-Full-Precision Sums

Full-precision sums are not always desirable and can result in complicated and inefficient generated C code. For example, the intermediate result `x(n) - quantize(...)`

in the above example has an 18-bit word length. Instead, it may be desirable to keep all results of addition and subtraction to 16 bits. You can use the `accumpos`

and `accumneg`

functions to keep the results of addition and subtraction to 16 bits.

a = fi(0.1,1,16,18); x = fi(2*rand(128,1)-1,1,16,15); y = fi(zeros(size(x)),1,16,14); T = numerictype(true, 16, 14); for n = 1:length(x) z = y(n); y(n) = quantize(x(n),T); % defaults: 'Floor','Wrap' y(n) = accumneg(y(n),quantize(a.*z, T)); % defaults: 'Floor','Wrap' end

### Model Accumulators

You can use the `accumpos`

and `accumneg`

functions to model accumulators. The behavior of `accumpos`

and `accumneg`

corresponds to the `+=`

and `-=`

operators in C, respectively. A common example is an FIR filter in which the coefficients and input data are represented with 16 bits. The multiplication is performed in full precision, yielding 32 bits, and an accumulator with 8 guard bits. In total, 40 bits are used to enable up to 256 accumulations without the possibility of overflow.

b = fi(1/256*[1:128,128:-1:1],1,16); % Filter coefficients x = fi(2*rand(300,1)-1,1,16,15); % Input data z = fi(zeros(256,1),1,16,15); % Used to store the states y = fi(zeros(size(x)),1,40,31); % Initialize Output data for n = 1:length(x) acc = fi(0,1,40,31); % Reset accumulator z(1) = x(n); % Load input sample for k = 1:length(b) acc = accumpos(acc,b(k).*z(k)); % Multiply and accumulate end z(2:end) = z(1:end-1); % Update states y(n) = acc; % Assign output end

### Matrix Arithmetic

To simplify syntax and shorten simulation time, you can use matrix arithmetic. For the FIR filter example, you can replace the inner loop with an inner product.

z = fi(zeros(256,1),1,16,15); % Used to store the states y = fi(zeros(size(x)),1,40,31); for n = 1:length(x) z(1) = x(n); y(n) = b*z; z(2:end) = z(1:end-1); end

The inner product `b*z`

is performed with full precision. Because this is a matrix operation, the bit growth is due to both the multiplication involved and the addition of the resulting products. Therefore, the bit growth depends on the length of the operands. In this example, `b`

and `z`

have length 256, resulting in an 8-bit growth due to the additions. The inner product results in 32 + 8 = 40 bits, with a fraction length of 31 bits. No quantization occurs in the assignment `y(n) = b*z`

because `y`

was initialized to this format.

If you had to perform an inner product for more than 256 coefficients, the bit growth would be more than 8 bits beyond the 32 needed for the product. If you only had a 40-bit accumulator, you could model the behavior by either introducing a quantizer, as in `y(n) = quantize(Q,b*z)`

, or you could use the `accumpos`

function.

### Model a Counter

You can use `accumpos`

to model a simple counter which wraps after reaching its maximum value. For example, you can model a 3-bit counter as follows.

c = fi(0,0,3,0); Ncounts = 20; % Number of times to count for n = 1:Ncounts c = accumpos(c,1); end

Because the 3-bit counter wraps back to 0 after reaching 7, the final value of the counter is `mod(20,8) = 4`

.

### Arithmetic with Other Built-In Data Types

In the C language, the result of an operation between an integer data type and a double data type promotes to a double. However, in MATLAB®, the result of an operation between a built-in integer data type and a double data type is an integer. In this respect, the `fi`

object behaves like the built-in integer data types in MATLAB. The result of an operation between a `fi`

and a `double`

is a `fi`

.

`fi * double`

When doing multiplication between `fi`

and `double`

, the `double`

is cast to a `fi`

with the same word length and signedness of the `fi`

, and best-precision fraction length. The result of the operation is a `fi`

.

a = fi(pi); b = 0.5 * a

b = 1.5708 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 32 FractionLength: 28

`fi + double`

or `fi - double`

When doing addition or subtraction between `fi`

and `double`

, the double is cast to a `fi`

with the same `numerictype`

as the `fi`

. The result of the operation is a `fi`

.

a = fi(pi); b = a + 1

b = 4.1416 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 17 FractionLength: 13

`fi * int8`

When doing arithmetic between `fi`

and one of the built-in integer data types, `[u]int[8,16,32,64]`

, the word length and signedness of the integer are preserved. The result of the operation is a `fi`

.

a = fi(pi); b = int8(2) * a

b = 6.2832 DataTypeMode: Fixed-point: binary point scaling Signedness: Signed WordLength: 24 FractionLength: 13

Restore warning states.

```
warning(warnstate);
%#ok<*NASGU,*NOPTS>
```