ikerhurtado.com
You're in
Iker Hurtado's pro blog
Developer | Entrepreneur | Investor
Software engineer (entrepreneur and investor at times). These days doing performant frontend and graphics on the web platform at Barcelona Supercomputing Center

Data representation (I): Numbers

16 Dec 2014   |   iker hurtado  
Share on Twitter Share on Google+ Share on Facebook
By making this entry I intend once and for all to understand numbers (mainly floating-point) binary codification in detail -and in a practical way- and write it down in order to come back later. For a very detailed and complete info go to the below links.

Integer number representation

Integer is the word that we use in computer programming to refer to whole number representations. The commonly used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are two representation subsets for integers: unsigned integers (can represent zero and positive integers) and signed integers (can represent zero, positive and negative integers).

It is the task of the programmer to choose the bit-length and representation subset for his integers depending on his application's requirements.

Unsigned integer representation

The binary representation of unsigned integers is very easy to understand. The value of an unsigned integer is interpreted as "the magnitude of its underlying binary pattern".

An n-bit unsigned integer can represent integers from 0 to 2n-1. Examples:

8 bits integer: 0 - 255
16 bits integer: 0 - 65535
32 bits integer: 0 - 4294967295 

Signed integer 2's complement representation

Computers use 2's complement to represent signed integers because with this method are achieved two important goals: one representation for the number zero and, more important, positive and negative integers can be computed directly -very efficiently- in addition and subtraction. Subtraction can be carried out using the "addition logic".

In 2's complement representation the most significant bit is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.

The remaining n-1 bits represents the magnitude of the integer.

For positive ones, the absolute value of the integer is equal to the magnitude of the (n-1)-bit binary pattern. Example (n=8):

01000011B → abs(1000011B) → +67D → 67D

For negative integers, the absolute value of the integer is equal to the magnitude of the complement of the (n-1)-bit binary pattern plus one (hence called 2's complement). Example (n=8):

10000011B → - (1111100B + 1B) → - abs(1111101B) → -125D

An n-bit 2's complement signed integer can represent integers from -2n-1 to 2n-1-1, like this:

8 bits: form -128 to 127
16 bits: from -32,768 to 32,767
32 bits: from -2147483648 to 2147483647

Arithmetic issues

As said, Arithmetic with 2's complement integers is very efficient but has some quirks. Let's see:

When a positive and a negative numbers are operated and the result is positive the carry-out bit has to be discarded:

65D - 5D → 65D + (-5D) → 01000001B + 11111011B = out(1)00111100B → 60D

When two negative numbers are operated and the result is in the right range of representation (not underflow) the carry-out bit appears and has to be discarded:

-65D -5D → (-65D) + (-5D) → 10111111B + 11111011B = out(1)10111010B → -70D 

As seen, n-bit 2's complement signed integers represent a certain range of whole numbers. So, during addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Overflow example:

127D + 2D → 01111111B + 00000010B = 10000001B → -127D  WRONG RESULT

Underflow (and carry-out bit discarded) example:

-125D - 5D → 10000011B + 11111011B = out(1)01111110B → 126D WRONG RESULT 

Floating-Point number representation

It is a method of representing real numbers that balances range and precision. Floating-point numbers have different nature of representation and are processed differently (in a so-called floating-point processor) than integers in computers.

A floating-point number can represent a very large positive (or negative) (1.45×1056) or a very small (close to zero) positive (or negative) (1.45×10-56) value.

A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent. So, the floating-point representation can be thought of as a kind of scientific notation. Like this form:

significand × baseexponent

Therefore, representation of floating point number is not unique but the fractional part (significand) can be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×102; binary number 1010.1011B can be normalized as 1.0101011B×23.

It worthy to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (32-bit or 64-bit). This is because there are infinite number of real numbers and a n-bit binary pattern can represent a finite distinct number (2n) of numbers.

In addition, the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale.

Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two main representation schemes: 32-bit single-precision and 64-bit double-precision.


32-bit Single-Precision Floating-Point representation

This codification is usually used to represent the "float" type in the C language family.

I'll use 1100 0000 1011 0000 0000 0000 0000 0000 like example.

The binary representation is:

The most significant bit is the sign bit, with 0 for positive numbers and 1 for negative numbers.

In the example the sign value is 1, so it's a negative number.

The following 8 bits represent exponent. The exponent codification is called excess-127 or bias-127; the value is stored as an unsigned number that has a fixed "bias" added to it. With this scheme we achieve to represent both positive and negative exponent, ranging the exponent value from -127 to 128. (later we'll see the especial values of all 0s and all 1s)

With the example value (10000001B = 129D):  
129 - 127 = 2 is the decimal value of the exponent. 

The remaining 23 bits represents the significand or fraction. The significand is normalized with an implicit leading 1 in the form of 1.F.

In the example (011 0000 0000 0000 0000 0000):
the actual fraction is 1.011 0000 0000 0000 0000 0000 = 
1 + 1×2-2 + 1×2-3 = 1.375 is the decimal value.

Finally, the number represented is:

- 1.375 × 22 = -5.5

De-Normalized Form

The previous explained normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero. De-normalized form was devised to represent zero and other special numbers.

The exponent value 00000000 is reserved for the zeros and denormal or subnormal numbers. Now an implicit leading 0 (instead of 1) is used for the significand and the exponent value is always -126 (not -127 because of the radix point shift). The number zero is represented with exponent value 0 and significand value 0. Hence, there are two representations for zero: +0 with S=0 and -0 with S=1.

It's worthy to note that it's an advantage that subnormal numbers can represent very small (close to zero) positive and negative numbers in de-normalized form.

The exponent value 11111111 is reserved for special values, such as ±INF (positive and negative infinity) and NaN (not a number).

64-bit Double-Precision Floating-Point representation

The representation logic and scheme for 64-bit double-precision is analogous to the 32-bit single-precision; just change the number of bits allocated to each item (like showed in above figure).

The value is calculated this way:

Normalized form:

For 1 ≤ exponent ≤ 2046, value = (-1)sign × 1.significand × 2exponent-1023

Denormalized form:

For exponent = 0, value = (-1)sign × 0.significand × 2exponent-1022

For exponent = 2047, it represents special values, such as ±INF (infinity), NaN (not a number).

More info about floating-point operations and floating-point types in C in my post:

The Hw/Sw Interface. Section 2: Integer and Floating Point Numbers - ikerhurtado.com


For very detailed info:

POST A COMMENT: