Integer number representation
Integer is the word that we use in computer programming to refer to whole number representations. The commonly used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are two representation subsets for integers: unsigned integers (can represent zero and positive integers) and signed integers (can represent zero, positive and negative integers).
It is the task of the programmer to choose the bit-length and representation subset for his integers depending on his application's requirements.
Unsigned integer representation
The binary representation of unsigned integers is very easy to understand. The value of an unsigned integer is interpreted as "the magnitude of its underlying binary pattern".
An n-bit unsigned integer can represent integers from 0 to 2n-1. Examples:
8 bits integer: 0 - 255 16 bits integer: 0 - 65535 32 bits integer: 0 - 4294967295
Signed integer 2's complement representation
Computers use 2's complement to represent signed integers because with this method are achieved two important goals: one representation for the number zero and, more important, positive and negative integers can be computed directly -very efficiently- in addition and subtraction. Subtraction can be carried out using the "addition logic".
In 2's complement representation the most significant bit is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
The remaining n-1 bits represents the magnitude of the integer.
For positive ones, the absolute value of the integer is equal to the magnitude of the (n-1)-bit binary pattern. Example (n=8):
01000011B → abs(1000011B) → +67D → 67D
For negative integers, the absolute value of the integer is equal to the magnitude of the complement of the (n-1)-bit binary pattern plus one (hence called 2's complement). Example (n=8):
10000011B → - (1111100B + 1B) → - abs(1111101B) → -125D
An n-bit 2's complement signed integer can represent integers from -2n-1 to 2n-1-1, like this:
8 bits: form -128 to 127 16 bits: from -32,768 to 32,767 32 bits: from -2147483648 to 2147483647
As said, Arithmetic with 2's complement integers is very efficient but has some quirks. Let's see:
When a positive and a negative numbers are operated and the result is positive the carry-out bit has to be discarded:
65D - 5D → 65D + (-5D) → 01000001B + 11111011B = out(1)00111100B → 60D
When two negative numbers are operated and the result is in the right range of representation (not underflow) the carry-out bit appears and has to be discarded:
-65D -5D → (-65D) + (-5D) → 10111111B + 11111011B = out(1)10111010B → -70D
As seen, n-bit 2's complement signed integers represent a certain range of whole numbers. So, during addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.
127D + 2D → 01111111B + 00000010B = 10000001B → -127D WRONG RESULT
Underflow (and carry-out bit discarded) example:
-125D - 5D → 10000011B + 11111011B = out(1)01111110B → 126D WRONG RESULT
Floating-Point number representation
It is a method of representing real numbers that balances range and precision. Floating-point numbers have different nature of representation and are processed differently (in a so-called floating-point processor) than integers in computers.
A floating-point number can represent a very large positive (or negative) (1.45×1056) or a very small (close to zero) positive (or negative) (1.45×10-56) value.
A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent. So, the floating-point representation can be thought of as a kind of scientific notation. Like this form:
significand × baseexponent
Therefore, representation of floating point number is not unique but the fractional part (significand) can be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×102; binary number 1010.1011B can be normalized as 1.0101011B×23.
It worthy to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (32-bit or 64-bit). This is because there are infinite number of real numbers and a n-bit binary pattern can represent a finite distinct number (2n) of numbers.
In addition, the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale.
Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two main representation schemes: 32-bit single-precision and 64-bit double-precision.
32-bit Single-Precision Floating-Point representation
This codification is usually used to represent the "float" type in the C language family.
I'll use 1100 0000 1011 0000 0000 0000 0000 0000 like example.
The binary representation is:
The most significant bit is the sign bit, with 0 for positive numbers and 1 for negative numbers.
In the example the sign value is 1, so it's a negative number.
The following 8 bits represent exponent. The exponent codification is called excess-127 or bias-127; the value is stored as an unsigned number that has a fixed "bias" added to it. With this scheme we achieve to represent both positive and negative exponent, ranging the exponent value from -127 to 128. (later we'll see the especial values of all 0s and all 1s)
With the example value (10000001B = 129D): 129 - 127 = 2 is the decimal value of the exponent.
The remaining 23 bits represents the significand or fraction. The significand is normalized with an implicit leading 1 in the form of 1.F.
In the example (011 0000 0000 0000 0000 0000): the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2-2 + 1×2-3 = 1.375 is the decimal value.
Finally, the number represented is:
- 1.375 × 22 = -5.5
The previous explained normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero. De-normalized form was devised to represent zero and other special numbers.
The exponent value 00000000 is reserved for the zeros and denormal or subnormal numbers. Now an implicit leading 0 (instead of 1) is used for the significand and the exponent value is always -126 (not -127 because of the radix point shift). The number zero is represented with exponent value 0 and significand value 0. Hence, there are two representations for zero: +0 with S=0 and -0 with S=1.
It's worthy to note that it's an advantage that subnormal numbers can represent very small (close to zero) positive and negative numbers in de-normalized form.
The exponent value 11111111 is reserved for special values, such as ±INF (positive and negative infinity) and NaN (not a number).
64-bit Double-Precision Floating-Point representation
The representation logic and scheme for 64-bit double-precision is analogous to the 32-bit single-precision; just change the number of bits allocated to each item (like showed in above figure).
The value is calculated this way:
For 1 ≤ exponent ≤ 2046, value = (-1)sign × 1.significand × 2exponent-1023
For exponent = 0, value = (-1)sign × 0.significand × 2exponent-1022
For exponent = 2047, it represents special values, such as ±INF (infinity), NaN (not a number).
More info about floating-point operations and floating-point types in C in my post:
For very detailed info: