Number Representation and Arithmetic Operations in Computer Systems

Hey there! Welcome to KnowledgeKnot! Don't forget to share this with your friends and revisit often. Your support motivates us to create more content in the future. Thanks for being awesome!

Basic Terms to Know

Before we dive into the details, let's establish a solid foundation with these key terms:

Binary: A base-2 number system using only 0 and 1 to represent all numbers.Example: 1010 in binary equals 10 in decimal.
Bit: The smallest unit of data in computing, representing a single binary digit (0 or 1).Example: A bit can store either 0 or 1.
Byte: A unit of digital information consisting of 8 bits.Example: 1 Byte can store a character like 'A' in ASCII encoding.
Integer: Whole numbers without fractional parts, including positive, negative, and zero.Example: -5, 0, 42.
Floating-Point: A way to represent real numbers with fractional parts in computing systems.Example: 3.14159 or -0.001.
Fixed-Point: A number representation with a fixed number of digits after the decimal point.Example: 123.45 with two fixed decimal places.
Hexadecimal: A base-16 number system using digits 0-9 and letters A-F.Example: 0x1F represents 31 in decimal.
Octal: A base-8 number system using digits 0-7.Example: 075 in octal equals 61 in decimal.
Overflow: When a value exceeds the maximum limit of its data type.Example: Adding 1 to an 8-bit unsigned integer at 255 causes it to wrap around to 0.
Underflow: When a value is too small to be represented in its data type.Example: Subtracting 1 from an 8-bit unsigned integer at 0 results in 255.

Integer Representation

Integers in computers are typically represented using a fixed number of bits, most commonly 8, 16, 32, or 64 bits. This representation determines the range and precision of integer values that can be stored and manipulated.

Unsigned Integers

Unsigned integers only represent non-negative whole numbers. The range for an $n$ -bit unsigned integer is $0$ to $2^n - 1$ .

Example: For an 8-bit unsigned integer, the range is $0$ to $255$ ( $2^8 - 1$ ).

Use Case: Unsigned integers are commonly used for counting, indexing arrays, and scenarios where negative values are unnecessary.

Signed Integers (Two's Complement)

Two's complement is the most common method for representing signed integers. It simplifies arithmetic operations by ensuring that addition and subtraction work consistently for both positive and negative numbers.

Range for an $n$ -bit signed integer: $-2^{n-1}$ to $2^{n-1} - 1$

Example: For an 8-bit signed integer, the range is $-128$ to $127$ .

Steps to find two's complement:

→ Write the positive binary form.
→ Invert all bits (0s to 1s and vice versa).
→ Add 1 to the result.

Example: Represent $-42$ in 8-bit two's complement:

→ Positive 42 in binary: $00101010$
→ Invert all bits: $11010101$
→ Add 1: $11010110$

Thus, $-42$ in 8-bit two's complement is $11010110$ .

Important Note: Two's complement eliminates ambiguity, ensuring a unique binary representation for each signed integer.

Bitwise Operations

Bitwise operations are fundamental in computer systems, enabling low-level manipulation of data. They are commonly used in areas like cryptography, compression, and hardware interfacing.

→ AND (&): Sets each bit to 1 if both bits are 1.
→ OR (|): Sets each bit to 1 if at least one of the bits is 1.
→ XOR (^): Sets each bit to 1 if exactly one of the bits is 1.
→ NOT (~): Inverts all bits (1 becomes 0, and 0 becomes 1).
→ Left Shift (<<): Shifts bits left, filling with 0s on the right. Equivalent to multiplying by powers of 2.
→ Right Shift (>>): Shifts bits right, filling with the sign bit for signed integers (arithmetic shift) or 0s for unsigned integers (logical shift).

Binary Arithmetic Operations

Addition

Binary addition works similarly to decimal addition, but you carry over when the sum of two bits is 2 (10 in binary).

Example: Add 19 (10011) and 13 (01101):

\begin{array}{r} \textcolor{blue}{1} \textcolor{blue}{1}\textcolor{blue}{1} \textcolor{blue}{1} \\ 10011 \\ \text{+ } 01101 \\ \hline 100000 \text{ (32 in decimal)} \end{array}

The blue 1's above indicate where carrying occurs.

Subtraction

Binary subtraction can be performed using two's complement. To subtract, we add the two's complement of the subtrahend to the minuend.

Example: Subtract 13 from 19 using 8-bit representation:

\begin{array}{rcl} 19 &=& 00010011 \\ \text{Two's complement of 13:} & & \\ 13 &=& 00001101 \\ \text{Invert} &=& 11110010 \\ \text{Add 1} &=& 11110011 \\ \hline 19 - 13 &=& 00010011 \\ &+& 11110011 \\ \hline \text{Result} &=& \textcolor{blue}{1}00000110 \text{ (6 in decimal)} \end{array}

The final carry (blue 1) is discarded in 8-bit representation.

Multiplication

Binary multiplication is like decimal multiplication but simpler since you only multiply by 0 or 1.

Example: Multiply 5 (101) by 3 (011):

\begin{array}{r} 101 \\ \text{× } 011 \\ \hline 101 \\ 101 \textcolor{blue}{0} \\ \textcolor{gray}{000}\textcolor{blue}{00} \\ \hline 001111 \text{ (15 in decimal)} \end{array}

Blue 0's are added for proper alignment. Gray 000 is not written in actual calculation.

Division

Binary division is similar to long division in decimal, using repeated subtraction and shifting.

Example: Divide 30 (11110) by 6 (110):

\begin{array}{r} 101 \\ 110 \enclose{longdiv}{11110} \\ \underline{110} \\ 0110 \\ \underline{110} \\ 000 \end{array}

Result: 101 (5 in decimal) with no remainder.

Overflow and Underflow

In fixed-width binary arithmetic, overflow occurs when the result of an operation exceeds the maximum representable value, and underflow occurs when the result is smaller than the minimum representable value.

Example of Overflow: Adding 1 to the largest 8-bit unsigned integer:

\begin{array}{rcl} 11111111 &=& 255 \\ \text{+ } 00000001 &=& 1 \\ \hline \textcolor{red}{1}00000000 &=& 0 \text{ (overflow, result wraps around)} \end{array}

The red 1 indicates the overflow bit, which is lost in 8-bit representation.

Floating-Point Representation

Floating-point numbers are used to represent real numbers, including those with fractional parts. Unlike integers, which are limited to whole numbers, floating-point numbers are designed to represent values across a wide range with varying degrees of precision. These numbers are especially important in fields like scientific computing, engineering, and graphics, where precise calculations are required.

Floating-point representation follows the IEEE 754 standard to ensure uniformity and reliability across platforms. This standard describes two primary formats: 32-bit (single precision) and 64-bit (double precision). Each format defines how numbers are stored in binary and how computations are performed.

IEEE 754 Structure

A floating-point number is represented using the formula:

(-1)^s \times (1 + f) \times 2^{e-bias}

In this representation:

$s$ : The sign bit determines whether the number is positive (0) or negative (1).
$f$ : The fraction (or mantissa) represents the precision bits of the number.
$e$ : The exponent is used to scale the number and is stored with a bias to handle both positive and negative exponents.
$bias$ : The bias ensures the exponent is always stored as a positive value (e.g., 127 for single precision, 1023 for double precision).

For example, a number like 5.75 in binary is represented with these components, making it easy to encode and manipulate.

Single Precision (32-bit) Format

In single precision, 32 bits are divided into three parts:

1 bit for the sign (S), indicating the number's sign.
8 bits for the exponent (E), which determines the scale or range.
23 bits for the fraction (F), storing the number's precision.

\underbrace{S}_{1\text{ bit}} \underbrace{EEEEEEEE}_{8\text{ bits}} \underbrace{FFFF\ldots FFF}_{23\text{ bits}}

For instance, to represent 10.5:

Convert 10.5 to binary: 1010.1.
Normalize it: 1.0101 × 2^3.
Sign bit (S) = 0 (positive).
Exponent (E) = 3 + 127 = 130, represented as 10000010 in binary.
Fraction (F) = 01010000000000000000000.

The final representation is:

\underbrace{0}_{S} \underbrace{10000010}_{E} \underbrace{01010000000000000000000}_{F}

Double Precision (64-bit) Format

Double precision offers greater accuracy and range by using 64 bits, divided as follows:

1 bit for the sign (S).
11 bits for the exponent (E), allowing a broader range of values.
52 bits for the fraction (F), enabling higher precision.

\underbrace{S}_{1\text{ bit}} \underbrace{EEEEEEEEEEE}_{11\text{ bits}} \underbrace{FFFF\ldots FFF}_{52\text{ bits}}

For example, to represent a small value like 0.0000123, the additional precision ensures accurate encoding.

Examples

Representing 12.25 in Single Precision

Steps to represent 12.25:

Convert 12.25 to binary: 1100.01.
Normalize: 1.10001 × 2^3.
S = 0 (positive).
E = 3 + 127 = 130 (10000010 in binary).
F = 10001000000000000000000.

Final representation:

\underbrace{0}_{S} \underbrace{10000010}_{E} \underbrace{10001000000000000000000}_{F}

Decoding -0.1 in Single Precision

To decode -0.1:

Binary for 0.1: 0.0001100110011... (repeating).
Normalize: 1.10011 × 2^-4.
S = 1 (negative).
E = -4 + 127 = 123 (01111011 in binary).
F = 10011001100110011001100 (rounded to 23 bits).

Final representation:

\underbrace{1}_{S} \underbrace{01111011}_{E} \underbrace{10011001100110011001100}_{F}

Special Values

IEEE 754 also defines special cases:

Zero: Exponent and fraction are all zeros. Sign bit differentiates +0 and -0.
Infinity: Exponent is all ones, fraction is all zeros. Sign bit indicates +∞ or -∞.
NaN: Exponent is all ones, fraction is non-zero. Used for undefined results like 0/0.
Denormalized Numbers: Exponent is zero, fraction is non-zero. These represent very small values with reduced precision.

For example, dividing 1 by 0 gives infinity, while 0/0 results in NaN.

Precision and Range

The precision and range for floating-point formats vary:

Single precision provides about 7 decimal digits of accuracy and ranges from 1.2 × 10^-38 to 3.4 × 10^38.
Double precision offers about 15-17 decimal digits of accuracy and ranges from 2.2 × 10^-308 to 1.8 × 10^308.

For example, scientific simulations often rely on double precision for accuracy over extended computations.

Floating-Point Arithmetic Operations

Addition and Subtraction

Floating-point addition and subtraction involve aligning the exponents of the numbers before performing the arithmetic. The steps are as follows:

Align the decimal points by adjusting the smaller number's exponent so both numbers have the same exponent.
Add or subtract the significands (the part of the number before the exponent).
Normalize the result if necessary (i.e., adjust the result to fit into scientific notation).
Round the result to fit the available precision (this is crucial as floating-point numbers have limited precision).

Example: Add 123.45 and 67.89

We first express both numbers in scientific notation:

\begin{array}{rcl} 123.45 & = & 1.2345 \times 10^2 \\ 67.89 & = & 6.789 \times 10^1 = 0.6789 \times 10^2 \\ \hline \text{Sum} & = & 1.9134 \times 10^2 = 191.34 \end{array}

Here, we adjust the second number by shifting its exponent to match the first number, then add the significands. The sum is normalized, and the result is rounded to fit the available precision.

Multiplication

Floating-point multiplication involves multiplying the significands and adding the exponents. The steps are as follows:

Multiply the significands (the decimal parts).
Add the exponents of the numbers.
Normalize the result if necessary (adjust to scientific notation).
Round the result to fit the available precision.

Example: Multiply 1.5 by 2.5

We first express both numbers in scientific notation:

\begin{array}{rcl} 1.5 & = & 1.5 \times 10^0 \\ 2.5 & = & 2.5 \times 10^0 \\ \hline \text{Product} & = & (1.5 \times 2.5) \times 10^0 = 3.75 \times 10^0 = 3.75 \end{array}

The multiplication of the significands gives 3.75, and since the exponents are both 0, the result remains at 3.75. There is no need for normalization in this case.

Division

Floating-point division involves dividing the significands and subtracting the exponents. The steps are as follows:

Divide the significands (the decimal parts).
Subtract the exponent of the denominator from the exponent of the numerator.
Normalize the result if necessary (adjust to scientific notation).
Round the result to fit the available precision.

Example: Divide 1.5 by 2.5

We first express both numbers in scientific notation:

\begin{array}{rcl} 1.5 & = & 1.5 \times 10^0 \\ 2.5 & = & 2.5 \times 10^0 \\ \hline \text{Quotient} & = & (1.5 \div 2.5) \times 10^{0-0} = 0.6 \times 10^0 = 0.6 \end{array}

The division of the significands gives 0.6, and since the exponents of both numbers are 0, the result remains 0.6.

Rounding Errors

Floating-point arithmetic can lead to rounding errors because of the limited precision of binary representation. This is especially evident with decimal fractions that cannot be exactly represented in binary format.

Example: Adding 0.1 and 0.2 in many programming languages:

\begin{array}{rcl} 0.1 + 0.2 & \neq & 0.3 \\ 0.1 + 0.2 & \approx & 0.30000000000000004 \end{array}

This issue arises because the numbers 0.1 and 0.2 do not have exact binary equivalents. When performing the addition, the result cannot be precisely represented, leading to a small error. This rounding error can propagate in further calculations.

Suggetested Articles