Table of Contents
Binary Machine Numbers
-
- Representing Real Numbers:
- A 64-bit (binary digit) representation is used for a real number.
- The first bit is a sign indicator, denoted \(s\).
- Followed by an 11-bit exponent, \(c\), called the characteristic,
- and a 52-bit binary fraction, \(f\) , called the mantissa.
- The base for the exponent is 2.
-
- Floating-Point Number Form:
- \[(−1)^s\ \ 2^{c−1023} \ (1 + f)\]
- Smallest Normalized positive Number:
When: \(s = 0, \ \ c = 1,\ \\) and \(\ \ f = 0\).
Equivalent to: \(2^{−1022}\ \ \dot \ \ (1 + 0) \ \approx \ 0.22251 \ \dot \ 10^{−307}\). - Largest Normalized positive Number:
When: \(s = 0,\ \ c = 2046,\ \\) and \(\ \ f = 1 - 2^{-52}\).
Equivalent to: \(2^{1023}\ \dot \ (2 - 2^{-52}) \ \approx 0.17977 × 10^{309}\). -
- UnderFlow:
- When numbers occurring in calculations have a magnitude less than,
\(2^{-1022}\).
-
- OverFlow:
- When numbers occurring in calculations have a magnitude greater than,
\(2^{1023} \dot (2 - 2^{-52})\).
- Representing the Zero:
- There are Two Representations of the number zero:
- A positive 0: when \(s = 0, \ \ c = 0, \ \\) and \(\ f = 0\).
- A negative 0: when \(s = 1, \ \ c = 0, \ \\) and \(\ f = 0\).
- There are Two Representations of the number zero:
Decimal Machine Numbers
-
- What?
- We assume that machine numbers are represented in the normalized decimal floating-point form.
-
- (k-digit) Decimal Machine Numbers:
- \[±0.d_1d_2 ... d_k × 10^n , 1 \leq d_1 \leq 9, \text{and } 0 \leq d_i \leq 9,\]
- \(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for each } i = 2, ... , k\).
-
- Normalized Form:
- \[y = 0.d_1d_2 ... d_k × 10^n\]
- Floating-Point Form of a Decimal Machine Number:
The floating-point form of y, denoted \(f_l(y)\), is obtained by terminating the mantissa of \(y\) at k-decimal digits. - Termination:
There are two common ways of performing this termination:- Chopping: is to simply chop off the digits \(d_{k+1}d_{k+2}\).
This produces the floating-point form: \(f_l(y) = 0.d_1d_2 ... d_k × 10^n\)
- Rounding: adds \(5 × 10^{n−(k+1)}\) to \(y\) and then chops the result
This produces the floating-point form: \(f_l(y) = 0.\delta_1\delta_2 ... \delta_k × 10^n\).
For rounding, when \(d_{k+1} \geq 5\), we add \(1\) to \(d_k\) to obtain \(f_l(y)\); that is, we round up. When \(d_{k+1} < 5\), we simply chop off all but the first k digits; so we round down. If we round down, then \(\delta_i = d_i\), for each \(i = 1, 2, ... , k\).
However, if we round up, the digits (and even the exponent) might change.
- Chopping: is to simply chop off the digits \(d_{k+1}d_{k+2}\).
- Approximation Errors:
-
The Absolute Error: \(\ \ \ \ \ \ \ \|p − p^∗\|\).
-
The Relative Error: \(\ \ \ \ \ \ \ \dfrac{\|p − p^∗\|}{\|p\|}\).
-
-
Significant Digits:
- Error in using Floating-Point Repr.:
-
- Chopping:
- The Relative Error = \(|\dfrac{y - f_l(y)}{y}|\)
- The Machine Repr. [for k decimial digits] =
- \(y = 0.d_1d_2 ... d_kd_{k+1} ... × 10^n\).
\(\implies\)
Bound \(\ \ \ \implies \ \ |\dfrac{y - f_l(y)}{y}| \leq \dfrac{1}{0.1} \times 10^{-k} = 10^{-k+1}\). - Rounding:
In a similar manner, a bound for the relative error when using k-digit rounding arithmetic is
Bound \(\ \ \ \implies \ \ \|\dfrac{y - f_l(y)}{y}\| \leq 0.5 × 10^{−k+1}\).
-
- Distribution of Numbers:
The number of decimal machine numbers in \([10^n, 10^{n+1}]\) is constant for all integers \(n\).
Finite-Digit Arithmetic
-
- Values:
- \[x = f_l(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)\]
- \[y = f_l(y) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)\]
- Operations:
- Error-producing Calculations:
- Cancelation of significant digits due to the subtraction of nearly equal numbers.
- Dividing by a number with small magnitude / Multiplying by a number with large magnitude.
- Cancelation of significant digits due to the subtraction of nearly equal numbers.
- Avoiding Round-Off Error:
The loss of accuracy due to round-off error can often be avoided by a reformulation of the calculations.
We change the form of the quadratic formula by rationalizing the numerator:
\(\implies \ \ \ \ \ \ \ \ \ \ \ \ \ \ x_1 = \dfrac{−2c}{b + \sqrt{b^2 − 4ac}}, \ \ \ \ \ \ \ \ \ \ \ \ \ \ x_2 = \dfrac{−2c}{b - \sqrt{b^2 − 4ac}},\)
Nested Arithmetic
-
- What?
- Rearranging calculations to reduce the number of computations.
Remember that chopping (or rounding) is performed after each calculation.
-
Polynomials should always be expressed in nested form before performing an evaluation, because this form minimizes the number of arithmetic calculations.
-
- Why?
- Accuracy loss due to round-off error can also be reduced by rearranging calculations to reduce the number of computations.