LookUp Tables (LUT) Operations in C++Copyright © 2000 – 2015
Carlos Moreno
IntroductionLookUp Tables are a technique commonly used to accelerate numeric processing in applications with demanding timing requirements. They are often used in Image Processing and DSP (Digital Signal Processing) applications. The basic idea is to precompute the result of complex operations that are or can be expressed as a function of an integer value. The precomputed results are typically stored in an array, which is used at runtime instead of performing the whole, timeconsuming operation. In this article, I present an implementation in C++ of a wrapper class to provide LUT operations in a way that is completely transparent to the user of that class. I first introduce the concept and discuss the benefits and implications of their use, and then describe the implementation. The article requires previous knowledge of the basic tools of the C++ language, in particular, class templates and operator overloading. What is a LookUp Table?Simply put, a LookUpTable (LUT) is an array that holds a set of precomputed results for a given operation. This array provides acces to the results in a way that is faster than computing each time the result of the given operation. LUT's are typically used in realtime data acquisition and processing systems (often embedded systems), since these types of systems impose demanding and strict timing restriction. An important detail to consider is the fact that LUT's require a considerable amount of execution time to initialize the array (to precompute the results). In realtime systems, it is in general acceptable to have a delay during the initialization of the application (after all, the application will be presumably run right after boot, which takes a few seconds anyway). Let's take a look at a simple example to illustrate the idea of LUT's. Consider a realtime data acquisition and processing system in which data are sampled as 8bit numbers, representing positive values from 0 to 255. Suppose that the required processing involves computing the square root of every sample. In C, the use of a LUT to compute the square root would look as follows:
double LUT_sqrt [256]; /* presumably declared globally */
/* Somewhere in the Initialization code */
for (i = 0; i < 256; i++)
{
LUT_sqrt[i] = sqrt(i);/* provided that <math.h> was #included */
}
/* During the normal execution of the application */
result = LUT_sqrt[sample]; /* instead of result = sqrt(sample); */
During the initialization phase, the application sacrifices a certain amount of time to compute all the 256 possible results. But the key point is that after that, when the system starts to read data in real time, the system will be able to complete the processing required in the time available. Remember that data must be read at fixed intervals of time, and the processing of one datum must be completed before the interval of time until the next datum expires (otherwise, the system would collapse). Can LUTs be Used for Operations on Noninteger Arguments?Despite the obvious, the short answer is yes. Given that the results are stored in an array, and the array subscript must be an integer number, it would look like LUT's can only replace operations that are function of an integer argument. Being a little more specific, we should say that they can only replace operations that can be expressed as a function of an integer argument. A typical example of this is found in some common Digital Signal Processing techniques, in which we often need to evaluate the following functions: sin (2πk/N) and cos (2πk/N) where N is a constant, and k = 0,1,2, ... N1. The argument of the functions sin and cos are obviously not integers. But the entire function can be described as functions of the integer argument k. Thus, we could use a LUT to evaluate those functions as shown in the example below:
#define pi 3.1415926536 /* or const double pi = 3.14... , if C++ */
double LUT_sin_2pi_N [N], LUT_cos_2pi_N [N];
/* Somewhere in the Initialization code */
for (k = 0; k < N; k++)
{
LUT_sin_2pi_N[k] = sin (2*pi*k/N); /* provided that <math.h> was #included */
LUT_cos_2pi_N[k] = cos (2*pi*k/N);
}
/* During the normal execution of the application, if we need to compute
x[k] * sin(2πk/N), we use the following: */
result = x[k] * LUT_sin_2pi_N [k];
In this case, we are using only one access to a memory element, instead of two multiplications, one division and one evaluation of a math function. This example is even nicer if we use C++'s standard library class complex to write code that better follows the standard math notation for these DSP techniques (using complex number exponentials, instead of trigonometric functions). We are in general interested in evaluating functions similar to: e^{j2πk/N} where j is the imaginary unit. The example in C++ to implement this could be as follows:
typedef complex<double> Complex;
const double pi = 3.1415926536;
const Complex j(0,1); // Imaginary unit
Complex LUT_e_j2pi_N [N];
// Somewhere in the Initialization code
for (k = 0; k < N; k++)
{
LUT_e_j2pi_N[k] = exp (j*2*pi*k/N);
}
// During the normal execution of the application, if we need to compute
// x[k] * exp(j*2*pi*k/N), provided that 0 <= k < N, we use the following:
Complex result = x[k] * LUT_e_j2pi_N [k];
In the examples presented so far, I have used LUT's only for operations expressed as a function of positive integer arguments. That is not a restriction, as the following example shows:
typedef complex<double> Complex;
const double pi = 3.1415926536;
const Complex j(0,1); // Imaginary unit
Complex LUT_space[2*N  1];
Complex * const LUT_e_j2pi_N = LUT_space + N  1;
// Somewhere in the Initialization code
for (k = (N1); k < N; k++)
{
LUT_e_j2pi_N[k] = exp (j*2*pi*k/N);
}
// During the normal execution of the application, if we need to compute
// x[k] * exp(j*2*pi*k/N), for any k such that N < k < N, we use the following:
Complex result = x[k] * LUT_e_j2pi_N [k];


Encapsulating These IdeasIt would be convenient to encapsulate the LUT related data and functionality. There are several reasons for this:
I am now ready to present the class definitions for the two types of LUT's:
LUT's to perform multiplication of an integer number times a floatingpoint
number, and LUT's to evaluate functions. A Class to Represent FloatingPoint Multiplication LUT'sThis may sound a little strange. Why would we need a LUT to precompute the results of an operation so simple as multiplication of two numbers? Floatingpoint operations, in particular multiplications, have a high cost associated in terms of processor time. Recent processors that feature an embedded floatingpoint unit (e.g., Intel x486 and above) perform these operations very efficiently  maybe as fast as a single access to a variable in memory! In these cases, LUTs would be completely useless: they will not accelerate multiplications by floatingpoint numbers. But in older processors, like Intel x386, x286, and x86/x88, floatingpoint operations are much slower than a memory access. It is not uncommon to find singleboard PC's for embedded applications that are based on these processors. Embedded applications is precisely the type of application that most commonly requires realtime processing with demanding timing restrictions, which makes it reasonable to think about using LUT's to accelerate multiplications by floatingpoint numbers. The class is defined as a template with a type argument to specify the result of the multiplication, two nontype integer arguments to indicate the range of the integer operand, and a nontype argument to specify the scaling factor in the multiplication. The scaling factor is useful when we want to have the result as an integer type with increased resolution. This allows us to perform all the arithmetic operations in integer format with a considerably reduced roundingerror problem (if we use a scaling factor large enough). Listing 1 shows the class definition and implementation for class LUT_number (member functions are inlined, given that the main goal of LUT's is runtime efficiency). The examples below illustrate the use of this class to smooth a sequence of values by applying a moving weighted average filtering (i.e., for each value in the sequence, the output of the operation is the weighted average between that value and the four values surrounding it).
#include "lut.h"
// ... other definitions
signed char x [BUFFER_SIZE];
typedef LUT_number<128, 127, long int, 256> coeff;
const coeff Ho = 0.6;
const coeff H1 = 0.3;
const coeff H2 = 0.1;
signed char moving_avg_filter (const signed char * x)
{
return (Ho*x[0] + H1*x[1] + H1*x[1] + H2*x[2] + H2*x[2]) / 256;
}
Notice that the operations are evaluated that way to avoid overflow problems (e.g., H1*x[1] + H1*x[1] instead of H1 * (x[1]+x[1])). The whole operation is done without performing a single floatingpoint operation
(neither multiplications nor additions). Furthermore, the optimizer could decide
to do the division by 256 by shifting eight bits, which is much more efficient.
A Class to Represent Function Evaluation LUT'sIn this case, the class is defined as a template class taking a type parameter specifying the result type of the operation, a type parameter specifying the data type of the argument of the function that is being "emulated" by the LUT, and two nontype parameters specifying the range of the integer argument. The class provides an overloaded () operator taking one integer as parameter to allow clients of the class to use a syntax equivalent to that of evaluating a function. Furthermore, the constructor receives, as argument, a pointer to the function and an optional parameter indicating the factor that multiplies the argument of the function. Listing 2 shows the definition and implementation of class LUT_function. The header file lut.h shows both template class definitions. The example below shows how to use this class to evaluate the functions f(k) = sin(2pik/N) and g(k) = cos(2pik/N).
#include <cmath>
const double pi = 3.1415926536;
#include "lut.h"
LUT_function<0, N> Sin (sin, 2*pi/N),
Cos (cos, 2*pi/N);
// ...
// Somewhere in the code:
for (int k = 0; k <= N; k++)
{
total += x[k] * Sin (k) + x[k] * Cos (k);
}
Use LUTs with CareYou have to be very careful before deciding to use LUT's to solve a particular problem. We are often adicted to tools or tricks that seem to provide faster execution. An example of this is the use of inline functions in C++, or the use of macros in C. Before using LUT's, we should ask ourselves (and try to find an accurate answer!) the following two questions: 1) Is there a high cost to pay in exchange for speed? (and if there is, is it worth paying that cost?) and 2) Is a LUT really going to accelerate the processing? The first question seems reasonable, and relatively easy to evaluate. For example, if we need to store precomputed results (of type double) for an operation that takes a 16bit integer as an argument, we better think twice before using the LUT! We would require a table holding 65536 elements of type double! (8 bytes each). What is the answer in a situation like that? It depends on many factors. I can not give a general answer or recipe. You have to evaluate and prioritize your resources and requirements given for the specific application. But the second question seems stranger. LUT's are intended to (and supposed to) accelerate certain time consuming operations. But that may not always be the case. In particular, it is not obvious that reading a number from memory will be faster than performing one floatingpoint multiplication. For processors featuring a hardware floatingpoint processing unit, it is very likely that performing the multiplication will be faster. Keep in mind the following: processors feature a clock multiplier for the internal hardware of the chip. This means that if the external circuits (including system memory!) are run at 100 MHz, internal hardware (onchip circuitry) may be running at 400 or 500 MHz. Accessing system memory is done at a clock speed that is 5 times slower than the clock speed controlling the internal operations (including multiplications). Of course, the multiplication involves more steps than reading from memory, but on the other hand, there is a specialized hardware section inside the processors to efficiently perform floatingpoint arithmetic operations. The overall result is that floatingpoint multiplications are in general faster. That is not the case for older processor architectures. In particular, for Intel x386 and older architectures, it is almost certain that the use of LUT's will tremendously accelerate multiplications. The use of these processors is very common in embedded systems (e.g., computer based control, monitoring, data acquisition and processing, etc.), given that there are many singleboard PC computers at prices that make them very attractive to develop devices that require softwarebased operations. In short, you should think carefully before using LUT's to solve a particular
problem. I would even dare to recommend that, in general, you test and measure
the performance of both methods before making a decision. Avoid NonStatic LUT ObjectsAnother detail to be careful about is the fact that LUT's require a long initialization process, and take a lot of memory. Thus, you should always try to use them only as global or local static objects. That way, they are initialized at startup, and are always ready to use. Also, we avoid using stack space to store the (possibly huge) array of precomputed results. Another detail that may be very obvious, but I will mention it anyway: never instantiate onthefly or cast to LUT objects to accelerate an individual operation in the middle of your code! (notice that implicit type conversion is disallowed in the class definition, using the explicit attribute in the constructor). 

ConclusionsLookUp Tables (LUT's) are a convenient tool to accelerate operations that can be expressed as functions of an integer argument. They use stored precomputed results that allow the program to immediately obtain a result without performing again the timeconsuming operation. However, we have to be very careful before deciding to use LUT's to solve a particular problem. First of all, we should carefully evaluate the cost associated to their use (in particular, the memory space that will be required to store the precomputed results). And secondly, we should keep in mind that is not always the case that LUT's accelerate the process. Depending on the processor that we are using, it may not be the case (in particular, for recent processors that perform very efficiently operations on floatingpoint numbers, it is often inappropriate the use of LUT's). In embedded systems, where it is common to find older processor architectures or microcontrollers, the use of LUT's is often appropriate. The use of C++ to implement this tool turns out to be appropriate. Without
any sacrifice in runtime efficiency, I presented a highly reusable class definition
that provides the required LUT operations in a very transparent way, making
it easier to develop data processing algorithms in an efficient way. BibliographyThe C++ Programming Language, 3rd Edition. Bjarne Stroustrup.
