The Impact of Compilation Flags and Choosing Single - or -Double-Precision Variables in Linear Systems Solvers

ABSTRACT This paper intends to show the impact of compiler optimization flags and the variable’s precision on direct methods to solve linear systems. The chosen six methods are simple direct methods, so our work could be a study for new researchers in this field. The methods are LU Decomposition, LDU Decomposition, Gaussian Elimination, Gauss-Jordan Elimination, Cholesky Decomposition, and QR Decomposition using the Gram-Schmidt orthogonalization process. Our study showed a huge difference in time between single-and double-precision in all methods, but the error encountered in the single-precision was not so high. Also, the best flags to these methods were the ‘-O3’ and the ‘-Ofast’ ones.


INTRODUCTION
Many areas of science wants to find a fast and correct solution using computational models.At some point, most of the created models are turned into a system of linear equations.This system can be described as a set of m linear equations and n unknowns in which each equation can be described as a i1 x 1 + a i2 x 2 + • • • + a in x n = b i .All the coefficients can be stored in a matrix A and all constant terms in a column vector b.Both A and b can be stored in a single matrix M called by "augmented matrix", considering m = n, as described in (1.1).
In this paper, we focus on six direct methods: Gaussian Elimination [14,17], Gauss-Jordan Elimination [17], LU Decomposition [10], LDU Decomposition [10], Cholesky Decomposition [17] and QR Decomposition using the Gram-Schmidt process [10].All these methods are classified as direct methods and they find the solution in a finite number of steps [6].More precisely, in Table 1 we present the quantity of floating-point operations (FLOPs) for each method, found in [8,11,15,16] Table 1: Quantity of floating point operations needed by each method.

Method # FLOPs Reference Gaussian Elimination
O(n 3 /3) (Trefethen, 1985)  O(n 3 /3) (Higham, 2009) [8] QR Decomposition (using GS process) O(2n 3 /3) (Trefethen, 1985) [16] Although these methods were proposed to solve different matrices, this paper is focused on measuring the difference between single-precision variables and double-precision ones in a specific type of matrix to present a basic study to initial researchers.The variable precision defines how many digits can be stored for each floating-point number.Usually, a single-precision variable uses one word of 32 bits and a double-precision one uses two words of 32 bits each [12] which makes the computer takes more time to get it from memory.
Besides, some well-defined (rational) numbers in the decimal numerical system can be transformed into an irrational number in the binary numerical system, the one used by all computers.This mistranslated numbers together with the propagation errors of arithmetical operations can lead to huge numerical errors in the result.These are some of the motivations of the present work and we compare the quality of the results against the execution time to see which one is better.
As our objective is to present a study to initial researches, we focused on executing the methods in common CPUs on contrary to recent research in FPGAs [4].
The outline of this work is as follows.Section 2 summarizes all options of compilers and optimization flags used here.Numerical experiments are reported in Section 3. Finally, some conclusions are given in Section 4.

PROGRAMMING LANGUAGE AND COMPILERS USED
In this paper, all methods were implemented in C Language, as it has static typing and we can manipulate the memory in run-time with dynamic allocation [3].This language has two types of floating-point variables named float and double with single-and double-precision, respectively.
The C Language is a compiled language, and it means that our code has to be analyzed and transformed into binary code by a compiler [3].In [5], they compare two compilers for C Language and one for C and C++ language.They concluded that the GCC, the Gnu compiler Collection (found in https://gcc.gnu.org/) were by far a better compiler than Microsoft Visual Studio.Therefore, two versions of GCC is used here, the 5.4.0 version (https://gcc.gnu.org/onlinedocs/5.4.0/) and the 7.1.0version (https://gcc.gnu.org/onlinedocs/7.1.0/).The former version was released in June of 2016 and the latter one was released in May of 2017.

Compiler options used
The GCC compiler let us choose some options to optimize the compilation in many ways.The options are called compiler flags and there are eight ones to optimize: '-O0', '-O', '-O1', '-O2', '-O3', '-Ofast', '-Og' and '-Os'.Each one of these is at one level of optimization meaning that in each level all the optimization done in the previous level is done as well as new optimization.
In [9], an algorithm to automatically choose the optimization option for programs.However, this is an iterative algorithm increasing the compilation time as well as they don't show the quality of the results.Now we will describe each one, for each version of GCC.All flags are explained in detail in [1,2].

Compiler flags for GCC version 5.4.0
The default optimization flag is '-O0', which reduces compilation time and allows the debugging to show the expected results.The next level of optimization are the flags '-O' and '-O1', as they make the same optimization in the code.In this level, the compiler tries to minimize the code size and the execution time, but without time-consuming optimization options.
Continuing on the optimization levels, the '-O2' flag turns on almost all optimization options that do not have a space-speed trade-off in the compilation process.One example of optimization done in the '-O2' level is aligning (in memory) the beginning of a function to a power of 2 greater than a given number, usually a machine-dependent one.The next level of optimization is the '-03' flag.In this level, we have optimization turned to increase the code size to decrease the execution time.For example, the compiler copies the code of a function to where it finds this function callings, making inline functions.
The last level of optimization is the '-Ofast' flag.Besides enabling all optimization the '-O3' flag does, the '-Ofast' makes some optimization that is not valid for all standard-compliant programs.
For example, it turns on some mathematical optimization that can lead to an incorrect result on programs that depend on some standard rules for math functions.
The last two flags, '-Os' and '-Og', is focused on optimizing the code size and code debugging, respectively.The '-Os' flag turns on all the '-O2' optimization that does not increase the final code size and some other specific optimization to decrease the size of the binary code.However, the '-Og' flag turns on only the optimization that does not interfere in the debugging process.

Compiler flags for GCC version 7.1.0
All the flags in the 7.1.0version have the same optimization options as the 5.4.0 version with some new options.In the '-O' (or '-O1') level, one new optimization option is to reorder blocks of instructions to avoid branches and optimize code locality.
The '-O2' level has some new optimization as well as the old ones.In this version, the compiler search for stores in the memory that are smaller than a memory word and try to merge then into a single memory store.In the '-O3' level, one of the new optimization in this version is that the compiler tries to unroll loops that do not take many iterations to decrease the time consumed by controlling the program flow.
The other three flags, '-Ofast', '-Os' and '-Og', do not have any new optimization comparing to these flags in the 5.4.0GCC version.

NUMERICAL TESTS AND RESULTS
In all experiments, the matrix A was created using A = CC T + nI, having C as a square matrix of order n fulfill for random numbers between 0 and 999.We forced the linear system to have the trivial solution, that is, ⃗ b = A(1, 1, . . ., 1) T being easier to compare the results obtained.This matrix A is well conditioned, as our objective is to compare the compilation flags in each method.Some explanation about error propagation and linear system sensitivity in [7].The machine used in the experiments has an Intel Core i7 processor of 2.80 GHz, 4 cores and 8 GB of memory and 8 MB of cache memory, Ubuntu operational system version 16.04. 1

Execution time considering all flags
Our first experiment was to execute all six methods with all flags to see the impact of each one.We tested with a linear system with 5000 unknowns.For the time experiment, we execute each combination of flag and method at least five times to get the average with a standard deviation of less than 1s.In Table 2, there is the average execution time for LU Decomposition in GCC version 5.4.0 and version 7.1.0,respectively.Comparing both versions, we can see that the most recent version of GCC does not have faster execution times than the version from 2016.One reason for this is because the GCC version 5.4.0 comes with the Ubuntu used in the tests and it can be optimized specifically for this version of OS.The GCC version 7.1.0we had to install by ourselves.Another observation is that the fastest flag for LU Decomposition was the flag '-O3'.When the '-Ofast' flag was used, the execution time in the newest version increased compared to '-03' flag, which was a little unexpected as in the former flag the compiler makes the program skips some comparison in the maths.
Table 3 shows the average execution time of LDU Decomposition in versions of GCC version 5.4.0 and version 7.1.0.The observations we can make from this table is that the fastest flag for LDU Decomposition in both GCC versions is the '-O3'.We can also observe that the '-Os' Comparing the times between LU and LDU Decomposition, we can see that they are very similar, as the algorithms are almost the same, only having one more step in the final decomposition.
Table 4 shows the average execution time of Gaussian Elimination in both versions of GCC, version 5.4.0 and version 7.1.0.For the Gaussian Elimination, the fastest flag is not the '-O3' or the '-Ofast', but it is the '-0g' one, the optimization level focused on debugging having the same time as the first level of optimization.As the '-03' flag turns on the optimization that increases the code, for example using inline functions, it may increase the number of variables used as well as the memory access.Finally, the '-Os' flag, the optimization level focused on not increase the final code size, has a bad execution time in the Gaussian Elimination.The main part of this algorithm is a big loop.As to optimize a loop we have to increase the size of the code, it is expected that the '-Os' optimization level takes more time to finish that other levels of optimization.5, there is the average execution time for Gauss-Jordan Elimination in GCC version 5.4.0 and version 7.1.0,respectively.These results show that this algorithm cannot be optimized as the Gaussian Elimination, for example.We observe a decrease in the execution time in Gauss-Jordan Elimination with the first level of optimization, the '-O' flag, compared to the default optimization level ('-O0' flag), but we don't see any decrease in the other levels as the Gauss-Jordan algorithm is too sequential and dependent.
Comparing Gaussian Elimination to Gauss-Jordan Elimination, we can see that the Gaussian Elimination is a better option as it consumes less time to execute and can be optimized a little more.From the no optimization level ('-O0' flag) to the first level of optimization ('-O'), the Gaussian Elimination reduces its execution time by a factor of 3 and the Gauss-Jordan Elimination reduces its time by a factor of 2.35.
Table 6 shows the average execution time of Cholesky Decomposition in versions of GCC version 5.4.0 and 7.1.0.This method is the one that has the second most gain in the optimization levels.
For the first level, the execution time is reduced to a third of the original time.In the '-Ofast'  7 shows the average execution time of QR Decomposition using the Gram-Schmidt process in both versions of GCC, version 5.4.0 and version 7.1.0.The QR Decomposition method is the one that takes more time, as it is the most complex algorithm.It has to iterate the matrix three times: one to create the orthogonal matrix Q; the second to multiply the Q T to ⃗ b; and the last one to find the value of the unknowns.As it used many Maths operations all over the algorithm [10], this method is the one that gains most in the optimization process.With the '-Ofast' flag, it reduces the execution time by a factor of 12.
In summary, with all experiments in this section, we conclude that only with the first level of optimization, the '-O' flag, the execution time of all methods drops to at least half of the execution time with no optimization (flag '-O0').Also, we can regard similar execution times from flags '-O' and '-O1' as they represent the same optimization level.

Numerical solution considering all flags
The next test was concerning the numerical results obtained by each flag.To this test, we used the same linear system from the previous test but we took the results and calculated the absolute error of the obtained results.
Although the average execution time differed between both versions of GCC, the numerical results were the same.Hence, we will show the euclidean norm of errors for just the 5.4.0GCC version.
Table 8 shows the euclidean norm of errors for the results obtained through the LU Decomposition.We can see in this table that no flag changed the precision of the results.So, for the LU Decomposition, we can choose the best optimization level looking only at the execution time in Table 2. Having said that, the best flag for LU Decomposition is the '-Ofast' for GCC version 5.4.0 and '-O3' for GCC version 7.1.0.
Besides, the double-precision variables have an absolute error of 10 −13 and the single-precision has one of 10 −4 .The latter precision has an absolute error that can be not acceptable in some applications.For LDU Decomposition the results in both precisions were the same as in LU Decomposition.Then, the decision for the best flag is made again looking for the execution time, in Table 3.For the 2016 version of GCC, the best flags are '-O3' and '-Ofast', as they have almost the same execution time.For the other version of GCC, the best flag is '-O3'.As the absolute error is the same as LDU Decomposition, the single-precision one can again be not acceptable in some applications as it is 10 −4 .
The next table (Table 9) shows the euclidean norm of errors of the results obtained through the Gaussian elimination.
As we can see in this table, some flags could not calculate the final result due to the number of operations to find the unknowns.It probably caused an overflow (or underflow) which prints '-NaN' as a result.Having said that, the only four flags that we can choose to be the best ones to  the Gaussian elimination is '-O2', '-O3', '-Ofast' and '-Os' flags.Looking at the execution times in Table 4, we conclude that all these four flags can be the best as they all have similar execution times.
Besides, in concern of the precision, the single-precision has, again, an absolute error of 10 −4 , which can be bad for some applications.So, the double-precision variables, which has an error of 10 −13 , would be the best choice.
Table 10 shows the euclidean norm of errors of the Gauss-Jordan elimination and we can see a similar behaviour as the Gaussian elimination (Table 9).As the Gauss-Jordan elimination transforms the matrix M (shown in 1.1) into a canonical matrix and the step of finding the unknowns does not have any operation of division or multiply.It explains why in the Gauss-Jordan elimination we can get a result for all flags while in Gaussian elimination we cannot get with the double-precision variables.However, the best flags concerning precision are the same as Gaussian elimination: '-O2', '-O3', '-Ofast' and '-Os'.When looking at execution times (Table 5), we conclude that all four flags can be used to optimize this method as they have similar execution times.The single-precision in this method in the four flags mentioned above has the euclidean norm of errors of 10 −4 , which can be not acceptable in some applications.Then, the best option for Gauss-Jordan elimination is double-precision.We can observe that in the other four flags, the norms of the double-precision variables are big and similar, because the input data has big numbers and in the operations of multiply and divide the computer has to truncate or round somewhat often.
For the Cholesky Decomposition, we observed that all flags calculated the results with the same error Euclidean norm of 1.50 × 10 −4 for single-precision and 3.55 × 10 −15 for double-precision.So, we can choose the best flags to use looking only at the execution times (Table 6).The best flag for Cholesky Decomposition is the '-Ofast' one to single-precision variables and the '-O3' one to double-precision.
As for the QR Decomposition, the Euclidean error norms observed was 2.12×10 −4 to the singleprecision and 1.18 × 10 −13 to the double-precision to almost all flags except the '-Ofast' one (equal 1.47×10 −4 to the single-precision and 1.02×10 −13 to the double-precision).This method is the one with the biggest error norm due to the number of multiplications and divisions made, which leads to many truncate and rounding errors, as said in [13].
Most of all flags in the QR Decomposition don't change the norm of errors except the '-Ofast' one.With this in mind and looking at the execution times in Table 7, we conclude that the '-Ofast' flag is the best optimization choice for the QR Decomposition method.
To finalize these tests, we can see that most of the methods have the '-O3' or the '-Ofast' flag being the best option for optimization.Having said that, we show our final test, with bigger matrices.

Comparison between double-and single-precision
Our next test was executed with the '-O3' and '-Ofast' flags and bigger matrices, with 5000, 10000, 15000 and 20000 unknowns.All the systems were created with the same methodology shown at the beginning of Section 3.These tests were made in both versions of GCC, the 5.4.0 and the 7.1.0.Each combination of flag and method was executed at least 10 times to get a standard deviation of less than 5%.
Figure 1 shows the average execution time for LU Decomposition for GCC version 5.4.0 and version 7.1.0,respectively.As the input data increases in size, the absolute difference between single-and double-precision increases, but it is not linear.Table 11 shows how much the double-precision execution is slower than the single-precision.We computed this relation using E d/E s , where E d is the execution time for double-precision variables and E s is the one for single-precision variables.
For the '-O3' flag, for example, the  Figure 2 shows the average execution time for LDU Decomposition for GCC version 5.4.0 and version 7.1.0,respectively.As the input data increases in size, the absolute difference between single-and double-precision increases, but it is again not linearly.In the next table (Table 12) we can see how much the double-precision execution is slower than the single-precision.
Table 12: Relation between double-and single-precision times of LDU Decomposition.
Flag '-O3' Flag '-Ofast' # of unknowns GCC v. 5.4.0GCC v. 7.1.0GCC v. 5.4.0GCC v. 7  The graphics in Figure 3 show the average execution time for Gaussian Elimination for GCC version 5.4.0 and version 7.1.0.As the input data increases in size, the absolute difference between single-and double-precision increases linearly.In the next table (Table 13) we can see how much the double-precision execution is slower than the single-precision.For the '-O3' flag, for example, the Table 13 13, all double-precision times are around 1.5 times longer than the single-precision times.
The graphics in Figure 4 show the average execution time for Gauss-Jordan Elimination for GCC version 5.4.0 and version 7.1.0,respectively.As the input data increases in size, the absolute difference between single-and double-precision increases linearly again.In the next table (Table 14) we can see how much the double-precision execution is slower than the single-precision.Figure 5 shows the average execution time for Cholesky Decomposition for GCC version 5.4.0 and version 7.1.0,respectively.As the input data increases in size, the absolute difference between single-and double-precision increases, but it is not linearly.In the next table (Table 15) we can see how much the double-precision execution is slower than the single-precision.In the Cholesky Decomposition, there is a huge difference in the relation shown in Table 15 between the '-O3' and '-Ofast' times.In the '-O3' optimization flag, both precisions have similar times.In contrast, with the '-Ofast' flag the double-precision times is around twice the singleprecision times.
Figure 6 shows the average execution time for QR Decomposition for GCC version 5.4.0 and version 7.1.0,respectively.The QR Decomposition has the biggest complexity of all methods, O(N 3 ), and that why the time to 10000 unknowns is eight times bigger than the time to 5000 unknowns.For that reason, the other two input sizes were not computed.In the next table (Table 16) we can see how much the double-precision execution is slower than the single-precision.For the '-Ofast' flag, for example, the Table 16 means that with 5000 unknowns the doubleprecision time is 1.831 times bigger in the GCC version 5.4.0 and 1.942 times bigger in the 7.1.0version than the single-precision time; with 10000 unknowns the double-precision execution time is 1.748 times bigger in the former version and 2.022 times in the latter version than the single-precision time.
Similar to the Cholesky Decomposition, the QR Decomposition has big differences between the relation in the two flags studied.For the '-03' flag, both precisions have similar execution times.However, for the '-Ofast' flag, the double-precision execution time is almost twice the single-precision execution time.

CONCLUSIONS
Our computational experiments show some difference in the execution time of some methods with double-precision and single-precision variables.Besides, we search for the best optimization flag in GCC for each of six different methods.The methods studied was: LU Decomposition, LDU Decomposition, Gaussian Elimination, Gauss-Jordan Elimination, Cholesky Decomposition and QR Decomposition using the Gram-Schmidt process.We showed execution times for all these methods in six optimization flags of GCC compiler: '-O0', '-O', '-O1', '-O2', '-O3', '-Ofast', '-Og', '-Os'.Besides, the experiments were compiled with two versions of GCC, version 5.4.0, released in June of 2016, and version 7.1.0,released in May of 2017.
Our first conclusion is that the best optimization flags for the studied methods were '-O3' or '-Ofast', depending on the method.The '-O3' turns on all optimization from two previous levels and optimization related to spend more memory or time compiling to decrease execution time.
The '-Ofast' flag turn on all optimization from '-O3' level (the ones mentioned before) and some optimization related to the mathematical functions.
Finally, between single-or double-precision variables, we observed that in some methods (like Gauss-Jordan elimination) the double-precision time can be more than 1.5 times the singleprecision one and the single-precision error is small (the biggest error is around 10 −3 ) for the flags mentioned above.If the application in question is not real-time or it has tolerance of 10 −3 or bigger, it is better to use single-precision variables.However, if the application has to have a smaller tolerance, it is better to use a double-precision variable.

Figure 1 :
Figure 1: Average times of LU Decomposition.

Figure 3 :
Figure 3: Average times of Gaussian Elimination.
means that with 5000 unknowns the doubleprecision time is 1.546 times bigger in the GCC version 5.4.0 and 1.542 times bigger in the 7.1.0version than the single-precision time; with 10000 unknowns the double-precision execution time is 1.557 times bigger in the former version and 1.559 times in the other version than the single-precision time.For 15000 unknowns, the double-precision time is 1.561 times bigger in the oldest version and 1.561 times bigger in the newest version than the single-precision time; for 20000 unknowns, the double-precision time is 1.558 times bigger in GCC version 5.4.0 and 1.562 times bigger in GCC version 7.1.0than the single-precision.As we can see in Table

Table 2 :
Execution times of LU Decomposition -matrix with 5000 unknowns.

Table 3 :
Execution times of LDU Decomposition -matrix with 5000 unknowns.
and '-Ofast' flag have the most meaningful time variations between versions due to the different optimizations made in each version of the compiler.

Table 4 :
Execution times of Gaussian Elimination -matrix with 5000 unknowns.

Table 5 :
Execution times of Gauss-Jordan Elimination -matrix with 5000 unknowns.

Table 6 :
time, the fastest optimization level to Cholesky Decomposition, it is reduced to a sixth of the original time.Execution times of Cholesky Decomposition -matrix with 5000 unknowns.

Table 7 :
Execution times of QR Decomposition -matrix with 5000 unknowns.

Table 8 :
Absolute error of results from LU Decomposition -matrix with 5000 unknowns.

Table 9 :
Absolute error of results from Gaussian Elimination -matrix with 5000 unknowns.

Table 10 :
Absolute error of results from Gauss-Jordan Elimination -matrix with 5000 unknowns.

Table 11
means that with 5000 unknowns the doubleprecision time is 1.706 times bigger in the GCC version 5.4.0 and 1.546 times bigger in the 7.1.0version than the single-precision time; with 10000 unknowns the double-precision execution time is 1.832 times bigger in the former version and 1.864 times in the other version than

Table 13 :
Relation between double-and single-precision times of Gaussian Elimination.

Table 14 :
Relation between double-and single-precision times of Gauss-Jordan Elimination., the double-precision time is 1.398 times bigger in GCC version 5.4.0 and 1.376 times bigger in GCC version 7.1.0than the single-precision.
the single-precision time.For 15000 unknowns, the double-precision time is 1.191 times bigger in the oldest version and 1.192 times bigger in the newest version than the single-precision time; for 20000 unknowns, the double-precision time is 1.215 times bigger in GCC version 5.4.0 and 1.201 times bigger in GCC version 7.1.0than the single-precision.
precision time is only 1.123 times bigger in the GCC version 5.4.0 and 1.104 times bigger in the 7.1.0version than the single-precision time; with 10000 unknowns the double-precision execution time is 1.143 times bigger in the former version and 1.156 times in the latter version than Trends Comput.Appl.Math., 24, N. 2 (2023)