An Improved Vectorization Algorithm to Solve the d-MP Problem

ABSTRACT The d-minimal path (d-MP) problem is to find all the system state vectors (SSV) under which d units of data can be transmitted from a source node to a destination node in a stochastic-flow network (SFN). This problem has been very attractive in the last decades as one can compute the exact amount of the network’s reliability through the d-MPs. Although several algorithms have been proposed in the literature to address the problem, the research continues because it is NP-hard. Since the number of d-MPs grows exponentially with the size of the network, the available algorithms in the literature are not so practical. Hence, we employ the vectorization techniques for proposing an improved algorithm to address the problem. We conduct many experimental results on the known benchmarks and two hundred randomly generated SFNs in the sense of performance profile introduced by Dolan and Moré. The experimental results show the vectorization algorithm to be considerably more efficient than the non-vectorization ones.


INTRODUCTION
A stochastic-flow network (SFN) is a flow network whose components, and consequently itself, can have more than two different states [14,20,28].The components' states in an SFN are represented by a vector, the so-called system state vector (SSV), in which each component represents the state of a corresponding component of the SFN.There are three types of data transmission in an SFN: (1) two-terminal, in which the data is transmitted from a source node to a sink node, (2) k-terminal, in which the data is transmitted among k components, and (3) all-terminal, in which the data is transmitted among all the components in the network.The focus of this work is on the first case, where the network reliability at demand level d, denoted by R d , is the probability of transmitting at least d units of data from a source node to a destination node in the network [15,19,31].
We focus on the second stage and d-MP problem which aims to find all the d-MPs.
A path is a sequence of adjacent arcs through which the data can be sent from a source node to a sink node, and an MP is a path with no proper subset being a path from the source node to the sink node.
Several authors have worked to improve the solution of the d-MP problem.Forghani-elahabad and Bonani [9] proposed an improved algorithm to address the problem and showed its efficiency in comparison with others in the literature by providing the complexity and numerical results.Lin and Chen [26] proposed a new algorithm to the problem by using the maximal flow evaluation and fast enumeration technique.Based on a breadth-first search technique, Chen et al. [4] proposed a recursive algorithm for the determination of all the d-MPs for all the demand values d.Considering a budget constraint, Forghani-elahabad and Kagan [11] proposed an improved algorithm for the determination of all the d-MPs within a limited budget.The authors showed their proposed algorithm's efficiency and explained how it could be adopted to assess the reliability of a multi-source multi-sink communication smart grid network.Balan and Traldi [2] first showed that listing the MPs by their sizes is not indeed an optimal choice for the sum of disjoint products algorithms and then proposed a more efficient strategy to be used.By employing heuristic and recursive techniques and using the state-space decomposition technique, Bai et al. [1] proposed an MP-based algorithm to address the problem.In [34], a novel technique was presented to identify the duplicate solutions by using which an improved MP-based algorithm was proposed for reliability evaluation of SFNs.Pointing out some of the obstacles in the available algorithms in the literature, Yeh [37] proposed an improved addition-based algorithm to overcome those obstacles.Yeh and Zuo [39] proposed a subtraction-based algorithm to determine all the d-MPs for all the d values and showed its practical efficiency to the available algorithms in the literature.Forghani-elahabad et al. [12] developed an approximation algorithm based on the exact algorithms to address the problem.By presenting a simple method for checking the candidates and a new technique to remove the duplicates, Niu et al. [31] proposed an improved algorithm to address the problem.The authors also conducted a sensitivity analysis to explore the most effective arc in improving network reliability.Lamalem et al. [23] proposed an addition-based algorithm to solve the problem for all the d values that recognized which MPs can lead to the valid (i + 1)-MPs starting from i-MPs.
However, no vectorization algorithms have been published in the literature on this problem to the best of our knowledge.Hence, in this work, we first state a slightly improved version of an available algorithm in the literature, based on which we propose a vectorization algorithm to address the problem efficiently.These are the main contributions of this work.We also show its practical efficiency by conducting several experimental results on the known benchmarks and one thousand randomly generated SFNs in the sense of the performance profile introduced by Dolan and Moré [6].
The remaining of the paper is organized as follows.Section 2 states the required notations, acronyms and assumptions and also provides some preliminaries on the problem.Section 3 presents a background on the vectorization process.The experimental results are provided in Section 4, and the concluding remarks are given in Section 5.

MP
Minimal path SFN Stochastic-flow network SSV System state vector FFV Feasible flow vector

Assumptions
We consider the same assumptions as many other related works in the literature [1,4,9,14,25,26,28,30,31,35,39].We note that assumptions 1 to 4 are required to have an integer SFN with reliable nodes, and assumption five is meant to highlight that finding the minimal paths is not considered a part of our algorithm.

The capacity of arc a
2. Every node is perfectly reliable.
3. The arcs' capacities are statistically independent one from the other.
4. Flow in the network satisfies the flow conservation law.
5. All the minimal paths of the network are given in advance.

The algorithm
Let G = G(N, A, M) be a stochastic-flow network (SFN), where N = {1, 2, • • • , n} is the set of nodes with nodes 1 and n being the source and destination nodes, respectively, A = {a i | 1 ≤ i ≤ m} is the set of arcs, and be the maximum flow of the network from nodes 1 to n under SSV X, Z(X) = {a i ∈ A|x i > 0} be the set of arcs with positive capacity, and e i = 0(a i ) be a system vector in which the capacity level is 1 for a i and 0 for the other arcs.Let also P 1 , P 2 , • • • , P h be all the MPs in the network (so h is the number of all MPs) and K j = min{M r |a r ∈ P j } be the capacity of the jth MP, for  Definition 2.1.Assuming f j as the amount of flow on MP P j , for j = 1, ) is called a feasible flow vector (FFV) at demand level d, denoted by d-FFV here, when it satisfies the following system.
For example, it can be calculated that f 1 = f 2 = f 3 = 1 and f 4 = 0 satisfy the system (2.1) for d = 3 in the given network in Fig. 1.Therefore, (2.2) For example, considering X = (2, 1, 1, 1, 2) in the network given in Fig. 1, we have As a result, X is a 3-MP for this network.
, that satisfies the following equation: 3) The obtained d-MP candidate from a d-FFV through Eq. (2.3) is called the associated d-MP candidate with that d-FFV.For instance, we know that F = (1, 1, 1, 0) is a 3-FFV in Fig. 1.Now, we use Eq.(2.3) to calculate its associated SSV.We have Notably, this vector is also a (real) 3-MP.The following result, which is proven in [25], shows the relation between the (real) d-MPs and the candidates.Although one can use the given condition in Definition 2.2 to check each candidate for being a d-MP, Lin and his co-workers [25] proposed the following theorem which can be employed to determine all the d-MPs among the candidates.
Theorem 2.2.Let Ψ d be the set of all the d-MP candidates.Then, Ψ d,min = {X| X is a minimal vector in Ψ} is the set of all the d-MPs.
We remind that a vector X ∈ Ψ is a minimal vector if there is no other vector Y ∈ Ψ such that Y < X, namely, X is not necessarily less than all the vectors in Ψ; however, there is no another vector in Ψ less than it.Therefore, one can: (1) find all the d-FFVs by solving the Diophantine system (2.1); (2) then calculate the associated d-MP candidates with the d-FFVs by using Eq.(2.3); and finally (3) remove the possible duplicates and the non-minimal vectors to determine the set of all the d-MPs.
The proposed algorithm in [25] does not remove the duplicate solutions first, and in fact, it removes the duplicates and non-minimal vectors simultaneously.Hence, in the worst case, it needs to compare all the vectors.However, suppose one first removes the duplicates, which can be done efficiently using vectorization techniques, and subsequently removes the non-minimal vectors.In that case, we have less work to do in the later stage, and, in practice, keeping a reduced number of items in memory could lead to better performance.As a result, Algorithm 1, shown below, which first removes the duplicates and then the non-minimal vectors, is a slightly improved version of the proposed algorithm by Lin et al. [25].

Algorithm 1
Step 1. Solve the system (2.1) for determination of all the d-FFVs.
Step 2. Use Eq. ( 2.3) to calculate the associate d-MP candidate with each d-FFV obtained in Step 1.
Step 3. Remove the duplicate d-MP candidates.
Step 4. Remove the non-minimal d-MP candidates.
Although some algorithms in the literature are more efficient than Algorithm 1, they are either not efficient enough for using in real-world systems or not directly optimizable by vectorization, parallelism, or cache-friendly approaches.On the other hand, the focus of this paper is to show how the usage of the vectorization techniques can be effective on the d-MP problem algorithms.Therefore, in this case, the choice of algorithm is arbitrary as any of them would be adequate (though some are more convenient than others) to generate the experimental evaluation results shown in Section 4.

VECTORIZATION
Vectorization can be described as the process of converting the application of an operation to a single value (a scalar) to the application of the same operation to multiple values (a vector) at once [32].It is, therefore, a parallelization technique that can improve the performance of many kinds of computer loads.It should not be confused with the application of standard parallelization techniques, which use multiple processors to achieve similar results since vector operations happen in the context of a single processor and a single instruction [24].
Vector processors have existed in one form or another since the '70s, and virtually all modern processors provide some form of SIMD for example, offers multiple versions of their SSE and AVX extensions.Other processor vendors such as ARM also provide extensions such as Neon.More recently, GPUs have also begun to include vector instructions [21].
Invariably all these approaches work using long registers that are usually several times the size of regular registers used for scalar operations.While 128 and 256-bit registers are commonplace, 4096-bit registers are also available in some specialized architectures.For example, consider 128-bit vector registers.Typical integers are stored in 32 bits which means that the time it takes to, for instance, sum one pair of integers using scalar operations is the same as summing four pairs of integers when using vector instructions.
Most optimizing compilers and numeric computing suites such as Mathematica, MATLAB, and NumPy offer some support for SIMD operations.However, effectively and automatically vectorizing code that was not explicitly written to profit from these SIMD instructions is still a very active research problem in Computer Science.
In this work, we leverage vectorization explicitly, using 128-bit vectors to improve the algorithm's performance.In particular, for increased compatibility we focus on Intel's x86 architecture with SSE2 extensions.While some more recent extensions such as SSE4.3 and AVX offer larger vector registers, these extensions are only available in most recent processors1 .
To exemplify how vector instructions can be used, let us examine a simple example in which we want do compare if two vectors are equal, element-wise.A traditional implementation might look like this: int vequals (int n, const uint8_t *v1, const uint8_t *v2) { for (int i = 0; i < n; ++i) return 0; return 1;

}
We compare each element of v1 to its respective element in v2 (lines 2-4).If any of these comparisons determines that a pair of elements are not equal, we stop the execution and return 0 (false, i.e., vectors are not equal) in line 4. If, after all the elements have been compared, we have not found a pair which was different, we conclude the vectors are the same and return 1 (true) in line 5.
profile introduced by Dolan and Moré [6].These results show clearly the practical efficiency of using vectorization techniques in solving the d-MP problem.

Comparing three versions on known benchmarks
Here, we make several numerical comparisons between the three versions of Algorithm 1.The first one (A1) is Algorithm 1.The second one (MA) is an improved version of Algorithm 1 in which steps 2 and 3 are merged so that no duplicate candidate is generated.The third version (VMA) is the vectorized version of the second one.
The experiments were done on a computer Intel(R) Core(TM) i7-7500U CPU 2.7 GHz, with 16 GB of RAM.We employ three benchmark networks taken from the literature given in Figs. 2, 3 and 4. The maximum capacities of arcs in all of these benchmarks are set to be 3, i.e., M i = 3.All the generated results are summarized in Tables 1 and 2. The columns in the tables are d, the demand value, n can , the number of candidates, n d−MP , the number of d-MPs.Also, t A1 , t MA , t V MA are respectively the running times of A1, MA and VMA.We note that Table 2 provides the related results to the general network in Fig. 4 whose size varies with u, and the first column in this table, u, lists the parameter used to generate the corresponding network.Comparing the running times of the three algorithms given in Table 1 clearly shows the practical efficiency of MA and VMA Figure 4: The general network topology taken from the literature [9].compared to A1.Hence, we did not run A1 on the general network topology resulting in the absence of column t A1 in Table 2.The presented results in the tables are an average of at least ten executions.The standard deviation of the samples is shown alongside the average.If Eq. (2.3) is naively applied during Step 2, duplicate candidates might be generated.Therefore MA employs a radix-tree with a branching factor of M to merge steps 2 and 3 in A1.During the generation of the candidates, MA performs a synchronized walk of the radix-tree containing all candidates already generated, and only generates values for each element of the underconstruction candidate vector which have not yet been explored.Therefore, in O(log M n can ) steps, it can determine if a newly generated candidate was already known.Computationally, this is more efficient than generating all the candidates and then removing the duplicates, not due to lower algorithmic complexity, but because of the savings in memory usage for the temporary storage of duplicates, and thus a more efficient use of the processor caches.The use of the radix-tree, although beneficial for the reduction of total execution time, comes at a cost.Some of the operations of the merged procedure are not directly suitable for vectorization.In fact, in any code in which control flow is present (which MA needs to perform the walk on the radix-tree) the application of regular vectorization techniques can be severily compromised [33].Therefore, our code is a carefully chosen mix of regular and vector instructions intended to minimize execution time.
The effect of this approach can be seen in the number of candidates, n can , and execution times.For instance, in the network given in Fig. 2, considering d = 4, n can is lower than the respective value when d = 5 (76530 vs. 180266).The results provided in tables 1 and 2 clearly show the practical efficiency of VMA compared to the other ones.One notes that as the number of candidates or d-MPs increases, so does VMA's performance due to more efficient use of vectorization.
The performance bottleneck of the current implementation is in the fourth step of Algorithm 1, i.e., the removal of the non-minimal candidates.More than 95% and 75% of the execution times for MA and VMA are spent in this step, respectively.Currently, the complexity of this step is quadratic.However, one may improve it by using multidimensional divide-and-conquer approaches such as the one proposed by Bentley [3] which we aim to investigate in our future works.

Comparing the two best versions on random networks
In addition to tables 1 and 2, to have a more intuitive comparison between the two best versions of Algorithm 1, we solve two hundred randomly generated test problems by these two versions.
To not have very complex test problems, we consider n = 6, 7, and 8 as the number of nodes in the randomly generated SFNs.For the number of arcs in each network, a random integer number between or equal to u = ⌈ (n−1)(n+2)

4
⌉ and l = u − 4 is considered.We note that u is the average of the minimum and maximum possible numbers of arcs in a connected graph, i.e., n − 1 and n(n−1) 2 , respectively.The capacity of each arc in the randomly generated test problems is also set to M i = 3 as for the benchmarks.This way, we consider the running times of the two best versions on these two hundred random test problems for producing the performance profile introduced by Dolan and Moré [6], in which the proportion of the executing times of the desired algorithms versus the best ones are considered.Assuming t i,MA and t i,VMA , respectively, as the running times of the versions MA and VMA, for i = 1, 2, • • • , 200, the performance ratios are r i,s = t i,s min{t i,s : s = MA, VMA} , for s = MA, VMA [6].For each algorithm, the performance is calculated by Pr s (T ) = N s 200 , where N s is the number of SFNs for which r i,s ≤ T, i = 1, 2, • • • , 200.Fig. 5 shows the final result of the Dolan and Moré performance profile for the desired versions of Algorithm 1.In this figure, at any τ on the horizontal axis, the difference between the diagrams shows the percentage of the test problems solved τ times faster by the algorithm whose diagram lies above the other one.Hence, the figure expresses that VMA solves 95% of the test problems faster than MA (see the difference between the diagrams at τ = 1).Or observing at τ = 2 in the figure, one can see that VMA solves around 25% of the test problems at least two times faster than MA.It also shows that VMA solves some test problems more than four times more quickly than MA.In this profile, the algorithm whose performance diagram lies above the other is preferred [6].

Figure 1 :
Figure 1: A simple benchmark network example taken from the literature [9].

Theorem 2 . 1 .
Every d-MP is a d-MP candidate.

Figure 2 :
Figure 2: A benchmark network example with 14 arcs taken from the literature [9].

Table 1 :
The final results on the benchmarks given in Figs.2 and 3.

Table 2 :
The final results on the general network topology given in Fig.4.