You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/doc/users-guide/convert-a-c++-program.md
+48-40Lines changed: 48 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,22 +2,23 @@
2
2
title: Convert a C++ program
3
3
author: Timothy Kaler
4
4
date: 2022-07-20T16:22:55.620Z
5
+
attribution: true
5
6
---
6
-
A common application of OpenCilk is the parallelization of existing serial code. Indeed, it is often advisable for programmers to prioritize writing correct and efficient serial code before attempting parallelization because of the notorious difficulty of writing correct parallel code. In this section, we shall walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
7
+
OpenCilk can be used to add parallelism to existing serial codes without changing the original program's semantics. Let us walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
7
8
8
9
## General workflow
9
10
10
-
One typically begins with an existing serial C or C++ program that implements the functions or algorithms that are relevant to one's application. Ideally, the serial code you start with will be well tested to verify it is correct and be performance engineered to achieve good performance when run serially. Any correctness bugs in the serial code will result in correctness bugs in the parallel program, but they will be more difficult to identify and fix! Similarly, inefficiency in the original sequential code will translate to inefficiency in its parallel equivalent.
11
+
The typical process for adding parallelism to existing serial C or C++ programs using OpenCilk involves five steps:
11
12
12
-
Next, one begins the process of introducing parallelism into the program. Typically one starts by identifying the regions of the program that will most benefit from parallel execution. Operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization. The programmer can identify tasks in their code that can execute in parallel using the three OpenCilk keywords:
13
+
1.**Debug serial code:** Verify the original program is correct. It is a good practice to write correct and well-tested serial code prior to attempting parallelization. Bugs that exist in the serial code will also exist after introducing parallelism, but may be more difficult to debug.
14
+
2.**Identify parallelism:** Identify regions of the code that could benefit from parallel execution. Typically, operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization.
15
+
3.**Annotate parallelism:** Introduce parallelism to the code using the OpenCilk keywords \`cilk_for\`, \`cilk_spawn\`, and \`cilk_scope\`. These keywords are described in more depth in ???. A summary of their semantics is as follows:
13
16
14
-
*`cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
15
-
*`cilk_sync` indicates that all spawned children must complete before proceeding.
16
-
*`cilk_for` identifies a loop for which all iterations can execute in parallel.
17
-
18
-
The parallel version of the code can be compiled and tested using the OpenCilk compiler. On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program.
19
-
20
-
The OpenCilk tools can be used to debug race conditions and scalability bottlenecks in parallelized codes. Verifying the absence of race conditions is particularly important as such errors can lead to non-deterministic (and often buggy) behavior. Fortunately, OpenCilk provides the ***cilksan race detector*** which can identify all possible race conditions introduced by parallel operations when a program is run on a given input. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of **reducers**, locks, and recoding.
17
+
*`cilk_for` identifies a loop for which all iterations can execute in parallel.
18
+
*`cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
19
+
*`cilk_scope` indicates that all spawned children within the scoped region must complete before proceeding.
20
+
4.**Compile:** Compile the code using the OpenCilk compiler. On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance.
21
+
5.**Verify absence of races:** Use OpenCilk's ***cilksan race detector*** to verify the absence of race conditions in the parallel program. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of ***reducers***, locks, and recoding.
21
22
22
23
## Example: Quicksort
23
24
@@ -80,21 +81,17 @@ int main(int argc, char* argv[])
80
81
}
81
82
```
82
83
83
-
### Compiling Quicksort with the OpenCilk compiler
84
84
85
-
This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
86
85
87
-
##### Linux* OS
86
+
## Identify parallelism
88
87
89
-
```shell
90
-
> clang++ qsort.cpp -o qsort –O3 -fopencilk
91
-
```
88
+
The `sample_qsort` function is invoked recursively on two disjoint subarrays on line 16 and line 17. These independent tasks will be relatively long-running and are good candidates for parallelization. This proposed parallelization of quicksort represents a typical divide-and-conquer strategy for parallelizing recursive algorithms. An intrepid reader might also notice that the partition algorithm invoked on line 13 may also be parallelized for even greater scalability.
92
89
93
-
### Add parallelism using `cilk_spawn`
90
+
## Annotate parallelism
94
91
95
-
The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_spawn`, `cilk_sync`, and `cilk_for`.
92
+
The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_for`, `cilk_spawn`, and `cilk_scope`.
96
93
97
-
In this example, we shall make use of just the `cilk_spawn` and `cilk_sync` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_sync` statement indicates that the function may not continue until all `cilk_spawn` requests in the same function have completed. The `cilk_sync` instruction does not affect parallel strands spawned in other functions.
94
+
In this example, we shall make use of just the `cilk_spawn` and `cilk_scope` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_scope` statement indicates that the function may not continue until all `cilk_spawn` requests within the scoped region in the same function have completed.
98
95
99
96
Let us walk through a version of the quicksort code that has been parallelized using OpenCilk.
In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 8, and the `cilk_sync` keyword on line 10. The `cilk_spawn` keyword on line 8 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the program instructions that are executed after the return of the function call on line 8 and the `cilk_sync` instruction on line 10.
114
+
In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 9, and defining the `cilk_scope` region to include lines 9--10. The `cilk_spawn` keyword on line 9 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the function call `sample_qsort(++middle, ++end)` on line 10.
115
+
116
+
The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 10 to execute asynchronously. Thus, when we call `sample_qsort` again in line 10, the call at line 9 might not have completed. The end of the `cilk_scope` region at line 11 indicates that this function will not continue until all `cilk_spawn` requests in the same scoped region have completed. There is an implicit `cilk_scope` surrounding the body of every function so that at the end of every function all tasks spawned in the function have returned.
117
+
118
+
The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 10) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
117
119
118
-
The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 8 to execute asynchronously. Thus, when we call `sample_qsort` again in line 9, the call at line 8 might not have completed. The `cilk_sync` statement at line 10 indicates that this function will not continue until all `cilk_spawn` requests in the same function have completed. There is an implicit `cilk_sync` at the end of every function that waits until all tasks spawned in the function have returned, so the `cilk_sync` here is redundant, but written explicitly for clarity.
120
+
## Compile
119
121
120
-
The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 9) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
122
+
This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
123
+
124
+
##### Linux* OS
125
+
126
+
```shell
127
+
> clang++ qsort.cpp -o qsort –O3 -fopencilk
128
+
```
121
129
122
130
### Build, execute, and test
123
131
@@ -157,6 +165,20 @@ Sorting 10000000 integers
157
165
1.468 seconds Sort succeeded.
158
166
```
159
167
168
+
### Checking for race conditions using Cilksan
169
+
170
+
The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).
181
+
160
182
### Measuring scalability using Cilkscale
161
183
162
184
Cilkscale can be used to benchmark and analyze the parallelism, in terms of work and span, of an OpenCilk program. These measurements can be used to predict performance when running on a varying number of parallel processors.
@@ -183,17 +205,3 @@ Plots illustrating the parallel execution time and speedup of the quicksort prog
183
205
184
206

185
207
186
-
### Checking for race conditions using Cilksan
187
-
188
-
The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).
0 commit comments