Skip to content

Commit fd47847

Browse files
committed
Update User's guide “convert-a-c++-program”
1 parent b884315 commit fd47847

File tree

1 file changed

+48
-40
lines changed

1 file changed

+48
-40
lines changed

src/doc/users-guide/convert-a-c++-program.md

Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,23 @@
22
title: Convert a C++ program
33
author: Timothy Kaler
44
date: 2022-07-20T16:22:55.620Z
5+
attribution: true
56
---
6-
A common application of OpenCilk is the parallelization of existing serial code. Indeed, it is often advisable for programmers to prioritize writing correct and efficient serial code before attempting parallelization because of the notorious difficulty of writing correct parallel code. In this section, we shall walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
7+
OpenCilk can be used to add parallelism to existing serial codes without changing the original program's semantics. Let us walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
78

89
## General workflow
910

10-
One typically begins with an existing serial C or C++ program that implements the functions or algorithms that are relevant to one's application. Ideally, the serial code you start with will be well tested to verify it is correct and be performance engineered to achieve good performance when run serially. Any correctness bugs in the serial code will result in correctness bugs in the parallel program, but they will be more difficult to identify and fix! Similarly, inefficiency in the original sequential code will translate to inefficiency in its parallel equivalent.
11+
The typical process for adding parallelism to existing serial C or C++ programs using OpenCilk involves five steps:
1112

12-
Next, one begins the process of introducing parallelism into the program. Typically one starts by identifying the regions of the program that will most benefit from parallel execution. Operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization. The programmer can identify tasks in their code that can execute in parallel using the three OpenCilk keywords:
13+
1. **Debug serial code:** Verify the original program is correct. It is a good practice to write correct and well-tested serial code prior to attempting parallelization. Bugs that exist in the serial code will also exist after introducing parallelism, but may be more difficult to debug.
14+
2. **Identify parallelism:** Identify regions of the code that could benefit from parallel execution. Typically, operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization.
15+
3. **Annotate parallelism:** Introduce parallelism to the code using the OpenCilk keywords \`cilk_for\`, \`cilk_spawn\`, and \`cilk_scope\`. These keywords are described in more depth in ???. A summary of their semantics is as follows:
1316

14-
* `cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
15-
* `cilk_sync` indicates that all spawned children must complete before proceeding.
16-
* `cilk_for` identifies a loop for which all iterations can execute in parallel.
17-
18-
The parallel version of the code can be compiled and tested using the OpenCilk compiler. On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program.
19-
20-
The OpenCilk tools can be used to debug race conditions and scalability bottlenecks in parallelized codes. Verifying the absence of race conditions is particularly important as such errors can lead to non-deterministic (and often buggy) behavior. Fortunately, OpenCilk provides the ***cilksan race detector*** which can identify all possible race conditions introduced by parallel operations when a program is run on a given input. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of **reducers**, locks, and recoding.
17+
* `cilk_for` identifies a loop for which all iterations can execute in parallel.
18+
* `cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
19+
* `cilk_scope` indicates that all spawned children within the scoped region must complete before proceeding.
20+
4. **Compile:** Compile the code using the OpenCilk compiler. On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance.
21+
5. **Verify absence of races:** Use OpenCilk's ***cilksan race detector*** to verify the absence of race conditions in the parallel program. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of ***reducers***, locks, and recoding.
2122

2223
## Example: Quicksort
2324

@@ -80,21 +81,17 @@ int main(int argc, char* argv[])
8081
}
8182
```
8283
83-
### Compiling Quicksort with the OpenCilk compiler
8484
85-
This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
8685
87-
##### Linux* OS
86+
## Identify parallelism
8887
89-
```shell
90-
> clang++ qsort.cpp -o qsort –O3 -fopencilk
91-
```
88+
The `sample_qsort` function is invoked recursively on two disjoint subarrays on line 16 and line 17. These independent tasks will be relatively long-running and are good candidates for parallelization. This proposed parallelization of quicksort represents a typical divide-and-conquer strategy for parallelizing recursive algorithms. An intrepid reader might also notice that the partition algorithm invoked on line 13 may also be parallelized for even greater scalability.
9289
93-
### Add parallelism using `cilk_spawn`
90+
## Annotate parallelism
9491
95-
The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_spawn`, `cilk_sync`, and `cilk_for`.
92+
The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_for`, `cilk_spawn`, and `cilk_scope`.
9693
97-
In this example, we shall make use of just the `cilk_spawn` and `cilk_sync` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_sync` statement indicates that the function may not continue until all `cilk_spawn` requests in the same function have completed. The `cilk_sync` instruction does not affect parallel strands spawned in other functions.
94+
In this example, we shall make use of just the `cilk_spawn` and `cilk_scope` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_scope` statement indicates that the function may not continue until all `cilk_spawn` requests within the scoped region in the same function have completed.
9895
9996
Let us walk through a version of the quicksort code that has been parallelized using OpenCilk.
10097
@@ -104,20 +101,31 @@ void sample_qsort(int * begin, int * end)
104101
if (begin != end) {
105102
--end; // Exclude last element (pivot)
106103
int * middle = std::partition(begin, end,
107-
std::bind2nd(std::less<int>(),*end));
104+
std::bind2nd(std::less<int>(),*end));
108105
std::swap(*end, *middle); // pivot to middle
109-
cilk_spawn sample_qsort(begin, middle);
110-
sample_qsort(++middle, ++end); // Exclude pivot
111-
cilk_sync;
106+
cilk_scope {
107+
cilk_spawn sample_qsort(begin, middle);
108+
sample_qsort(++middle, ++end); // Exclude pivot
109+
}
112110
}
113111
}
114112
```
115113

116-
In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 8, and the `cilk_sync` keyword on line 10. The `cilk_spawn` keyword on line 8 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the program instructions that are executed after the return of the function call on line 8 and the `cilk_sync` instruction on line 10.
114+
In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 9, and defining the `cilk_scope` region to include lines 9--10. The `cilk_spawn` keyword on line 9 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the function call `sample_qsort(++middle, ++end)` on line 10.
115+
116+
The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 10 to execute asynchronously. Thus, when we call `sample_qsort` again in line 10, the call at line 9 might not have completed. The end of the `cilk_scope` region at line 11 indicates that this function will not continue until all `cilk_spawn` requests in the same scoped region have completed. There is an implicit `cilk_scope` surrounding the body of every function so that at the end of every function all tasks spawned in the function have returned.
117+
118+
The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 10) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
117119

118-
The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 8 to execute asynchronously. Thus, when we call `sample_qsort` again in line 9, the call at line 8 might not have completed. The `cilk_sync` statement at line 10 indicates that this function will not continue until all `cilk_spawn` requests in the same function have completed. There is an implicit `cilk_sync` at the end of every function that waits until all tasks spawned in the function have returned, so the `cilk_sync` here is redundant, but written explicitly for clarity.
120+
## Compile
119121

120-
The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 9) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
122+
This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
123+
124+
##### Linux* OS
125+
126+
```shell
127+
> clang++ qsort.cpp -o qsort –O3 -fopencilk
128+
```
121129

122130
### Build, execute, and test
123131

@@ -157,6 +165,20 @@ Sorting 10000000 integers
157165
1.468 seconds Sort succeeded.
158166
```
159167

168+
### Checking for race conditions using Cilksan
169+
170+
The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
171+
172+
```shell
173+
> clang++ qsort.cpp -o qsort –Og -g -fopencilk -fsanitize=cilk
174+
./qsort 10000000
175+
176+
Cilksan detected 0 distinct races.
177+
Cilksan suppressed 0 duplicate race reports.
178+
```
179+
180+
The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).
181+
160182
### Measuring scalability using Cilkscale
161183

162184
Cilkscale can be used to benchmark and analyze the parallelism, in terms of work and span, of an OpenCilk program. These measurements can be used to predict performance when running on a varying number of parallel processors.
@@ -183,17 +205,3 @@ Plots illustrating the parallel execution time and speedup of the quicksort prog
183205

184206
![Cilkscale execution time for quicksort.](/img/cilkscale-qsort-execution-time.png "Quicksort execution time")
185207

186-
### Checking for race conditions using Cilksan
187-
188-
The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
189-
190-
```shell
191-
> clang++ qsort.cpp -o qsort –Og -g -fopencilk -fsanitize=cilk
192-
./qsort 10000000
193-
194-
Cilksan detected 0 distinct races.
195-
Cilksan suppressed 0 duplicate race reports.
196-
197-
```
198-
199-
The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).

0 commit comments

Comments
 (0)