Update User's guide “convert-a-c++-program”

timkaler · timkaler · commit fd4784725322 · 2022-09-16T18:35:59.000Z
diff --git a/src/doc/users-guide/convert-a-c++-program.md b/src/doc/users-guide/convert-a-c++-program.md
@@ -2,22 +2,23 @@
 title: Convert a C++ program
 author: Timothy Kaler
 date: 2022-07-20T16:22:55.620Z
+attribution: true
 ---
-A common application of OpenCilk is the parallelization of existing serial code. Indeed, it is often advisable for programmers to prioritize writing correct and efficient serial code before attempting parallelization because of the notorious difficulty of writing correct parallel code. In this section, we shall walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
+OpenCilk can be used to add parallelism to existing serial codes without changing the original program's semantics. Let us walk through the process of converting an existing serial C or C++ code to an OpenCilk parallel program and show how OpenCilk's suite of tools can be used to debug race-conditions and scalability bottlenecks.
 
 ## General workflow
 
-One typically begins with an existing serial C or C++ program that implements the functions or algorithms that are relevant to one's application. Ideally, the serial code you start with will be well tested to verify it is correct and be performance engineered to achieve good performance when run serially. Any correctness bugs in the serial code will result in correctness bugs in the parallel program, but they will be more difficult to identify and fix! Similarly, inefficiency in the original sequential code will translate to inefficiency in its parallel equivalent.
+The typical process for adding parallelism to existing serial C or C++ programs using OpenCilk involves five steps:
 
-Next, one begins the process of introducing parallelism into the program.  Typically one starts by identifying the regions of the program that will most benefit from parallel execution. Operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization. The programmer can identify tasks in their code that can execute in parallel using the three OpenCilk keywords:
+1. **Debug serial code:** Verify the original program is correct. It is a good practice to write correct and well-tested serial code prior to attempting parallelization. Bugs that exist in the serial code will also exist after introducing parallelism, but may be more difficult to debug. 
+2. **Identify parallelism:** Identify regions of the code that could benefit from parallel execution. Typically, operations that are relatively long-running and/or tasks that can be performed independently are prime candidates for parallelization.
+3. **Annotate parallelism:** Introduce parallelism to the code using the OpenCilk keywords \`cilk_for\`, \`cilk_spawn\`, and \`cilk_scope\`. These keywords are described in more depth in ???. A summary of their semantics is as follows:
 
-* `cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
-* `cilk_sync` indicates that all spawned children must complete before proceeding.
-* `cilk_for` identifies a loop for which all iterations can execute in parallel.
-
-The parallel version of the code can be compiled and tested using the OpenCilk compiler.  On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program. 
-
-The OpenCilk tools can be used to debug race conditions and scalability bottlenecks in parallelized codes. Verifying the absence of race conditions is particularly important as such errors can lead to non-deterministic (and often buggy) behavior. Fortunately, OpenCilk provides the ***cilksan race detector*** which can identify all possible race conditions introduced by parallel operations when a program is run on a given input. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of **reducers**, locks, and recoding. 
+   * `cilk_for` identifies a loop for which all iterations can execute in parallel.
+   * `cilk_spawn` indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").
+   * `cilk_scope` indicates that all spawned children within the scoped region must complete before proceeding.
+4. **Compile:** Compile the code using the OpenCilk compiler. On **Linux* OS** one invokes the OpenCilk compiler using the `clang` or `clang++` commands. One compiled, the program can be run on the local machine to test for correctness and measure performance.
+5. **Verify absence of races:** Use OpenCilk's ***cilksan race detector*** to verify the absence of race conditions in the parallel program. If the parallelization of the original (correct) serial program contains no ***race conditions***, then the parallel program will produce the same result as the serial program. With the help of OpenCilk's tools, one can identify and resolve race conditions through the use of ***reducers***, locks, and recoding. 
 
 ## Example: Quicksort
 
@@ -80,21 +81,17 @@ int main(int argc, char* argv[])
 }
 ```
 
-### Compiling Quicksort with the OpenCilk compiler
 
-This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
 
-##### Linux* OS
+## Identify parallelism
 
-```shell
-> clang++ qsort.cpp -o qsort –O3 -fopencilk
-```
+The  `sample_qsort` function is invoked recursively on two disjoint subarrays on line 16 and line 17. These independent tasks will be relatively long-running and are good candidates for parallelization. This proposed parallelization of quicksort represents a typical divide-and-conquer strategy for parallelizing recursive algorithms. An intrepid reader might also notice that the partition algorithm invoked on line 13 may also be parallelized for even greater scalability.
 
-### Add parallelism using `cilk_spawn`
+## Annotate parallelism
 
-The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_spawn`, `cilk_sync`, and `cilk_for`. 
+The next step is to actually introduce parallelism into our quicksort program. This can be accomplished through the judicious use of OpenCilk's three keywords for expressing parallelism: `cilk_for`, `cilk_spawn`, and `cilk_scope`. 
 
-In this example, we shall make use of just the `cilk_spawn` and `cilk_sync` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_sync` statement indicates that the function may not continue until all `cilk_spawn` requests in the same function have completed. The `cilk_sync` instruction does not affect parallel strands spawned in other functions.
+In this example, we shall make use of just the `cilk_spawn` and `cilk_scope` keywords. The `cilk_spawn` keyword indicates that a function (the *child*) may be executed in parallel with the code that follows the `cilk_spawn` statement (the *parent*). Note that the keyword *allows* but does not *require* parallel operation. The OpenCilk scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The `cilk_scope` statement indicates that the function may not continue until all `cilk_spawn` requests within the scoped region in the same function have completed. 
 
 Let us walk through a version of the quicksort code that has been parallelized using OpenCilk.
 
@@ -104,20 +101,31 @@ void sample_qsort(int * begin, int * end)
     if (begin != end) {
         --end; // Exclude last element (pivot)
         int * middle = std::partition(begin, end,
-                    std::bind2nd(std::less<int>(),*end));        
+            std::bind2nd(std::less<int>(),*end));        
         std::swap(*end, *middle); // pivot to middle
-        cilk_spawn sample_qsort(begin, middle);
-        sample_qsort(++middle, ++end); // Exclude pivot
-        cilk_sync;
+        cilk_scope {
+          cilk_spawn sample_qsort(begin, middle);
+          sample_qsort(++middle, ++end); // Exclude pivot
+        }
     }
 }
 ```
 
-In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 8, and the  `cilk_sync` keyword on line 10. The `cilk_spawn` keyword on line 8 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the program instructions that are executed after the return of the function call on line 8 and the `cilk_sync` instruction on line 10. 
+In the example code above, the serial quicksort code has been converted into a parallel OpenCilk code by adding the `cilk_spawn` keyword on line 9, and defining the  `cilk_scope` region to include lines 9--10. The `cilk_spawn` keyword on line 9 indicates that the function call `sample_qsort(begin, middle)` is allowed to execute in-parallel with its ***continuation*** which includes the function call `sample_qsort(++middle, ++end)` on line 10.     
+
+The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 10 to execute asynchronously. Thus, when we call `sample_qsort` again in line 10, the call at line 9 might not have completed. The end of the `cilk_scope` region at line 11 indicates that this function will not continue until all `cilk_spawn` requests in the same scoped region have completed. There is an implicit `cilk_scope` surrounding the body of every function so that at the end of every function all tasks spawned in the function have returned.
+
+The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 10) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
 
-The `cilk_spawn` keyword can be thought of as allowing the recursive invocation of `sample_qsort` on line 8 to execute asynchronously. Thus, when we call `sample_qsort` again in line 9, the call at line 8 might not have completed. The `cilk_sync` statement at line 10 indicates that this function will not continue until all `cilk_spawn` requests in the same function have completed. There is an implicit `cilk_sync` at the end of every function that waits until all tasks spawned in the function have returned, so the `cilk_sync` here is redundant, but written explicitly for clarity.
+## Compile
 
-The parallelization of quicksort provided in this example implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 9) continues executing the current function, while a child strand executes the other recursive call. In general, recursive divide-and-conquer algorithms can expose significant parallelism. In the case of quicksort, however, parallelizing according to the standard recursive structure of the serial algorithm only exposes limited parallelism. The reason for this is due to the substantial amount of work performed by the serial `partition` function invoked on line 5. The partition algorithm may be parallelized for better scalability, but we shall leave this task as an exercise to the intrepid reader.
+This quicksort code can be compiled using the OpenCilk C++ compiler by adding `#include <cilk.h>` statement to the source file. The `cilk.h` header file contains declarations of the OpenCilk runtime API and the keywords used to specify parallel control flow. After adding the `cilk.h` header file, one can compile the quicksort program using the OpenCilk compiler.
+
+##### Linux* OS
+
+```shell
+> clang++ qsort.cpp -o qsort –O3 -fopencilk
+```
 
 ### Build, execute, and test
 
@@ -157,6 +165,20 @@ Sorting 10000000 integers
 1.468 seconds Sort succeeded.
 ```
 
+### Checking for race conditions using Cilksan
+
+The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
+
+```shell
+> clang++ qsort.cpp -o qsort –Og -g -fopencilk -fsanitize=cilk
+./qsort 10000000
+
+Cilksan detected 0 distinct races.
+Cilksan suppressed 0 duplicate race reports.
+```
+
+The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).
+
 ### Measuring scalability using Cilkscale
 
 Cilkscale can be used to benchmark and analyze the parallelism, in terms of work and span, of an OpenCilk program. These measurements can be used to predict performance when running on a varying number of parallel processors.
@@ -183,17 +205,3 @@ Plots illustrating the parallel execution time and speedup of the quicksort prog
 
 ![Cilkscale execution time for quicksort.](/img/cilkscale-qsort-execution-time.png "Quicksort execution time")
 
-### Checking for race conditions using Cilksan
-
-The Cilksan race detector can be used to check for race conditions in the parallelized quicksort code. To run Cilksan on our parallel quicksort routine, we must compile the program with Cilksan enabled and then execute the instrumented program.
-
-```shell
-> clang++ qsort.cpp -o qsort –Og -g -fopencilk -fsanitize=cilk
-./qsort 10000000
-
-Cilksan detected 0 distinct races.
-Cilksan suppressed 0 duplicate race reports.
-
-```
-
-The Cilksan race detector will report any race conditions present in the program and verify the absence of races in a race-free program. More detailed instructions about the use of Cilksan can be found [here](/doc/users-guide/getting-started/#using-cilksan).