Requirements for the DASH Tasking Model

This is a list of features the tasking model in DASH/DART should support, mainly from the viewpoint of the runtime system. The C++ API will be discussed later.

1. Efficient scheduling of small tasks, i.e., tasks with a minimum amount of work carried out. The DASH tasking runtime might not efficiently schedule micro-tasks, e.g., as used by HPX to taskify single communication.

2. Implementation in C as the runtime will be integrated with DART. This excludes runtimes written in C++ as per the result of the discussion in Munich, October 2016.

3. Local synchronization between tasks could be possible in two ways:
1. async after is a way of direct task synchronization by explicitly specifying the dependencies between two tasks, i.e., building the task graph explicitly. This requires the runtime to return some form of reference to the caller that can later be used for synchronization. This is commonly used in programming models such as HPX and Chapel.

This form of synchronization requires manual work by the programmer for specifying the dependencies and can become tedious for complex patterns. However, it might be easier to implement for the runtime and can be the tool of choice for simple dependencies.

2. Data dependencies are a way of indirect task synchronization in which the dependencies between tasks are expressed implicitely through a description of the produced and consumed data elements of each task, i.e., WAW and RAW dependencies. The runtime has to keep track of the dependencies and release tasks that become runnable due to satisfied dependencies. Examples are OpenMP and OmpSs.

This form of synchronization appears to be a more natural way of defining dependencies between tasks as users usually know the input and output of tasks they define, especially if the tasks operate on shared data structures. Explicitly referencing tasks is usually not required but might be useful when used in combination with 3.1.

4. Remote synchronization between tasks should be possible as well to coordinate the production and consumption of data elements by tasks executed on different units. Remote data-dependencies seem to be a natural fit for the PGAS model since all tasks operate on a global address space and can produce and consume data elements without explicit synchronization (as seen in 2-sided communication).

[JS] Synchronizing remote tasks explicitly through handles might turn out to be more complex although I have not investigated this so far. It is also only valuable in combination with remote task invocation (see below).

5. Nested task creation allows the programmer to create tasks from within tasks, e.g., to increase the granularity of concurrency exposed to the runtime. The nested tasks commonly inherit certain properties from the parent tasks such as the list of dependent tasks. Nested tasking also requires more fine-grained task graph management as tasks should be able to wait for their child tasks to finish execution.

6. Priorities will be used to prioritize tasks that either perform communication or prepare data for later communication in order to overlap computation-only tasks with the communication.

7. Remote task invocation is a way of explicitly shifting work from one unit to another unit, e.g., for implementing load balancing.

However, real use-cases seem to be scarce so far. For common scientific applications, shifting work from one unit to another unit is done by moving data between the units, which will then operate on this data. For irregular applications, e.g., graph applications, remote task invocation might be worth investigating.

Interface Proposal

DART

The tasking runtime will be provided by DART, which might only be an abstraction for existing interfaces such as the LLVM OpenMP runtime. Hence, the API has to be defined such that the basic DASH tasking model described below can be implemented.

The interface might look like this:


dart_gptr_t gptr1 = *(dart_gptr_t*)data;
// do something with gptr1
}

// data structure for marshalling multiple arguments to task2
dart_gptr_t gptr1;
dart_gptr_t gptr2;
};

dart_gptr_t gptr1 = td->gptr1;
dart_gptr_t gptr2 = td->gptr2;
// here, data points to that contains gptr1 and gptr2
// consume gptr1 and fill gptr2
}

int main(int argc, char **argv) {

dart_gptr_t gptr[2]; // gptr1 and gptr2

int concurrency;

int srcunit = (unitid + 1) % team_size;

// first task to fill gptr1
deps[0].gptr = gptr[0];
deps[0].type = DART_DEP_OUT;

// second task to consume gptr1 and fill gptr2, won't run before task1 is completed
deps[0].gptr = gptr[0];
deps[0].gptr.unitid = srcunit;
deps[0].type = DART_DEP_IN;
deps[1].gptr = gptr[1];
deps[1].type = DART_DEP_OUT;
// use a struct to pass multiple arguments to task2
td->gptr1 = gptr[0];
td->gptr2 = gptr[1];

// Now we can either wait for the second task explicitely...
// ... or wait for all tasks defined so far...
// ... or wait for the task group (required for the finish statement introduced below:
}

The signature of dart_task_create would look as follows


// target unit
dart_unit_t target,
// function pointer for the action to be executed
void (*fn)(void*),
// the data argument to be passed to the action function
// the size of the task_data; data will be copied if data_size > 0
size_t data_size,
// task dependency definition, may be NULL
// number of dependencies specified
size_t ndeps,
dart_task_group_t tg);

A special version can be used to create detached tasks, i.e., without receiving a task handle:


dart_ret_t
// target unit
dart_unit_t target,
// function pointer for the action to be executed
void (*fn)(void*),
// the data argument to be passed to the action function
// the size of the task_data; data will be copied if data_size > 0
size_t data_size,
// task dependency definition, may be NULL
// number of dependencies specified
size_t ndeps,
dart_task_group_t tg);

DASH

The basic C++ API should allow to define tasks with actions provided via function pointers (combined with parameters), bound member functions, and lamda expression (basically any std::function object). The tasking model should provide ways for synchronizing tasks both directly (async after) and indirectly (data dependencies). Remote task invocation with return value communication could be provided optionally (async at). A finish construct can be used to wait for created at the end of a code block (similar to X10/UPC++). Explicit task finalization through an API call is also supported.

An example for direct task synchronization:

using namespace dash;

finish {
// create a task on unit u
future<...> f1 = async(u)([](){
printf("Hello world!");
return 42;
}
// wait for the result to become available
int result = f1.get();

// create a task chain (similar to UPC++/X10
// the future object could hold a reference to task1,

} // wait for all tasks to be complete

An example for indirect task synchronization through data dependencies:

using namespace dash;

Array<int> arr1(team_size);
Array<int> arr2(team_size);

// fill arr1 using detached tasks (no futures returned)
async_detached(unitid, {arr1[unitid], OUTDEP})([&](){
arr1.local[0] = unitid;
});
// cyclic left shift the content of arr1 into arr2
// creating the source value has been executed.
// Arbitrary numbers of dependencies can be specified (also dynamically)
int srcidx = (unitid + 1) % team_size;
async_detached(unitid, {arr1[srcidx], INDEP})([&](){
arr2.local[0] = arr1[srcidx];
});

// wait for all tasks to be complete
// this is similar to the finish statement above but on a global scale
task_complete();

These abstraction can then be used in other places, i.e., to create a parallel dash::for_each or async copying.

Existing Approaches

Feature Matrix

Below is an attempt to match existing approaches with the features outlined above. This is useful for comparing the DASH/DART runtime with related work.

Approach / Feature Language Async after Data dependency graphs Nested task creation Remote synchronization Priorities Remote task invocation
Skalar variables Arrays Array Subranges Excplicit Implicit
——————– ——————————– —————– —————— ——– ————— ———————- ———————— ———- ———— ————————
[HPX] C++ YES NO NO NO YES YES NO NO YES
[Chapel] Custom YES NO NO NO YES YES NO NO YES
[OpenMP] C/C++/Fortran (compiler, C RT) NO YES YES NO YES NO NO YES NO
[OmpSs] C/C++/Fortran (compiler, C++ RT) NO YES YES YES YES NO NO YES NO
[UPC++] C++ YES NO NO NO YES YES NO CHECK YES
[HabaneroUPC++] C++ YES NO NO NO CHECK CHECK NO CHECK YES
[RMA-Async] C++ YES NO NO NO CHECK CHECK NO NO YES
[MTAPI] C NO NO NO NO YES (YES) NO YES (on heterogeneous devs)
[EMBB] C++ YES NO NO NO YES (YES) NO YES (on heterogeneous devs)
[Intel TBB] C++ YES NO NO NO YES NO NO NO NO

References

Alpaka

[Chapel]: Website: http://chapel.cray.com/, Introductory paper from Programming Models for Parallel Computing

[OpenMP]: Website: http://openmp.org, OpenMP 4.5 standard available at http://www.openmp.org/mp-documents/openmp-4.5.pdf, LLVM runtime library http://openmp.llvm.org/

[OmpSs]: Website: Task-Based Programming with OmpSs and Its Application, CHECK for other publications

[UPC++]: UPC++: A PGAS Extension for C++

[HabaneroUPC++]: HabaneroUPC++: A Compiler-free PGAS Library

[RMA-Async]: Github project: https://github.com/swfrench/rma-async