Learning

Learning paths provide you with a curated set of training modules that help you learn specific tools and technologies. You learn the code and practice using Intel® DevCloud.



Introduction to JupyterLab* and Notebooks.


마지막 수정일: 2022-02-21, teratec

JupyterLab is a sequence of boxes referred to as "cells". Each cell will contain text, like this one, or C++ or Python code that may be executed as part of this tutorial. As you proceed, please note the following:

  • The active cell is indicated by the blue bar on the left. Click on a cell to select it.
  • Use the "run" ▶ button at the top or Shift+Enter to execute a selected cell, starting with this one.
    • Note: If you mistakenly press just Enter, you will enter the editing mode for the cell. To exit editing mode and continue, press Shift+Enter.
  • Unless stated otherwise, the cells containing code within this tutorial MUST be executed in sequence.
  • You may save the tutorial at any time, which will save the output, but not the state. Saved Jupyter Notebooks will save sequence numbers which may make a cell appear to have been executed when it has not been executed for the new session. Because state is not saved, re-opening or restarting a Jupyter Notebook will required re-executing all the executable steps, starting in order from the beginning.
  • If for any reason you need to restart the tutorial from the beginning, you may reset the state of the Jupyter Notebook and clear all output. Use the menu at the top to select Kernel -> "Restart Kernel and Clear All Outputs"
  • Cells containing Markdown can be executed and will render. However, there is no indication of execution, and it is not necessary to explicitly execute Markdown cells.
  • Cells containing executable code will have "a [ ]:" to the left of the cell:
    • [ ] blank indicates that the cell has not yet been executed.
    • [*] indicates that the cell is currently executing.
    • Once a cell is done executing, a number will appear in the small brackets with each cell execution to indicate where in the sequence the cell has been executed. Any output (e.g. print()'s) from the code will appear below the cell.

Code editing, Compiling and Running in Jupyter Notebooks

This code shows a simple C++ Hello world. Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[2]:

%%writefile src/hello.cpp
#include <iostream>
#define RESET   "\033[0m"
#define RED     "\033[31m"    /* Red */
#define BLUE    "\033[34m"    /* Blue */

int main(){
    std::cout << RED << "Hello World" << RESET << std::endl;
}
Overwriting src/hello.cpp

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:[3]:

! chmod 755 q; chmod 755 run_hello.sh;if [ -x "$(command -v qsub)" ]; then ./q run_hello.sh; else run_hello.sh; fi
Job has been submitted to Intel(R) DevCloud and will execute soon.

 If you do not see result in 60 seconds, please restart the Jupyter kernel:
 Kernel -> 'Restart Kernel and Clear All Outputs...' and then try again

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1857083.v-qsvr-1           ...ub-singleuser u126779         00:00:12 R jupyterhub     
1857089.v-qsvr-1           run_hello.sh     u126779                0 Q batch          

Waiting for Output ...

TimeOut 60 seconds: Job is still queued for execution, check for output file later (run_hello.sh.o1857089)

 Done⬇

########################################################################
#      Date:           Sun 27 Feb 2022 08:45:05 PM PST
#    Job ID:           1857089.v-qsvr-1.aidevcloud
#      User:           u126779
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################

## u126779 is compiling Welcome Module-- 1 of 1 hello.cpp
Hello World

########################################################################
# End of output for job 1857089.v-qsvr-1.aidevcloud
# Date: Sun 27 Feb 2022 08:45:23 PM PST
########################################################################

Job Completed in 60 seconds.

Introduction to DPC++


마지막 수정일: 2022-02-23, teratec

Introduction to oneAPI and DPC++

Sections

Learning Objectives

  • Explain how the oneAPI programming model can solve the challenges of programming in a heterogeneous world
  • Use oneAPI projects to enable your workflows
  • Understand the Data Parallel C++ (DPC++) language and programming model
  • Familiarization on the use Jupyter notebooks for training throughout the course

oneAPI Programming Model Overview

The oneAPI programming model provides a comprehensive and unified portfolio of developer tools that can be used across hardware targets, including a range of performance libraries spanning several workload domains. The libraries include functions custom-coded for each target architecture so the same function call delivers optimized performance across supported architectures. DPC++ is based on industry standards and open specifications to encourage ecosystem collaboration and innovation.

oneAPI Distribution

Intel® oneAPI toolkits are available via multiple distribution channels:

  • Local product installation: install the oneAPI toolkits from the Intel® Developer Zone.
  • Install from containers or repositories: install the oneAPI toolkits from one of several supported containers or repositories.
  • Pre-installed in the Intel® DevCloud: a free development sandbox for access to the latest Intel® SVMS hardware and select oneAPI toolkits.

Programming Challenges for Multiple architectures

Currently in the data centric space there is growth in specialized workloads. Each kind of data centric hardware typically needs to be programmed using different languages and libraries as there is no common programming language or APIs, this requires maintaining separate code bases. Developers have to learn a whole set of different tools as there is inconsistent tool support across platforms. Developing software for each hardware platform requires a separate investment, with little ability to reuse that work to target a different architecture. You will also have to consider the requirement of the diverse set of data-centric hardware.

Introducing oneAPI

oneAPI is a solution to deliver unified programming model to simplify development across diverse architectures. It includes a unified and simplified language and libraries for expressing parallelism and delivers uncompromised native high-level language performance across a range of hardware including CPUs, GPUs, FPGAs. oneAPI initiative is based on industry standards and open specifications and is interoperable with existing HPC programming models.


Simple Exercise

This exercise introduces DPC++ to the developer by way of a small simple code. In addition, it introduces the developer to the Jupyter notebook environment for editing and saving code; and for running and submitting programs to the Intel® DevCloud.

Editing the simple.cpp code

The Jupyter cell below with the gray background can be edited in-place and saved.

The first line of the cell contains the command %%writefile 'simple.cpp' This tells the input cell to save the contents of the cell into a file named 'simple.cpp' in your current directory (usually your home directory). As you edit the cell and run it in the Jupyter notebook, it will save your changes into that file.

The code below is some simple DPC++ code to get you started in the DevCloud environment. Simply inspect the code - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.

  1. Inspect the code cell below, then click run ▶ to save the code to a file
  2. Run ▶ the cell in the Build and Run section below the code snippet to compile and execute the code in the saved file
%%writefile lab/simple.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;
static const int N = 16;
int main(){
  //# define queue which has default device associated for offload
  queue q;
  std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

  //# Unified Shared Memory Allocation enables data access on host and device
  int *data = malloc_shared<int>(N, q);

  //# Initialization
  for(int i=0; i<N; i++) data[i] = i;

  //# Offload parallel computation to device
  q.parallel_for(range<1>(N), [=] (id<1> i){
    data[i] *= 2;
  }).wait();

  //# Print Output
  for(int i=0; i<N; i++) std::cout << data[i] << "\n";

  free(data, q);
  return 0;
}
Output: 
Overwriting lab/simple.cpp

Build and Run

Select the cell below and click Run ▶ to compile and execute the code above:[ ]:

! chmod 755 q; chmod 755 run_simple.sh;if [ -x "$(command -v qsub)" ]; then ./q run_simple.sh; else ./run_simple.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

SYCL

SYCL (pronounced ‘sickle’) represents an industry standardization effort that includes support for data-parallel programming for C++. It is summarized as “C++ Single-source Heterogeneous Programming for OpenCL.” The SYCL standard, like OpenCL*, is managed by the Khronos Group*.

SYCL is a cross-platform abstraction layer that builds on OpenCL. It enables code for heterogeneous processors to be written in a “single source” style using C++. This is not only useful to the programmers, but it also gives a compiler the ability to analyze and optimize across the entire program regardless of the device on which the code is to be run.

Unlike OpenCL, SYCL includes templates and lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code. Developers program at a higher level than OpenCL but always have access to lower-level code through seamless integration with OpenCL, as well as C/C++ libraries.

What is Data Parallel C++

oneAPI programs are written in Data Parallel C++ (DPC++). It takes advantage of modern C++ productivity benefits and familiar constructs, and incorporates the SYCL* standard for data parallelism and heterogeneous programming. DPC++ is a single source language where host code and heterogeneous accelerator kernels can be mixed in same source files. A DPC++ program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionliaties like a queue for work targeting, buffer for data management, and parallel_for for parallelism to direct which parts of the computation and data should be offloaded.

DPC++ extends SYCL 1.2.1

DPC++ programs enhance productivity. Simple things should be simple to express and lower verbosity and programmer burden. They also enhance performance by giving programmers control over program execution and by enabling hardware-specific features. It is a fast-moving open collaboration feeding into the SYCL* standard, and is an open source implementation with the goal of upstreaming LLVM and DPC++ extensions to become core SYCL*, or Khronos* extensions.

HPC Single Node Workflow with oneAPI

Accelerated code can be written in either a kernel (DPC++) or directive based style. Developers can use the Intel® DPC++ Compatibility tool to perform a one-time migration from CUDA to Data Parallel C++. Existing Fortran applications can use a directive style based on OpenMP. Existing C++ applications can choose either the Kernel style or the directive based style option and existing OpenCL applications can remain in the OpenCL language or migrate to Data Parallel C++.

Intel® Advisor is recommended to Optimize the design for vectorization and memory (CPU and GPU) and Identify loops that are candidates for offload and project the performance on target accelerators.

The figure below shows the recommended approach of different starting points for HPC developers:

oneAPI Programming models

Platform Model

The platform model for oneAPI is based upon the SYCL* platform model. It specifies a host controlling one or more devices. A host is the computer, typically a CPU-based system executing the primary portion of a program, specifically the application scope and the command group scope.

The host coordinates and controls the compute work that is performed on the devices. A device is an accelerator, a specialized component containing compute resources that can quickly execute a subset of operations typically more efficiently than the CPUs in the system. Each device contains one or more compute units that can execute several operations in parallel. Each compute unit contains one or more processing elements that serve as the individual engine for computation.

The following figure provides a visual depiction of the relationships in the platform model. One host communicates with one or more devices. Each device can contain one or more compute units. Each compute unit can contain one or more processing elements. In this example, the CPU in a desktop computer is the host and it can also be made available as a device in a platform configuration.

Execution Model

The execution model is based upon the SYCL* execution model. It defines and specifies how code, termed kernels, execute on the devices and interact with the controlling host. The host execution model coordinates execution and data management between the host and devices via command groups. The command groups, which are groupings of commands like kernel invocation and accessors, are submitted to queues for execution.

Accessors, which are formally part of the memory model, also communicate ordering requirements of execution. A program employing the execution model declares and instantiates queues. Queues can execute with an in-order or out-of-order policy controllable by the program. In-order execution is an Intel extension.

The device execution model specifies how computation is accomplished on the accelerator. Compute ranging from small one-dimensional data to large multidimensional data sets are allocated across a hierarchy of ND-ranges, work-groups, sub-groups (Intel extension), and work-items, which are all specified when the work is submitted to the command queue.

It is important to note that the actual kernel code represents the work that is executed for one work-item. The code outside of the kernel controls just how much parallelism is executed; the amount and distribution of the work is controlled by specification of the sizes of the ND-range and work-group.

The following figure depicts the relationship between an ND-range, work-group, sub-group, and work-item. The total amount of work is specified by the ND-range size. The grouping of the work is specified by the work-group size. The example shows the ND-range size of X * Y * Z, work-group size of X’ * Y’ * Z’, and subgroup size of X’. Therefore, there are X * Y * Z work-items. There are (X * Y * Z) / (X’ * Y’ * Z’) work-groups and (X * Y * Z) / X’ subgroups.

Memory Model

The memory model for oneAPI is based upon the SYCL* memory model. It defines how the host and devices interact with memory. It coordinates the allocation and management of memory between the host and devices. The memory model is an abstraction that aims to generalize across and be adaptable to the different possible host and device configurations.

In this model, memory resides upon and is owned by either the host or the device and is specified by declaring a memory object. There are two different types of memory objects, buffers and images. Interaction of these memory objects between the host and device is accomplished via an accessor, which communicates the desired location of access, such as host or device, and the particular mode of access, such as read or write.

Consider a case where memory is allocated on the host through a traditional malloc call. Once the memory is allocated on the host, a buffer object is created, which enables the host allocated memory to be communicated to the device. The buffer class communicates the type and number of items of that type to be communicated to the device for computation. Once a buffer is created on the host, the type of access allowed on the device is communicated via an accessor object, which specifies the type of access to the buffer.

Kernel Programming Model

The kernel programming model for oneAPI is based upon the SYCL* kernel programming model. It enables explicit parallelism between the host and device. The parallelism is explicit in the sense that the programmer determines what code executes on the host and device; it is not automatic. The kernel code executes on the accelerator.

Programs employing the oneAPI programming model support single source, meaning the host code and device code can be in the same source file. However, there are differences between the source code accepted in the host code and the device code with respect to language conformance and language features.

The SYCL Specification defines in detail the required language features for host code and device code. The following is a summary that is specific to the oneAPI product.

How to Compile & Run DPC++ program

The three main steps of compiling and running a DPC++ program are:

  1. Initialize environment variables
  2. Compile the DPC++ source code
  3. Run the application

Compiling and Running on Intel® DevCloud:

For this training, we have written a script (q) to aid developers in developing projects on DevCloud. This script submits the run.sh script to a gpu node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run on DevCloud: ./q run.sh

Compiling and Running on a Local System:

If you have installed the Intel® oneAPI Base Toolkit on your local system, you can use the commands below to compile and run a DPC++ program:

source /opt/intel/inteloneapi/setvars.sh

dpcpp simple.cpp -o simple

./simple

Note: run.sh script is a combination of the three steps listec above.

Lab Exercise: Simple Vector Increment TO Vector Add

Code Walkthrough

DPC++ programs are standard C++. The program is invoked on the host computer, and offloads computation to the accelerator. You will use DPC++’s queue, buffer, device, and kernel abstractions to direct which parts of the computation and data should be offloaded.

The DPC++ compiler and the oneAPI libraries automate the tedious and error-prone aspects of compute and data offload, but still allow you to control how computation and data are distributed for best performance. The compiler knows how to generate code for both the host and the accelerator, how to launch computation on the accelerator, and how to move data back and forth.

In the program below you will use a data parallel algorithm with DPC++ to leverage the computational power in heterogenous computers. The DPC++ platform model includes a host computer and a device. The host offloads computation to the device, which could be a GPU, FPGA, or a multi-core CPU.

As a first step in a DPC++ program, create a queue. Offload computation to a device by submitting tasks to a queue. You can choose CPU, GPU, FPGA, and other devices through the selector. This program uses the default q here, which means the DPC++ runtime selects the most capable device available at runtime by using the default selector. You will learn more about devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules, but here is a simple DPC++ program to get you started.

Device and host can either share physical memory or have distinct memories. When the memories are distinct, offloading computation requires copying data between host and device. DPC++ does not require you to manage the data copies. By creating Buffers and Accessors, DPC++ ensures that the data is available to host and device without any effort on your part. DPC++ also allows you explicit control over data movement to achieve best peformance.

In a DPC++ program, we define a kernel, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a C++ lambda function. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The parallel_for in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional range from 0 to N-1.

The parallel_for is nested inside another lamba function, which is passed as an argument in the below program where we submit to the queue. The DPC++ runtime invokes the lambda when the accelerator connected to the queue is ready. The handler argument to the lambda allows operations inside the lambda to define the data and dependences with other computation that may be executed on host or devices. You will see more of this in later modules.

Finally, the program does a q.wait() on the queue. The earlier submit operation queues up an operation to be performed at a later time and immmediately returns. If the host wants to see the result of the computation, it must wait for the work to complete with a wait. Sometimes the device will encounter an error. The q.wait_and_throw() is a way for the host to capture and handle the error that has happened on the device.

Lab Exercise

Vector increment is the “hello world” of data parallel computing. A vector is an array of data elements, and the program below performs the same computation on each element of the vector by adding 1. The code below shows Simple Vector Increment DPC++ code. You will change the program to create a new vector, then add the elements in the new vector to the existing vector using DPC++.

  1. Select the code cell below, follow the STEPS 1 to 6 in the code comments to change from vector-increment to vector-add and click run ▶ to save the code to a file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/simple-vector-incr.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <CL/sycl.hpp>
using namespace sycl;
//N is set as 2 as this is just for demonstration purposes. Even if you make N bigger than 2 the program still
//counts N as only 2 as the first 2 elements are only initialized here and the rest all becomes zero.
static const size_t N = 2;

// ############################################################
// work

void work(queue &q) {
  std::cout << "Device : "
            << q.get_device().get_info<info::device::name>()
            << "\n";
  // ### Step 1 - Inspect
  // The code presents one input buffer (vector1) for which Sycl buffer memory
  // is allocated. The associated with vector1_accessor set to read/write gets
  // the contents of the buffer.
  int vector1[N] = {10, 10};
  auto R = range(N);
  
  std::cout << "Input  : " << vector1[0] << ", " << vector1[1] << "\n";

  // ### Step 2 - Add another input vector - vector2
  // Uncomment the following line to add input vector2
  //int vector2[N] = {20, 20};

  // ### Step 3 - Print out for vector2
  // Uncomment the following line
  //std::cout << "Input  : " << vector2[0] << ", " << vector2[1] << "\n";
  buffer vector1_buffer(vector1,R);

  // ### Step 4 - Add another Sycl buffer - vector2_buffer
  // Uncomment the following line
  //buffer vector2_buffer(vector2,R);
  q.submit([&](handler &h) {
    accessor vector1_accessor (vector1_buffer,h);

    // Step 5 - add an accessor for vector2_buffer
    // Uncomment the following line to add an accessor for vector 2
    //accessor vector2_accessor (vector2_buffer,h,read_only);

    h.parallel_for<class test>(range<1>(N), [=](id<1> index) {
      // ### Step 6 - Replace the existing vector1_accessor to accumulate
      // vector2_accessor 
      // Comment the following line
      vector1_accessor[index] += 1;

      // Uncomment the following line
      //vector1_accessor[index] += vector2_accessor[index];
    });
  });
  q.wait();
  host_accessor h_a(vector1_buffer,read_only);
  std::cout << "Output : " << vector1[0] << ", " << vector1[1] << "\n";
}

// ############################################################
// entry point for the program

int main() {  
  try {
    queue q;
    work(q);
  } catch (exception e) {
    std::cerr << "Exception: " << e.what() << "\n";
    std::terminate();
  } catch (...) {
    std::cerr << "Unknown exception" << "\n";
    std::terminate();
  }
}
Output:

Overwriting lab/simple-vector-incr.cpp

Build and Run

Select the cell below and click Run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_simple-vector-incr.sh; if [ -x "$(command -v qsub)" ]; then ./q run_simple-vector-incr.sh; else ./run_simple-vector-incr.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Solution

Summary

In this module you will have learned the following:

  • How oneAPI solves the challenges of programming in a heterogeneous world
  • Take advantage of oneAPI solutions to enable your workflows
  • Use the Intel® DevCloud to test-drive oneAPI tools and libraries
  • Basics of the DPC++ language and programming model
  • Become familiarized with the use of Juypter notebooks by editing of source code in context.

Survey

Tell us how we did in this module with a short survey. We will use your feedback to improve the quality and impact of these learning materials. Thanks!

Reset Notebook

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/oneAPI_Essentials/01_oneAPI_Intro/ ~/oneAPI_Essentials/01_oneAPI_Intro
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])
Output:
Notebook reset -- now click reload on browser.

Resources

Check out these related resources

Intel® oneAPI Toolkit documentation

SYCL

Modern C++

DPC++ Program Structure


마지막 수정일: 2022-02-23, teratec
Sections

Learning Objectives

  • Explain the SYCL fundamental classes
  • Use device selection to offload kernel workloads
  • Decide when to use basic parallel kernels and ND Range Kernels
  • Create a host Accessor
  • Build a sample DPC++ application through hands-on lab exercises

What is Data Parallel C++?

oneAPI programs are written in Data Parallel C++ (DPC++). It is based on modern C++ productivity benefits and familiar constructs and incorporates the SYCL standard for data parallelism and heterogeneous programming. DPC++ is a single source where host code and heterogeneous accelerator kernels can be mixed in same source files. A DPC++ program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionliaties like a queue for work targeting, buffer for data management, and parallel_for for parallelism to direct which parts of the computation and data should be offloaded.

Device

The device class represents the capabilities of the accelerators in a system utilizing Intel® oneAPI Toolkits. The device class contains member functions for querying information about the device, which is useful for DPC++ programs where multiple devices are created.

  • The function get_info gives information about the device:
  • Name, vendor, and version of the device
  • The local and global work item IDs
  • Width for built in types, clock frequency, cache width and sizes, online or offline
queue q;
device my_device = q.get_device();
std::cout << "Device: " << my_device.get_info<info::device::name>() << "\n";

Device Selector

The device_selector class enables the runtime selection of a particular device to execute kernels based upon user-provided heuristics. The following code sample shows use of the standard device selectors (default_selector, cpu_selector, gpu_selector…) and a derived device_selector

default_selector selector;
// host_selector selector;
// cpu_selector selector;
// gpu_selector selector;
queue q(selector);
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

The DPC++ code below shows different device selectors: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/gpu_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// ==============================================================

#include <CL/sycl.hpp>​

using namespace cl::sycl;

int main() {
//# Create a device queue with device selector
  gpu_selector selector;
//cpu_selector selector;
//default_selector selector;
//host_selector selector;

  queue q(selector);
  //# Print the device name
  std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_gpu.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu.sh; else ./run_gpu.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Queue

Queue submits command groups to be executed by the SYCL runtime. Queue is a mechanism where work is submitted to a device.A queue map to one device and multiple queues can be mapped to the same device.

q.submit([&](handler& h) {
    //COMMAND GROUP CODE
});

Kernel

The kernel class encapsulates methods and data for executing code on the device when a command group is instantiated. Kernel object is not explicitly constructed by the user and is is constructed when a kernel dispatch function, such as parallel_for, is called

q.submit([&](handler& h) {
 h.parallel_for(range<1>(N), [=](id<1> i) {
   A[i] = B[i] + C[i]);
 });
});

Choosing where device kernels run

Work is submitted to queues and each queue is associated with exactly one device (e.g. a specific GPU or FPGA). You can decide which device a queue is associated with (if you want) and have as many queues as desired for dispatching work in heterogeneous systems.

Target DeviceQueue
Create queue targeting any device:queue()
Create queue targeting a pre-configured classes of devices:queue(cpu_selector{}); queue(gpu_selector{}); queue(INTEL::fpga_selector{}); queue(accelerator_selector{}); queue(host_selector{});
Create queue targeting specific device (custom criteria):class custom_selector : public device_selector {int operator()(…… // Any logic you want! … queue(custom_selector{});

DPC++ Language and Runtime

DPC++ language and runtime consists of a set of C++ classes, templates, and libraries.

Application scope and command group scope :

  • Code that executes on the host
  • The full capabilities of C++ are available at application and command group scope

Kernel scope:

  • Code that executes on the device.
  • At kernel scope there are limitations in accepted C++

Parallel Kernels

Parallel Kernel allows multiple instances of an operation to execute in parallel. This is useful to offload parallel execution of a basic for-loop in which each iteration is completely independent and in any order. Parallel kernels are expressed using the parallel_for function A simple 'for' loop in a C++ application is written as below

for(int i=0; i < 1024; i++){
    a[i] = b[i] + c[i];
});

Below is how you can offload to accelerator

h.parallel_for(range<1>(1024), [=](id<1> i){
    A[i] =  B[i] + C[i];
});

Basic Parallel Kernels

The functionality of basic parallel kernels is exposed via rangeid, and item classes. Range class is used to describe the iteration space of parallel execution and id class is used to index an individual instance of a kernel in a parallel execution

h.parallel_for(range<1>(1024), [=](id<1> i){
// CODE THAT RUNS ON DEVICE 
});

The above example is good if all you need is the index (id), but if you need the range value in your kernel code, then you can use item class instead of id class , which you can use to query for the range as shown below. item class represents an individual instance of a kernel function, exposes additional functions to query properties of the execution range

h.parallel_for(range<1>(1024), [=](item<1> item){
    auto i = item.get_id();
    auto R = item.get_range();
    // CODE THAT RUNS ON DEVICE
    
    
});

ND RANGE KERNELS

Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. ND-Range kernel is another way to expresses parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware. The entire iteration space is divided into smaller groups called work-groupswork-items within a work-group are scheduled on a single compute unit on hardware.

The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution.The functionality of nd_range kernels is exposed via nd_range and nd_item classes. nd_range class represents a grouped execution range using global execution range and the local execution range of each work-group. nd_item class represents an individual instance of a kernel function and allows to query for work-group range and index.

h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){
    auto idx = item.get_global_id();
    auto local_id = item.get_local_id();
    // CODE THAT RUNS ON DEVICE
});

Buffer Model

Buffers encapsulate data in a SYCL application across both devices and host. Accessors is the mechanism to access buffer data.

DPC++ Code Anatomy

Programs which utilize oneAPI require the include of cl/sycl.hpp. It is recommended to employ the namespace statement to save typing repeated references into the cl::sycl namespace.

#include <CL/sycl.hpp>
using namespace cl::sycl;

DPC++ programs are standard C++. The program is invoked on the host computer, and offloads computation to the accelerator. A programmer uses DPC++’s queue, buffer, device, and kernel abstractions to direct which parts of the computation and data should be offloaded.

As a first step in a DPC++ program we create a queue. We offload computation to a device by submitting tasks to a queue. The programmer can choose CPU, GPU, FPGA, and other devices through the selector. This program uses the default q here, which means DPC++ runtime selects the most capable device available at runtime by using the default selector. We will talk about the devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules but below is a simple DPC++ program for you to get started with the above concepts.

Device and host can either share physical memory or have distinct memories. When the memories are distinct, offloading computation requires copying data between host and device. DPC++ does not require the programmer to manage the data copies. By creating Buffers and Accessors, DPC++ ensures that the data is available to host and device without any programmer effort. DPC++ also allows the programmer explicit control over data movement when it is necessary to achieve best peformance.

In a DPC++ program, we define a kernel, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a C++ lambda function. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The parallel_for in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional range from 0 to N-1.

The code below shows Simple Vector addition using DPC++. Read through the comments addressed in step 1 through step 6.

void dpcpp_code(int* a, int* b, int* c, int N) {
  //Step 1: create a device queue
  //(developer can specify a device type via device selector or use default selector)
  auto R = range<1>(N);
  queue q;
  //Step 2: create buffers (represent both host and device memory)
  buffer buf_a(a, R);
  buffer buf_b(b, R);
  buffer buf_c(c, R);
  //Step 3: submit a command for (asynchronous) execution
  q.submit([&](handler &h){
  //Step 4: create buffer accessors to access buffer data on the device
  accessor A(buf_a,h,read_only);
  accessor B(buf_b,h,read_only);
  accessor C(buf_c,h,write_only);
  
  //Step 5: send a kernel (lambda) for execution
  h.parallel_for(range<1>(N), [=](auto i){
    //Step 6: write a kernel
    //Kernel invocations are executed in parallel
    //Kernel is invoked for each element of the range
    //Kernel invocation has access to the invocation id
    C[i] = A[i] + B[i];
    });
  });
}

Implicit dependency with Accessors

  • Accessors create data dependencies in the SYCL graph that order kernel executions
  • If two kernels use the same buffer, the second kernel needs to wait for the completion of the first kernel to avoid race conditions.

The DPC++ code below demonstrates Implicit dependency with Accessors: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/buffer_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>

constexpr int num=16;
using namespace sycl;

  int main() {
  auto R = range<1>{ num };
  //Create Buffers A and B
  buffer<int> A{ R }, B{ R };
  //Create a device queue
  queue Q;
  //Submit Kernel 1
  Q.submit([&](handler& h) {
    //Accessor for buffer A
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit Kernel 2
  Q.submit([&](handler& h) {
    //This task will wait till the first queue is complete
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] += idx[0]; }); });
  //Submit Kernel 3
  Q.submit([&](handler& h) { 
    //Accessor for Buffer B
    accessor out(B,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit task 4
  Q.submit([&](handler& h) {
   //This task will wait till kernel 2 and 3 are complete
   accessor in (A,h,read_only);
   accessor inout(B,h);
  h.parallel_for(R, [=](auto idx) {
    inout[idx] *= in[idx]; }); }); 
      
 // And the following is back to device code
 host_accessor result(B,read_only);
  for (int i=0; i<num; ++i)
    std::cout << result[i] << "\n";      
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_buffer.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer.sh; else ./run_buffer.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Host Accessors

The Host Accessor is an accessor which uses host buffer access target. It is created outside of the scope of the command group and the data that this gives access to will be available on the host. These are used to synchronize the data back to the host by constructing the host accessor objects. Buffer destruction is the other way to synchronize the data back to the host.

Synchronization: Host Accessor

Buffer takes ownership of the data stored in vector. Creating host accessor is a blocking call and will only return after all enqueued DPC++ kernels that modify the same buffer in any queue completes execution and the data is available to the host via this host accessor.

The DPC++ code below demonstrates Synchronization with Host Accessor: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/host_accessor_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <CL/sycl.hpp>
using namespace sycl;

int main() {
  constexpr int N = 16;
  auto R = range<1>(N);
  std::vector<int> v(N, 10);
  queue q;
  // Buffer takes ownership of the data stored in vector.  
  buffer buf(v);
  q.submit([&](handler& h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
  // Creating host accessor is a blocking call and will only return after all
  // enqueued DPC++ kernels that modify the same buffer in any queue completes
  // execution and the data is available to the host via this host accessor.
  host_accessor b(buf,read_only);
  for (int i = 0; i < N; i++) std::cout << b[i] << " ";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_host_accessor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_host_accessor.sh; else ./run_host_accessor.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples,please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Synchronization: Buffer Destruction

In the below example Buffer creation happens within a separate function scope. When execution advances beyond this function scope, buffer destructor is invoked which relinquishes the ownership of data and copies back the data to the host memory.

The DPC++ code below demonstrates Synchronization with Buffer Destruction: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to a file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/buffer_destruction2.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <CL/sycl.hpp>
constexpr int N = 16;
using namespace sycl;

// Buffer creation happens within a separate function scope.
void dpcpp_code(std::vector<int> &v, queue &q) {
  auto R = range<1>(N);
  buffer buf(v);
  q.submit([&](handler &h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
}
int main() {
  std::vector<int> v(N, 10);
  queue q;
  dpcpp_code(v, q);
  // When execution advances beyond this function scope, buffer destructor is
  // invoked which relinquishes the ownership of data and copies back the data to
  // the host memory.
  for (int i = 0; i < N; i++) std::cout << v[i] << " ";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_buffer_destruction.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer_destruction.sh; else ./run_buffer_destruction.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Custom Device Selector

The following code shows derived device_selector that employs a device selector heuristic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator.

The DPC++ code below demonstrates Custom Device Selector: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to a file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/custom_device_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iostream>
using namespace sycl;
class my_device_selector : public device_selector {
public:
    my_device_selector(std::string vendorName) : vendorName_(vendorName){};
    int operator()(const device& dev) const override {
    int rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device.
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) != std::string::npos))
        rating = 3;
    else if (dev.is_gpu()) rating = 2;
    else if (dev.is_cpu()) rating = 1;
    return rating;
    };
    
private:
    std::string vendorName_;
};
int main() {
    //pass in the name of the vendor for which the device you want to query 
    std::string vendor_name = "Intel";
    //std::string vendor_name = "AMD";
    //std::string vendor_name = "Nvidia";
    my_device_selector selector(vendor_name);
    queue q(selector);
    std::cout << "Device: "
    << q.get_device().get_info<info::device::name>() << "\n";
    return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_custom_device.sh;if [ -x "$(command -v qsub)" ]; then ./q run_custom_device.sh; else ./run_custom_device.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

Lab Exercise: Complex Number Multiplication

The following is the definition of a custom class type that represents complex numbers.

  • The file Complex.hpp defines the Complex2 class.
  • The Complex2 Class got two member variables "real" and "imag" of type int.
  • The Complex2 class got a member function for performing complex number multiplication. The function complex_mul returns the object of type Complex2 performing the multiplication of two complex numbers.
  • We are going to call complex_mul function from our DPC++ code.

LAB EXERCISE

  • In this lab we provide with the source code that computes multiplication of two complex numbers where Complex class is the definition of a custom type that represents complex numbers.
  • In this example the student will learn how to use custom device selector to target GPU or CPU of a specific vendor and then pass in a vector of custom Complex class objects in parallel.The student needs to modify the source code to select Intel® GPU as the first choice and then, setup a write accessor and call the Complex class member function as kernel to compute the multiplication.
  • Follow the Step1 and Step 2 and Step 3 in the below code.
  • The Complex class in the below example is to demonstarte the usage of a custom class and how a custom class can be passed in a DPC++ code, but not to show the functionality of the complex class in the std library. You can use the std::complex library as it is on its own in a DPC++ program
  1. Select the code cell below, follow the STEPS 1 to 3 in the code comments, click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/complex_mult.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"
#include "Complex.hpp"

using namespace sycl;
using namespace std;

// Number of complex numbers passing to the DPC++ code
static const int num_elements = 10000;

class CustomDeviceSelector : public device_selector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) const override {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    //**************Step1: Uncomment the following lines where you are setting the rating for the devices********
    /*if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;*/
    return device_rating;
  };

 private:
  std::string vendorName_;
};

// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void DpcppParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);

  // Setup Output buffers 
  buffer bufout_vect(out_vect);

  std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    //**************STEP 2: Uncomment the below line to set the Write Accessor******************** 
    //accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      //**************STEP 3: Uncomment the below line to call the complex_mul function that computes the multiplication
      //of the  complex numbers********************
      //V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void DpcppScalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;

  for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }

  // Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    // queue constructor passed exception handler
    CustomDeviceSelector selector(vendor_name);
    queue q(selector, dpc_common::exception_handler);
    // Call the DpcppParallel with the required inputs and outputs
    DpcppParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }

  std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);

  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the DpcppScalar function with the required input and outputs
  DpcppScalar(input_vect1, input_vect2, out_vect_scalar);

  // Compare the outputs from the parallel and the scalar functions. They should
  // be equal

  int ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";

  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:

! chmod 755 q; chmod 755 run_complex_mult.sh; if [ -x "$(command -v qsub)" ]; then ./q run_complex_mult.sh; else ./run_complex_mult.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples,please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Solution


Summary

In this module you learned:

  • The fundamental SYCL Classes
  • How to select the device to offload to kernel workloads
  • How to write a DPC++ program using Buffers, Accessors, Command Group handler, and kernel
  • How to use the Host accessors and Buffer destruction to do the synchronization

Survey

We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks!

DPC++ Unified Shared Memory


마지막 수정일: 2022-02-23, teratec
Sections

Learning Objectives

  • Explain the SYCL fundamental classes
  • Use device selection to offload kernel workloads
  • Decide when to use basic parallel kernels and ND Range Kernels
  • Create a host Accessor
  • Build a sample DPC++ application through hands-on lab exercises

What is Data Parallel C++?

oneAPI programs are written in Data Parallel C++ (DPC++). It is based on modern C++ productivity benefits and familiar constructs and incorporates the SYCL standard for data parallelism and heterogeneous programming. DPC++ is a single source where host code and heterogeneous accelerator kernels can be mixed in same source files. A DPC++ program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionliaties like a queue for work targeting, buffer for data management, and parallel_for for parallelism to direct which parts of the computation and data should be offloaded.

Device

The device class represents the capabilities of the accelerators in a system utilizing Intel® oneAPI Toolkits. The device class contains member functions for querying information about the device, which is useful for DPC++ programs where multiple devices are created.

  • The function get_info gives information about the device:
  • Name, vendor, and version of the device
  • The local and global work item IDs
  • Width for built in types, clock frequency, cache width and sizes, online or offline
queue q;
device my_device = q.get_device();
std::cout << "Device: " << my_device.get_info<info::device::name>() << "\n";

Device Selector

The device_selector class enables the runtime selection of a particular device to execute kernels based upon user-provided heuristics. The following code sample shows use of the standard device selectors (default_selector, cpu_selector, gpu_selector…) and a derived device_selector

default_selector selector;
// host_selector selector;
// cpu_selector selector;
// gpu_selector selector;
queue q(selector);
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

The DPC++ code below shows different device selectors: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/gpu_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace cl::sycl;
int main() {
  //# Create a device queue with device selector
  
  gpu_selector selector;
  //cpu_selector selector;
  //default_selector selector;
  //host_selector selector;
  
  queue q(selector);
  //# Print the device name
  std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_gpu.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu.sh; else ./run_gpu.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Queue

Queue submits command groups to be executed by the SYCL runtime. Queue is a mechanism where work is submitted to a device.A queue map to one device and multiple queues can be mapped to the same device.

q.submit([&](handler& h) {
    //COMMAND GROUP CODE
});

Kernel

The kernel class encapsulates methods and data for executing code on the device when a command group is instantiated. Kernel object is not explicitly constructed by the user and is is constructed when a kernel dispatch function, such as parallel_for, is called

q.submit([&](handler& h) {
 h.parallel_for(range<1>(N), [=](id<1> i) {
   A[i] = B[i] + C[i]);
 });
});

Choosing where device kernels run

Work is submitted to queues and each queue is associated with exactly one device (e.g. a specific GPU or FPGA). You can decide which device a queue is associated with (if you want) and have as many queues as desired for dispatching work in heterogeneous systems.

Target DeviceQueue
Create queue targeting any device:queue()
Create queue targeting a pre-configured classes of devices:queue(cpu_selector{}); queue(gpu_selector{}); queue(INTEL::fpga_selector{}); queue(accelerator_selector{}); queue(host_selector{});
Create queue targeting specific device (custom criteria):class custom_selector : public device_selector {int operator()(…… // Any logic you want! … queue(custom_selector{});

Image

DPC++ Language and Runtime

DPC++ language and runtime consists of a set of C++ classes, templates, and libraries.

Application scope and command group scope :

  • Code that executes on the host
  • The full capabilities of C++ are available at application and command group scope

Kernel scope:

  • Code that executes on the device.
  • At kernel scope there are limitations in accepted C++

Parallel Kernels

Parallel Kernel allows multiple instances of an operation to execute in parallel. This is useful to offload parallel execution of a basic for-loop in which each iteration is completely independent and in any order. Parallel kernels are expressed using the parallel_for function A simple 'for' loop in a C++ application is written as below

for(int i=0; i < 1024; i++){
    a[i] = b[i] + c[i];
});

Below is how you can offload to accelerator

h.parallel_for(range<1>(1024), [=](id<1> i){
    A[i] =  B[i] + C[i];
});

Basic Parallel Kernels

The functionality of basic parallel kernels is exposed via rangeid, and item classes. Range class is used to describe the iteration space of parallel execution and id class is used to index an individual instance of a kernel in a parallel execution

h.parallel_for(range<1>(1024), [=](id<1> i){
// CODE THAT RUNS ON DEVICE 
});

The above example is good if all you need is the index (id), but if you need the range value in your kernel code, then you can use item class instead of id class , which you can use to query for the range as shown below. item class represents an individual instance of a kernel function, exposes additional functions to query properties of the execution range

h.parallel_for(range<1>(1024), [=](item<1> item){
    auto i = item.get_id();
    auto R = item.get_range();
    // CODE THAT RUNS ON DEVICE
    
    
});

ND RANGE KERNELS

Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. ND-Range kernel is another way to expresses parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware. The entire iteration space is divided into smaller groups called work-groupswork-items within a work-group are scheduled on a single compute unit on hardware.

The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution.The functionality of nd_range kernels is exposed via nd_range and nd_item classes. nd_range class represents a grouped execution range using global execution range and the local execution range of each work-group. nd_item class represents an individual instance of a kernel function and allows to query for work-group range and index.

h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){
    auto idx = item.get_global_id();
    auto local_id = item.get_local_id();
    // CODE THAT RUNS ON DEVICE
});

Image

Buffer Model

Buffers encapsulate data in a SYCL application across both devices and host. Accessors is the mechanism to access buffer data.

DPC++ Code Anatomy

Programs which utilize oneAPI require the include of cl/sycl.hpp. It is recommended to employ the namespace statement to save typing repeated references into the cl::sycl namespace.

#include <CL/sycl.hpp>
using namespace cl::sycl;

DPC++ programs are standard C++. The program is invoked on the host computer, and offloads computation to the accelerator. A programmer uses DPC++’s queue, buffer, device, and kernel abstractions to direct which parts of the computation and data should be offloaded.

As a first step in a DPC++ program we create a queue. We offload computation to a device by submitting tasks to a queue. The programmer can choose CPU, GPU, FPGA, and other devices through the selector. This program uses the default q here, which means DPC++ runtime selects the most capable device available at runtime by using the default selector. We will talk about the devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules but below is a simple DPC++ program for you to get started with the above concepts.

Device and host can either share physical memory or have distinct memories. When the memories are distinct, offloading computation requires copying data between host and device. DPC++ does not require the programmer to manage the data copies. By creating Buffers and Accessors, DPC++ ensures that the data is available to host and device without any programmer effort. DPC++ also allows the programmer explicit control over data movement when it is necessary to achieve best peformance.

In a DPC++ program, we define a kernel, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a C++ lambda function. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The parallel_for in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional range from 0 to N-1.

The code below shows Simple Vector addition using DPC++. Read through the comments addressed in step 1 through step 6.

void dpcpp_code(int* a, int* b, int* c, int N) {
  //Step 1: create a device queue
  //(developer can specify a device type via device selector or use default selector)
  auto R = range<1>(N);
  queue q;
  //Step 2: create buffers (represent both host and device memory)
  buffer buf_a(a, R);
  buffer buf_b(b, R);
  buffer buf_c(c, R);
  //Step 3: submit a command for (asynchronous) execution
  q.submit([&](handler &h){
  //Step 4: create buffer accessors to access buffer data on the device
  accessor A(buf_a,h,read_only);
  accessor B(buf_b,h,read_only);
  accessor C(buf_c,h,write_only);
  
  //Step 5: send a kernel (lambda) for execution
  h.parallel_for(range<1>(N), [=](auto i){
    //Step 6: write a kernel
    //Kernel invocations are executed in parallel
    //Kernel is invoked for each element of the range
    //Kernel invocation has access to the invocation id
    C[i] = A[i] + B[i];
    });
  });
}

Implicit dependency with Accessors

  • Accessors create data dependencies in the SYCL graph that order kernel executions
  • If two kernels use the same buffer, the second kernel needs to wait for the completion of the first kernel to avoid race conditions.

Image

The DPC++ code below demonstrates Implicit dependency with Accessors: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/buffer_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
constexpr int num=16;
using namespace sycl;
  int main() {
  auto R = range<1>{ num };
  //Create Buffers A and B
  buffer<int> A{ R }, B{ R };
  //Create a device queue
  queue Q;
  //Submit Kernel 1
  Q.submit([&](handler& h) {
    //Accessor for buffer A
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit Kernel 2
  Q.submit([&](handler& h) {
    //This task will wait till the first queue is complete
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] += idx[0]; }); });
  //Submit Kernel 3
  Q.submit([&](handler& h) { 
    //Accessor for Buffer B
    accessor out(B,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit task 4
  Q.submit([&](handler& h) {
   //This task will wait till kernel 2 and 3 are complete
   accessor in (A,h,read_only);
   accessor inout(B,h);
  h.parallel_for(R, [=](auto idx) {
    inout[idx] *= in[idx]; }); }); 
      
 // And the following is back to device code
 host_accessor result(B,read_only);
  for (int i=0; i<num; ++i)
    std::cout << result[i] << "\n";      
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_buffer.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer.sh; else ./run_buffer.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Host Accessors

The Host Accessor is an accessor which uses host buffer access target. It is created outside of the scope of the command group and the data that this gives access to will be available on the host. These are used to synchronize the data back to the host by constructing the host accessor objects. Buffer destruction is the other way to synchronize the data back to the host.

Synchronization: Host Accessor

Buffer takes ownership of the data stored in vector. Creating host accessor is a blocking call and will only return after all enqueued DPC++ kernels that modify the same buffer in any queue completes execution and the data is available to the host via this host accessor.

The DPC++ code below demonstrates Synchronization with Host Accessor: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/host_accessor_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;
int main() {
  constexpr int N = 16;
  auto R = range<1>(N);
  std::vector<int> v(N, 10);
  queue q;
  // Buffer takes ownership of the data stored in vector.  
  buffer buf(v);
  q.submit([&](handler& h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
  // Creating host accessor is a blocking call and will only return after all
  // enqueued DPC++ kernels that modify the same buffer in any queue completes
  // execution and the data is available to the host via this host accessor.
  host_accessor b(buf,read_only);
  for (int i = 0; i < N; i++) std::cout << b[i] << " ";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_host_accessor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_host_accessor.sh; else ./run_host_accessor.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples,please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Synchronization: Buffer Destruction

In the below example Buffer creation happens within a separate function scope. When execution advances beyond this function scope, buffer destructor is invoked which relinquishes the ownership of data and copies back the data to the host memory.

The DPC++ code below demonstrates Synchronization with Buffer Destruction: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to a file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/buffer_destruction2.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
constexpr int N = 16;
using namespace sycl;
// Buffer creation happens within a separate function scope.
void dpcpp_code(std::vector<int> &v, queue &q) {
  auto R = range<1>(N);
  buffer buf(v);
  q.submit([&](handler &h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
}
int main() {
  std::vector<int> v(N, 10);
  queue q;
  dpcpp_code(v, q);
  // When execution advances beyond this function scope, buffer destructor is
  // invoked which relinquishes the ownership of data and copies back the data to
  // the host memory.
  for (int i = 0; i < N; i++) std::cout << v[i] << " ";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_buffer_destruction.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer_destruction.sh; else ./run_buffer_destruction.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Custom Device Selector

The following code shows derived device_selector that employs a device selector heuristic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator.

The DPC++ code below demonstrates Custom Device Selector: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to a file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/custom_device_sample.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iostream>
using namespace sycl;
class my_device_selector : public device_selector {
public:
    my_device_selector(std::string vendorName) : vendorName_(vendorName){};
    int operator()(const device& dev) const override {
    int rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device.
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) != std::string::npos))
        rating = 3;
    else if (dev.is_gpu()) rating = 2;
    else if (dev.is_cpu()) rating = 1;
    return rating;
    };
    
private:
    std::string vendorName_;
};
int main() {
    //pass in the name of the vendor for which the device you want to query 
    std::string vendor_name = "Intel";
    //std::string vendor_name = "AMD";
    //std::string vendor_name = "Nvidia";
    my_device_selector selector(vendor_name);
    queue q(selector);
    std::cout << "Device: "
    << q.get_device().get_info<info::device::name>() << "\n";
    return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_custom_device.sh;if [ -x "$(command -v qsub)" ]; then ./q run_custom_device.sh; else ./run_custom_device.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

Lab Exercise: Complex Number Multiplication

The following is the definition of a custom class type that represents complex numbers.

  • The file Complex.hpp defines the Complex2 class.
  • The Complex2 Class got two member variables "real" and "imag" of type int.
  • The Complex2 class got a member function for performing complex number multiplication. The function complex_mul returns the object of type Complex2 performing the multiplication of two complex numbers.
  • We are going to call complex_mul function from our DPC++ code.

LAB EXERCISE

  • In this lab we provide with the source code that computes multiplication of two complex numbers where Complex class is the definition of a custom type that represents complex numbers.
  • In this example the student will learn how to use custom device selector to target GPU or CPU of a specific vendor and then pass in a vector of custom Complex class objects in parallel.The student needs to modify the source code to select Intel® GPU as the first choice and then, setup a write accessor and call the Complex class member function as kernel to compute the multiplication.
  • Follow the Step1 and Step 2 and Step 3 in the below code.
  • The Complex class in the below example is to demonstarte the usage of a custom class and how a custom class can be passed in a DPC++ code, but not to show the functionality of the complex class in the std library. You can use the std::complex library as it is on its own in a DPC++ program
  1. Select the code cell below, follow the STEPS 1 to 3 in the code comments, click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

[ ]:





%%writefile lab/complex_mult.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"
#include "Complex.hpp"
using namespace sycl;
using namespace std;
// Number of complex numbers passing to the DPC++ code
static const int num_elements = 10000;
class CustomDeviceSelector : public device_selector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) const override {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    //**************Step1: Uncomment the following lines where you are setting the rating for the devices********
    /*if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;*/
    return device_rating;
  };
 private:
  std::string vendorName_;
};
// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void DpcppParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);
  // Setup Output buffers 
  buffer bufout_vect(out_vect);
  std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    //**************STEP 2: Uncomment the below line to set the Write Accessor******************** 
    //accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      //**************STEP 3: Uncomment the below line to call the complex_mul function that computes the multiplication
      //of the  complex numbers********************
      //V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void DpcppScalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;
  for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }
  // Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    // queue constructor passed exception handler
    CustomDeviceSelector selector(vendor_name);
    queue q(selector, dpc_common::exception_handler);
    // Call the DpcppParallel with the required inputs and outputs
    DpcppParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }
  std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);
  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the DpcppScalar function with the required input and outputs
  DpcppScalar(input_vect1, input_vect2, out_vect_scalar);
  // Compare the outputs from the parallel and the scalar functions. They should
  // be equal
  int ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:[ ]:





! chmod 755 q; chmod 755 run_complex_mult.sh; if [ -x "$(command -v qsub)" ]; then ./q run_complex_mult.sh; else ./run_complex_mult.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples,please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Solution


Summary

In this module you learned:

  • The fundamental SYCL Classes
  • How to select the device to offload to kernel workloads
  • How to write a DPC++ program using Buffers, Accessors, Command Group handler, and kernel
  • How to use the Host accessors and Buffer destruction to do the synchronization

Survey

We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks!

DPC++ Sub-Groups


마지막 수정일: 2022-02-23, teratec

Subgroups

Sections

Learning Objectives

  • Understand advantages of using Subgroups in Data Parallel C++ (DPC++)
  • Take advantage of Subgroup collectives in ND-Range kernel implementation
  • Use Subgroup Shuffle operations to avoid explicit memory operations

What are Subgroups?

On many modern hardware platforms, a subset of the work-items in a work-group are executed simultaneously or with additional scheduling guarantees. These subset of work-items are called subgroups. Leveraging subgroups will help to map execution to low-level hardware and may help in achieving higher performance.

Subgroups in ND-Range Kernel Execution

Parallel execution with the ND_RANGE Kernel helps to group work items that map to hardware resources. This helps to tune applications for performance.

The execution range of an ND-range kernel is divided into work-groups, subgroups and work-items as shown in picture below.

How a Subgroup Maps to Graphics Hardware

Work-itemRepresents the individual instances of a kernel function.
Work-groupThe entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware.
SubgroupA subset of work-items within a work-group that are executed simultaneously, may be mapped to vector hardware. (DPC++)

The picture below shows how work-groups and subgroups map to Intel® Gen11 Graphics Hardware.

Why use Subgroups?

  • Work-items in a sub-group can communicate directly using shuffle operations, without explicit memory operations.
  • Work-items in a sub-group can synchronize using sub-group barriers and guarantee memory consistency using sub-group memory fences.
  • Work-items in a sub-group have access to sub-group functions and algorithms, providing fast implementations of common parallel patterns.

sub_group class

The subgroup handle can be obtained from the nd_item using the get_sub_group()

        sycl::sub_group&nbsp;sg&nbsp;=&nbsp;nd_item.get_sub_group();

                 OR

        auto&nbsp;sg&nbsp;=&nbsp;nd_item.get_sub_group();

Once you have the subgroup handle, you can query for more information about the subgroup, do shuffle operations or use collective functions.

Subgroup info

The subgroup handle can be queried to get other information like number of work-items in subgroup, or number of subgroups in a work-group which will be needed for developers to implement kernel code using subgroups:

  • get_local_id() returns the index of the work-item within its subgroup
  • get_local_range() returns the size of sub_group
  • get_group_id() returns the index of the subgroup
  • get_group_range() returns the number of subgroups within the parent work-group
    h.parallel_for(nd_range<1>(64,64),&nbsp;[=](nd_item<1>&nbsp;item){
      /* get sub_group handle */
      auto&nbsp;sg&nbsp;=&nbsp;item.get_sub_group();
      /* query sub_group and print sub_group info once per sub_group */
      if(sg.get_local_id()[0]&nbsp;==&nbsp;0){
        out&nbsp;<<&nbsp;"sub_group&nbsp;id:&nbsp;"&nbsp;<<&nbsp;sg.get_group_id()[0]
            <<&nbsp;"&nbsp;of&nbsp;"&nbsp;<<&nbsp;sg.get_group_range()[0]
            <<&nbsp;",&nbsp;size="&nbsp;<<&nbsp;sg.get_local_range()[0]&nbsp;
            <<&nbsp;"\n";
      }
    });

Lab Exercise: Subgroup Info

The DPC++ code below demonstrates subgroup query methods to print sub-group info: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_info.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 64; // global size
static constexpr size_t B = 64; // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  q.submit([&](handler &h) {
    //# setup sycl stream class to print standard output from device code
    auto out = stream(1024, 768, h);

    //# nd-range kernel
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
      //# get sub_group handle
      auto sg = item.get_sub_group();

      //# query sub_group and print sub_group info once per sub_group
      if (sg.get_local_id()[0] == 0) {
        out << "sub_group id: " << sg.get_group_id()[0] << " of "
            << sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
            << "\n";
      }
    });
  }).wait();
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:
! chmod 755 q; chmod 755 run_sub_group_info.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_info.sh; else ./run_sub_group_info.sh; fi
If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Subgroup Size

For tuning applications for performance, sub-group size may have to be set a specific value. For example Intel(R) GPU supports sub-groups sizes of 8, 16 and 32; by default the compiler implimentation will pick optimal sub-group size, but it can also be forced to use a specific value.

The supported sub-group sizes for a GPU can be queried from device information as shown below:

auto sg_sizes = q.get_device().get_info<info::device::sub_group_sizes>();
                                                      ^

reqd_sub_group_size(S) allows setting a specific sub-group size to use for kernel execution, the specified value should be one of the supported sizes and must be a compile time constant value.

    q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(16)]] {
                                                          ^
        // Kernel Code

    }).wait();

Lab Exercise: Subgroup Size

The code below shows how to query for supported sub-group sizes, and also how to set kernel to use a specific supported sub-group size.

The DPC++ code below demonstrates how to use reqd_sub_group_size() to let the kernel use a specified sub-group size, change the S = 32 to 16 or 8 to change sub_group sizes and check the output:

  1. Inspect the code cell below and click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_reqd_size.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 64; // global size
static constexpr size_t B = 64; // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  //# get all supported sub_group sizes and print
  auto sg_sizes = q.get_device().get_info<info::device::sub_group_sizes>();
  std::cout << "Supported Sub-Group Sizes : ";
  for (int i=0; i<sg_sizes.size(); i++) std::cout << sg_sizes[i] << " "; std::cout << "\n";
    
  //# find out maximum supported sub_group size
  auto max_sg_size = std::max_element(sg_sizes.begin(), sg_sizes.end());
  std::cout << "Max Sub-Group Size        : " << max_sg_size[0] << "\n";
    
  q.submit([&](handler &h) {
    //# setup sycl stream class to print standard output from device code
    auto out = stream(1024, 768, h);

    //# nd-range kernel with user specified sub_group size
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(32)]] {
      //# get sub_group handle
      auto sg = item.get_sub_group();

      //# query sub_group and print sub_group info once per sub_group
      if (sg.get_local_id()[0] == 0) {
        out << "sub_group id: " << sg.get_group_id()[0] << " of "
            << sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
            << "\n";
      }
    });
  }).wait();
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:

! chmod 755 q; chmod 755 run_sub_group_reqd_size.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_reqd_size.sh; else ./run_sub_group_reqd_size.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Subgroup Functions and Algorithms

The sub-group functions and algorithms expose functionality tied to work-items within a sub-group.

Providing these implementations as library functions instead increases developer productivity and gives implementations the ability to generate highly optimized code for individual target devices.

Below are some of the functions and algorithms available for sub-groups, they include useful fuctionalities to perform shuffles, reductions, scans and votes:

  • select_by_group
  • shift_group_left
  • shift_group_right
  • permute_group_by_xor
  • group_broadcast
  • reduce_over_group
  • exclusive_scan_over_group
  • inclusive_scan_over_group
  • any_of_group
  • all_of_group
  • none_of_group

Subgroup Shuffle

One of the most useful features of subgroups is the ability to communicate directly between individual work-items without explicit memory operations.

Shuffle operations enable us to remove work-group local memory usage from our kernels and/or to avoid unnecessary repeated accesses to global memory.

Below are the different types of shuffle operations available for sub-groups:

  • select_by_group(sg, x, id)
  • shift_group_left(sg, x, delta)
  • shift_group_right(sg, x, delta)
  • permute_group_by_xor(sg, x, mask)

The code below uses permute_group_by_xor to swap the values of two work-items:

h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){
      auto sg = item.get_sub_group();
      auto i = item.get_global_id(0);
      /* Shuffles */
      //data[i] = select_by_group(sg, data[i], 2);
      //data[i] = shift_group_left(sg, data[i], 1);
      //data[i] = shift_group_right(sg, data[i], 1);
      data[i] = permute_group_by_xor(sg, data[i], 1);
    });

Lab Exercise: Subgroup Shuffle

The code below uses subgroup shuffle to swap items in a subgroup. You can try other shuffle operations or change the fixed constant in the shuffle function to express some common commuinication patterns using permute_group_by_xor.

The DPC++ code below demonstrates sub-group shuffle operations, the code shows how permute_group_by_xor can be used to swap adjacent elements in sub-group, and also you can change the code to reverse the order of element in sub-group using a different mask.

  1. Inspect the code cell below and click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_shuffle.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 256; // global size
static constexpr size_t B = 64;  // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  //# initialize data array using usm
  int *data = malloc_shared<int>(N, q);
  for (int i = 0; i < N; i++) data[i] = i;
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << "\n\n";

  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group();
    auto i = item.get_global_id(0);

    //# swap adjacent items in array using sub_group permute_group_by_xor
    data[i] = permute_group_by_xor(sg, data[i], 1);
      
    //# reverse the order of items in sub_group using permute_group_by_xor
    //data[i] = permute_group_by_xor(sg, data[i], sg.get_max_local_range()[0] - 1);
      
  }).wait();

  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << "\n";

  free(data, q);
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:

! chmod 755 q; chmod 755 run_sub_group_shuffle.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_shuffle.sh; else ./run_sub_group_shuffle.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Lab Exercise: Subgroup - Reduce

The code below uses subgroup reduce_over_group function to perform reduction for all items in a subgroup.

h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){
      auto sg = item.get_sub_group();
      auto i = item.get_global_id(0);
      /* Reduction Collective on Sub-group */
      int result = reduce_over_group(sg, data[i], plus<>());
      //int result = reduce_over_group(sg, data[i], maximum<>());
      //int result = reduce_over_group(sg, data[i], minimum<>());
    });

The DPC++ code below demonstrates sub-group collectives: Inspect code, you can change the operator "plus" to "maximum" or "minimum" and check output:

  1. Inspect the code cell below and click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_reduce.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 256; // global size
static constexpr size_t B = 64;  // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  //# initialize data array using usm
  int *data = malloc_shared<int>(N, q);
  for (int i = 0; i < N; i++) data[i] = i;
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << "\n\n";

  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group();
    auto i = item.get_global_id(0);

    //# Add all elements in sub_group using sub_group collectives
    int result = reduce_over_group(sg, data[i], plus<>());

    //# write sub_group sum in first location for each sub_group
    if (sg.get_local_id()[0] == 0) {
      data[i] = result;
    } else {
      data[i] = 0;
    }
  }).wait();

  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << "\n";

  free(data, q);
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:

! chmod 755 q; chmod 755 run_sub_group_reduce.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_reduce.sh; else ./run_sub_group_reduce.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Lab Exercise: Subgroup - Broadcast

The code below uses subgroup collectives group_broadcast function, this enables one work-item in a group to share the value of a variable with all other work-items in the group.

The DPC++ code below demonstrates sub-group broadcast function: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_broadcast.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 256; // global size
static constexpr size_t B = 64; // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  //# initialize data array using usm
  int *data = malloc_shared<int>(N, q);
  for(int i=0; i<N; i++) data[i] = i;
  for(int i=0; i<N; i++) std::cout << data[i] << " "; 
  std::cout << "\n\n";  

  //# use parallel_for and sub_groups
  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group();
    auto i = item.get_global_id(0);

    //# write sub_group item values to broadcast value at index 3
    data[i] = group_broadcast(sg, data[i], 3);

  }).wait();

  for(int i=0; i<N; i++) std::cout << data[i] << " "; 
  std::cout << "\n";
  
  free(data, q);
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:

! chmod 755 q; chmod 755 run_sub_group_broadcast.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_broadcast.sh; else ./run_sub_group_broadcast.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Lab Exercise: Subgroup - Votes

The any_of_group, all_of_group and none_of_group functions (henceforth referred to collectively as
“vote” functions) enable work-items to compare the result of a Boolean
condition across their group.

The DPC++ code below demonstrates sub-group collectives any_of_group, all_of_group and none_of_group functions: Inspect code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file.
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/sub_group_votes.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 32; // global size
static constexpr size_t B = 16; // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  //# initialize input and output array using usm
  auto input = malloc_shared<int>(N, q);
  auto all = malloc_shared<int>(N, q);
  auto any = malloc_shared<int>(N, q);
  auto none = malloc_shared<int>(N, q);
    
  //# initialize values for input array  
  for(int i=0; i<N; i++) { if (i< 10) input[i] = 0; else input[i] = i; }
  std::cout << "input:\n";
  for(int i=0; i<N; i++) std::cout << input[i] << " "; std::cout << "\n";  

  //# use parallel_for and sub_groups
  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(8)]] {
    auto sg = item.get_sub_group();
    auto i = item.get_global_id(0);

    //# write items with vote functions
    all[i] = all_of_group(sg, input[i]);
    any[i] = any_of_group(sg, input[i]);
    none[i] = none_of_group(sg, input[i]);

  }).wait();

  std::cout << "all_of:\n";
  for(int i=0; i<N; i++) std::cout << all[i] << " "; std::cout << "\n";
  std::cout << "any_of:\n";
  for(int i=0; i<N; i++) std::cout << any[i] << " "; std::cout << "\n";
  std::cout << "none_of:\n";
  for(int i=0; i<N; i++) std::cout << none[i] << " "; std::cout << "\n";
  
  free(input, q);
  free(all, q);
  free(any, q);
  free(none, q);
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code:

! chmod 755 q; chmod 755 run_sub_group_votes.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_votes.sh; else ./run_sub_group_votes.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Summary

Subgroups allow kernel programming that maps executions at low-level hardware and may help in achieving higher levels of performance.

Survey

We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks!

Reset Notebook

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/oneAPI_Essentials/04_DPCPP_Sub_Groups/ ~/oneAPI_Essentials/04_DPCPP_Sub_Groups
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

linking button and function together using a button's method

button.on_click(on_button_clicked)

displaying button and its output together

widgets.VBox([button,out])

Demonstration of Intel® Advisor


마지막 수정일: 2022-02-23, teratec

Intel® Advisor - Offload Advisor

These sections demonstrate how to collect and generate a roofline report using Intel® Advisor, below we will examine our "offload" report.

Sections

Learning Objectives

The goal of this notebook is to show how Intel® Advisor can help deciding what part of the code should or should not be offloaded on the GPU. At the end of this, you will be able:

  • To run Offload Advisor and generate a HTML report
  • To read and understand the metrics in the report
  • To get a performance estimation of your application on the target hardware
  • To decide which loops are good candidate for offload

What is Offload Advisor?

Offload Advisor allows you to collect performance predictor data in addition to the profiling capabilities of Intel® Advisor. View output files containing metrics and performance data such as total speedup, fraction of code accelerated, number of loops and functions offloaded, and a call tree showing offloadable and accelerated regions.

Offload Advisor Analysis

The below HTML report is live, click navigation to see output.
Intel Advisor Offload report

View the Report

Select the cell below and click run ▶ to view the analysis.

import os
os.system('/bin/echo $(whoami) is running DPCPP_Essentials Module5 -- Intel Advisor - 1 of 2 offload.html')
from IPython.display import IFrame
IFrame(src='assets/offload.html', width=1024, height=1280)

Using Intel® Advisor to increase performance

Intel® Advisor is recommended to Optimize the design for vectorization and memory (CPU and GPU) and Identify loops that are candidates for offload and project the performance on target accelerators.

offload Advisor can help determine what kernels should be offloaded and can predict the speedup that can be expected.

Developers can use the Intel® DPC++ Compatibility tool to perform a one-time migration from CUDA to Data Parallel C++. Existing Fortran applications can use a directive style based on OpenMP. Existing C++ applications can choose either the Kernel style or the directive based style option.

Once you wirte the DPC++ code, GPU roofline analyis helps to develop an optimization strategy and see potential bottlenecks relative to target maximums.

Finally the GPU analysis using VTune can help optimize for the target.

Intel® Advisor - Offload Advisor: Find code that can be profitably offloaded

From the below fugure we can clearly observe that the the workload was accelerated by 3.5x. You can see in program metrics that the original workload ran in 18.51s and the accelerated workload ran in 5.45s

Offload Advisor: Will Offload Increase Performance?

From the below figure we can clearly observe the good candidates for offloading and the bad candidates to offload. You can also observe what your workload is bounded by.

Analysis of Top Offload Regions

Provides a detailed description of each loop interesting for offload. You can view the Timings (total time, time on the accelerator, speedup), the Offload metrics like the offload taxe and the data transfers, Memory traffic (DRAM, L3, L2, L1) and the trip count. It also highlighst which part of the code should run on the accelerator.

What Kernels Should Not Be Offloaded?

Below explains why Intel Advisor does not recommend a given loop for offload. The possible reason can be dependency issues, that loops are not profitable, or the total time is too small.

Compare Acceleration on Different GPUs

Below compares acceleration on Gen9 and Gen11. You can observe from the below picture that its not efficient to offload on Gen 9
whereas in Gen11 there is one offload with 98% of code accelerated and by 1.6x.

What Is the Workload Bounded By?

The performance will ultimately have an upper bound based on your hardware’s limitations. There are several limitations that Offload Advisor can indicate but they generally come down to compute, memory and data transfer. Knowing what your application is bounded by is critical to developing an optimization strategy. In the below example 95% of workload bounded by L3 bandwidth but you may have several bottlenecks.

Program Tree

The program tree offers another view of the proportion of code that can be offloaded to the accelerator

Command line options

The application runs on a CPU and is actually need not be threaded. For Intel® Offload Advisor, it doesn't matter if your code is already threaded. Advisor will run several analyses on your application to extract several metric such as the number of operations, the number of memory transfers, data dependencies and many more.
Remember that our goal here is to decide if some of our loops are good candidates for offload. In this section, we will generate the report assuming that we want to offload our computations on a Gen Graphic (gen9) which is the hardware available on DevCloud.
Keep in mind that if you want Advisor to extract as much information as possible, you need to compile your application with debug information (-g with intel compilers).

The easiest way to run Offload Advisor is to use the batch mode that consists in running 2 scripts available is the folder $APM ($APM is available when Advisor is sourced).

  • collect.py: Used to collect data such as timing, flops, tripcounts and many more
  • analyze.py: Creating the report

To be more specific, collect.py runs the following analyses:

  • survey: Timing your application functions and loops, reading compiler diagnostics
  • tripcount: With flops and cache simulation to count the number of iterations in the loops as well as the number of operations and memory transfers
  • dependency: Check if you have data dependency in your loops, preventing it to be good candidates for offloading or vectorization

Offload Advisor is currently run from the command-line as below. Once the run is complete you can view the generated report.html.

  • Clone official GitHub samples repository
    git clone https://github.com/oneapi-src/oneAPI-samples.git
  • Go into Project directory to the matrix multiply advisor sample cd oneAPI-samples/Tools/Advisor/matrix_multiply_advisor/
  • Build the application and generate the matrix multiplication binary cmake .
    make
advixe-python $APM/collect.py advisor_project --config gen9 -- ./matrix.dpcpp
advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir ./analyze
%%writefile advisor_offload.sh
#!/bin/bash

advixe-python $APM/collect.py advisor_project --config gen9 -- ./matrix.dpcpp
advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir ./analyze

Generating the HTML report

The last step is to generate our HTML report for offloading on gen9. This report will show us:

  • What is the expected speedup on Gen9
  • What will most likely be our bottleneck on Gen9
  • What are the good candidates for offload
  • What are the loops that should not be offloaded

Offload Advisor Output Overview

report.html: Main report in HTML format

report.csv and whole_app_metric.csv: Comma-separated CSV files

program_tree.dot: A graphical representation of the call tree showing the offloadable and accelerated regions

program_tree.pdf: A graphical representation of the call tree generated if the DOT(GraphViz*) utility is installed and a 1:1 conversion from the program_tree.dot file

JSON and LOG files that contain data used to generate the HTML report and logs, primarily used for debugging and reporting bugs and issues

Summary

  • Ran the Offload Advisor report.
  • Analyzed various outputs.
  • Learned about additional command line options and how to speed up collection time.
    Survey

We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks!

Continue to Roofline Analysis

Roofline Analysis

Reset Notebook

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/oneAPI_Essentials/05_Intel_Advisor/ ~/oneAPI_Essentials/05_Intel_Advisor
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

linking button and function together using a button's method

button.on_click(on_button_clicked)

displaying button and its output together

widgets.VBox([button,out])

Intel® VTune™ Profiler on Intel® DevCloud


마지막 수정일: 2022-02-23, teratec

VTune™ Profiling on Intel® DevCloud

Sections

Learning Objectives

  • Profile a DPC++ application using the VTune™ profiling tool on Intel® DevCloud
  • Understand the basics of VTune™ command line options for collecting data and generating reports

What is VTune™ Profiler?

VTune™ allows DPC++ Profiling capabilities so you can tune for CPU, GPU, and FPGA.

Analyze Data Parallell C++ :
See the lines of DPC++ that consume the most time

Tune for CPU, GPU & FPGA :
Optimize for any supported hardware accelerator

Optimize Offload :
Tune OpenMP offload performance

Wide Range of Performance Profiles :
CPU, GPU, FPGA, threading, memory, cache, storage…

Most Popular Languages :
DPC++, C, C++, Fortran, Python, Go, Java, or a mix

VTune™ Command-line Options

Run and collect VTune™ data

vtune -collect gpu_hotspots -result-dir vtune_data a.out

Various types of profiling data can be collected like hotspots, memory-consumption, memory-access, threading

Use the command line help to find out more:

vtune --help -collect

Generate html report for collected VTune™ data:

vtune -report summary -result-dir vtune_data -format html -report-output $(pwd)/summary.html

Various types of report can be generated like summary, top-down, callstacks

Use the command line help to find out more:

vtune --help -report

When to use VTune™ Command line

VTune™ Command-line is useful when on Intel® DevCloud or you only have SSH access to development system.

However, it is recommended to install the full VTune™ version on a local system and use the UI rich experience of VTune Profiling Tool.

Lab Exercise: VTune™ Profiling

  • Build, run, collect VTune™ data and display VTune summary when running on gpu and cpu.

Test Application: DPC++ implementation of iso3dfd

DPC++ implementation of iso3dfd will be used to collect VTune™ data and analyze the generated result. Below are source code to iso3dfd application:

Build and Run

%%writefile run_iso3dfd.sh
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1

dpcpp src/iso3dfd.cpp src/utils.cpp src/iso3dfd_kernels.cpp -o iso3dfd

./iso3dfd 256 256 256 8 8 8 20 sycl gpu

STEP 1: Build and Run the iso3dfd app by running ▶ the command below:

! chmod 755 q; chmod 755 run_iso3dfd.sh; if [ -x "$(command -v qsub)" ]; then ./q run_iso3dfd.sh; else ./run_iso3dfd.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Analyze performance with VTune™

Use VTune™ command line to analyze performace on GPU vs CPU and display the summary

VTune™ Command Line for collecting and reporting

%%writefile vtune_collect.sh
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh
/bin/echo "##" $(whoami) is compiling DPCPP_Essentials Module6 -- Intel Vtune profiler - 1 of 1 Vtune_Profiler
#vtune
#type=hotspots
#type=memory-consumption
#type=uarch-exploration
#type=memory-access
#type=threading
#type=hpc-performance
#type=system-overview
#type=graphics-rendering
#type=io
#type=fpga-interaction
#type=gpu-offload
type=gpu-hotspots
#type=throttling
#type=platform-profiler
#type=cpugpu-concurrency
#type=tsx-exploration
#type=tsx-hotspots
#type=sgx-hotspots

rm -r vtune_data

echo "Vtune Collect $type"
vtune -collect $type -result-dir vtune_data $(pwd)/iso3dfd 256 256 256 8 8 8 20 sycl gpu

echo "Vtune Summary Report"
vtune -report summary -result-dir vtune_data -format html -report-output $(pwd)/summary.html

Run VTune™ to Collect Hotspots and Generate Report

STEP 2: Collect VTune™ data and generate report by running ▶ the command below:

! chmod 755 vtune_collect.sh; if [ -x "$(command -v qsub)" ]; then ./q vtune_collect.sh; else ./vtune_collect.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again.

Display VTune™ Summary

Display VTune™ summary report generated in html format

Display VTune™ Report for GPU

STEP 3: Display VTune™ summary report by running ▶ the command below

from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

Summary

VTune™ command line is useful for quick analysis of DPC++ application to get performance metric and tune applications.

Survey

We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks!

Reset Notebook

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/oneAPI_Essentials/06_Intel_VTune_Profiler/ ~/oneAPI_Essentials/06_Intel_VTune_Profiler
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

linking button and function together using a button's method

button.on_click(on_button_clicked)

displaying button and its output together

widgets.VBox([button,out])

DPC++ Library Utilization


마지막 수정일: 2022-02-23, teratec

Intel® oneAPI DPC++ Library

Sections

Learning Objectives

  • Simplify DPC++ programming by using Intel® oneAPI DPC++ Library (oneDPL)
  • Use DPC++ Library algorithms for Heterogeneous Computing
  • Implement oneDPL algorithms using Buffers and Unified Shared Memory

What is Intel® oneAPI DPC++ Library?

The Intel® oneAPI DPC++ Library (oneDPL) is a companion to the Intel® oneAPI DPC++ Compiler and provides an alternative for C++ developers who create heterogeneous applications and solutions. Its APIs are based on familiar standards—C++ STL, Parallel STL (PSTL), and SYCL* — to maximize productivity and performance across CPUs, GPUs, and FPGAs.

oneDPL consists of the following components:

  • Standard C++ APIs
  • Parallel STL algorithms
  • Extensions APIs - additional set of library classes and functions

Why use oneDPL for DPC++ Heterogeneous Computing?

The Intel oneAPI DPC++ Library helps to maximize productivity and performance across CPUs, GPUs, and FPGAs.

Maximize performance by offloading computation to devices like GPU, for example the code snippet below shows how an existing functionality that executes on CPU can be offloaded to devices like GPU or FPGA using oneDPL.

Compute on CPU:

  std::sort(v.begin(), v.end());  

Compute on GPU with oneDPL:

  sycl::queue q(sycl::gpu_selector{});
  std::sort(oneapi::dpl::execution::make_device_policy(q), v.begin(), v.end());
                                    ^                  ^  

Maximize productivity by making use of oneDPL algorithms instead of writing DPC++ kernel code for the algorithms that already exist in oneDPL, for example the entire DPC++ kernel code in the below DPC++ example can be accomplished with one line of code when using DPC++ Library algorithm.

#include<CL/sycl.hpp>
using namespace sycl;
constexpr int N = 4;

int main() {
  queue q;
  std::vector<int> v(N);
    
//==================================================================↓
  {
    buffer<int> buf(v.data(),v.size());
    q.submit([&](handler &h){
       auto V = buf.get_access<access::mode::read_write>(h);
       h.parallel_for(range<1>(N),[=] (id<1> i){ V[i] = 20; }); 
    });
  }
//==================================================================↑
    
  for(int i = 0; i < v.size(); i++) std::cout << v[i] << "\n";
  return 0;
}

The above code block can be accomplished with one line of code using oneDPL:

  std::fill(oneapi::dpl::execution::make_device_policy(q), v.begin(), v.end(), 20);

The above code will create a temporary SYCL buffer, computes the algorith on device and copies back the buffer.

Simple oneDPL example

The example below shows how a single line of code with Parallel STL alogorithm can replace the DPC++ kernel code to get same results as previous example

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/dpl_simple.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include<CL/sycl.hpp>
using namespace sycl;
constexpr int N = 4;

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
  std::vector<int> v(N);
    
  //# Parallel STL fill function with device policy
  std::fill(oneapi::dpl::execution::make_device_policy(q), v.begin(), v.end(), 20);
    
  for(int i = 0; i < v.size(); i++) std::cout << v[i] << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_dpl_simple.sh;if [ -x "$(command -v qsub)" ]; then ./q run_dpl_simple.sh; else ./run_dpl_simple.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

oneDPL Algorithms

  1. C++ standard APIs have been tested and function well within DPC++ kernels. To use them, include the corresponding C++ standard header files and use the std namespace. List of tested C++ standard APIs available for DPC++ can be found here for reference.

  2. Parallel STL which offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors is extended with support for DPC++ compliant devices by introducing special DPC++ execution policies and functions. List of different Parallel STL algorithms available for DPC++ can be found here for reference.

  3. Extension APIs are non-standard algorithms, utility classes and iterators. List of different extension APIs available for DPC++ can be found here for reference.

All oneDPL header files are in the dpstd directory. Depending on the algorithm you use, include appropriate header files:

Then add a subset of the following set of lines, depending on the algorithms you intend to use:
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/numeric>
#include <oneapi/dpl/memory>

oneDPL has its own namespace oneapi::dpl for all its extensions, including DPC++ execution policies, non-standard algorithms, special iterators, etc.

DPC++ Execution Policy Usage

The DPC++ execution policy specifies where and how a Parallel STL algorithm runs. It inherits a standard C++ execution policy, encapsulates a SYCL* device or queue, and enables you to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies.

  1. Add #include <oneapi/dpl/execution> to your code.
  2. Create a policy object by providing a standard policy type, a optional class type for a unique kernel name as a template argument and one of the following constructor arguments:
  • A SYCL queue
  • A SYCL device
  • A SYCL device selector
  • An existing policy object with a different kernel name
  1. The oneapi::dpl::execution::dpcpp_default object is a predefined object of the device_policy class, created with a default kernel name and a default queue. Use it to create customized policy objects, or to pass directly when invoking an algorithm.

Below is example showing usage of execution policy to use with Parallel STL:

queue q;
auto policy = oneapi::dpl::execution::make_device_policy(q);
std::fill(policy, v.begin(), v.end(), 20);
  • Parallel STL algorithms can be called with ordinary iterators.
  • A temporary SYCL buffer is created and the data is copied to this buffer.
  • After processing of the temporary buffer on a device is complete, the data is copied back to the host.

Using multiple oneDPL algorithms

The code example below uses two algorithms, the input vector is doubled using std::for_each algorithm and then it is sorted using std::sort algorithm. Execute the code below to find out if this is the right way or not?

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/dpl_sortdouble.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include<CL/sycl.hpp>
using namespace sycl;
using namespace oneapi::dpl::execution;

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
  std::vector<int> v{2,3,1,4};
    
  std::for_each(make_device_policy(q), v.begin(), v.end(), [](int &a){ a *= 2; });
  std::sort(make_device_policy(q), v.begin(), v.end());
    
  for(int i = 0; i < v.size(); i++) std::cout << v[i] << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_dpl_sortdouble.sh;if [ -x "$(command -v qsub)" ]; then ./q run_dpl_sortdouble.sh; else ./run_dpl_sortdouble.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

The above example works but memory is copied to device and back twice when vectors are passed directly to the oneDPL algorithms:

//# memory copied host -> device
std::for_each(make_device_policy(q), v.begin(), v.end(), [](int &a){ a *= 2; });
//# memory copied device -> host

//# memory copied host -> device
std::sort(make_device_policy(q), v.begin(), v.end());
//# memory copied device -> host

To avoid memory being copied back and forth twice, we have to use create buffer and use buffer iterators which is explained below

oneDPL with Buffer Iterators

The oneapi::dpl::begin and oneapi::dpl::end are special helper functions that allow you to pass SYCL buffers to Parallel STL algorithms. These functions accept a SYCL buffer and return an object of an unspecified type. This will require the following header file:

#include <oneapi/dpl/iterator>

Using buffer iterators will ensure that memory is not copied back and forth in between each algorithm execution on device. The code example below shows how the same example above is implemented using buffer iterators which make sure the memory stays on device until the buffer is destructed.

The code below shows simple oneDPL code. Inspect code, there are no modifications necessary.

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/dpl_buffer.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>
#include <CL/sycl.hpp>
using namespace sycl;
using namespace oneapi::dpl::execution;


int main(){
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
  std::vector<int> v{2,3,1,4};
    
  //# Create a buffer and use buffer iterators in Parallel STL algorithms
  {
    buffer buf(v);
    auto buf_begin = oneapi::dpl::begin(buf);
    auto buf_end   = oneapi::dpl::end(buf);

    std::for_each(make_device_policy(q), buf_begin, buf_end, [](int &a){ a *= 3; });
    std::sort(make_device_policy(q), buf_begin, buf_end);
  }
    
  for(int i = 0; i < v.size(); i++) std::cout << v[i] << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_dpl_buffer.sh;if [ -x "$(command -v qsub)" ]; then ./q run_dpl_buffer.sh; else ./run_dpl_buffer.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

oneDPL with Unified Shared Memory

The following examples demonstrate two ways to use the oneDPL algorithms with Unified Shared Memory (USM), with either using pointers directly to iterate or use vectors to iterate:

  • USM pointers
  • USM allocators

If the same buffer is processed by several algorithms, explicitly wait for completion of each algorithm before passing the buffer to the next one. Also wait for completion before accessing the data at the host.

oneDPL with USM Pointers

malloc_shared will allocate memory which can be accessed on both host and device, this USM pointer can be used to iterate when using oneDPL algorithm by passing pointer to the start and end of allocation.

The code below shows how oneDPL can be used with USM pointer. Inspect code, there are no modifications necessary.

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/dpl_usm_pointer.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
using namespace sycl;
using namespace oneapi::dpl::execution;
const int N = 4;

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
    
  //# USM allocation on device
  int* data = malloc_shared<int>(N, q);
    
  //# Parallel STL algorithm using USM pointer
  std::fill(make_device_policy(q), data, data + N, 20);
  q.wait();
    
  for (int i = 0; i < N; i++) std::cout << data[i] << "\n";
  free(data, q);
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_dpl_usm_pointer.sh;if [ -x "$(command -v qsub)" ]; then ./q run_dpl_usm_pointer.sh; else ./run_dpl_usm_pointer.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

oneDPL with USM Allocators

usm_allocator is a C++ allocator class for USM, it takes the data type and kind of allocation as template parameter. This allocator is passed to std::vector constructor and the oneDPL algorithm can now use vector iterators.

The code below shows oneDPL with USM Allocators with vector declaration. Inspect code, there are no modifications necessary.

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.
%%writefile lab/dpl_usm_alloc.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>

using namespace sycl;
using namespace oneapi::dpl::execution;

const int N = 4;

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
    
  //# USM allocator 
  usm_allocator<int, usm::alloc::shared> alloc(q);
  std::vector<int, decltype(alloc)> v(N, alloc);
    
  //# Parallel STL algorithm with USM allocator
  std::fill(make_device_policy(q), v.begin(), v.end(), 20);
  q.wait();
    
  for (int i = 0; i < v.size(); i++) std::cout << v[i] << "\n";
  return 0;
}

Build and Run

Select the cell below and click run ▶ to compile and execute the code above:

! chmod 755 q; chmod 755 run_dpl_usm_alloc.sh;if [ -x "$(command -v qsub)" ]; then ./q run_dpl_usm_alloc.sh; else ./run_dpl_usm_alloc.sh; fi

If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again

Summary

In this module you will have learned the following:

  • What is Intel® oneAPI DPC++ Library and Why use it?
  • Usage of oneDPL for Heterogeneous Computing
  • Using oneDPL algorithm with Buffers and Unified Shared Memory

<html><body><span style="color:Red"><h1>Reset Notebook</h1></span></body></html>

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/oneAPI_Essentials/07_DPCPP_Library/ ~/oneAPI_Essentials/07_DPCPP_Library
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

Introduction to OpenMP Offload


마지막 수정일: 2022-02-25, teratec

Sections:

Learning Objectives

  • Explain how oneAPI can solve the challenges of programming in a heterogeneous world
  • Use oneAPI solutions to enable your workflows
  • Use OpenMP Offload directives to execute code on the GPU
  • Familiarization on the use Jupyter notebooks for training throughout the course ​

Prerequisites

This course assumes general OpenMP knowledge for CPUs. If you are new to OpenMP, below are some great resources to get you started.

oneAPI Software Model Overview

The oneAPI software model provides a comprehensive and unified portfolio of developer tools that can be used across hardware targets, including a range of performance libraries spanning several workload domains. The libraries include functions custom-coded for each target architecture so the same function call delivers optimized performance across supported architectures. oneAPI initiative is based on industry standards and open specifications and is interoperable with existing HPC programming models. ​ <img src="Assets/oneapi2.png">

HPC Single-Node Workflow with oneAPI

Accelerated code can be written in either a kernel (DPC++) or directive-based style(OpenMP). Developers can use the Intel® DPC++ Compatibility tool to perform a one-time migration from CUDA* to Data Parallel C++. Existing Fortran applications can use a directive style based on OpenMP. Existing C++ applications can choose either the Kernel style or the directive based style option and existing OpenCL applications can remain in the OpenCL language or migrate to Data Parallel C++. ​ Intel® Advisor is recommended to Optimize the design for vectorization and memory (CPU and GPU) and Identify loops that are candidates for offload and project the performance on target accelerators. ​ The figure below shows the recommended approach of different starting points for HPC developers: ​ <img src="Assets/workflow.png"> ​

OpenMP vs DPC++

Both OpenMP and DPC++ are open standards that can be used to accelerate algorithms on GPUs. As the workflow diagram shows, oneAPI supports both methodologies and you should be able to achieve similar optimized performance with either option. The decision between the two choices likely depends on workflow requirements and ease of porting. When migrating from existing CUDA or OpenCL projects, DPC++ would likely make more sense. When migrating from existing C/Fortran applications with OpenMP, then OpenMP offload would be the easier alternative. ​

OpenMP Offload

OpenMP Offload constructs are a set of directives for C++ and Fortran introduced in OpenMP 4.0 and further enhanced in later versions that allows developers to offload data and execution to target accelerators such as GPUs. OpenMP offload is supported in the Intel® oneAPI HPC Toolkit with the Intel® C++ Compiler and the Intel® Fortran Compiler.


Simple Exercise

This exercise introduces OpenMP offload to the developer by way of a small simple code. In addition, it introduces the developer to the Jupyter notebook environment for editing and saving code; and for running and submitting programs to the Intel® oneAPI DevCloud. ​ We start with a program that includes basic OpenMP constructs including parallel and for. We will then add the target directive to offload part of the program to the GPU device. ​ This simple program loops through all of the elements of data array and multiplies it by 2. ​

Editing the simple.cpp code

The Jupyter cell below with the gray background can be edited in-place and saved. ​ The first line of the cell contains the command %%writefile 'simple.cpp' This tells the input cell to save the contents of the cell into the file name 'simple.cpp' As you edit the cell and run it, it will save your changes into that file. The code below shows the simple OpenMP code. Inspect the code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code. %%writefile lab/simple.cpp //============================================================== // Copyright © 2020 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <omp.h> ​ #include <iostream> ​ constexpr int N = 16; int main() { int is_cpu; int *data = static_cast<int *>(malloc(N * sizeof(int))); ​ // Initialization for (int i = 0; i < N; i++) data[i] = i; ​ { is_cpu = omp_is_initial_device(); ​ // Use OpenMP to Parallelize Algorithm #pragma omp parallel for for (int i = 0; i < N; i++) { data[i] *= 2; } } ​ // Print Output std::cout << "Running on " << (is_cpu ? "CPU" : "GPU") << "\n"; for (int i = 0; i < N; i++) std::cout << data[i] << "\n"; ​ free(data); return 0; }

Compile and Running C/C++ Programs

Compiling and Running on DevCloud:

For this training purposes, we have written a script (q) to simplify launching tasks on the DevCloud. The q script does the job of submiting a script to a GPU node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run programs on the DevCloud: ./q <script>.sh

Compiling and Running on local system:

​ If you have installed oneAPI HPC Toolkit on your local system, you can use the commands below to compile and run a OpenMP offload program:

source /opt/intel/inteloneapi/setvars.sh
​
icx -fiopenmp -fopenmp-targets=spir64 simple.cpp
​
./simple
  
Note: our scripts is a combination of the above three steps.

​ Using the icx or icpx compiler with the "-fiopenmp -fopenmp-targets=spir64" options enables OpenMP offload to the GPU. ​

Compile the code

To compile the code above, we'll be using the compile_c.sh script. This script sets up the compile environment and executes the Intel® C++ Compiler. #Optional: Examine contents of compile_c.sh %pycat compile_c.sh Execute the following cell to submit the compile_c.sh script using the q script. ! chmod 755 compile_c.sh; ./compile_c.sh; If the Jupyter cells are not responsive or if they error out when you compile the samples, please restart the Kernel and compile the samples again

Running the code

To execute the compiled executable, we'll be using the run.sh script. #Optional: Examine contents of run.sh %pycat run.sh Execute the following cell to submit the run.sh script using the q script. ! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

Target Directive

The omp target construct transfers control and data from the host to the device. The transfer of control is sequential and synchronous. In a multi-device environment, the device clause can be optionally used to denote a specific device. Each device is assigned an implementation-specific integer number. Map clauses can be used to control the direction of data flow. Map clauses will be discussed in detail in the next module. ​ Example:

...// Sequential Host Code
​
#pragma omp target     //Target Region Executed on the Device
{
    for (...) {
        ...;
    }
}
...// More Sequential Host Code

Lab Exercise: Running an OpenMP program with the Target Directive

In the example below, add the #pragma omp target map(from:is_cpu) map(tofrom:data[0:N]) directive where stated to offload execution to the GPU. We use the map clauses here to transfer data to and from the GPU while also copy the value of is_cpu back to the host to see if our code actually executed on the GPU. The map clause will be discussed in detail in the next module. %%writefile lab/simple.cpp //============================================================== // Copyright © 2020 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <omp.h> ​ #include <iostream> ​ constexpr int N = 16; int main() { int is_cpu = true; int *data = static_cast<int *>(malloc(N * sizeof(int))); ​ // Initialization for (int i = 0; i < N; i++) data[i] = i; ​ // Add the target directive here, including the map clause. ​ { is_cpu = omp_is_initial_device(); #pragma omp parallel for for (int i = 0; i < N; i++) { data[i] *= 2; } } ​ // Print Output std::cout << "Running on " << (is_cpu ? "CPU" : "GPU") << "\n"; for (int i = 0; i < N; i++) std::cout << data[i] << "\n"; ​ free(data); return 0; }

Execute this cell to compile the code

! chmod 755 compile_c.sh; ./compile_c.sh;

Execute this cell to run the code

! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again ​ Once execution completes, you should see the message that the the program ran on the GPU.

See the solution by running this cell

%pycat simple_solution.cpp

Summary

In this module you have learned the following:

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.

Resources

​ Check out these related resources ​

Intel® oneAPI

OpenMP

Manage Device Data


마지막 수정일: 2022-02-25, teratec

Sections:

Learning Objectives

  • Explain how oneAPI can solve the challenges of programming in a heterogeneous world
  • Use oneAPI solutions to enable your workflows
  • Use OpenMP Offload directives to execute code on the GPU
  • Familiarization on the use Jupyter notebooks for training throughout the course ​

Prerequisites

This course assumes general OpenMP knowledge for CPUs. If you are new to OpenMP, below are some great resources to get you started.

oneAPI Software Model Overview

The oneAPI software model provides a comprehensive and unified portfolio of developer tools that can be used across hardware targets, including a range of performance libraries spanning several workload domains. The libraries include functions custom-coded for each target architecture so the same function call delivers optimized performance across supported architectures. oneAPI initiative is based on industry standards and open specifications and is interoperable with existing HPC programming models. ​ <img src="Assets/oneapi2.png">

HPC Single-Node Workflow with oneAPI

Accelerated code can be written in either a kernel (DPC++) or directive-based style(OpenMP). Developers can use the Intel® DPC++ Compatibility tool to perform a one-time migration from CUDA* to Data Parallel C++. Existing Fortran applications can use a directive style based on OpenMP. Existing C++ applications can choose either the Kernel style or the directive based style option and existing OpenCL applications can remain in the OpenCL language or migrate to Data Parallel C++. ​ Intel® Advisor is recommended to Optimize the design for vectorization and memory (CPU and GPU) and Identify loops that are candidates for offload and project the performance on target accelerators. ​ The figure below shows the recommended approach of different starting points for HPC developers: ​ <img src="Assets/workflow.png"> ​

OpenMP vs DPC++

Both OpenMP and DPC++ are open standards that can be used to accelerate algorithms on GPUs. As the workflow diagram shows, oneAPI supports both methodologies and you should be able to achieve similar optimized performance with either option. The decision between the two choices likely depends on workflow requirements and ease of porting. When migrating from existing CUDA or OpenCL projects, DPC++ would likely make more sense. When migrating from existing C/Fortran applications with OpenMP, then OpenMP offload would be the easier alternative. ​

OpenMP Offload

OpenMP Offload constructs are a set of directives for C++ and Fortran introduced in OpenMP 4.0 and further enhanced in later versions that allows developers to offload data and execution to target accelerators such as GPUs. OpenMP offload is supported in the Intel® oneAPI HPC Toolkit with the Intel® C++ Compiler and the Intel® Fortran Compiler.


Simple Exercise

This exercise introduces OpenMP offload to the developer by way of a small simple code. In addition, it introduces the developer to the Jupyter notebook environment for editing and saving code; and for running and submitting programs to the Intel® oneAPI DevCloud. ​ We start with a program that includes basic OpenMP constructs including parallel and for. We will then add the target directive to offload part of the program to the GPU device. ​ This simple program loops through all of the elements of data array and multiplies it by 2. ​

Editing the simple.cpp code

The Jupyter cell below with the gray background can be edited in-place and saved. ​ The first line of the cell contains the command %%writefile 'simple.cpp' This tells the input cell to save the contents of the cell into the file name 'simple.cpp' As you edit the cell and run it, it will save your changes into that file. The code below shows the simple OpenMP code. Inspect the code, there are no modifications necessary:

  1. Inspect the code cell below and click run ▶ to save the code to file
  2. Next run ▶ the cell in the Build and Run section below the code to compile and execute the code. %%writefile lab/simple.cpp //============================================================== // Copyright © 2020 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <omp.h> ​ #include <iostream> ​ constexpr int N = 16; int main() { int is_cpu; int *data = static_cast<int *>(malloc(N * sizeof(int))); ​ // Initialization for (int i = 0; i < N; i++) data[i] = i; ​ { is_cpu = omp_is_initial_device(); ​ // Use OpenMP to Parallelize Algorithm #pragma omp parallel for for (int i = 0; i < N; i++) { data[i] *= 2; } } ​ // Print Output std::cout << "Running on " << (is_cpu ? "CPU" : "GPU") << "\n"; for (int i = 0; i < N; i++) std::cout << data[i] << "\n"; ​ free(data); return 0; }

Compile and Running C/C++ Programs

Compiling and Running on DevCloud:

For this training purposes, we have written a script (q) to simplify launching tasks on the DevCloud. The q script does the job of submiting a script to a GPU node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run programs on the DevCloud: ./q <script>.sh

Compiling and Running on local system:

​ If you have installed oneAPI HPC Toolkit on your local system, you can use the commands below to compile and run a OpenMP offload program:

source /opt/intel/inteloneapi/setvars.sh
​
icx -fiopenmp -fopenmp-targets=spir64 simple.cpp
​
./simple
  
Note: our scripts is a combination of the above three steps.

​ Using the icx or icpx compiler with the "-fiopenmp -fopenmp-targets=spir64" options enables OpenMP offload to the GPU. ​

Compile the code

To compile the code above, we'll be using the compile_c.sh script. This script sets up the compile environment and executes the Intel® C++ Compiler. #Optional: Examine contents of compile_c.sh %pycat compile_c.sh Execute the following cell to submit the compile_c.sh script using the q script. ! chmod 755 compile_c.sh; ./compile_c.sh; If the Jupyter cells are not responsive or if they error out when you compile the samples, please restart the Kernel and compile the samples again

Running the code

To execute the compiled executable, we'll be using the run.sh script. #Optional: Examine contents of run.sh %pycat run.sh Execute the following cell to submit the run.sh script using the q script. ! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

Target Directive

The omp target construct transfers control and data from the host to the device. The transfer of control is sequential and synchronous. In a multi-device environment, the device clause can be optionally used to denote a specific device. Each device is assigned an implementation-specific integer number. Map clauses can be used to control the direction of data flow. Map clauses will be discussed in detail in the next module. ​ Example:

...// Sequential Host Code
​
#pragma omp target     //Target Region Executed on the Device
{
    for (...) {
        ...;
    }
}
...// More Sequential Host Code

Lab Exercise: Running an OpenMP program with the Target Directive

In the example below, add the #pragma omp target map(from:is_cpu) map(tofrom:data[0:N]) directive where stated to offload execution to the GPU. We use the map clauses here to transfer data to and from the GPU while also copy the value of is_cpu back to the host to see if our code actually executed on the GPU. The map clause will be discussed in detail in the next module. %%writefile lab/simple.cpp //============================================================== // Copyright © 2020 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <omp.h> ​ #include <iostream> ​ constexpr int N = 16; int main() { int is_cpu = true; int *data = static_cast<int *>(malloc(N * sizeof(int))); ​ // Initialization for (int i = 0; i < N; i++) data[i] = i; ​ // Add the target directive here, including the map clause. ​ { is_cpu = omp_is_initial_device(); #pragma omp parallel for for (int i = 0; i < N; i++) { data[i] *= 2; } } ​ // Print Output std::cout << "Running on " << (is_cpu ? "CPU" : "GPU") << "\n"; for (int i = 0; i < N; i++) std::cout << data[i] << "\n"; ​ free(data); return 0; }

Execute this cell to compile the code

! chmod 755 compile_c.sh; ./compile_c.sh;

Execute this cell to run the code

! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: "Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again ​ Once execution completes, you should see the message that the the program ran on the GPU.

See the solution by running this cell

%pycat simple_solution.cpp

Summary

In this module you have learned the following:

Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.

Resources

​ Check out these related resources ​

Intel® oneAPI

OpenMP