GPU Computing in Julia

This session introduces GPU computing in Julia.

GPGPU

GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.

NVIDIA GPUs Tesla K80 GTX 1080 GT 650M
Tesla M2090 GTX 580 GT 650M
Computers servers, cluster desktop laptop
Server Desktop Laptop
Main usage scientific computing daily work, gaming daily work
Memory 24 GB 8 GB 1GB
Memory bandwidth 480 GB/sec 320 GB/sec 80GB/sec
Number of cores 4992 2560 384
Processor clock 562 MHz 1.6 GHz 0.9GHz
Peak DP performance 2.91 TFLOPS 257 GFLOPS
Peak SP performance 8.73 TFLOPS 8228 GFLOPS 691Gflops

GPU architecture vs CPU architecture.

  • GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC
  • Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm
  • Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly
i7 die Fermi die
Einstein Rain man

GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are at least three paradigms to program GPU in Julia.

  • CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

    The CuArrays.jl package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.

  • OpenCL is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.

    The CLArrays.jl package allows defining arrays on OpenCL devices and overloads many common operations.

  • ArrayFire is a high performance library that works on both CUDA or OpenCL framework.

    The ArrayFire.jl package wraps the library for julia.

  • Warning: Most recent Apple operating system iOS 10.15 (Catalina) does not support CUDA yet.

I'll illustrate using CuArrays on my Linux box running CentOS 7. It has a NVIDIA GeForce RTX 2080 Ti OC with 11GB GDDR6 (14 Gbps) and 4352 cores.

In [1]:
versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Query GPU devices in the system

In [2]:
using CuArrays, CUDAdrv

# check available devices on this machine and show their capability
for device in CuArrays.devices()
    @show capability(device)
end
capability(device) = v"7.5.0"

Transfer data between main memory and GPU

In [3]:
# generate data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = CuArray(x)
Out[3]:
3×3 CuArray{Float32,2,Nothing}:
 0.384964  0.38627   0.175825
 0.376902  0.221037  0.921937
 0.674334  0.800958  0.985284
In [4]:
# generate array on GPU directly
yd = ones(CuArray{Float32}, 3, 3)
Out[4]:
3×3 CuArray{Float32,2,Nothing}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
In [5]:
# collect data from GPU to CPU
x = collect(xd)
Out[5]:
3×3 Array{Float32,2}:
 0.384964  0.38627   0.175825
 0.376902  0.221037  0.921937
 0.674334  0.800958  0.985284

Linear algebra

In [6]:
using BenchmarkTools, LinearAlgebra

n = 1024
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = CuArray(x)
yd = CuArray(y)
zd = CuArray(z)

# SP matrix multiplication on GPU
@benchmark mul!($zd, $xd, $yd)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.993 μs (0.00% GC)
  median time:      172.123 μs (0.00% GC)
  mean time:        166.267 μs (0.00% GC)
  maximum time:     201.820 μs (0.00% GC)
  --------------
  samples:          3340
  evals/sample:     9
In [7]:
# SP matrix multiplication on CPU
@benchmark mul!($z, $x, $y)
Out[7]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.710 ms (0.00% GC)
  median time:      8.822 ms (0.00% GC)
  mean time:        8.838 ms (0.00% GC)
  maximum time:     11.772 ms (0.00% GC)
  --------------
  samples:          566
  evals/sample:     1

We see ~40-50x fold speedup in this matrix multiplication example.

In [8]:
# cholesky on Gram matrix
xtxd = xd'xd + I
@benchmark cholesky($(Symmetric(xtxd)))
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays /home/huazhou/.julia/packages/GPUArrays/QDGmr/src/host/indexing.jl:43
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  3.30 KiB
  allocs estimate:  107
  --------------
  minimum time:     812.867 μs (0.00% GC)
  median time:      820.569 μs (0.00% GC)
  mean time:        822.205 μs (0.09% GC)
  maximum time:     7.234 ms (39.54% GC)
  --------------
  samples:          6074
  evals/sample:     1
In [9]:
xtx = collect(xtxd)
@benchmark cholesky($(Symmetric(xtx)))
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  4.00 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.154 ms (0.00% GC)
  median time:      2.200 ms (0.00% GC)
  mean time:        2.365 ms (5.32% GC)
  maximum time:     8.892 ms (0.00% GC)
  --------------
  samples:          2111
  evals/sample:     1

GPU speedup of Cholesky on this example is moderate.

Elementiwise operations on GPU

In [10]:
# elementwise function on GPU arrays
fill!(yd, 1)
@benchmark $zd .= log.($yd .+ sin.($xd))
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  3.66 KiB
  allocs estimate:  84
  --------------
  minimum time:     8.005 μs (0.00% GC)
  median time:      31.654 μs (0.00% GC)
  mean time:        30.654 μs (0.91% GC)
  maximum time:     1.469 ms (97.42% GC)
  --------------
  samples:          10000
  evals/sample:     3
In [11]:
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
@benchmark $z .= log.($y .+ sin.($x))
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.246 ms (0.00% GC)
  median time:      14.263 ms (0.00% GC)
  mean time:        14.282 ms (0.00% GC)
  maximum time:     14.463 ms (0.00% GC)
  --------------
  samples:          351
  evals/sample:     1

GPU brings great speedup (>500x) to the massive evaluation of elementary math functions.