Numerical linear algebra

In [1]:
versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Introduction

  • Topics in numerical algebra:

    • BLAS
    • solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    • regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
    • eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
    • generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
    • singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
    • iterative methods for numerical linear algebra
  • Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).

  • Our major goal (or learning objectives) is to

    1. know the complexity (flop count) of each task
    2. be familiar with the BLAS and LAPACK functions (what they do)
    3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
    4. understand the need for iterative methods
    5. apply appropriate numerical algebra tools to various statistical problems
  • All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.

    • Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS

  • BLAS stands for basic linear algebra subprograms.

  • See netlib for a complete list of standardized BLAS functions.

  • There are many implementations of BLAS.

    • Netlib provides a reference implementation.
    • Matlab uses Intel's MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS.
    • Julia uses OpenBLAS. OpenBLAS is the best open source implementation.
  • There are 3 levels of BLAS functions.

Level Example Operation Name Dimension Flops
1 $\alpha \gets \mathbf{x}^T \mathbf{y}$ dot product $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
1 $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ axpy $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
2 $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ gaxpy $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
2 $\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ rank one update $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
3 $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ matrix multiplication $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ $2mnp$
  • Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

Examples

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Some operations appear as level-3 but indeed are level-2.

Example 1. A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form
$$ \mathbf{X}^T \mathbf{W} \mathbf{X}, $$ where $\mathbf{W}$ is a diagonal matrix with observation weights on diagonal.

Column and row scalings are essentially level-2 operations!

In [2]:
using BenchmarkTools, LinearAlgebra, Random

Random.seed!(123) # seed
n = 2000
A = rand(n, n) # n-by-n matrix
d = rand(n)  # n vector
D = Diagonal(d) # diagonal matrix with d as diagonal
Out[2]:
2000×2000 Diagonal{Float64,Array{Float64,1}}:
 0.140972   ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅        0.143596   ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅        0.612494   ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅        0.0480573      ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
 ⋮                                        ⋱                      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅            0.882415   ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅        0.450904   ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅        0.814614
In [3]:
Dfull = convert(Matrix, D) # convert to full matrix
Out[3]:
2000×2000 Array{Float64,2}:
 0.140972  0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.143596  0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.612494  0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0480573     0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 ⋮                                        ⋱                      
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.882415  0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.450904  0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.814614
In [4]:
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
# this is SLOW!
@benchmark $A * $Dfull
Out[4]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     97.924 ms (0.00% GC)
  median time:      103.736 ms (0.00% GC)
  mean time:        104.157 ms (2.15% GC)
  maximum time:     132.963 ms (6.63% GC)
  --------------
  samples:          48
  evals/sample:     1
In [5]:
# dispatch to special method for diagonal matrix multiplication.
# columnwise scaling: O(n^2) flops
@benchmark $A * $D
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  4
  --------------
  minimum time:     9.124 ms (0.00% GC)
  median time:      9.527 ms (0.00% GC)
  mean time:        11.734 ms (19.45% GC)
  maximum time:     18.956 ms (36.41% GC)
  --------------
  samples:          426
  evals/sample:     1
In [6]:
# in-place: avoid allocate space for result
# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.
@benchmark rmul!($A, $D)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.819 ms (0.00% GC)
  median time:      9.262 ms (0.00% GC)
  mean time:        9.966 ms (0.00% GC)
  maximum time:     19.283 ms (0.00% GC)
  --------------
  samples:          502
  evals/sample:     1

Note: In R or Matlab, diag(d) will create a full matrix. Be cautious using diag function: do we really need a full diagonal matrix?

In [7]:
using RCall

R"""
d <- runif(5)
diag(d)
"""
Out[7]:
RObject{RealSxp}
          [,1]     [,2]      [,3]      [,4]      [,5]
[1,] 0.3255548 0.000000 0.0000000 0.0000000 0.0000000
[2,] 0.0000000 0.924516 0.0000000 0.0000000 0.0000000
[3,] 0.0000000 0.000000 0.5533861 0.0000000 0.0000000
[4,] 0.0000000 0.000000 0.0000000 0.3606721 0.0000000
[5,] 0.0000000 0.000000 0.0000000 0.0000000 0.1970953
In [8]:
using MATLAB

mat"""
d = rand(5, 1)
diag(d)
"""
>> >> >> 
d =

    0.8147
    0.9058
    0.1270
    0.9134
    0.6324

Out[8]:
5×5 Array{Float64,2}:
 0.814724  0.0       0.0       0.0       0.0
 0.0       0.905792  0.0       0.0       0.0
 0.0       0.0       0.126987  0.0       0.0
 0.0       0.0       0.0       0.913376  0.0
 0.0       0.0       0.0       0.0       0.632359

Example 2. Innter product between two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ is often written as $$ \text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}). $$ They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops).

In [9]:
Random.seed!(123)
n = 2000
A, B = randn(n, n), randn(n, n)

# slow way to evaluate this thing
@benchmark tr(transpose($A) * $B)
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     106.165 ms (0.00% GC)
  median time:      117.704 ms (0.00% GC)
  mean time:        123.716 ms (2.00% GC)
  maximum time:     146.697 ms (5.48% GC)
  --------------
  samples:          41
  evals/sample:     1

But $\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops.

In [10]:
@benchmark dot($A, $B)
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.125 ms (0.00% GC)
  median time:      2.509 ms (0.00% GC)
  mean time:        2.622 ms (0.00% GC)
  maximum time:     4.392 ms (0.00% GC)
  --------------
  samples:          1901
  evals/sample:     1

Example 3. Similarly $\text{diag}(\mathbf{A}^T \mathbf{B})$ can be calculated in $O(mn)$ flops.

In [11]:
# slow way to evaluate this thing
@benchmark diag(transpose($A) * $B)
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  3
  --------------
  minimum time:     103.975 ms (0.00% GC)
  median time:      113.334 ms (0.00% GC)
  mean time:        115.301 ms (2.16% GC)
  maximum time:     156.616 ms (0.00% GC)
  --------------
  samples:          44
  evals/sample:     1
In [12]:
# smarter
@benchmark Diagonal(vec(sum($A .* $B, dims=1)))
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  6
  --------------
  minimum time:     9.216 ms (0.00% GC)
  median time:      9.344 ms (0.00% GC)
  mean time:        11.637 ms (19.71% GC)
  maximum time:     18.415 ms (43.65% GC)
  --------------
  samples:          430
  evals/sample:     1

To get rid of allocation of intermediate array at all, we can just write a double loop or use dot function.

In [13]:
using LoopVectorization

function diag_matmul!(d, A, B)
    m, n = size(A)
    @assert size(B) == (m, n) "A and B should have same size"
    fill!(d, 0)
    @avx for j in 1:n, i in 1:m
        d[j] += A[i, j] * B[i, j]
    end
#     for j in 1:n
#         @views d[j] = dot(A[:, j], B[:, j])
#     end
    Diagonal(d)
end

d = zeros(eltype(A), size(A, 2))
@benchmark diag_matmul!($d, $A, $B)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     3.292 ms (0.00% GC)
  median time:      3.569 ms (0.00% GC)
  mean time:        3.578 ms (0.00% GC)
  maximum time:     6.176 ms (0.00% GC)
  --------------
  samples:          1395
  evals/sample:     1

Memory hierarchy and level-3 fraction

Key to high performance is effective use of memory hierarchy. True on all architectures.

  • Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

  • Numbers everyone should know
Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

  • For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

  • Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
    Answer: use high-level BLAS as much as possible.

BLAS Dimension Mem. Refs. Flops Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $3n$ $2n$ 3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ $n^2$ $2n^2$ 1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ $4n^2$ $2n^3$ 2:n
  • Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
    See Dongarra slides.

  • A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

  • To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

Effect of data layout

  • Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

  • Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).

  • Cache line is the minimum amount of cache which can be loaded and stored to memory.

    • x86 CPUs: 64 bytes
    • ARM CPUs: 32 bytes

  • Accessing column-major stored matrix by rows ($ij$ looping) causes lots of cache misses.

  • Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.

    • jki or kji looping:
      # inner most loop
        for i = 1:m
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ikj or kij looping:
      # inner most loop        
        for j = 1:n
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ijk or jik looping:
      # inner most loop        
        for k = 1:p
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
  • We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant A Stride B Stride C Stride
$jki$ or $kji$ Unit 0 Unit
$ikj$ or $kij$ 0 Non-Unit Non-Unit
$ijk$ or $jik$ Non-Unit Unit 0

Apparently the variants $jki$ or $kji$ are preferred.

In [14]:
"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        @inbounds for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        @inbounds for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        @inbounds for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        @inbounds for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        @inbounds for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        @inbounds for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

using Random

Random.seed!(123)
m, p, n = 2000, 100, 2000
A = rand(m, p)
B = rand(p, n)
C = zeros(m, n);
  • $jki$ and $kji$ looping:
In [15]:
using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.909 ms (0.00% GC)
  median time:      72.492 ms (0.00% GC)
  mean time:        72.507 ms (0.00% GC)
  maximum time:     81.386 ms (0.00% GC)
  --------------
  samples:          70
  evals/sample:     1
In [16]:
@benchmark matmul_by_loop!($A, $B, $C, "kji")
Out[16]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     385.850 ms (0.00% GC)
  median time:      398.189 ms (0.00% GC)
  mean time:        407.959 ms (0.00% GC)
  maximum time:     450.873 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1
  • $ikj$ and $kij$ looping:
In [17]:
@benchmark matmul_by_loop!($A, $B, $C, "ikj")
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.589 s (0.00% GC)
  median time:      2.590 s (0.00% GC)
  mean time:        2.590 s (0.00% GC)
  maximum time:     2.590 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
In [18]:
@benchmark matmul_by_loop!($A, $B, $C, "kij")
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.745 s (0.00% GC)
  median time:      2.751 s (0.00% GC)
  mean time:        2.751 s (0.00% GC)
  maximum time:     2.758 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  • $ijk$ and $jik$ looping:
In [19]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     464.136 ms (0.00% GC)
  median time:      504.428 ms (0.00% GC)
  mean time:        500.509 ms (0.00% GC)
  maximum time:     541.258 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1
In [20]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[20]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     463.230 ms (0.00% GC)
  median time:      474.855 ms (0.00% GC)
  mean time:        477.178 ms (0.00% GC)
  maximum time:     491.306 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1
  • Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
In [21]:
@benchmark mul!($C, $A, $B)
Out[21]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.255 ms (0.00% GC)
  median time:      7.418 ms (0.00% GC)
  mean time:        7.578 ms (0.00% GC)
  maximum time:     11.688 ms (0.00% GC)
  --------------
  samples:          659
  evals/sample:     1
In [22]:
# direct call of BLAS wrapper function
@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.265 ms (0.00% GC)
  median time:      7.516 ms (0.00% GC)
  mean time:        7.681 ms (0.00% GC)
  maximum time:     12.464 ms (0.00% GC)
  --------------
  samples:          651
  evals/sample:     1

Exercise: Annotate the loop in matmul_by_loop! by @avx and benchmark again.

BLAS in R

  • Tip for R user. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
In [23]:
using RCall

R"""
library(dplyr)
library(bench)
bench::mark($A %*% $B) %>%
  print(width = Inf)
""";
┌ Warning: RCall.jl: 
│ Attaching package: ‘dplyr’
│ 
│ The following objects are masked from ‘package:stats’:
│ 
│     filter, lag
│ 
│ The following objects are masked from ‘package:base’:
│ 
│     intersect, setdiff, setequal, union
│ 
└ @ RCall /Users/huazhou/.julia/packages/RCall/g7dhB/src/io.jl:113
# A tibble: 1 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 `#JL`$A %*% `#JL`$B    245ms    276ms      3.62    30.5MB     3.62     2     2
  total_time result                       memory           time    
    <bch:tm> <list>                       <list>           <list>  
1      552ms <dbl[,2000] [2,000 × 2,000]> <df[,3] [1 × 3]> <bch:tm>
  gc              
  <list>          
1 <tibble [2 × 3]>
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall /Users/huazhou/.julia/packages/RCall/g7dhB/src/io.jl:113
  • Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.

  • Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra.

In [24]:
using MATLAB

mat"""
f = @() $A * $B;
timeit(f)
"""
Out[24]:
0.008039132633

Avoid memory allocation: some examples

  1. Transposing matrix is an expensive memory operation.
    • In R, the command
      t(A) %*% x
      
      will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
    • Julia is smart to avoid transposing matrix if possible.
In [25]:
using Random, LinearAlgebra, BenchmarkTools
Random.seed!(123)

n = 1000
A = rand(n, n)
x = rand(n);
In [26]:
typeof(transpose(A))
Out[26]:
Transpose{Float64,Array{Float64,2}}
In [27]:
fieldnames(typeof(transpose(A)))
Out[27]:
(:parent,)
In [28]:
# same data in tranpose(A) and original matrix A
pointer(transpose(A).parent), pointer(A)
Out[28]:
(Ptr{Float64} @0x000000014732e000, Ptr{Float64} @0x000000014732e000)
In [29]:
# dispatch to BLAS
# does *not* actually transpose the matrix
@benchmark transpose($A) * $x
Out[29]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     85.293 μs (0.00% GC)
  median time:      120.712 μs (0.00% GC)
  mean time:        122.611 μs (0.00% GC)
  maximum time:     378.357 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [30]:
# pre-allocate result
out = zeros(size(A, 2))
@benchmark mul!($out, transpose($A), $x)
Out[30]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     81.323 μs (0.00% GC)
  median time:      115.555 μs (0.00% GC)
  mean time:        117.869 μs (0.00% GC)
  maximum time:     369.209 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [31]:
# or call BLAS wrapper directly
@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)
Out[31]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     86.052 μs (0.00% GC)
  median time:      116.507 μs (0.00% GC)
  mean time:        128.828 μs (0.00% GC)
  maximum time:     420.352 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  1. Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

    Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command

    max(abs(X), abs(Y))
    

    will create two intermediate arrays and then one result array.

In [32]:
using RCall
Random.seed!(123)
X, Y = rand(1000, 1000), rand(1000, 1000)

R"""
library(dplyr)
library(bench)
bench::mark(max(abs($X), abs($Y))) %>%
  print(width = Inf)
""";
# A tibble: 1 x 13
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 max(abs(`#JL`$X), abs(`#JL`$Y))   6.13ms   6.58ms      137.    15.3MB     76.4
  n_itr  n_gc total_time result    memory           time     gc               
  <int> <dbl>   <bch:tm> <list>    <list>           <list>   <list>           
1    36    20      262ms <dbl [1]> <df[,3] [2 × 3]> <bch:tm> <tibble [56 × 3]>

In Julia, dot operations are fused so no intermediate arrays are created.

In [33]:
# no intermediate arrays created, only result array created
@benchmark max.(abs.($X), abs.($Y))
Out[33]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.259 ms (0.00% GC)
  median time:      1.679 ms (0.00% GC)
  mean time:        2.095 ms (18.42% GC)
  maximum time:     10.123 ms (61.45% GC)
  --------------
  samples:          2382
  evals/sample:     1

Pre-allocating result array gets rid of memory allocation at all.

In [34]:
# no memory allocation at all!
Z = zeros(size(X)) # zero matrix of same size as X
@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!
Out[34]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.193 ms (0.00% GC)
  median time:      1.295 ms (0.00% GC)
  mean time:        1.335 ms (0.00% GC)
  maximum time:     2.113 ms (0.00% GC)
  --------------
  samples:          3726
  evals/sample:     1
  1. View-1) avoids creating extra copy of matrix data.
In [35]:
Random.seed!(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])
Out[35]:
BenchmarkTools.Trial: 
  memory estimate:  488.39 KiB
  allocs estimate:  2
  --------------
  minimum time:     63.992 μs (0.00% GC)
  median time:      239.525 μs (0.00% GC)
  mean time:        263.412 μs (11.57% GC)
  maximum time:     9.008 ms (96.22% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [36]:
# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])
Out[36]:
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  1
  --------------
  minimum time:     167.038 μs (0.00% GC)
  median time:      181.709 μs (0.00% GC)
  mean time:        186.781 μs (0.00% GC)
  maximum time:     384.111 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @views macro, which can be useful in some operations.

In [37]:
@benchmark @views sum($A[1:2:500, 1:2:500])
Out[37]:
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  1
  --------------
  minimum time:     167.040 μs (0.00% GC)
  median time:      181.641 μs (0.00% GC)
  mean time:        184.829 μs (0.00% GC)
  maximum time:     399.530 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1