Writing high(ish) performance code with Clifford and Numba via Numpy

This document describes how to take algorithms developed in the clifford package with notation that is close to the maths and convert it into numerically efficient and fast running code. To do this we will expose the underlying representation of multivector as a numpy array of canonical basis vector coeficients and operate directly on these arrays in a manner that is conducive to JIT compilation with numba.

First import the Clifford library as well as numpy and numba

In [1]:
import clifford as cf
import numpy as np
import numba

Choose a specific space

For this document we will use 3d euclidean space embedded in the conformal framework giving a Cl(4,1) algebra. We will also rename some of the variables to match the notation that used by Lasenby et al. in “A Covariant Approach to Geometry using Geometric Algebra”

In [2]:
from clifford import g3c
# Get the layout in our local namespace etc etc
layout = g3c.layout
locals().update(g3c.blades)

ep, en, up, down, homo, E0, ninf, no = (g3c.stuff["ep"], g3c.stuff["en"],
                                        g3c.stuff["up"], g3c.stuff["down"], g3c.stuff["homo"],
                                        g3c.stuff["E0"], g3c.stuff["einf"], -g3c.stuff["eo"])
# Define a few useful terms
E = ninf^(no)
I5 = e12345
I3 = e123

Performance of mathematically iodmatic Clifford algorithms

By default the Clifford library sacrifices performance for syntactic convenience.

Consider a function that applies a rotor to a multivector:

In [3]:
def apply_rotor(R,mv):
    return R*mv*~R

We will define a rotor that takes one line to another:

In [4]:
line_one = (up(0)^up(e1)^ninf).normal()
line_two = (up(0)^up(e2)^ninf).normal()
R = 1 + line_two*line_one

Check that this works

In [5]:
print(line_two)
print(apply_rotor(R,line_one).normal())
(1.0^e245)
(1.0^e245)

We would like to improve the speed of our algorithm, first we will profile it and see where it spends its time

In [6]:
#%%prun -s cumtime
#for i in range(1000000):
#    apply_rotor(R,line_one)

An example profile output from running this notebook on the author’s laptop is as follows:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000   66.290   66.290 {built-in method builtins.exec}
      1    0.757    0.757   66.290   66.290 <string>:2(<module>)
1000000    3.818    0.000   65.534    0.000 <ipython-input-13-70a01003bf51>:1(apply_rotor)
2000000    9.269    0.000   55.641    0.000 __init__.py:751(__mul__)
2000000    3.167    0.000   29.900    0.000 __init__.py:717(_checkOther)
2000000    1.371    0.000   19.906    0.000 __init__.py:420(__ne__)
2000000    6.000    0.000   18.535    0.000 numeric.py:2565(array_equal)
2000000   10.505    0.000   10.505    0.000 __init__.py:260(mv_mult)

We can see that the function spends almost all of its time in __mul__ and within __mul__ it spends most of its time in _checkOther. In fact it only spends a small fraction of its time in mv_mult which does the numerical multivector multiplication. To write more performant code we need to strip away the high level abstractions and deal with the underlying representations of the blade component data.

Canonical blade coeficient representation in Clifford

In Clifford a multivector is internally represented as a numpy array of the coefficients of the canonical basis vectors, they are arranged in order of grade. So for our 4,1 algebra the first element is the scalar part, the next 5 are the vector coeficients, the next 10 are the bivectors, the next 10 the triectors, the next 5 the quadvectors and the final value is the pseudoscalar coefficient.

In [7]:
(5.0*e1 - e2 + e12 + e135 + np.pi*e1234).value
Out[7]:
array([ 0.        ,  5.        , -1.        ,  0.        ,  0.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  3.14159265,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ])

Exploiting blade representation to write a fast function

We can rewrite our rotor application function using the functions that the layout exposes for operations on the numpy arrays themselves.

In [8]:
def apply_rotor_faster(R,mv):
    return cf.MultiVector( layout, layout.gmt_func(R.value,layout.gmt_func(mv.value,layout.adjoint_func(R.value))) )
In [9]:
#%%prun -s cumtime
#for i in range(1000000):
#    apply_rotor_faster(R,line_one)

This gives a much faster function

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000   19.567   19.567 {built-in method builtins.exec}
      1    0.631    0.631   19.567   19.567 <string>:2(<module>)
1000000    7.373    0.000   18.936    0.000 <ipython-input-35-6a5344d83bdb>:1(apply_rotor_faster)
2000000    9.125    0.000    9.125    0.000 __init__.py:260(mv_mult)
1000000    1.021    0.000    1.619    0.000 __init__.py:677(__init__)
1000000    0.819    0.000    0.819    0.000 __init__.py:244(adjoint_func)

We have successfully skipped past the higher level checks on the multivectors while maintaining extactly the same function signature. It is important to check that we still have the correct answer:

In [10]:
print(line_two)
print(apply_rotor_faster(R,line_one).normal())
(1.0^e245)
(1.0^e245)

The performance improvements gained by rewriting our function are significant but it comes at the cost of readibility.

By loading the layouts gmt_func and adjoint_func into the global namespace before the function is defined and separating the value operations from the multivector wrapper we can make our code more concise.

In [11]:
gmt_func = layout.gmt_func
adjoint_func = layout.adjoint_func

def apply_rotor_val(R_val,mv_val):
    return gmt_func(R_val,gmt_func(mv_val,adjoint_func(R_val)))

def apply_rotor_wrapped(R,mv):
    return cf.MultiVector(layout,apply_rotor_val(R.value,mv.value))
In [12]:
#%%prun -s cumtime
#for i in range(1000000):
#    apply_rotor_wrapped(R,line_one)

The time cost is essentially the same, there is probably some minor overhead from the function call itself

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000   19.621   19.621 {built-in method builtins.exec}
      1    0.557    0.557   19.621   19.621 <string>:2(<module>)
1000000    1.421    0.000   19.064    0.000 <ipython-input-38-a1e0b5c53cdc>:7(apply_rotor_wrapped)
1000000    6.079    0.000   16.033    0.000 <ipython-input-38-a1e0b5c53cdc>:4(apply_rotor_val)
2000000    9.154    0.000    9.154    0.000 __init__.py:260(mv_mult)
1000000    1.017    0.000    1.610    0.000 __init__.py:677(__init__)
1000000    0.800    0.000    0.800    0.000 __init__.py:244(adjoint_func)
In [13]:
print(line_two)
print(apply_rotor_wrapped(R,line_one).normal())
(1.0^e245)
(1.0^e245)

The additional advantage of splitting the function like this is that the numba JIT compiler can reason about the memory layout of numpy arrays in no python mode as long as no pure python objects are operated upon within the function. This means we can JIT our function that operates on the value directly.

In [14]:
@numba.njit
def apply_rotor_val_numba(R_val,mv_val):
    return gmt_func(R_val,gmt_func(mv_val,adjoint_func(R_val)))

def apply_rotor_wrapped_numba(R,mv):
    return cf.MultiVector(layout,apply_rotor_val_numba(R.value,mv.value))
In [15]:
#%%prun -s cumtime
#for i in range(1000000):
#    apply_rotor_wrapped_numba(R,line_one)

This gives a small improvement in performance but more importantly it allows us to write larger functions that also use the jitted apply_rotor_val_numba and are themselves jitted.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.000    0.000   16.033   16.033 {built-in method builtins.exec}
       1    0.605    0.605   16.033   16.033 <string>:2(<module>)
 1000000    2.585    0.000   15.428    0.000 <ipython-input-42-1142126d93ca>:5(apply_rotor_wrapped_numba)
 1000000    8.606    0.000    8.606    0.000 <ipython-input-42-1142126d93ca>:1(apply_rotor_val_numba)
       1    0.000    0.000    2.716    2.716 dispatcher.py:294(_compile_for_args)
     7/1    0.000    0.000    2.716    2.716 dispatcher.py:554(compile)

Composing larger functions

By chaining together functions that operate on the value arrays of multivectors it is easy to construct fast and readable code

In [16]:
I5_val = I5.value
omt_func = layout.omt_func

def dual_mv(mv):
    return -I5*mv

def meet_unwrapped(mv_a,mv_b):
    return -dual_mv(dual_mv(mv_a)^dual_mv(mv_b))

@numba.njit
def dual_val(mv_val):
    return -gmt_func(I5_val,mv_val)

@numba.njit
def meet_val(mv_a_val,mv_b_val):
    return -dual_val( omt_func( dual_val(mv_a_val) , dual_val(mv_b_val)) )

def meet_wrapped(mv_a,mv_b):
    return cf.MultiVector(layout, meet_val(mv_a.value, mv_b.value))

sphere = (up(0)^up(e1)^up(e2)^up(e3)).normal()
print(sphere.meet(line_one).normal().normal())
print(meet_unwrapped(sphere,line_one).normal())
print(meet_wrapped(line_one,sphere).normal())
(1.0^e14) - (1.0^e15) - (1.0^e45)
(1.0^e14) - (1.0^e15) - (1.0^e45)
(1.0^e14) - (1.0^e15) - (1.0^e45)
In [17]:
#%%prun -s cumtime
#for i in range(100000):
#    meet_unwrapped(sphere,line_one)
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   13.216   13.216 {built-in method builtins.exec}
        1    0.085    0.085   13.216   13.216 <string>:2(<module>)
   100000    0.418    0.000   13.131    0.000 <ipython-input-98-f91457c8741a>:7(meet_unwrapped)
   300000    0.681    0.000    9.893    0.000 <ipython-input-98-f91457c8741a>:4(dual_mv)
   300000    1.383    0.000    8.127    0.000 __init__.py:751(__mul__)
   400000    0.626    0.000    5.762    0.000 __init__.py:717(_checkOther)
   400000    0.270    0.000    3.815    0.000 __init__.py:420(__ne__)
   400000    1.106    0.000    3.544    0.000 numeric.py:2565(array_equal)
   100000    0.460    0.000    2.439    0.000 __init__.py:783(__xor__)
   800000    0.662    0.000    2.053    0.000 __init__.py:740(_newMV)
   400000    1.815    0.000    1.815    0.000 __init__.py:260(mv_mult)
In [18]:
#%%prun -s cumtime
#for i in range(100000):
#    meet_wrapped(sphere,line_one)
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.000    0.000    1.951    1.951 {built-in method builtins.exec}
       1    0.063    0.063    1.951    1.951 <string>:2(<module>)
  100000    0.274    0.000    1.888    0.000 <ipython-input-98-f91457c8741a>:18(meet_wrapped)
  100000    1.448    0.000    1.448    0.000 <ipython-input-98-f91457c8741a>:14(meet_val)
  100000    0.096    0.000    0.166    0.000 __init__.py:677(__init__)

Algorithms exploiting known sparseness of MultiVector value array

The standard multiplication generator function for two general multivectors is as follows:

In [19]:
def get_mult_function(mult_table,n_dims):
    '''
    Returns a function that implements the mult_table on two input multivectors
    '''
    non_zero_indices = mult_table.nonzero()
    k_list = non_zero_indices[0]
    l_list = non_zero_indices[1]
    m_list = non_zero_indices[2]
    mult_table_vals = np.array([mult_table[k,l,m] for k,l,m in np.transpose(non_zero_indices)],dtype=int)

    @numba.njit
    def mv_mult(value,other_value):
        output = np.zeros(n_dims)
        for ind,k in enumerate(k_list):
            l = l_list[ind]
            m = m_list[ind]
            output[l] += value[k]*mult_table_vals[ind]*other_value[m]
        return output
    return mv_mult

There are however instances in which we might be able to use the known sparseness of the input data value representation to speed up the operations. For example, in Cl(4,1) rotors only contain even grade blades and we can therefore remove all the operations accessing odd grade objects.

In [20]:
def get_grade_from_index(index_in):
    if index_in == 0:
        return 0
    elif index_in < 6:
        return 1
    elif index_in < 16:
        return 2
    elif index_in < 26:
        return 3
    elif index_in < 31:
        return 4
    elif index_in == 31:
        return 5
    else:
        raise ValueError('Index is out of multivector bounds')

def get_sparse_mult_function(mult_table,n_dims,grades_a,grades_b):
    '''
    Returns a function that implements the mult_table on two input multivectors
    '''
    non_zero_indices = mult_table.nonzero()
    k_list = non_zero_indices[0]
    l_list = non_zero_indices[1]
    m_list = non_zero_indices[2]
    mult_table_vals = np.array([mult_table[k,l,m] for k,l,m in np.transpose(non_zero_indices)],dtype=int)

    # Now filter out the sparseness
    filter_mask = np.zeros(len(k_list), dtype=bool)
    for i in range(len(filter_mask)):
        if get_grade_from_index(k_list[i]) in grades_a:
            if get_grade_from_index(m_list[i]) in grades_b:
                filter_mask[i] = 1

    k_list = k_list[filter_mask]
    l_list = l_list[filter_mask]
    m_list = m_list[filter_mask]
    mult_table_vals = mult_table_vals[filter_mask]

    @numba.njit
    def mv_mult(value,other_value):
        output = np.zeros(n_dims)
        for ind,k in enumerate(k_list):
            l = l_list[ind]
            m = m_list[ind]
            output[l] += value[k]*mult_table_vals[ind]*other_value[m]
        return output
    return mv_mult
In [21]:
left_rotor_mult = get_sparse_mult_function(layout.gmt,layout.gaDims,[0,2,4],[0,1,2,3,4,5])
right_rotor_mult = get_sparse_mult_function(layout.gmt,layout.gaDims,[0,1,2,3,4,5],[0,2,4])

@numba.njit
def sparse_apply_rotor_val(R_val,mv_val):
    return left_rotor_mult(R_val,right_rotor_mult(mv_val,adjoint_func(R_val)))

def sparse_apply_rotor(R,mv):
    return cf.MultiVector(layout,sparse_apply_rotor_val(R.value,mv.value))
In [22]:
#%%prun -s cumtime
#for i in range(1000000):
#    sparse_apply_rotor(R,line_one)
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.000    0.000    9.490    9.490 {built-in method builtins.exec}
       1    0.624    0.624    9.489    9.489 <string>:2(<module>)
 1000000    2.684    0.000    8.865    0.000 <ipython-input-146-f75aae3ce595>:8(sparse_apply_rotor)
 1000000    4.651    0.000    4.651    0.000 <ipython-input-146-f75aae3ce595>:4(sparse_apply_rotor_val)
 1000000    0.934    0.000    1.530    0.000 __init__.py:677(__init__)
 1000000    0.596    0.000    0.596    0.000 {built-in method numpy.core.multiarray.array}
In [23]:
print(line_two)
print(sparse_apply_rotor(R,line_one).normal())
(1.0^e245)
(1.0^e245)

We can do the same with the meet operation that we defined earlier if we know what grade objects we are meeting

In [24]:
left_pseudo_mult = get_sparse_mult_function(layout.gmt,layout.gaDims,[5],[0,1,2,3,4,5])
sparse_omt_2_1 = get_sparse_mult_function(layout.omt,layout.gaDims,[2],[1])

@numba.njit
def dual_sparse_val(mv_val):
    return -left_pseudo_mult(I5_val,mv_val)

@numba.njit
def meet_sparse_3_4_val(mv_a_val,mv_b_val):
    return -dual_sparse_val( sparse_omt_2_1( dual_sparse_val(mv_a_val) , dual_sparse_val(mv_b_val)) )

def meet_sparse_3_4(mv_a,mv_b):
    return cf.MultiVector(layout, meet_sparse_3_4_val(mv_a.value, mv_b.value))

print(sphere.meet(line_one).normal().normal())
print(meet_sparse_3_4(line_one,sphere).normal())
(1.0^e14) - (1.0^e15) - (1.0^e45)
(1.0^e14) - (1.0^e15) - (1.0^e45)
In [25]:
#%%prun -s cumtime
#for i in range(100000):
#    meet_sparse_3_4(line_one,sphere)
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.725    0.725 {built-in method builtins.exec}
     1    0.058    0.058    0.725    0.725 <string>:2(<module>)
100000    0.252    0.000    0.667    0.000 <ipython-input-156-f346d0563682>:12(meet_sparse_3_4)
100000    0.267    0.000    0.267    0.000 <ipython-input-156-f346d0563682>:8(meet_sparse_3_4_val)
100000    0.088    0.000    0.148    0.000 __init__.py:677(__init__)

Future work on performance

Investigate efficient operations on containers of large numbers of multivectors.

Possibly investigate http://numba.pydata.org/numba-doc/0.13/CUDAJit.html for larger algebras/other areas in which GPU memory latency will not be such a large factor, ie, lots of bulk parallel numerical operations.

In [26]: