An experimental package to lower and execute Julia tensor operations to the IREE compiler stack using MLIR.
Coil exports only one function: Coil.compile(f)
which returns a function which leverages MLIR and the IREE compiler stack to produce a (hopefully) faster version of f
. Goals are the following:
- Perform whole model analysis and optimizations to fuse and re-order operations across function calls.
- Fold model hyperparameters by unrolling loops, control flow, etc...
- Evaluate on different hardware accelerators using the IREE runtime.
Note Note that Coil currently does not meet any of those goals and is also a way for me to learn about MLIR and IREE.
Coil.compile
should return an equivalent and hopefully faster function. Note that like Julia, the function will compile when its first called.
julia> using Coil, Flux
julia> dense = Dense(3, 6, relu)
Dense(3 => 6, relu) # 24 parameters
julia> compiled_dense = Coil.compile(dense)
#23 (generic function with 1 method)
julia> x = randn(Float32,3,1);
julia> compiled_dense(x)
2.7212882f0
julia> compiled_dense(x)
2.7212882f0
julia> dense(x)
2.7212882f0
Other niceties include the @code_mlir
and @code_linalg
macros.
julia> using Coil
julia> f(x) = sum(exp, x)
f (generic function with 1 method)
julia> @code_mlir f(Float32[1., 2., 3.])
MModule:
module {
func.func @f(%arg0: tensor<3xf32>) -> f32 {
%cst = arith.constant dense<0.000000e+00> : tensor<f32>
%reduced = linalg.reduce ins(%arg0 : tensor<3xf32>) outs(%cst : tensor<f
32>) dimensions = [0]
(%in: f32, %init: f32) {
%1 = math.exp %in : f32
%2 = arith.addf %1, %init : f32
linalg.yield %2 : f32
}
%c0 = arith.constant 0 : index
%0 = stablehlo.reshape %reduced : (tensor<f32>) -> tensor<1xf32>
%extracted = tensor.extract %0[%c0] : tensor<1xf32>
return %extracted : f32
}
}
To trace functions, Coil leverages Umlaut.jl which converts functions to linearized tapes. It then replaces lowerable calls from this tape to MLIR operations. Since not all Julia calls can be replaced to MLIR operation (struct code, io, etc...), the transformation produce a new tape where only tensor and arithmetic operations are lifted to MLIR dialects.
Consider this input tape of a Flux.Dense
layer with bias and a relu activation:
julia> import Coil.Tracing
julia> dense = Dense(3, 6, relu)
Dense(3 => 6, relu) # 24 parameters
julia> x = randn(Float32,3,1);
julia> _, tape = Tracing.trace(dense, x; ctx=Tracing.Context(dense));
julia> tape
inp %1::Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}
inp %2::Matrix{Float32}
const %3 = fast_act::typeof(NNlib.fast_act)
%4 = getproperty(%1, :σ)::typeof(relu)
const %5 = nothing::Nothing
const %6 = +::typeof(+)
%7 = getproperty(%1, :weight)::Matrix{Float32}
%8 = *(%7, %2)::Matrix{Float32}
%9 = getproperty(%1, :bias)::Vector{Float32}
%10 = broadcasted(%6, %8, %9)::Broadcasted{}
%11 = broadcasted(%4, %10)::Broadcasted{}
%12 = materialize(%11)::Matrix{Float32}
julia> Tracing.compile_tape(tape, x; verbose=true)
[...]
Tape{Coil.Tracing.Context}
inp %1::Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}
inp %2::Matrix{Float32}
%3 = getproperty(%1, :weight)::Matrix{Float32}
%4 = getproperty(%1, :bias)::Vector{Float32}
%5 = Call(#= 3 => 1 =#)(%3, %2, %4)::Matrix{Float32}
where the Call
struct calls into the following generated MLIR function:
julia> Coil.@code_mlir dense(x)
MModule:
module {
func.func @Dense(%arg0: tensor<6x3xf32>, %arg1: tensor<3x1xf32>, %arg2: tensor<6xf32>) -> tensor<6x1xf32> {
%c1_i64 = arith.constant 1 : i64
%c1_i64_0 = arith.constant 1 : i64
%0 = arith.addi %c1_i64, %c1_i64_0 : i64
%c2_i64 = arith.constant 2 : i64
%c1_i64_1 = arith.constant 1 : i64
%1 = arith.addi %c2_i64, %c1_i64_1 : i64
%2 = stablehlo.dot %arg0, %arg1 : (tensor<6x3xf32>, tensor<3x1xf32>) -> tensor<6x1xf32>
%3 = stablehlo.broadcast_in_dim %arg2, dims = [0] : (tensor<6xf32>) -> tensor<6x1xf32>
%4 = stablehlo.add %2, %3 : tensor<6x1xf32>
%cst = arith.constant dense<0.000000e+00> : tensor<6x1xf32>
%5 = arith.maximumf %4, %cst : tensor<6x1xf32>
return %5 : tensor<6x1xf32>
}
}
Due to its use of Umlaut.jl, all control flow from the input function is taken as is for the first given arguments. This means that loops and conditions are unrolled when applied to the linear tape.
To build IREE to be used as a shared library callable from Julia, you need to use a custom fork:
git clone https://github.com/Pangoraw/iree
cd iree
git checkout build_coil2
git submodule update --init
cmake -GNinja -B ../iree-build/ -S . \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DIREE_ENABLE_ASSERTIONS=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DIREE_HAL_DRIVER_VULKAN=on \
-DIREE_TARGET_BACKEND_VULKAN_SPIRV=on \
-DIREE_ENABLE_LLD=ON
cmake --build ../iree-build --target iree_runtime_runtime_shared
This will build the runtime library in the iree-build/
folder. The runtime library (lib_runtime_shared_shared
) contains the bytecode interpreter and hardware drivers to run IREE programs.
The compiler library (libIREECompiler
) containing MLIR and IREE specific passes is downloaded using artifacts from the official releases (Linux x86_64 glibc only) when the package is instantiated.
Later, these libraries will be provided as _jll packages built using Binary Builder.
This package is tested only on the Julia 1.9 release, therefore a special version of CompilerPluginTools.jl should be installed (see CompilerPluginTools.jl#9):
(Coil) pkg> add https://github.com/JuliaCompilerPlugins/CompilerPluginTools.jl#roger/fix-1.9
- ONNX.jl - Coil takes a very similar approach to ONNX.jl but lowers down to MLIR modules instead of ONNX operations.
- XLA.jl - XLA lowers from Julia IR down to XLA HLO and can execute to TPU. Interestingly, the tensor shape inference is embedded in Julia's type system whereas Coil uses the runtime values collected during tracing.