Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict between MKL_jll and libgomp in CompilerSupportLibraries_jll #700

Open
KristofferC opened this issue Feb 29, 2020 · 12 comments
Open

Comments

@KristofferC
Copy link
Member

KristofferC commented Feb 29, 2020

Today I was tracking down the test failure observed in JuliaMath/IntelVectorMath.jl#39 when moving to MKL_jll that occurred on a Mac.

After a while, with some help from @giordano, an MWE is:

# dlopen the libgomp inside CompilerSupportLibraries_jll (libgfortran 5). 
using Libdl
dlopen("/Users/kristoffercarlsson/.julia/artifacts/411c17f069dd12b3f79b19e4647949050a58223c/lib/libgomp.1.dylib")

using MKL_jll

gamma1(x) = (x = copy(x);    ccall((:vdTGamma, libmkl_rt), Cvoid, (Int32, Ptr{Float64}, Ptr{Float64}), length(x), x, x); x);
gamma2(x) = (y = similar(x); ccall((:vdTGamma, libmkl_rt), Cvoid, (Int32, Ptr{Float64}, Ptr{Float64}), length(x), x, y); y);
INPUT_2  = rand(1000);

gamma1(INPUT_2)[1:3]
gamma2(INPUT_2)[1:3]

the result from gamma1 and gamma2 should give the same answers, but they don't, as an example

julia> gamma1(INPUT_2)[1:3]
3-element Array{Float64,1}:
 Inf
 Inf
 Inf

julia> gamma2(INPUT_2)[1:3]
3-element Array{Float64,1}:
 16.209046961478283
  6.570129954697695
 41.08264424860374

(This doesn't seem to happen 100% of the time but the majority of the times).

This only happens if the length of the input vector is a certain size, cf

julia> gamma1(INPUT_2[1:602])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

julia> gamma2(INPUT_2[1:602])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

julia> gamma1(INPUT_2[1:603])[1:3] # bad
3-element Array{Float64,1}:
 0.9663910468243027
 0.8856476139287297
 0.9213177809196278

julia> gamma2(INPUT_2[1:603])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

The cutoff seems to be when the length of the vector is 603. My guess is that MKL thinks this is a long enough vector to perhaps enable threading? Also, it is important that libgomp is loaded before MKL_jll for the bug to occur.

The versions used are:

  [e66e0078] CompilerSupportLibraries_jll v0.2.0+1
  [856f044c] MKL_jll v2020.0.166+0

and

julia> versioninfo()
Julia Version 1.4.0-rc2.0
Commit b99ed72c95 (2020-02-24 16:51 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
@KristofferC KristofferC changed the title Conflict between MKL_jll and CompilerSupportLibraries_jll Conflict between MKL_jll and libgomp in CompilerSupportLibraries_jll Feb 29, 2020
@KristofferC
Copy link
Member Author

Updated to reflect that the problem is the dlopen of libgomp in CompilerSupportLibraries_jll

@staticfloat
Copy link
Member

@tlienart this is almost certainly the issue that you brought up to me recently.

The fundamental issue is that MKL has its own OMP implementation (libiomp) that exports the same symbols as libgomp. If you import CompilerSupportLibraries_jll first, you get libgomp hogging those symbols, while if you load MKL_jll first, then libiomp gets its implementations bound to those symbols.

So one workaround that should work today; call using MKL_jll first. Why won't that break other applications? Because very few other packages are actually using libgomp. We include it in CSL_jll because there are a few (And it's important to provide all the libraries), but it's a little rare for a package to actually use it. We automatically open all libraries within these packages so that dependent packages can open those libraries. While we could prevent CSL_jll from loading libgomp automatically (via the dont_dlopen=true argument) that would break packages actually trying to use libgomp, as there's no way for those dependent packages to find where libgomp is living (from the dynamic linker's perspective; from Julia's perspective of course we can find it, but communicating that to the dynamic linker is the issue).

The basic problem is that we have two libraries that are mutually incompatible; libgomp and libiomp. If you have a single application with dependencies that rely on both, this will never work. However packages like MLJ probably don't really care about libgomp and it's just coming along for the ride via CSL. To fix this, I see a few possibilities:

  • Split libgomp out from CSL_jll, force packages that actually need libgomp to depend on that rather than the CSL_jll package itself. (I'm not really a fan of this, I'd like to keep it all together)

  • Switch libgomp to be dont_dlopen=true, tell downstream packages to always manually dlopen(libgomp) before trying to load other packages. (I don't like this either; we want to be more automatic, not less)

  • Teach BinaryBuilder to inspect all libraries within a prefix, build a tree of all dependencies, and if it finds a recursive dependency that was marked as dont_dlopen, manually dlopen it before trying to load the top-level dependency.

My favorite solution is the third, but of course it's also the most work. :P @giordano I'm open to suggestions.

@tlienart
Copy link

tlienart commented Apr 17, 2020

@staticfloat as a heads up, this is still bugging us and using MKL_jll does not fix the issue (see e.g. https://travis-ci.com/github/alan-turing-institute/MLJModels.jl/jobs/320357986). I'm still at loss why MLJ users would see this issue while ScikitLearn.jl would not.

The only way I've found so far to tell our users is to use my fork of 0.5.3+1 (instead of 0.5.3+3) https://github.com/tlienart/OpenSpecFun_jll.jl . Of course this is a shit workaround (and won't work on CI) but I couldn't use a compat bound on the package as the version string does not meet the standards (the +1 is throwing Pkg off)

Maybe here a 4th solution to your list would be to add a way to cap the version on such packages so that I can enforce 0.5.3+1 with which the problem does not happen.

@KristofferC
Copy link
Member Author

I wonder if JuliaPackaging/Yggdrasil#915 is related. It also has to do with libiomp on Mac. Does things work ok for you if you use a system installation of MKL / copy over the libmkl_intel_thread.dylib from the system installation to the artifact one?

@tlienart
Copy link

Hey @KristofferC thanks for this. The problem is now mostly with Travis (is the travis log linked to above, useful in narrowing things down?), @staticfloat tried to reproduce this locally and couldn't (thanks for trying!). However the fact that it fails on Travis is still bad as some users may have setups that would look like it.
We currently have a workaround with an fork of OpenSpecFun_jll which should work for users in the mean time but it's not ideal of course.

I'd be happy to try telling Travis to use this libmkl_intel_thread.dylib but not sure how that would work?

@ViralBShah
Copy link
Member

ViralBShah commented Dec 29, 2021

IMO this causes silent bugs. I would much rather favor correctness and go for the second option as the solution right away. Since such few packages use libgomp, it is ok to get them to manually dlopen it.

@ViralBShah
Copy link
Member

@tlienart can you point me to the fork of openspecfun_jll? I would like to see if we can remove the libgomp dependency in that package.

@giordano
Copy link
Member

Since such few packages use libgomp

I don't think this assesment is very accurate (and I don't think it was even accurate last year):

% find . -name '*.jl' | xargs grep -l CompilerSupportLibraries_jll | xargs grep -lv expand_gfortran
./R/raptor/build_tarballs.jl
./R/rr/build_tarballs.jl
./R/RadeonProRender/build_tarballs.jl
./U/UCX/build_tarballs.jl
./I/IpoptMKL/build_tarballs.jl
./I/IOAPI/build_tarballs.jl
./I/ITSOL_2/build_tarballs.jl
./N/NL2sol/build_tarballs.jl
./N/NetCDFF/build_tarballs.jl
./N/NOMAD/build_tarballs.jl
./N/nlminb/build_tarballs.jl
./N/normaliz/build_tarballs.jl
./G/glmnet/build_tarballs.jl
./G/Gettext/build_tarballs.jl
./G/gb/build_tarballs.jl
./G/Gingko/build_tarballs.jl
./Z/zfp/build_tarballs.jl
./Z/ZITSOL_1/build_tarballs.jl
./Z/ZPares/build_tarballs.jl
./T/tblis/build_tarballs.jl
./T/Torch/build_tarballs.jl
./T/Tesseract/build_tarballs.jl
./T/TMatrix/build_tarballs.jl
./T/Trilinos/build_tarballs.jl
./S/SCALAPACK/build_tarballs.jl
./S/SCIP/build_tarballs.jl
./S/SDPA/build_tarballs.jl
./S/SuperLU_MT/build_tarballs.jl
./S/SoXResampler/build_tarballs.jl
./S/Sundials/Sundials@5/build_tarballs.jl
./S/Sundials/Sundials32@5/build_tarballs.jl
./S/spglib/build_tarballs.jl
./S/SoapyRTLSDR/build_tarballs.jl
./S/StarPU/build_tarballs.jl
./S/SHTOOLS/build_tarballs.jl
./S/SCIP_PaPILO/build_tarballs.jl
./S/scopehal/build_tarballs.jl
./S/SPRAL/build_tarballs.jl
./S/SLATEC/build_tarballs.jl
./S/SuiteSparse/SSGraphBLAS/build_tarballs.jl
./A/Arpack/build_tarballs.jl
./A/AMReX/build_tarballs.jl
./A/ADIOS2/build_tarballs.jl
./A/AzStorage/build_tarballs.jl
./F/FastTransforms/build_tarballs.jl
./F/FLANN/build_tarballs.jl
./F/FGlT/build_tarballs.jl
./F/FMM3D/build_tarballs.jl
./F/finufft/build_tarballs.jl
./O/openPMD_api/build_tarballs.jl
./O/OpenAL/build_tarballs.jl
./O/OpenSpecFun/build_tarballs.jl
./O/OpenBLAS/common.jl
./O/ODEInterface/build_tarballs.jl
./O/OpenMPI/build_tarballs.jl
./O/OpenLSTO/build_tarballs.jl
./H/HOHQMesh/build_tarballs.jl
./H/HHsuite/build_tarballs.jl
./H/HiGHS/build_tarballs.jl
./M/MPICH/build_tarballs.jl
./M/Modflow6/build_tarballs.jl
./M/MPItrampoline/build_tarballs.jl
./M/MMseqs2/build_tarballs.jl
./M/muparser/build_tarballs.jl
./M/msolve/build_tarballs.jl
./M/MUMPS/MUMPS_seq_MKL/build_tarballs.jl
./M/MUMPS/MUMPS_seq@5/build_tarballs.jl
./M/MUMPS/MUMPS@5/build_tarballs.jl
./M/MUMPS/MUMPS_seq@4/build_tarballs.jl
./C/CovidSim/build_tarballs.jl
./C/CUTENSOR/build_tarballs.jl
./C/Coin-OR/BiCePS/build_tarballs.jl
./C/Coin-OR/Cgl/build_tarballs.jl
./C/Coin-OR/SYMPHONY/build_tarballs.jl
./C/Coin-OR/CSDP/build_tarballs.jl
./C/Coin-OR/Ipopt/build_tarballs.jl
./C/Coin-OR/CoinUtils/build_tarballs.jl
./C/Coin-OR/Cbc/build_tarballs.jl
./C/Coin-OR/Osi/build_tarballs.jl
./C/Coin-OR/CHiPPS_BLIS/build_tarballs.jl
./C/Coin-OR/MibS/build_tarballs.jl
./C/Coin-OR/Bonmin/build_tarballs.jl
./C/Coin-OR/SHOT/build_tarballs.jl
./C/Coin-OR/ALPS/build_tarballs.jl
./C/Coin-OR/Clp/build_tarballs.jl
./C/CvxCompress/build_tarballs.jl
./C/Chuffed/build_tarballs.jl
./C/CeresSolver/build_tarballs.jl
./C/CUTEst/build_tarballs.jl
./C/cilantro/build_tarballs.jl
./C/COSMA/build_tarballs.jl
./D/DASKR/build_tarballs.jl
./D/difmap/build_tarballs.jl
./D/Darknet/common.jl
./D/Dierckx/build_tarballs.jl
./V/ViennaRNA/build_tarballs.jl
./V/VMEC/build_tarballs.jl
./Q/qr_mumps/build_tarballs.jl
./Q/Qt/build_tarballs.jl
./Q/QuantReg/build_tarballs.jl
./Q/Qt5Base/build_tarballs.jl
./Q/QCDNUM/build_tarballs.jl
./Q/Qt6Base/build_tarballs.jl
./Q/qwtw/build_tarballs.jl
./Q/QuantumEspresso/build_tarballs.jl
./X/xfoil_light/build_tarballs.jl
./X/XGBoost/build_tarballs.jl
./X/Xyce/build_tarballs.jl
./B/Birch_Standard/build_tarballs.jl
./B/beefalo/build_tarballs.jl
./B/basiclu/build_tarballs.jl
./B/blis/build_tarballs.jl
./K/Kokkos/build_tarballs.jl
./L/libpolymake_julia/build_tarballs.jl
./L/libnabo/build_tarballs.jl
./L/LAPACK/build_tarballs.jl
./L/Libxc/build_tarballs.jl
./L/libgraphicsmagic/build_tarballs.jl
./L/libsharp2/build_tarballs.jl
./L/LibBirch/build_tarballs.jl
./L/LibRaw/build_tarballs.jl
./L/LAMMPS/build_tarballs.jl
./L/libsvm/build_tarballs.jl
./L/LoopTools/build_tarballs.jl
./L/LibAMVW/build_tarballs.jl
./P/polymake/build_tarballs.jl
./P/PCL/build_tarballs.jl
./P/PETSc/build_tarballs.jl
./P/primecount/build_tarballs.jl
./P/PGPLOT/build_tarballs.jl
./W/wannier90/build_tarballs.jl
./W/WaveFD/build_tarballs.jl
./W/Wigxjpf/build_tarballs.jl

I could find at least 133 recipes which reference CompilerSupportLibraries_jll which don't expand the libgfortran version: the main libraries for which we have to pull CSL are libgfortran and libgomp, if it isn't one, it's the other. 133 isn't that few packages and having to manually dlopen libraries isn't something easy to do.

@ViralBShah
Copy link
Member

But do they all need to dlopen libomp? It is definitely not a good option, but loading MKL.jl after SpecialFunctions.jl does not just error but silently gives wrong answers non-deterministically.

@giordano
Copy link
Member

They probably all dynamically link to libgomp, so yes, they do need it. As I said, it's either libgfortran or libgomp.

@ViralBShah
Copy link
Member

I added a note to MKL.jl for now about this: https://github.com/JuliaLinearAlgebra/MKL.jl/blob/master/README.md#usage

@tlienart
Copy link

openspecfun_jll

not sure if this is still relevant: https://github.com/tlienart/OpenSpecFun_jll.jl?organization=tlienart&organization=tlienart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants