Skip to content

RFC: Plan for establishment of xsf library for special function scalar kernels #1

Open
@steppi

Description

@steppi

Some background and history

In August of 2023 Irwin Zaid (@izaid) made the following comment on the pull request scipy/scipy#19023,

Something tangential: I know this issue is about the array API, but it might be good to explore whether the C versions of the special functions can be made very easily "copy-and-pasteable" to include in other array libraries. This way, although the C implementation isn't a defined standard itself, it would help out a lot of other projects if SciPy has something other libraries can easily steal. Example: how much of CuPy's special implementation is just taken directly from SciPy's, with __device__ in front of everything? Could SciPy just have a lot of header files for special functions that other projects can easily include?

For context: the array API, short for the Python array API standard, was proposed to reduce fragmentation in the Python array computing world by offering a common API for an important subset of array computing functionality. The goal is that code written to the array API standard should work with any conforming array backend (e.g. NumPy, PyTorch, CuPy, Jax, etc.). Although the array API standard is restricted to a core set of important features, there can be optionalextensions, standardizing coherent sets of functionality. There are currently extensions for Linear Algebra and Fourier transform functions. It seems natural that there should be a special function extension1 codifying consistent naming and argument conventions for a set of the most widely used special functions, but due to inconsistent support for special functions across array libraries, an extension restricted only to functions which are currently uniformly available among the most widely used array libraries would be missing many useful functions. It is much easier for these array libraries to add a new special function when there is an existing implementation that can be re-used with at most small modifications.

At the time, SciPy could not provide header files for many of its special functions that other projects can easily include because SciPy special was a polyglot submodule with special function scalar kernels written in each of C, C++, Cython, and Fortran 77. Scalar kernels written in Cython and Fortran 77 would require translation to run on the C/C++ based CUDA platform for use in GPU aware array libraries, and we observed that it was precisely the functions with such scalar kernels that were most likely to be missing from other array libraries.

Replacing the Fortran 77 codebase in scipy.special had already been under active discussion because the Fortran scalar kernels, particularly those from Zhang and Jin's2 specfun codebase, were a source of a sizable number of bug reports, which were challenging to address due to the need to work with old, unstructured, and often low quality Fortran 77 code. My first major contributions to scipy.special had begun in June of 2021, when I began rewriting the Fortran scalar kernel for hyp2f1 into Cython2, which was the prefered language at the time for new scalar kernels in scipy.special. Through this work, I'd already begun to think that C++ might be a more suitable choice due to features like overloading on input type, template metaprogramming, better cross-platform support for complex numbers, its extensive standard library, because it can lead to significantly smaller binaries than Cython, and because of the challenges of calling scalar kernels written in Cython from compiled code not written in Cython.

In an October 2023 meeting between Irwin, myself, and fellow SciPy maintainer Matt Haberland (@mdhaber) to discuss the draft proposal for a special function array API extension Matt was working on, Irwin explained his vision and I suggested we start by translating special function scalar kernels that are written in Cython into C++. I posted an RFC, scipy/scipy#19404, asking for thoughts on rewriting these Cython scalar kernels in C++, which was well received. I followed up by translating lambertw from Cython into C++, scipy/scipy#19435, and then began working on translating other scalar kernels. Irwin in turn followed up with scipy/scipy#19601, adding CUDA support for this lambertw implementation. In January 2024 I submitted cupy/cupy#8140 to add lambertw to CuPy. Starting in November 2023, I was able to work on this project 18 hours per week with support from the 2020 NASA ROSES, Reinforcing the Foundations of Scientific Python 3 through my employer Quansight Labs, aiming to bring GPU support to as much of scipy.special as feasible. Funding for travel, hardware, cloud, and other miscellaneous costs in service of this project has been provided by the 2023 NumFOCUS small development grant Streamlined Special Function Development in SciPy.

Irwin has devoted a great amount time to seeing this project through on a volunteer basis when his work allows.

We are now in a position where it is within reach to add GPU support for all special functions in SciPy except those where the need to manage nontrivial amounts of state makes this unworkable. It was a great help that SciPy maintainer Ilhan Polat (@ilayn) translated all of the Fortran code in scipy.special into C as part of his general effort to free SciPy from outdated and difficult to maintain Fortran 77 code 4. Boost Math has recently begun to offer GPU support for special functions and internal tools used within special function code 5, which means we can build upon their work.

Over the past year, @izaid and I have frequently made comments on GitHub stating our goal to split the C++ special function code we've been working on into a separate library which could be a shared dependency (as a submodule) of SciPy, CuPy, and other array libraries, and there is support for this from the CuPy developers 6. This RFC is the first step towards really making this happen. The name xsf was chosen by @izaid, based on the name sf for the C++ standard library extension for mathematical special functions.

Concept

We propose splitting SciPy special C++ codebase @izaid and I have been working on into a header only C++ library of inline special function scalar kernels which could be included and used in either C++ or CUDA projects (either with NVCC or the NVRTC jit. The identification macro __CUDACC__ can be used to detect if a CUDA source file is being compiled, and __CUDACC_RTC__ can be used to detect if the NVRTC jit is being used. Conditional compilation can then be used to adapt the headers for the platform on which they are running.

The simplest adaption is to define a macro XSF_HOST_DEVICE which expands to __host__ __device__ when a CUDA source file is being compiled, and otherwise an empty string.

#ifdef __CUDACC__
#define XSF_HOST_DEVICE __host__ __device__
#else
#define XSF_HOST_DEVICE
#endif

Host refers to the CPU side of a system and device to the CUDA capable GPU side. Defining a function with both the __host__ and __device__ qualifiers tells the compiler the compile the function for both the CPU and GPU (having both is needed for some array libraries). Function signatures in the xsf library then look like:

XSF_HOST_DEVICE inline std::complex<double> lambertw(std::complex<double> z, long k, double tol)

Note that the inline keyword is used here not because we expect the compiler to inline the long and somewhat complex implementation of lambertw, but because this is what's needed in a header only C++ library to allow the same function to be used in multiple translation units under the One Definition Rule.

We rely heavily on NVIDIA's libcudacxx 7, an implementation of much of the C++ standard library that is part of the CCCL (CUDA core compute libraries) suite. Since we only implement scalar kernels, no explicit management of parallelism or other GPU specific programming techniques are needed; it's sufficient to write scalar kernels which use the subset of the standard library that is supported on NVRTC 8, and which do not make nontrivial memory allocations, use recursion, or use other features which are not compatible with the highly parallel but resource constrained nature of GPU computation 9.

Currently a header config.h is used whose essential purpose is to include the headers xsf uses from the standard library when compiling for CPU, but for CUDA, to instead inject the standard library functions, templates, and types xsf uses from the libcudacxx's cuda::std namespace directly into the std namespace. This was a convenient because it let us write our code as if it was just regular C++ code that uses the standard library, but it is generally not considered good practice to modify the std namespace, and doing so leads to undefined behavior according the C++ standard 10. Since these headers will no longer just be part of SciPy's (and CuPy's through copy and paste) internal codebase, we should adopt a more principled approach to aliasing to the correct standard library utilities for a given platform.

Scope

We hope to include scalar kernels for all scipy.special ufuncs and gufuncs in xsf, and to enable GPU support for as much of them as feasible. Since @mborland and @jzmaddock have also begun adding GPU support to mathematical functions in Boost Math, a question that could arise is why there needs to be a separate project doing this. I think it's important that we coordinate with them to avoid duplication of effort. I envision Boost Math being a dependency of xsf, and for us to wrap Boost Math when it provides a quality GPU capable implementation of a function. Currently scipy.special provides many functions which are not available in Boost Math and provides complex-valued versions of others for which Boost Math only provides real valued ones. There may be other functions for which the Boost implementation cannot be easily made GPU capable, but for which we would like to maintain a GPU capable implementation. Over time, I'd hope to see work done in xsf make it's way into Boost in some form when that makes sense, but

  1. @izaid and I are more familiar with the xsf (née SciPy special) codebase
  2. I don't expect it to be straightforward to graft most code from xsf into Boost due to differences in architecture and style
  3. We wouldn't want to overwhelm the Boost Math developers with a flood of work to review.

So for the time being, I think it makes more sense for the work on xsf to proceed independendly of Boost. We will be able to move more quickly this way. In the future I anticipate a natural delineation will form over what makes more sense to be implemented in Boost vs xsf based on differences in project goals and standards.

Migration Plan

The initial steps would be:

  1. Copy the folder scipy/special/xsf in the SciPy repo to include/xsf in the xsf repo, taking care to preserve the commit history of scipy/special/xsf, perhaps using steps from an answer to this relevant StackOverflow question.
  2. Add a test suite to xsf. The simplest thing would be to xsf's ufunc creation tools to create Python wrappers and test these through pytest. To get started, we could just copy over the relevant scipy.special tests, making the minor changes to them that would be needed to get them to work. In the future we would build out a more thorough and consistent test suite.
  3. Set up CI on GitHub over a range of platforms. We could start only with CPU CI, since the goal is just to replicate what we currently have in SciPy, but in the not too far future the goal would be to set up GPU CI as well, following the approach used in scikit-learn 11.
  4. Go through the process of setting up xsf as a submodule of SciPy and CuPy.

There are headers in scipy/special/xsf which are not GPU capable, or which are in principle, but use things from the standard library which are not yet available in CuPy (config.h as it currently exists on SciPy's main branch is one of them). Compilation with NVRTC will fail at runtime if one tries to call a CuPy ufunc with scalar kernel including one of these headers. This is something which will need to be solved before xsf can be included as a submodule of CuPy. Currently what we're doing is taking stuff like this out of the config.h that we copy into CuPy when submitting PRs to add functions there. Instead we should either avoid things that are not available in CuPy (or perhaps spend some time figuring out how to get them to work in CuPy).

Priorities for the rest of the year.

Very roughly, the current status of the scipy special ufunc codebase is:

  1. All of cephes has been translated into C++ 12, and most of it should work on GPU in principle (except things like distribution functions for the kolmogorov distribution which depend on double double arithmetic, and the real valued implementation of hyp2f1, which uses recursion), but we haven't tested all of it.
  2. All of amos and specfun have been translated into C++, but only a few things in specfun and none of amos, have been made GPU capable.
  3. An improved version of root finding infrastructure needed to get cdflib working on GPU has been finished, and gdtrib has been made GPU capable 13. I've translated much of cdflib's internal code (for functions that get inverted with the root finding tools), but have found that cdflib's implementations tend to be worse than the implementations in Boost and cephes, so none of this work translating cdflib has found it's way into SciPy.
  4. Scalar kernels for lambertw, wright_bessel, digamma, loggamma, hyp2f1, sici, sinpi, and cospi have been translated from Cython to C++, but a handful remain, the most complicated being orthogonal_eval.pxd for orthongal polynomial evaluation.
  5. Faddeeva (complex valued statistical distribution functions) is already implemented in C++, but these have not been moved into headers in xsf and made GPU capable.
  6. We have not yet attempted to use the scalar kernels SciPy takes from Boost on GPU.
  7. Elliptic integral implementations in ellint_carlson are written in C++, but we have not moved them into headers in xsf or made them GPU capable.

Through the end of the year I have funding for 287.89 hours of work from the 2020 NASA ROSES grant. There are 14 weeks remaining in the year, which means I am funded to spend 20.5 hours per week for the rest of the year. All of the work above should

For the remainder of the year, I'd like to prioritize the following things 14:

  1. Writing quality documentation. It's important to write quality documentation for how to use the xsf library, but moreso, I want to emphasize documentation for how to contribute to xsf, how to use it's internal tools, the development workflow, advice for special function implementors. Over the years, SciPy has had many people show hope who wanted to help fix things in special, but weren't able to due to the burden of working with, and maintainers needing to review updates to the messy, at the time mostly Fortran, special codebase. I think there's a lot of potential to find people who'd like to help out, as long as it's straightforward enough for them to get started.
  2. Development of a comprehensive test suite, with comparisons against arbitrary precision reference implementations 15, over test cases which are automatically sampled based on the domains of the functions and configuration information such as whether we want to sample in linear or logspace. Additional test cases for critical regions where a function is known to be difficult to compute could be added manually. Such a test suite will make it much easier to review code for xsf, which could help a prevent a review bottleneck from slowing our development pace and discouraging contribution.
  3. Not of lower priority than the previous two items, and something which can be done within a relatively short timeframe: figuring out how to get Boost working in CuPy, integrating Boost into xsf, and documenting how to use Boost in xsf.
  4. Working on low hanging fruit: Things which would be easy to get working on GPU, but which we just haven't gotten to yet. Examples include, remaining Cython scalar kernels and most of cdflib, parts of amos and specfun which don't rely on recursion or internal state. The goal here is essentially to maximize the number of functions made GPU capable by the end of the year.

Request for Comment

Does anyone have any objections to splitting the scipy.special scalar kernel codebase into a separate repo, xsf, which is managed under the SciPy organization? Does anyone anticipate any technical or non-technical difficulties that I may be unaware of? One thing I think is really important to get right is testing and CI, so I plan to make a separate RFC this week just to discuss these issues. The main thing is that requiring a full SciPy build for xsf CI for each platform would be very time consuming, but some things like cython_special can only be tested in SciPy. Another thing that needs discussion is what the xsf release cycle would look like, and the process for updating xsf in SciPy.

@izaid has agreed to be a maintainer of xsf, and is willing to spend time to review and merge my pull requests. This should allow development of xsf to move faster than special function development in SciPy, where lack of bandwidth from qualified reviewers has lead to a significant bottleneck. Any current SciPy maintainers who are interested in getting involved and want commit rights or triage rights should let me know.

Footnotes

  1. An RFC for such an extension was proposed by SciPy maintainer Matt Haberland (@mdhaber) in December 2023,
    https://github.com/data-apis/array-api/issues/725, written with input from Irwin and I.

  2. Shanjie Zhang, Jianming Jin, Computation of Special Functions, Wiley, 1996, ISBN: 0-471-11963-6, LC: QA351.C45. 2

  3. #80NSSC22K0405 - Reinforcing the Foundations of Scientific Python. Awarded jointly to NumPy, pandas, SciPy, and scikit-learn. https://numfocus.medium.com/numfocus-projects-receive-nasa-grants-deee374e7a57

  4. https://github.com/scipy/scipy/issues/18566

  5. https://github.com/boostorg/math/pull/1191

  6. https://github.com/cupy/cupy/issues/8303

  7. Just days ago, AMD published its own libhipcxx implementation of the C++ standard library, which should allow us to also support AMD GPUs with minimal effort on our end.

  8. NVRTC is more restrictive and supports a smaller subset of the standard library.

  9. There are some special functions for which managing dynamic state is needed for more accurate or efficient computation in some or all parameter regions (e.g. Spheroidal Wave Functions). In these cases, conditional compilation may be needed to swap out implementations based on the platform for which the code is compiled. We may also decide and document that certain functions or parameter regions won't be supported for all platforms.

  10. https://wiki.sei.cmu.edu/confluence/display/cplusplus/DCL58-CPP.+Do+not+modify+the+standard+namespaces

  11. https://github.com/scikit-learn/scikit-learn/issues/24491

  12. https://github.com/scipy/scipy/pull/20390

  13. https://github.com/scipy/scipy/pull/21454

  14. It will be difficult, but developing a GPU capabable implementation of Miller's recurrence algorithm to perform argument reduction with a recurrence in cases where the recurrence is only stable in one direction (and not the one we want), is needed for us to implement accurate Bessel and hypergeometric functions which work on the GPU. This is something I've been hoping to work on for a while now, but I don't think it should be priortized over the four items below.

  15. mdhaber began work on the mparray library, which could be used as a home for arbitrary precision reference implementations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfcRequest for comment

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions