RFC: Plan for establishment of xsf library for special function scalar kernels

### Some background and history

In August of 2023 Irwin Zaid (@izaid) made the following [comment](https://github.com/scipy/scipy/pull/19023#issuecomment-1689050946) on the pull request https://github.com/scipy/scipy/pull/19023,

> Something tangential: I know this issue is about the array API, but it might be good to explore whether the C versions of the special functions can be made very easily "copy-and-pasteable" to include in other array libraries. This way, although the C implementation isn't a defined standard itself, it would help out a lot of other projects if SciPy has something other libraries can easily steal. Example: how much of CuPy's special implementation is just taken directly from SciPy's, with `__device__` in front of everything? Could SciPy just have a lot of header files for special functions that other projects can easily include?

For context: the array API, short for the [Python array API standard](https://data-apis.org/array-api/latest/), was proposed to reduce fragmentation in the Python array computing world by offering a common API for an important subset of array computing functionality. The goal is that code written to the array API standard should work with any conforming array backend (e.g. NumPy, PyTorch, CuPy, Jax, etc.). Although the array API standard is restricted to a core set of important features, there can be optional[extensions](https://data-apis.org/array-api/latest/extensions/index.html), standardizing *coherent sets of functionality*. There are currently extensions for [Linear Algebra](https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html) and [Fourier transform functions](https://data-apis.org/array-api/latest/extensions/fourier_transform_functions.html). It seems natural that there should be a special function extension[^1] codifying consistent naming and argument conventions for a set of the most widely used special functions, but due to inconsistent support for special functions across array libraries, an extension restricted only to functions which are currently uniformly available among the most widely used array libraries would be missing many useful functions. It is much easier for these array libraries to add a new special function when there is an existing implementation that can be re-used with at most small modifications.

At the time, SciPy could not provide header files for many of its special functions that other projects can easily include because SciPy special was a polyglot submodule with special function scalar kernels written in each of C, C++, Cython, and Fortran 77. Scalar kernels written in Cython and Fortran 77 would require translation to run on the C/C++ based CUDA platform for use in GPU aware array libraries, and we observed that it was precisely the functions with such scalar kernels that were most likely to be missing from other array libraries.

Replacing the Fortran 77 codebase in `scipy.special` had already been under active discussion because the Fortran scalar kernels, particularly those from Zhang and Jin's[^2] `specfun` codebase, were a source of a sizable number of bug reports, which were challenging to address due to the need to work with old, unstructured, and often low quality Fortran 77 code. My first major contributions to `scipy.special` had begun in  June of 2021, when I began rewriting the Fortran scalar kernel for `hyp2f1` into Cython[^2], which was the prefered language at the time for new scalar kernels in `scipy.special`. Through this work, I'd already begun to think that C++ might be a more suitable choice due to features like overloading on input type, template metaprogramming, better cross-platform support for complex numbers, its extensive standard library, because it can lead to significantly smaller binaries than Cython, and because of the challenges of calling scalar kernels written in Cython from compiled code not written in Cython.

In an October 2023 meeting between Irwin, myself, and fellow SciPy maintainer Matt Haberland (@mdhaber) to discuss the draft proposal for a special function array API extension Matt was working on, Irwin explained his vision and I suggested we start by translating special function scalar kernels that are written in Cython into C++. I posted an RFC, https://github.com/scipy/scipy/issues/19404, asking for thoughts on rewriting these Cython scalar kernels in C++, which was well received. I followed up by translating `lambertw` from Cython into C++, https://github.com/scipy/scipy/pull/19435, and then began working on translating other scalar kernels. Irwin in turn followed up with https://github.com/scipy/scipy/pull/19601, adding CUDA support for this `lambertw` implementation. In January 2024 I submitted https://github.com/cupy/cupy/pull/8140 to add `lambertw` to CuPy. Starting in November 2023, I was able to work on this project 18 hours per week with support from the 2020 NASA ROSES, *Reinforcing the Foundations of Scientific Python* [^4] through my employer [Quansight Labs](https://labs.quansight.org/), aiming to bring GPU support to as much of `scipy.special` as feasible. Funding for travel, hardware, cloud, and other miscellaneous costs in service of this project has been provided by the 2023 NumFOCUS small development grant *Streamlined Special Function Development in SciPy*.

Irwin has devoted a great amount time to seeing this project through on a volunteer basis when his work allows.

We are now in a position where it is within reach to add GPU support for all special functions in SciPy except those where the need to manage nontrivial amounts of state makes this unworkable. It was a great help that SciPy maintainer Ilhan Polat (@ilayn) translated **all** of the Fortran code in `scipy.special` into C as part of his general effort to free SciPy from outdated and difficult to maintain Fortran 77 code [^5]. Boost Math has recently begun to offer GPU support for special functions and internal tools used within special function code [^6], which means we can build upon their work.

Over the past year, @izaid and I have frequently made comments on GitHub stating our goal to split the C++ special function code we've been working on into a separate library which could be a shared dependency (as a submodule) of SciPy, CuPy, and other array libraries, and there is support for this from the CuPy developers [^7]. This RFC is the first step towards really making this happen. The name xsf was chosen by @izaid, based on the name [sf](https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2010/n3060.pdf) for the C++ standard library extension for mathematical special functions.

### Concept

We propose splitting SciPy special C++ codebase @izaid and I have been working on into a header only C++ library of inline special function scalar kernels which could be included and used in either C++ or CUDA projects (either with [NVCC](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) or the [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html) jit. The identification macro `__CUDACC__` can be used to detect if a CUDA source file is being compiled, and `__CUDACC_RTC__` can be used to detect if the NVRTC jit is being used. Conditional compilation can then be used to adapt the headers for the platform on which they are running.

The simplest adaption is to define a macro `XSF_HOST_DEVICE` which expands to `__host__ __device__` when a CUDA source file is being compiled, and otherwise an empty string.

```C++
#ifdef __CUDACC__
#define XSF_HOST_DEVICE __host__ __device__
#else
#define XSF_HOST_DEVICE
#endif
```

Host refers to the CPU side of a system and device to the CUDA capable GPU side. Defining a function with both the `__host__` and `__device__` qualifiers tells the compiler the compile the function for both the CPU and GPU (having both is needed for some array libraries). Function signatures in the xsf library then look like:

```C++
XSF_HOST_DEVICE inline std::complex<double> lambertw(std::complex<double> z, long k, double tol)
```

Note that the `inline` keyword is used here not because we expect the compiler to inline the long and somewhat complex implementation of `lambertw`, but because this is what's needed in a header only C++ library to allow the same function to be used in multiple translation units under the [One Definition Rule](https://en.wikipedia.org/wiki/One_Definition_Rule).

We rely heavily on NVIDIA's libcudacxx [^8], an implementation of much of the C++ standard library that is part of the [CCCL](https://github.com/NVIDIA/cccl) (CUDA core compute libraries) suite. Since we only implement scalar kernels, no explicit management of parallelism or other GPU specific programming techniques are needed; it's sufficient to write scalar kernels which use the subset of the standard library that is supported on NVRTC [^9], and which do not make nontrivial memory allocations, use recursion, or use other features which are not compatible with the highly parallel but resource constrained nature of GPU computation [^10].

Currently a header [config.h](https://github.com/scipy/scipy/blob/7c7878b86c8aa0cba9b85821bb13209ded56e752/scipy/special/xsf/config.h) is used whose essential purpose is to include the headers xsf uses from the standard library when compiling for CPU, but for CUDA, to instead inject the standard library functions, templates, and types xsf uses from the `libcudacxx`'s `cuda::std` namespace directly into the `std` namespace. This was a convenient because it let us write our code as if it was just regular C++ code that uses the standard library, but it is generally not considered good practice to modify the `std` namespace, and doing so leads to undefined behavior according the C++ standard [^11]. Since these headers will no longer just be part of SciPy's (and CuPy's through copy and paste) internal codebase, we should adopt a more principled approach to aliasing to the correct standard library utilities for a given platform.

### Scope

We hope to include scalar kernels for all `scipy.special` ufuncs and gufuncs in xsf, and to enable GPU support for as much of them as feasible. Since @mborland and @jzmaddock have also begun adding GPU support to mathematical functions in Boost Math, a question that could arise is why there needs to be a separate project doing this. I think it's important that we coordinate with them to avoid duplication of effort. I envision Boost Math being a dependency of xsf, and for us to wrap Boost Math when it provides a quality GPU capable implementation of a function. Currently `scipy.special` provides many functions which are not available in Boost Math and provides complex-valued versions of others for which Boost Math only provides real valued ones. There may be other functions for which the Boost implementation cannot be easily made GPU capable, but for which we would like to maintain a GPU capable implementation. Over time, I'd hope to see work done in xsf make it's way into Boost in some form when that makes sense, but

1. @izaid and I are more familiar with the xsf (née SciPy special) codebase
2. I don't expect it to be straightforward to graft most code from xsf into Boost due to differences in architecture and style
3. We wouldn't want to overwhelm the Boost Math developers with a flood of work to review.

So for the time being, I think it makes more sense for the work on xsf to proceed independendly of Boost. We will be able to move more quickly this way. In the future I anticipate a natural delineation will form over what makes more sense to be implemented in Boost vs xsf based on differences in project goals and standards.

### Migration Plan

The initial steps would be:

1. Copy the folder `scipy/special/xsf` in the SciPy repo to `include/xsf` in the xsf repo, taking care to preserve the commit history of `scipy/special/xsf`, perhaps using steps from an answer to this relevant [StackOverflow question](https://stackoverflow.com/questions/41811986/git-move-directory-to-another-repository-while-keeping-the-history).
2. Add a test suite to xsf. The simplest thing would be to xsf's ufunc creation tools to create Python wrappers and test these through pytest. To get started, we could just copy over the relevant `scipy.special` tests, making the minor changes to them that would be needed to get them to work. In the future we would build out a more thorough and consistent test suite.
3. Set up CI on GitHub over a range of platforms. We could start only with CPU CI, since the goal is just to replicate what we currently have in SciPy, but in the not too far future the goal would be to set up GPU CI as well, following the approach used in scikit-learn [^12].
4. Go through the process of setting up `xsf` as a submodule of SciPy and CuPy.

There are headers in `scipy/special/xsf` which are not GPU capable, or which are in principle, but use things from the standard library which are not yet available in CuPy ([config.h](https://github.com/scipy/scipy/blob/7c7878b86c8aa0cba9b85821bb13209ded56e752/scipy/special/xsf/config.h) as it currently exists on SciPy's main branch is one of them). Compilation with NVRTC will fail at runtime if one tries to call a CuPy ufunc with scalar kernel including one of these headers. This is something which will need to be solved before `xsf` can be included as a submodule of CuPy. Currently what we're doing is taking stuff like this out of the [config.h that we copy into CuPy](https://github.com/cupy/cupy/pull/8620/files#diff-564995cb200aef90364c427cc88857286ff39af28103eac9fd31907463e59892) when submitting PRs to add functions there. Instead we should either avoid things that are not available in CuPy (or perhaps spend some time figuring out how to get them to work in CuPy). 

### Priorities for the rest of the year.

Very roughly, the current status of the scipy special ufunc codebase is:

1. All of cephes has been translated into C++ [^13], and most of it should work on GPU in principle (except things like distribution functions for the `kolmogorov` distribution which depend on `double double` arithmetic, and the real valued implementation of `hyp2f1`, which uses recursion), but we haven't tested all of it.
2. All of amos and specfun have been translated into C++, but only a few things in specfun and none of amos, have been made GPU capable.
3. An improved version of root finding infrastructure needed to get cdflib working on GPU has been finished, and `gdtrib` has been made GPU capable [^14]. I've translated much of cdflib's internal code (for functions that get inverted with the root finding tools), but have found that cdflib's implementations tend to be worse than the implementations in Boost and cephes, so none of this work translating cdflib has found it's way into SciPy. 
4. Scalar kernels for `lambertw`, `wright_bessel`, `digamma`, `loggamma`, `hyp2f1`, `sici`, `sinpi`, and `cospi` have been translated from Cython to C++, but a handful remain, the most complicated being `orthogonal_eval.pxd` for orthongal polynomial evaluation.
5. Faddeeva (complex valued statistical distribution functions) is already implemented in C++, but these have not been moved into headers in xsf and made GPU capable.
6. We have not yet attempted to use the scalar kernels SciPy takes from Boost on GPU.
7. Elliptic integral implementations in `ellint_carlson` are written in C++, but we have not moved them into headers in xsf or made them GPU capable.

Through the end of the year I have funding for 287.89 hours of work from the 2020 NASA ROSES grant. There are 14 weeks remaining in the year, which means I am funded to spend 20.5 hours per week for the rest of the year. All of the work above should

For the remainder of the year, I'd like to prioritize the following things [^15]:

1. Writing quality documentation. It's important to write quality documentation for how to use the xsf library, but moreso, I want to emphasize documentation for how to contribute to xsf, how to use it's internal tools, the development workflow, advice for special function implementors. Over the years, SciPy has had many people show hope who wanted to help fix things in `special`, but weren't able to due to the burden of working with, and maintainers needing to review updates to the messy, at the time mostly Fortran, `special` codebase. I think there's a lot of potential to find people who'd like to help out, as long as it's straightforward enough for them to get started.
2. Development of a comprehensive test suite, with comparisons against arbitrary precision reference implementations [^16], over test cases which are automatically sampled based on the domains of the functions and configuration information such as whether we want to sample in linear or logspace. Additional test cases for critical regions where a function is known to be difficult to compute could be added manually. Such a test suite will make it much easier to review code for `xsf`, which could help a prevent a review bottleneck from slowing our development pace and discouraging contribution.
3. Not of lower priority than the previous two items, and something which can be done within a relatively short timeframe: figuring out how to get Boost working in CuPy, integrating Boost into xsf, and documenting how to use Boost in xsf.
4. Working on low hanging fruit: Things which would be easy to get working on GPU, but which we just haven't gotten to yet. Examples include, remaining Cython scalar kernels and most of cdflib, parts of amos and specfun which don't rely on recursion or internal state. The goal here is essentially to maximize the number of functions made GPU capable by the end of the year.


### Request for Comment

Does anyone have any objections to splitting the `scipy.special` scalar kernel codebase into a separate repo, xsf, which is managed under the SciPy organization? Does anyone anticipate any technical or non-technical difficulties that I may be unaware of? One thing I think is really important to get right is testing and CI, so I plan to make a separate RFC this week just to discuss these issues. The main thing is that requiring a full SciPy build for xsf CI for each platform would be very time consuming, but some things like `cython_special` can only be tested in SciPy. Another thing that needs discussion is what the xsf release cycle would look like, and the process for updating xsf in SciPy.

@izaid has agreed to be a maintainer of xsf, and is willing to spend time to review and merge my pull requests. This should allow development of xsf to move faster than special function development in SciPy, where lack of bandwidth from qualified reviewers has lead to a significant bottleneck. Any current SciPy maintainers who are interested in getting involved and want commit rights or triage rights should let me know.

[^1]: An RFC for such an extension was proposed by SciPy maintainer Matt Haberland (@mdhaber) in December 2023,
https://github.com/data-apis/array-api/issues/725, written with input from Irwin and I.

[^2]: Shanjie Zhang, Jianming Jin, Computation of Special Functions, Wiley, 1996, ISBN: 0-471-11963-6, LC: QA351.C45.

[^3]: https://github.com/scipy/scipy/pull/14256. 

[^4]: #80NSSC22K0405 - Reinforcing the Foundations of Scientific Python. Awarded jointly to NumPy, pandas, SciPy, and scikit-learn. https://numfocus.medium.com/numfocus-projects-receive-nasa-grants-deee374e7a57

[^5]: https://github.com/scipy/scipy/issues/18566

[^6]: https://github.com/boostorg/math/pull/1191

[^7]: https://github.com/cupy/cupy/issues/8303

[^8]: Just days ago, AMD published its own [libhipcxx](https://github.com/ROCm/libhipcxx) implementation of the C++ standard library, which should allow us to also support AMD GPUs with minimal effort on our end.

[^9]: NVRTC is more restrictive and supports a smaller subset of the standard library.

[^10]: There are some special functions for which managing dynamic state is needed for more accurate or efficient computation in some or all parameter regions (e.g. Spheroidal Wave Functions). In these cases, conditional compilation may be needed to swap out implementations based on the platform for which the code is compiled. We may also decide and document that certain functions or parameter regions won't be supported for all platforms.

[^11]: https://wiki.sei.cmu.edu/confluence/display/cplusplus/DCL58-CPP.+Do+not+modify+the+standard+namespaces

[^12]: https://github.com/scikit-learn/scikit-learn/issues/24491

[^13]: https://github.com/scipy/scipy/pull/20390

[^14]: https://github.com/scipy/scipy/pull/21454

[^15]: It will be difficult, but developing a GPU capabable implementation of [Miller's recurrence algorithm](https://en.wikipedia.org/wiki/Miller%27s_recurrence_algorithm) to perform argument reduction with a recurrence in cases where the recurrence is only stable in one direction (and not the one we want), is needed for us to implement accurate Bessel and hypergeometric functions which work on the GPU. This is something I've been hoping to work on for a while now, but I don't think it should be priortized over the four items below.

[^16]: mdhaber began work on the [mparray](https://github.com/mdhaber/mparray) library, which could be used as a home for arbitrary precision reference implementations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Plan for establishment of xsf library for special function scalar kernels #1

Some background and history

Concept

Scope

Migration Plan

Priorities for the rest of the year.

Request for Comment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Plan for establishment of xsf library for special function scalar kernels #1

Description

Some background and history

Concept

Scope

Migration Plan

Priorities for the rest of the year.

Request for Comment

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions