Add short tutorial on using the Driver API

tmcdonell · tmcdonell · commit 8351f18cbd38 · 2015-10-11T20:10:46.000-04:00
diff --git a/CHANGELOG.markdown b/CHANGELOG.markdown
@@ -8,6 +8,9 @@
 
   * Bug fixes, extra documentation, improve library coverage.
 
+  * Mac OS X no longer requires the DYLD_LIBRARY_PATH environment variable in
+    order to compile or run programs that use this package.
+
 0.6.7.0
 
   * Add support for building on Windows (thanks to mwu-tow)
diff --git a/Foreign/CUDA/Driver.hs b/Foreign/CUDA/Driver.hs
@@ -4,7 +4,202 @@
 -- Copyright : [2009..2014] Trevor L. McDonell
 -- License   : BSD
 --
--- Top level bindings to CUDA driver API
+-- This module defines an interface to the CUDA driver API. The Driver API
+-- is a lower-level interface to CUDA devices than that provided by the
+-- Runtime API. Using the Driver API, the programmer must deal explicitly
+-- with operations such as initialisation, context management, and loading
+-- (kernel) modules. Although more difficult to use initially, the Driver
+-- API provides more control over how CUDA is used. Furthermore, since it
+-- does not require compiling and linking the program with 'nvcc', the
+-- Driver API provides better inter-language compatibility.
+--
+-- [/Using the Driver API/]
+--
+-- Before any operation can be performed, the Driver API must be
+-- initialised:
+--
+-- >>> import Foreign.CUDA.Driver
+-- >>> initialise []
+--
+-- Next, we must select a GPU that we will execute operations on. Each GPU
+-- is assigned a unique identifier (beginning at zero). We can get a handle
+-- to a compute device at a given ordinal using the 'device' operation.
+-- Given a device handle, we can query the properties of that device using
+-- 'props'. The number of available CUDA-capable devices is given via
+-- 'count'. For example:
+--
+-- >>> count
+-- 1
+-- >>> dev0 <- device 0
+-- >>> props dev0
+-- DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}
+--
+-- This package also includes the executable 'nvidia-device-query', which when
+-- executed displays the key properties of all available devices. See
+-- "Foreign.CUDA.Driver.Device" for additional operations to query the
+-- capabilities or status of a device.
+--
+-- Once you have chosen a device to use, the next step is to create a CUDA
+-- context. A context is associated with a particular device, and all
+-- operations, such as memory allocation and kernel execution, take place
+-- with respect to that context. For example, to 'create' a new execution
+-- context on CUDA device 0:
+--
+-- >>> ctx <- create dev0 []
+--
+-- The second argument is a set of 'ContextFlag's which control how the
+-- context behaves in various situations, for example, whether or not the
+-- CPU should actively spin when waiting for results from the GPU
+-- ('SchedSpin'), or to yield control to other threads instead
+-- ('SchedYield').
+--
+-- The newly created context is now the /active/ context, and all
+-- subsequent operations take place within that context. More than one
+-- context can be created per device, but resources, such as memory
+-- allocated in the GPU, are unique to each context. The module
+-- "Foreign.CUDA.Driver.Context" contains operations for managing multiple
+-- contexts.
+--
+-- Once the context is no longer needed, it should be 'destroy'ed in order
+-- to free up any resources that were allocated to it.
+--
+-- >>> destroy ctx
+--
+-- Each device also has a unique context which is used by the Runtime API.
+-- This context can be accessed with the module
+-- "Foreign.CUDA.Driver.Context.Primary".
+--
+--
+-- [/Executing kernels onto the GPU/]
+--
+-- Once the Driver API is initialised and an execution context is created
+-- on the GPU, we can begin to interact with it.
+--
+-- At an example, we'll step through executing the CUDA equivalent of the
+-- following Haskell function, which element-wise adds the elements of two
+-- arrays:
+--
+-- >>> vecAdd xs ys = zipWith (+) xs ys
+--
+-- The following CUDA kernel can be used to implement this on the GPU:
+--
+-- > extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N)
+-- > {
+-- >     int ix = blockIdx.x * blockDim.x + threadIdx.x;
+-- >
+-- >     if ( ix < N ) {
+-- >         zs[ix] = xs[ix] + ys[ix];
+-- >     }
+-- > }
+--
+-- Here, the `__global__` keyword marks the function as a kernel that
+-- should be computed on the GPU in data parallel. When we execute this
+-- function on the GPU, (at least) /N/ threads will execute /N/ individual
+-- instances of the kernel function `vecAdd`. Each thread will operate on
+-- a single element of each input array to create a single value in the
+-- result. See the CUDA programming guide for more details.
+--
+-- We can save this to a file `vector_add.cu`, and compile it using `nvcc`
+-- into a form that we can then load onto the GPU and execute:
+--
+-- > $ nvcc --ptx vector_add.cu
+--
+-- The module "Foreign.CUDA.Driver.Module" contains functions for loading
+-- the resulting .ptx file (or .cubin files) into the running program.
+--
+-- >>> mdl <- loadFile "vector_add.ptx"
+--
+-- Once finished with the module, it is also a good idea to 'unload' it.
+--
+-- Modules may export kernel functions, global variables, and texture
+-- references. Before we can execute our function, we need to look it up in
+-- the module by name.
+--
+-- >>> vecAdd <- getFun mdl "vecAdd"
+--
+-- Given this reference to our kernel function, we are almost ready to
+-- execute it on the device using 'launchKernel', but first, we must create
+-- some data that we can execute the function on.
+--
+--
+-- [/Transferring data to and from the GPU/]
+--
+-- GPUs typically have their own memory which is separate from the CPU's
+-- memory, and we need to explicitly copy data back and forth between these
+-- two regions. The module "Foreign.CUDA.Driver.Marshal" provides functions
+-- for allocating memory on the GPU, and copying data between the CPU and
+-- GPU, as well as directly between multiple GPUs.
+--
+-- For simplicity, we'll use standard Haskell lists for our input and
+-- output data structure. Note however that this will have significantly
+-- lower effective bandwidth than reading a single contiguous region of
+-- memory, so for most practical purposes you will want to use some kind of
+-- unboxed array.
+--
+-- >>> let xs = [1..1024]   :: [Float]
+-- >>> let ys = [2,4..2048] :: [Float]
+--
+-- In CUDA, like C, all memory management is explicit, and arrays on the
+-- device must be explicitly allocated and freed. As mentioned previously,
+-- data transfer is also explicit. However, we do provide convenience
+-- functions for combined allocation and marshalling, as well as bracketed
+-- operations.
+--
+-- >>> xs_dev <- newListArray xs
+-- >>> ys_dev <- newListArray ys
+-- >>> zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)
+--
+-- After executing the kernel (see next section), we transfer the result
+-- back to the host, and free the memory that was allocated on the GPU.
+--
+-- >>> zs <- peekListArray 1024 zs_dev
+-- >>> free xs_dev
+-- >>> free ys_dev
+-- >>> free zs_dev
+--
+--
+-- [/Piecing it all together/]
+--
+-- Finally, we have everything in place to execute our operation on the
+-- GPU. Launching a kernel on the GPU consists of creating many threads on
+-- the GPU which all execute the same function, and each thread has
+-- a unique identifier in the grid/block hierarchy which can be used to
+-- identify exactly which element this thread should process (the
+-- `blockIdx` and `threadIdx` parameters that we saw earlier,
+-- respectively).
+--
+-- To execute our function, we will use a grid of 4 blocks, each containing
+-- 256 threads. Thus, a total of 1024 threads will be launched, which will
+-- each compute a single element of the output array (recall on our input
+-- arrays each have 1024 elements). The module
+-- "Foreign.CUDA.Analysis.Occupancy" contains functions to help determine
+-- the ideal thread block size for a given kernel and GPU combination.
+--
+-- >>> launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]
+--
+-- Note that kernel execution is asynchronous, so we should also wait for
+-- the operation to complete before attempting to read the results back.
+--
+-- >>> sync
+--
+-- And that's it!
+--
+--
+-- [/Next steps/]
+--
+-- As mentioned at the end of the previous section, kernels on the GPU are
+-- executed asynchronously with respect to the host, and other operations
+-- such as data transfers can also be executed asynchronously. This allows
+-- the CPU to continue doing other work while the GPU is busy.
+-- 'Foreign.CUDA.Driver.Event.Event's can be used to check whether an
+-- operation has completed yet.
+--
+-- It is also possible to execute multiple kernels or data transfers
+-- concurrently with each other, by assigning those operations to different
+-- execution 'Foreign.CUDA.Driver.Stream.Stream's. Used in conjunction with
+-- 'Foreign.CUDA.Driver.Event.Event's, operations will be scheduled
+-- efficiently only once all dependencies (in the form of
+-- 'Foreign.CUDA.Driver.Event.Event's) have been cleared.
 --
 --------------------------------------------------------------------------------
 
diff --git a/cuda.cabal b/cuda.cabal
@@ -13,6 +13,13 @@ Description:
     The configure script will look for your CUDA installation in the standard
     places, and if the nvcc compiler is found in your PATH, relative to that.
     .
+    This library provides bindings to both the CUDA Driver and Runtime APIs. To
+    get started, see one of:
+    .
+    * "Foreign.CUDA.Driver"
+    .
+    * "Foreign.CUDA.Runtime"
+    .
     This release tested with versions 6.0, 6.5, and 7.0 of the CUDA toolkit.
     .
 

Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,13 @@ Description:`
`13`	`13`	`The configure script will look for your CUDA installation in the standard`
`14`	`14`	`places, and if the nvcc compiler is found in your PATH, relative to that.`
`15`	`15`	`.`
	`16`	`+ This library provides bindings to both the CUDA Driver and Runtime APIs. To`
	`17`	`+ get started, see one of:`
	`18`	`+ .`
	`19`	`+ * "Foreign.CUDA.Driver"`
	`20`	`+ .`
	`21`	`+ * "Foreign.CUDA.Runtime"`
	`22`	`+ .`
`16`	`23`	`This release tested with versions 6.0, 6.5, and 7.0 of the CUDA toolkit.`
`17`	`24`	`.`
`18`	`25`