Skip to content
/ cuda Public
forked from tmcdonell/cuda

Commit 8351f18

Browse files
committed
Add short tutorial on using the Driver API
1 parent 624a490 commit 8351f18

File tree

3 files changed

+206
-1
lines changed

3 files changed

+206
-1
lines changed

CHANGELOG.markdown

+3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@
88

99
* Bug fixes, extra documentation, improve library coverage.
1010

11+
* Mac OS X no longer requires the DYLD_LIBRARY_PATH environment variable in
12+
order to compile or run programs that use this package.
13+
1114
0.6.7.0
1215

1316
* Add support for building on Windows (thanks to mwu-tow)

Foreign/CUDA/Driver.hs

+196-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,202 @@
44
-- Copyright : [2009..2014] Trevor L. McDonell
55
-- License : BSD
66
--
7-
-- Top level bindings to CUDA driver API
7+
-- This module defines an interface to the CUDA driver API. The Driver API
8+
-- is a lower-level interface to CUDA devices than that provided by the
9+
-- Runtime API. Using the Driver API, the programmer must deal explicitly
10+
-- with operations such as initialisation, context management, and loading
11+
-- (kernel) modules. Although more difficult to use initially, the Driver
12+
-- API provides more control over how CUDA is used. Furthermore, since it
13+
-- does not require compiling and linking the program with 'nvcc', the
14+
-- Driver API provides better inter-language compatibility.
15+
--
16+
-- [/Using the Driver API/]
17+
--
18+
-- Before any operation can be performed, the Driver API must be
19+
-- initialised:
20+
--
21+
-- >>> import Foreign.CUDA.Driver
22+
-- >>> initialise []
23+
--
24+
-- Next, we must select a GPU that we will execute operations on. Each GPU
25+
-- is assigned a unique identifier (beginning at zero). We can get a handle
26+
-- to a compute device at a given ordinal using the 'device' operation.
27+
-- Given a device handle, we can query the properties of that device using
28+
-- 'props'. The number of available CUDA-capable devices is given via
29+
-- 'count'. For example:
30+
--
31+
-- >>> count
32+
-- 1
33+
-- >>> dev0 <- device 0
34+
-- >>> props dev0
35+
-- DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}
36+
--
37+
-- This package also includes the executable 'nvidia-device-query', which when
38+
-- executed displays the key properties of all available devices. See
39+
-- "Foreign.CUDA.Driver.Device" for additional operations to query the
40+
-- capabilities or status of a device.
41+
--
42+
-- Once you have chosen a device to use, the next step is to create a CUDA
43+
-- context. A context is associated with a particular device, and all
44+
-- operations, such as memory allocation and kernel execution, take place
45+
-- with respect to that context. For example, to 'create' a new execution
46+
-- context on CUDA device 0:
47+
--
48+
-- >>> ctx <- create dev0 []
49+
--
50+
-- The second argument is a set of 'ContextFlag's which control how the
51+
-- context behaves in various situations, for example, whether or not the
52+
-- CPU should actively spin when waiting for results from the GPU
53+
-- ('SchedSpin'), or to yield control to other threads instead
54+
-- ('SchedYield').
55+
--
56+
-- The newly created context is now the /active/ context, and all
57+
-- subsequent operations take place within that context. More than one
58+
-- context can be created per device, but resources, such as memory
59+
-- allocated in the GPU, are unique to each context. The module
60+
-- "Foreign.CUDA.Driver.Context" contains operations for managing multiple
61+
-- contexts.
62+
--
63+
-- Once the context is no longer needed, it should be 'destroy'ed in order
64+
-- to free up any resources that were allocated to it.
65+
--
66+
-- >>> destroy ctx
67+
--
68+
-- Each device also has a unique context which is used by the Runtime API.
69+
-- This context can be accessed with the module
70+
-- "Foreign.CUDA.Driver.Context.Primary".
71+
--
72+
--
73+
-- [/Executing kernels onto the GPU/]
74+
--
75+
-- Once the Driver API is initialised and an execution context is created
76+
-- on the GPU, we can begin to interact with it.
77+
--
78+
-- At an example, we'll step through executing the CUDA equivalent of the
79+
-- following Haskell function, which element-wise adds the elements of two
80+
-- arrays:
81+
--
82+
-- >>> vecAdd xs ys = zipWith (+) xs ys
83+
--
84+
-- The following CUDA kernel can be used to implement this on the GPU:
85+
--
86+
-- > extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N)
87+
-- > {
88+
-- > int ix = blockIdx.x * blockDim.x + threadIdx.x;
89+
-- >
90+
-- > if ( ix < N ) {
91+
-- > zs[ix] = xs[ix] + ys[ix];
92+
-- > }
93+
-- > }
94+
--
95+
-- Here, the `__global__` keyword marks the function as a kernel that
96+
-- should be computed on the GPU in data parallel. When we execute this
97+
-- function on the GPU, (at least) /N/ threads will execute /N/ individual
98+
-- instances of the kernel function `vecAdd`. Each thread will operate on
99+
-- a single element of each input array to create a single value in the
100+
-- result. See the CUDA programming guide for more details.
101+
--
102+
-- We can save this to a file `vector_add.cu`, and compile it using `nvcc`
103+
-- into a form that we can then load onto the GPU and execute:
104+
--
105+
-- > $ nvcc --ptx vector_add.cu
106+
--
107+
-- The module "Foreign.CUDA.Driver.Module" contains functions for loading
108+
-- the resulting .ptx file (or .cubin files) into the running program.
109+
--
110+
-- >>> mdl <- loadFile "vector_add.ptx"
111+
--
112+
-- Once finished with the module, it is also a good idea to 'unload' it.
113+
--
114+
-- Modules may export kernel functions, global variables, and texture
115+
-- references. Before we can execute our function, we need to look it up in
116+
-- the module by name.
117+
--
118+
-- >>> vecAdd <- getFun mdl "vecAdd"
119+
--
120+
-- Given this reference to our kernel function, we are almost ready to
121+
-- execute it on the device using 'launchKernel', but first, we must create
122+
-- some data that we can execute the function on.
123+
--
124+
--
125+
-- [/Transferring data to and from the GPU/]
126+
--
127+
-- GPUs typically have their own memory which is separate from the CPU's
128+
-- memory, and we need to explicitly copy data back and forth between these
129+
-- two regions. The module "Foreign.CUDA.Driver.Marshal" provides functions
130+
-- for allocating memory on the GPU, and copying data between the CPU and
131+
-- GPU, as well as directly between multiple GPUs.
132+
--
133+
-- For simplicity, we'll use standard Haskell lists for our input and
134+
-- output data structure. Note however that this will have significantly
135+
-- lower effective bandwidth than reading a single contiguous region of
136+
-- memory, so for most practical purposes you will want to use some kind of
137+
-- unboxed array.
138+
--
139+
-- >>> let xs = [1..1024] :: [Float]
140+
-- >>> let ys = [2,4..2048] :: [Float]
141+
--
142+
-- In CUDA, like C, all memory management is explicit, and arrays on the
143+
-- device must be explicitly allocated and freed. As mentioned previously,
144+
-- data transfer is also explicit. However, we do provide convenience
145+
-- functions for combined allocation and marshalling, as well as bracketed
146+
-- operations.
147+
--
148+
-- >>> xs_dev <- newListArray xs
149+
-- >>> ys_dev <- newListArray ys
150+
-- >>> zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)
151+
--
152+
-- After executing the kernel (see next section), we transfer the result
153+
-- back to the host, and free the memory that was allocated on the GPU.
154+
--
155+
-- >>> zs <- peekListArray 1024 zs_dev
156+
-- >>> free xs_dev
157+
-- >>> free ys_dev
158+
-- >>> free zs_dev
159+
--
160+
--
161+
-- [/Piecing it all together/]
162+
--
163+
-- Finally, we have everything in place to execute our operation on the
164+
-- GPU. Launching a kernel on the GPU consists of creating many threads on
165+
-- the GPU which all execute the same function, and each thread has
166+
-- a unique identifier in the grid/block hierarchy which can be used to
167+
-- identify exactly which element this thread should process (the
168+
-- `blockIdx` and `threadIdx` parameters that we saw earlier,
169+
-- respectively).
170+
--
171+
-- To execute our function, we will use a grid of 4 blocks, each containing
172+
-- 256 threads. Thus, a total of 1024 threads will be launched, which will
173+
-- each compute a single element of the output array (recall on our input
174+
-- arrays each have 1024 elements). The module
175+
-- "Foreign.CUDA.Analysis.Occupancy" contains functions to help determine
176+
-- the ideal thread block size for a given kernel and GPU combination.
177+
--
178+
-- >>> launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]
179+
--
180+
-- Note that kernel execution is asynchronous, so we should also wait for
181+
-- the operation to complete before attempting to read the results back.
182+
--
183+
-- >>> sync
184+
--
185+
-- And that's it!
186+
--
187+
--
188+
-- [/Next steps/]
189+
--
190+
-- As mentioned at the end of the previous section, kernels on the GPU are
191+
-- executed asynchronously with respect to the host, and other operations
192+
-- such as data transfers can also be executed asynchronously. This allows
193+
-- the CPU to continue doing other work while the GPU is busy.
194+
-- 'Foreign.CUDA.Driver.Event.Event's can be used to check whether an
195+
-- operation has completed yet.
196+
--
197+
-- It is also possible to execute multiple kernels or data transfers
198+
-- concurrently with each other, by assigning those operations to different
199+
-- execution 'Foreign.CUDA.Driver.Stream.Stream's. Used in conjunction with
200+
-- 'Foreign.CUDA.Driver.Event.Event's, operations will be scheduled
201+
-- efficiently only once all dependencies (in the form of
202+
-- 'Foreign.CUDA.Driver.Event.Event's) have been cleared.
8203
--
9204
--------------------------------------------------------------------------------
10205

cuda.cabal

+7
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@ Description:
1313
The configure script will look for your CUDA installation in the standard
1414
places, and if the nvcc compiler is found in your PATH, relative to that.
1515
.
16+
This library provides bindings to both the CUDA Driver and Runtime APIs. To
17+
get started, see one of:
18+
.
19+
* "Foreign.CUDA.Driver"
20+
.
21+
* "Foreign.CUDA.Runtime"
22+
.
1623
This release tested with versions 6.0, 6.5, and 7.0 of the CUDA toolkit.
1724
.
1825

0 commit comments

Comments
 (0)