I touchs: CUDA

顯示具有 CUDA 標籤的文章。顯示所有文章

2015年11月12日星期四

CUDA: Query Device Information With The Runtime API

since: 2015/11/12
update: 2015/11/12
reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA
5. CUDA Runtime API :: CUDA Toolkit Documentation

A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: checkDeviceInfor.cu
#include <cuda_runtime.h>
#include <stdio.h>

/*
* Display a variety of information on the first CUDA device in this system,
* including driver version, runtime version, compute capability, bytes of
* global memory, etc.
*/

void CHECK(cudaError_t call)
{
    const cudaError_t error = call;
    if (error != cudaSuccess)
    {
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);
        fprintf(stderr, "code: %d, reason: %s\n", error,
                cudaGetErrorString(error));
    }
}

int main(int argc, char **argv)
{
    printf("\n%s Starting...\n\n", argv[0]);

    int deviceCount = 0;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0)
    {
        printf("There are no available device(s) that support CUDA\n");
    }
    else
    {
        printf("Detected %d CUDA Capable device(s)\n", deviceCount);
    }

    int dev = 0, driverVersion = 0, runtimeVersion = 0;
    CHECK(cudaSetDevice(dev));
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("Device %d: \"%s\"\n\n", dev, deviceProp.name);

    cudaDriverGetVersion(&driverVersion);
    cudaRuntimeGetVersion(&runtimeVersion);

    printf(" CUDA Driver Version / Runtime Version          %d.%d / %d.%d\n",
           driverVersion / 1000, (driverVersion % 100) / 10,
           runtimeVersion / 1000, (runtimeVersion % 100) / 10);

    printf(" CUDA Capability Major/Minor version number:    %d.%d\n\n",
           deviceProp.major, deviceProp.minor);

    printf(" Total amount of global memory:                 %.2f MBytes (%llu "
           "bytes)\n", (float)deviceProp.totalGlobalMem / pow(1024.0, 3),
           (unsigned long long)deviceProp.totalGlobalMem);

    printf(" GPU Clock rate:                                %.0f MHz (%0.2f "
           "GHz)\n", deviceProp.clockRate * 1e-3f,
           deviceProp.clockRate * 1e-6f);

    printf(" Memory Clock rate:                             %.0f Mhz\n",
           deviceProp.memoryClockRate * 1e-3f);

    printf(" Memory Bus Width:                              %d-bit\n",
           deviceProp.memoryBusWidth);

    if (deviceProp.l2CacheSize)
    {
        printf(" L2 Cache Size:                                 %d bytes\n\n",
               deviceProp.l2CacheSize);
    }

    printf(" Max Texture Dimension Size (x,y,z)             1D=(%d), "
           "2D=(%d,%d), 3D=(%d,%d,%d)\n", deviceProp.maxTexture1D,
           deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1],
           deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1],
           deviceProp.maxTexture3D[2]);

    printf(" Max Layered Texture Size (dim) x layers        1D=(%d) x %d, "
           "2D=(%d,%d) x %d\n\n", deviceProp.maxTexture1DLayered[0],
           deviceProp.maxTexture1DLayered[1], deviceProp.maxTexture2DLayered[0],
           deviceProp.maxTexture2DLayered[1],
           deviceProp.maxTexture2DLayered[2]);

    printf(" Total amount of constant memory:               %lu bytes\n",
           deviceProp.totalConstMem);

    printf(" Total amount of shared memory per block:       %lu bytes\n",
           deviceProp.sharedMemPerBlock);

    printf(" Total number of registers available per block: %d\n\n",
           deviceProp.regsPerBlock);

    printf(" Warp size:                                     %d\n\n",
           deviceProp.warpSize);

    printf(" Maximum number of threads per multiprocessor: %d\n",
           deviceProp.maxThreadsPerMultiProcessor);

    printf(" Maximum number of threads per block:           %d\n",
           deviceProp.maxThreadsPerBlock);

    printf(" Maximum sizes of each dimension of a block:    %d x %d x %d\n",
           deviceProp.maxThreadsDim[0],
           deviceProp.maxThreadsDim[1],
           deviceProp.maxThreadsDim[2]);

    printf(" Maximum sizes of each dimension of a grid:     %d x %d x %d\n",
           deviceProp.maxGridSize[0],
           deviceProp.maxGridSize[1],
           deviceProp.maxGridSize[2]);

    printf(" Maximum memory pitch:                          %lu bytes\n\n",
           deviceProp.memPitch);

    exit(EXIT_SUCCESS);
}
-----------------------------------------------------------------------------------------------

B. 執行結果:

-----------------------------------------------------------------------------------------------

C. 備註:
還可以使用 command-line 的工具: nvidia-smi (System Management Interface),
不過 Jetson TK1 似乎不支援.

2015年10月20日星期二

CUDA: Try To Measuring The Kernel By Timer

since: 2015/10/20
update: 2015/10/20
reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA

A. Timing With CPU Timer
     1. 在 Nsight Eclipse Edition 撰寫程式如下:
         // 檔名: sumArraysOnGPU-timer.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>

// check error
void CHECK(cudaError_t call)
{
    const cudaError_t error = call;
    if (error != cudaSuccess)
    {
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);
        fprintf(stderr, "code: %d, reason: %s\n", error,
                cudaGetErrorString(error));
    }
}

// the timer on cpu
double seconds()
{
    struct timeval tp;
    struct timezone tzp;
    int i = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}

// verifying your kernel
void checkResult(float *hostRef, float *gpuRef, const int N)
{
    double epsilon = 1.0E-8;
    bool match = 1;

    for (int i = 0; i < N; i++)
    {
        if (abs(hostRef[i] - gpuRef[i]) > epsilon)
        {
            match = 0;
            printf("Arrays do not match!\n");
            printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i],
                   gpuRef[i], i);
            break;
        }
    }

    if (match) printf("Arrays match.\n\n");

    return;
}

// initial random data
void initialData(float *ip, int size)
{
    // generate different seed for random number
    time_t t;
    srand((unsigned) time(&t));

    for (int i = 0; i < size; i++)
    {
        ip[i] = (float)( rand() & 0xFF ) / 10.0f;
    }

    return;
}

// sum Arrays On Host
void sumArraysOnHost(float *A, float *B, float *C, const int N)
{
    for (int idx = 0; idx < N; idx++)
    {
        C[idx] = A[idx] + B[idx];
    }
}

// sum Arrays On GPU
__global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < N) C[i] = A[i] + B[i];
}

// the main function
int main(int argc, char **argv)
{
    printf("\n%s Starting...\n", argv[0]);

    // set up device
    int dev = 0;
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("Using Device %d: %s\n", dev, deviceProp.name);
    CHECK(cudaSetDevice(dev));

    // set up data size of vectors
    int nElem = 1 << 24;
    printf("Vector size %d\n", nElem);

    // malloc host memory
    size_t nBytes = nElem * sizeof(float);

    float *h_A, *h_B, *hostRef, *gpuRef;
    h_A     = (float *)malloc(nBytes);
    h_B     = (float *)malloc(nBytes);
    hostRef = (float *)malloc(nBytes);
    gpuRef = (float *)malloc(nBytes);

    double iStart, iElaps;

    // initialize data at host side
    iStart = seconds();
    initialData(h_A, nElem);
    initialData(h_B, nElem);
    iElaps = seconds() - iStart;
    printf("initialData Time elapsed %f sec\n", iElaps);
    memset(hostRef, 0, nBytes);
    memset(gpuRef, 0, nBytes);

    // add vector at host side for result checks
    iStart = seconds();
    sumArraysOnHost(h_A, h_B, hostRef, nElem);
    iElaps = seconds() - iStart;
    printf("sumArraysOnHost Time elapsed %f sec\n", iElaps);

    // malloc device global memory
    float *d_A, *d_B, *d_C;
    CHECK(cudaMalloc((float**)&d_A, nBytes));
    CHECK(cudaMalloc((float**)&d_B, nBytes));
    CHECK(cudaMalloc((float**)&d_C, nBytes));

    // transfer data from host to device
    CHECK(cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice));

    // invoke kernel at host side
    int iLen = 256;
    dim3 block (iLen);
    dim3 grid ((nElem + block.x - 1) / block.x);

    iStart = seconds();
    sumArraysOnGPU<<<grid, block>>>(d_A, d_B, d_C, nElem);
    CHECK(cudaDeviceSynchronize()); // used just for debugging purpose
    iElaps = seconds() - iStart;
    printf("sumArraysOnGPU <<< %d, %d >>> Time elapsed %f sec\n", grid.x,
           block.x, iElaps);

    // the testing result ....
    //
    // sumArraysOnGPU <<< 16384, 1024 >>> Time elapsed 0.040817 sec
    // sumArraysOnGPU <<< 32768, 512 >>> Time elapsed 0.031031 sec
    // sumArraysOnGPU <<< 65536, 256 >>> Time elapsed 0.029070 sec
    // sumArraysOnGPU <<< 131072, 128 >>> Time elapsed 0.028804 sec
    // sumArraysOnGPU <<< 262144, 64 >>> Time elapsed 0.056073 sec
    // sumArraysOnGPU <<< 524288, 32 >>> Time elapsed 0.098515 sec
    // sumArraysOnGPU <<< 1048576, 16 >>> Time elapsed 0.190098 sec
    // sumArraysOnGPU <<< 2097152, 8 >>> Time elapsed 0.375030 sec
    // sumArraysOnGPU <<< 4194304, 4 >>> Time elapsed 0.803681 sec
    // sumArraysOnGPU <<< 8388608, 2 >>> Time elapsed 1.835205 sec
    // sumArraysOnGPU <<< 16777216, 1 >>> Time elapsed 3.515245 sec

    // check kernel error
    CHECK(cudaGetLastError()) ;

    // copy kernel result back to host side
    CHECK(cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost));

    // check device results
    checkResult(hostRef, gpuRef, nElem);

    // free device global memory
    CHECK(cudaFree(d_A));
    CHECK(cudaFree(d_B));
    CHECK(cudaFree(d_C));

    // free host memory
    free(h_A);
    free(h_B);
    free(hostRef);
    free(gpuRef);

    return(0);
}

     2. 執行結果:

-----------------------------------------------------------------------------------------------

B. Timing With nvprof(NVIDIA profiling tool)
     1. Before this, Don't forget to Add CUDA bin Environment Variables
         => How to re-flash your Jetson TK1 Development Kit
         => G. Add CUDA bin Environment Variables

     2. Login to TK1:
         $ ssh ubuntu@192.168.0.106

     3. Go to the release directory:
        $ cd /home/ubuntu/RD/projects/cuda/Release

     4. Timing with nvprof:
         $ nvprof ./NSightCUDA

     5. Result:
./NSightCUDA Starting...
==2996== NVPROF is profiling process 2996, command: ./NSightCUDA
Using Device 0: GK20A
Vector size 16777216
initialData Time elapsed 4.632509 sec
sumArraysOnHost Time elapsed 0.060414 sec
sumArraysOnGPU <<< 65536, 256 >>> Time elapsed 0.023884 sec
Arrays match.

==2996== Profiling application: ./NSightCUDA
==2996== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max Name
57.86% 137.14ms         3 45.714ms 44.908ms 46.267ms [CUDA memcpy HtoD]
32.42% 76.837ms         1 76.837ms 76.837ms 76.837ms [CUDA memcpy DtoH]
9.73% 23.053ms         1 23.053ms 23.053ms 23.053ms sumArraysOnGPU(float*, float*, float*, int)

==2996== API calls:
Time(%)      Time     Calls       Avg       Min       Max Name
54.19% 219.10ms         4 54.774ms 45.602ms 79.567ms cudaMemcpy
39.03% 157.79ms         3 52.598ms 1.0795ms 152.74ms cudaMalloc
5.87% 23.717ms         1 23.717ms 23.717ms 23.717ms cudaDeviceSynchronize
0.64% 2.5713ms         3 857.08us 661.42us 1.0873ms cudaFree
0.14% 582.58us        83 7.0190us 1.4160us 302.67us cuDeviceGetAttribute
0.08% 331.83us         1 331.83us 331.83us 331.83us cudaGetDeviceProperties
0.03% 119.25us         1 119.25us 119.25us 119.25us cudaLaunch
0.01% 45.583us         1 45.583us 45.583us 45.583us cudaSetDevice
0.00% 17.416us         2 8.7080us 4.5000us 12.916us cuDeviceGetCount
0.00% 17.333us         1 17.333us 17.333us 17.333us cudaGetLastError
0.00% 10.667us         1 10.667us 10.667us 10.667us cuDeviceTotalMem
0.00% 9.0000us         1 9.0000us 9.0000us 9.0000us cudaConfigureCall
0.00% 6.2500us         1 6.2500us 6.2500us 6.2500us cuDeviceGetName
0.00% 4.8340us         4 1.2080us     667ns 1.8340us cudaSetupArgument
0.00% 4.1670us         2 2.0830us 1.8330us 2.3340us cuDeviceGet

備註: The nvprof result is more accurate than the host-side timing result

2015年9月28日星期一

CUDA: Check Dimension

since: 2015/09/28
update: 2015/09/28
reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA

A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: checkDimension.cu
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void checkIndex(const int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < N)
    printf("threadIdx:(%d, %d, %d), blockIdx:(%d, %d, %d), blockDim:(%d, %d, %d), gridDim:(%d, %d, %d)\n",
               threadIdx.x, threadIdx.y, threadIdx.z,
               blockIdx.x, blockIdx.y, blockIdx.z,
               blockDim.x, blockDim.y, blockDim.z,
               gridDim.x, gridDim.y, gridDim.z);
}

int main(int argc, char **argv)
{
    // define total data element
    int nElem = 5;
    printf("\n");
    printf(">> total data count = %d \n", nElem);

    // define grid and block structure
    dim3 block(3);
    dim3 grid((nElem + block.x - 1) / block.x);
    printf(">> define block dimension = %d \n", block.x);
    printf(">> we get grid dimension = %d \n", grid.x);

    // check grid and block dimension from host side
    printf("\n* check grid and block dimension from host side *\n");
    printf("grid.x = %d , grid.y = %d , grid.z = %d\n", grid.x, grid.y, grid.z);
    printf("block.x = %d , block.y = %d , block.z = %d\n", block.x, block.y, block.z);
    printf("\n");

    // check grid and block dimension from device side
    printf("*check grid and block dimension from device side *\n");
    checkIndex<<<grid, block>>>(nElem);

    // reset device before you leave
    cudaDeviceReset();

    return 0;
}

-----------------------------------------------------------------------------------------------

B. 執行結果:

2015年9月5日星期六

CUDA: Hello World!

since: 2015/09/05
update: 2015/09/05
reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA

A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: hello.cu
#include <stdio.h>

// __global__:
// function will be called on the CPU and executed on the GPU
__global__ void helloFromGPU()
{
    // blockIdx.x for block index
    // threadIdx.x for thread index
    printf("Hello World from GPU! block %d thread %d \n", blockIdx.x, threadIdx.x);
}

int main(void)
{
    // hello from cpu
    printf("\nHello World from CPU!\n\n");

    // kernel(GPU code) configuration: total threads = blockNum x threadNum
    int blockNum = 2;
    int threadNum = 3;

    // hello from gpu: call from the host thread to the code on the device side
    helloFromGPU <<<blockNum, threadNum >>>();
    cudaDeviceReset(); //destroy and clean up all resources
    //cudaDeviceSynchronize();

    return 0;
}

-----------------------------------------------------------------------------------------------

B. 執行結果:

2015年8月1日星期六

Using the Jetson TK1 as a remote development environment for CUDA

since: 2015/08/01
update: 2015/08/01
reference:
1. How to use the Jetson TK1 as a remote development environment for CUDA
2. Remote application development using NVIDIA® Nsight™ Eclipse Edition
3. NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX v6.5
4. CUDA 6.0 in Ubuntu 14.04.1 LTS - NVIDIA Developer Forums
5. CUDA Toolkit 6.5
6. CUDA Toolkit Documentation v7.0

A. 在 Local (Mac) 上安裝 CUDA toolkit
   1. 安裝 Xcode

   2. 安裝 Xcode Command Line Tools:
         $ xcode-select --install

   3. 到 CUDA Toolkit 6.5 下載 CUDA Toolkit 6.5 for Mac OSX
         版本: cuda_6.5.14_mac_64.pkg (latest 64-bit CUDA 6.5 package)

     4. 點二下安裝, 其中 CUDA Driver 在 Local 處可以不需要安裝, 除非要執行 CUDA-Z.

-----------------------------------------------------------------------------------------------

B. 在 Remote (Jetson TK1) 上安裝 CUDA toolkit
   1. Verify You Have a CUDA-Capable GPU
         $ lspci | grep -i nvidia
         00:00.0 PCI bridge: NVIDIA Corporation Device 0e13 (rev a1)

     2. 到 CUDA Toolkit 6.5 下載 CUDA Toolkit 6.5 for Linux ARM
         版本: ONLY to be used for Jetson TK1
         cuda-repo-l4t-r21.2-6-5-prod_6.5-34_armhf.debf (L4T 21.2 DEB)
         (將檔案傳送到 remote 的 /home/ubuntu/RD/software 目錄下)

     3. 安裝:
        $ cd /home/ubuntu/RD/software
          $ sudo dpkg -i cuda-repo-l4t-r21.2-6-5-prod_6.5-34_armhf.deb
          $ sudo apt-get update
          $ sudo apt-get install cuda-toolkit-6-5

     4. Add the user to the video group
        $ sudo usermod -a -G video ubuntu

     5. Environment Setup ...
          To change the environment variables for 32-bit ARM operating systems:
        $ export PATH=/usr/local/cuda-6.5/bin:$PATH
        $ export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib:$LD_LIBRARY_PATH

     6. check the L4T version with the following command
         $ head -1 /etc/nv_tegra_release
            # R21 (release), REVISION: 4.0, GCID: 5650832, BOARD: ardbeg, EABI: hard, DATE: Thu Jun 25 22:38:59 UTC 2015

     7. install g++-4.6
         $ sudo apt-get install g++-4.6

     8. (Optional) Install Writable Samples
        $ cd /usr/local/cuda-6.5/bin
          $ cuda-install-samples-6.5.sh /home/ubuntu/RD/projects/
        $ cd NVIDIA_CUDA-6.5_Samples
          $ make

          The resulting binaries will be placed under:
          /home/ubuntu/RD/projects/NVIDIA_CUDA-6.5_Samples/bin

   9. After restart the system, Finally start one of the samples!
         $ cd /home/ubuntu/RD/projects/NVIDIA_CUDA-6.5_Samples/bin
         $ cd armv7l/linux/release/gnueabihf
         $ ./deviceQuery

$ ./bandwidthTest

-----------------------------------------------------------------------------------------------

C. Configure Git
       To synchronize CUDA projects between local(Mac) and target(Jetson TK1) systems,
       you need to configure git on both the local and the target systems using these
       commands:

       $ git config --global user.email "lanli0210@gmail.com"
       $ git config --global user.name "lanli"

-----------------------------------------------------------------------------------------------

D.Create nsight project on local(Mac)
     1. 開啟 nsight:
         YourMac > Developer/NVIDIA/CUDA-6.5/libnsight/nsight.app

   2. Select a workspace

3. Create a new project
File > New > CUDA C/C++ Project

4. Project type: Empty Project

5. Basic settings:

6. Target system > Manage... > Remote Connections

7. Configure CUDA Toolkit

8. Target System(Project Path)

9. remove the Local System

結果:

10. 完成:

-----------------------------------------------------------------------------------------------

E. Adding the source code for CUDA project
1. File > New > Source File

2. Source File Configure (just use a template)

3. 完成:

4. 按下 "編譯"

p.s. Note that the build command is /usr/local/cuda-6.5/bin/nvcc,
which is the compiler on the Jetson, not on the Macbook.

-----------------------------------------------------------------------------------------------

F. Debugging the code on CUDA project
1. Debug Configurations

2. Debugging ...

p.s. By default the debugger will break in main() – this can be switched off
in the debug configuration.

訂閱：意見 (Atom)

2015年11月12日 星期四