顯示具有 CUDA 標籤的文章。 顯示所有文章
顯示具有 CUDA 標籤的文章。 顯示所有文章

2015年11月12日 星期四

CUDA: Query Device Information With The Runtime API

since: 2015/11/12
update: 2015/11/12

reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit 
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA

5. CUDA Runtime API :: CUDA Toolkit Documentation

A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: checkDeviceInfor.cu
#include <cuda_runtime.h>
#include <stdio.h>

/*
 * Display a variety of information on the first CUDA device in this system,
 * including driver version, runtime version, compute capability, bytes of
 * global memory, etc.
 */


void CHECK(cudaError_t call)
{
    const cudaError_t error = call;
    if (error != cudaSuccess)
    {
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);
        fprintf(stderr, "code: %d, reason: %s\n", error,
                cudaGetErrorString(error));
    }
}

int main(int argc, char **argv)
{
    printf("\n%s Starting...\n\n", argv[0]);

    int deviceCount = 0;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0)
    {
        printf("There are no available device(s) that support CUDA\n");
    }
    else
    {
        printf("Detected %d CUDA Capable device(s)\n", deviceCount);
    }

    int dev = 0, driverVersion = 0, runtimeVersion = 0;
    CHECK(cudaSetDevice(dev));
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("Device %d: \"%s\"\n\n", dev, deviceProp.name);

    cudaDriverGetVersion(&driverVersion);
    cudaRuntimeGetVersion(&runtimeVersion);

    printf("  CUDA Driver Version / Runtime Version          %d.%d / %d.%d\n",
           driverVersion / 1000, (driverVersion % 100) / 10,
           runtimeVersion / 1000, (runtimeVersion % 100) / 10);

    printf("  CUDA Capability Major/Minor version number:    %d.%d\n\n",
           deviceProp.major, deviceProp.minor);

    printf("  Total amount of global memory:                 %.2f MBytes (%llu "
           "bytes)\n", (float)deviceProp.totalGlobalMem / pow(1024.0, 3),
           (unsigned long long)deviceProp.totalGlobalMem);

    printf("  GPU Clock rate:                                %.0f MHz (%0.2f "
           "GHz)\n", deviceProp.clockRate * 1e-3f,
           deviceProp.clockRate * 1e-6f);

    printf("  Memory Clock rate:                             %.0f Mhz\n",
           deviceProp.memoryClockRate * 1e-3f);

    printf("  Memory Bus Width:                              %d-bit\n",
           deviceProp.memoryBusWidth);

    if (deviceProp.l2CacheSize)
    {
        printf("  L2 Cache Size:                                 %d bytes\n\n",
               deviceProp.l2CacheSize);
    }

    printf("  Max Texture Dimension Size (x,y,z)             1D=(%d), "
           "2D=(%d,%d), 3D=(%d,%d,%d)\n", deviceProp.maxTexture1D,
           deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1],
           deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1],
           deviceProp.maxTexture3D[2]);

    printf("  Max Layered Texture Size (dim) x layers        1D=(%d) x %d, "
           "2D=(%d,%d) x %d\n\n", deviceProp.maxTexture1DLayered[0],
           deviceProp.maxTexture1DLayered[1], deviceProp.maxTexture2DLayered[0],
           deviceProp.maxTexture2DLayered[1],
           deviceProp.maxTexture2DLayered[2]);

    printf("  Total amount of constant memory:               %lu bytes\n",
           deviceProp.totalConstMem);

    printf("  Total amount of shared memory per block:       %lu bytes\n",
           deviceProp.sharedMemPerBlock);

    printf("  Total number of registers available per block: %d\n\n",
           deviceProp.regsPerBlock);

    printf("  Warp size:                                     %d\n\n",
           deviceProp.warpSize);

    printf("  Maximum number of threads per multiprocessor:  %d\n",
           deviceProp.maxThreadsPerMultiProcessor);

    printf("  Maximum number of threads per block:           %d\n",
           deviceProp.maxThreadsPerBlock);

    printf("  Maximum sizes of each dimension of a block:    %d x %d x %d\n",
           deviceProp.maxThreadsDim[0],
           deviceProp.maxThreadsDim[1],
           deviceProp.maxThreadsDim[2]);

    printf("  Maximum sizes of each dimension of a grid:     %d x %d x %d\n",
           deviceProp.maxGridSize[0],
           deviceProp.maxGridSize[1],
           deviceProp.maxGridSize[2]);

    printf("  Maximum memory pitch:                          %lu bytes\n\n",
           deviceProp.memPitch);

    exit(EXIT_SUCCESS);
}

-----------------------------------------------------------------------------------------------

B. 執行結果:

-----------------------------------------------------------------------------------------------

C. 備註:

     還可以使用 command-line 的工具: nvidia-smi (System Management Interface),
     不過 Jetson TK1 似乎不支援. 

2015年10月20日 星期二

CUDA: Try To Measuring The Kernel By Timer

since: 2015/10/20
update: 2015/10/20

reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit 
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA


A. Timing With CPU Timer

     1. 在 Nsight Eclipse Edition 撰寫程式如下:
         // 檔名: sumArraysOnGPU-timer.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>


// check error

void CHECK(cudaError_t call)
{
    const cudaError_t error = call;
    if (error != cudaSuccess)
    {
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);
        fprintf(stderr, "code: %d, reason: %s\n", error,
                cudaGetErrorString(error));
    }
}

 
// the timer on cpu
double seconds()
{
    struct timeval tp;
    struct timezone tzp;
    int i = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}


// verifying your kernel
void checkResult(float *hostRef, float *gpuRef, const int N)
{
    double epsilon = 1.0E-8;
    bool match = 1;

    for (int i = 0; i < N; i++)
    {
        if (abs(hostRef[i] - gpuRef[i]) > epsilon)
        {
            match = 0;
            printf("Arrays do not match!\n");
            printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i],
                   gpuRef[i], i);
            break;
        }
    }

    if (match) printf("Arrays match.\n\n");

    return;
}

 

// initial random data
void initialData(float *ip, int size)
{
    // generate different seed for random number
    time_t t;
    srand((unsigned) time(&t));

    for (int i = 0; i < size; i++)
    {
        ip[i] = (float)( rand() & 0xFF ) / 10.0f;
    }

    return;
}



// sum Arrays On Host

void sumArraysOnHost(float *A, float *B, float *C, const int N)
{
    for (int idx = 0; idx < N; idx++)
    {
        C[idx] = A[idx] + B[idx];
    }
}


// sum Arrays On GPU
__global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < N) C[i] = A[i] + B[i];
}


// the main function

int main(int argc, char **argv)
{
    printf("\n%s Starting...\n", argv[0]);

    // set up device
    int dev = 0;
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("Using Device %d: %s\n", dev, deviceProp.name);
    CHECK(cudaSetDevice(dev));

    // set up data size of vectors
    int nElem = 1 << 24;
    printf("Vector size %d\n", nElem);

    // malloc host memory
    size_t nBytes = nElem * sizeof(float);

    float *h_A, *h_B, *hostRef, *gpuRef;
    h_A     = (float *)malloc(nBytes);
    h_B     = (float *)malloc(nBytes);
    hostRef = (float *)malloc(nBytes);
    gpuRef  = (float *)malloc(nBytes);

    double iStart, iElaps;

    // initialize data at host side
    iStart = seconds();
    initialData(h_A, nElem);
    initialData(h_B, nElem);
    iElaps = seconds() - iStart;
    printf("initialData Time elapsed %f sec\n", iElaps);
    memset(hostRef, 0, nBytes);
    memset(gpuRef,  0, nBytes);

    // add vector at host side for result checks
    iStart = seconds();
    sumArraysOnHost(h_A, h_B, hostRef, nElem);
    iElaps = seconds() - iStart;
    printf("sumArraysOnHost Time elapsed %f sec\n", iElaps);

    // malloc device global memory
    float *d_A, *d_B, *d_C;
    CHECK(cudaMalloc((float**)&d_A, nBytes));
    CHECK(cudaMalloc((float**)&d_B, nBytes));
    CHECK(cudaMalloc((float**)&d_C, nBytes));

    // transfer data from host to device
    CHECK(cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice));

    // invoke kernel at host side
    int iLen = 256;
    dim3 block (iLen);
    dim3 grid  ((nElem + block.x - 1) / block.x);


    iStart = seconds();
    sumArraysOnGPU<<<grid, block>>>(d_A, d_B, d_C, nElem);
    CHECK(cudaDeviceSynchronize()); // used just for debugging purpose
    iElaps = seconds() - iStart;
    printf("sumArraysOnGPU <<<  %d, %d  >>>  Time elapsed %f sec\n", grid.x,
           block.x, iElaps);


    // the testing result ....
    //

    // sumArraysOnGPU <<<  16384, 1024  >>>  Time elapsed 0.040817 sec
    // sumArraysOnGPU <<<  32768, 512  >>>  Time elapsed 0.031031 sec
    // sumArraysOnGPU <<<  65536, 256  >>>  Time elapsed 0.029070 sec
    // sumArraysOnGPU <<<  131072, 128  >>>  Time elapsed 0.028804 sec

    // sumArraysOnGPU <<<  262144, 64  >>>  Time elapsed 0.056073 sec
    // sumArraysOnGPU <<<  524288, 32  >>>  Time elapsed 0.098515 sec
    // sumArraysOnGPU <<<  1048576, 16  >>>  Time elapsed 0.190098 sec
    // sumArraysOnGPU <<<  2097152, 8  >>>  Time elapsed 0.375030 sec
    // sumArraysOnGPU <<<  4194304, 4  >>>  Time elapsed 0.803681 sec
    // sumArraysOnGPU <<<  8388608, 2  >>>  Time elapsed 1.835205 sec
    // sumArraysOnGPU <<<  16777216, 1  >>>  Time elapsed 3.515245 sec


    // check kernel error
    CHECK(cudaGetLastError()) ;

    // copy kernel result back to host side
    CHECK(cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost));

    // check device results
    checkResult(hostRef, gpuRef, nElem);

    // free device global memory
    CHECK(cudaFree(d_A));
    CHECK(cudaFree(d_B));
    CHECK(cudaFree(d_C));

    // free host memory
    free(h_A);
    free(h_B);
    free(hostRef);
    free(gpuRef);

    return(0);
}


     2. 執行結果:

-----------------------------------------------------------------------------------------------

B. Timing With nvprof(NVIDIA profiling tool)
     1. Before this, Don't forget to Add CUDA bin Environment Variables
         => How to re-flash your Jetson TK1 Development Kit
         => G. Add CUDA bin Environment Variables


     2. Login to TK1:
         $ ssh ubuntu@192.168.0.106

     3. Go to the release directory:
        $ cd /home/ubuntu/RD/projects/cuda/Release

     4. Timing with nvprof:
         $ nvprof ./NSightCUDA

     5. Result:
./NSightCUDA Starting...
==2996== NVPROF is profiling process 2996, command: ./NSightCUDA
Using Device 0: GK20A
Vector size 16777216
initialData Time elapsed 4.632509 sec
sumArraysOnHost Time elapsed 0.060414 sec
sumArraysOnGPU <<<  65536, 256  >>>  Time elapsed 0.023884 sec
Arrays match.

==2996== Profiling application: ./NSightCUDA
==2996== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 57.86%  137.14ms         3  45.714ms  44.908ms  46.267ms  [CUDA memcpy HtoD]
 32.42%  76.837ms         1  76.837ms  76.837ms  76.837ms  [CUDA memcpy DtoH]
  9.73%  23.053ms         1  23.053ms  23.053ms  23.053ms  sumArraysOnGPU(float*, float*, float*, int)

==2996== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 54.19%  219.10ms         4  54.774ms  45.602ms  79.567ms  cudaMemcpy
 39.03%  157.79ms         3  52.598ms  1.0795ms  152.74ms  cudaMalloc
  5.87%  23.717ms         1  23.717ms  23.717ms  23.717ms  cudaDeviceSynchronize
  0.64%  2.5713ms         3  857.08us  661.42us  1.0873ms  cudaFree
  0.14%  582.58us        83  7.0190us  1.4160us  302.67us  cuDeviceGetAttribute
  0.08%  331.83us         1  331.83us  331.83us  331.83us  cudaGetDeviceProperties
  0.03%  119.25us         1  119.25us  119.25us  119.25us  cudaLaunch
  0.01%  45.583us         1  45.583us  45.583us  45.583us  cudaSetDevice
  0.00%  17.416us         2  8.7080us  4.5000us  12.916us  cuDeviceGetCount
  0.00%  17.333us         1  17.333us  17.333us  17.333us  cudaGetLastError
  0.00%  10.667us         1  10.667us  10.667us  10.667us  cuDeviceTotalMem
  0.00%  9.0000us         1  9.0000us  9.0000us  9.0000us  cudaConfigureCall
  0.00%  6.2500us         1  6.2500us  6.2500us  6.2500us  cuDeviceGetName
  0.00%  4.8340us         4  1.2080us     667ns  1.8340us  cudaSetupArgument
  0.00%  4.1670us         2  2.0830us  1.8330us  2.3340us  cuDeviceGet


  備註: The nvprof result is more accurate than the host-side timing result

2015年9月28日 星期一

CUDA: Check Dimension

since: 2015/09/28
update: 2015/09/28

reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit 
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA


A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: checkDimension.cu
#include <cuda_runtime.h>
#include <stdio.h>


__global__ void checkIndex(const int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < N)
    printf("threadIdx:(%d, %d, %d), blockIdx:(%d, %d, %d), blockDim:(%d, %d, %d), gridDim:(%d, %d, %d)\n",
               threadIdx.x, threadIdx.y, threadIdx.z,
               blockIdx.x, blockIdx.y, blockIdx.z,
               blockDim.x, blockDim.y, blockDim.z,
               gridDim.x, gridDim.y, gridDim.z);
}

int main(int argc, char **argv)
{
    // define total data element
    int nElem = 5;
    printf("\n");
    printf(">> total data count = %d \n", nElem);

    // define grid and block structure
    dim3 block(3);
    dim3 grid((nElem + block.x - 1) / block.x);
    printf(">> define block dimension = %d \n", block.x);
    printf(">> we get grid dimension = %d \n", grid.x);

    // check grid and block dimension from host side
    printf("\n* check grid and block dimension from host side *\n");
    printf("grid.x = %d , grid.y = %d , grid.z = %d\n", grid.x, grid.y, grid.z);
    printf("block.x = %d , block.y = %d , block.z = %d\n", block.x, block.y, block.z);
    printf("\n");

    // check grid and block dimension from device side
    printf("*check grid and block dimension from device side *\n");
    checkIndex<<<grid, block>>>(nElem);

    // reset device before you leave
    cudaDeviceReset();

    return 0;
}


-----------------------------------------------------------------------------------------------

B. 執行結果:

2015年9月5日 星期六

CUDA: Hello World!

since: 2015/09/05
update: 2015/09/05

reference:
1. Professional CUDA C Programming
2. I touchs: Get Started On NVIDIA Jetson TK1
3. I touchs: How to re-flash your Jetson TK1 Development Kit 
4. I touchs: Using the Jetson TK1 as a remote development environment for CUDA


A. 在 Nsight Eclipse Edition 撰寫程式如下:
     // 檔名: hello.cu
#include <stdio.h>

// __global__:
// function will be called on the CPU and executed on the GPU

__global__ void helloFromGPU()
{
    // blockIdx.x for block index
    // threadIdx.x for thread index

    printf("Hello World from GPU! block %d thread %d \n", blockIdx.x, threadIdx.x);
}

int main(void)
{
    // hello from cpu
    printf("\nHello World from CPU!\n\n");

    // kernel(GPU code) configuration: total threads = blockNum x threadNum
    int blockNum = 2;
    int threadNum = 3;

    // hello from gpu: call from the host thread to the code on the device side
    helloFromGPU <<<blockNum, threadNum >>>();
    cudaDeviceReset(); //destroy and clean up all resources
    //cudaDeviceSynchronize();

    return 0;
}


-----------------------------------------------------------------------------------------------

B. 執行結果:


2015年8月1日 星期六

Using the Jetson TK1 as a remote development environment for CUDA

since: 2015/08/01
update: 2015/08/01

reference:
1. How to use the Jetson TK1 as a remote development environment for CUDA
2. Remote application development using NVIDIA® Nsight™ Eclipse Edition
3. NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX v6.5
4. CUDA 6.0 in Ubuntu 14.04.1 LTS - NVIDIA Developer Forums
5. CUDA Toolkit 6.5
6. CUDA Toolkit Documentation v7.0


A. 在 Local (Mac) 上安裝 CUDA toolkit
     1. 安裝 Xcode

     2. 安裝 Xcode Command Line Tools:
         $ xcode-select --install

     3. 到 CUDA Toolkit 6.5 下載 CUDA Toolkit 6.5 for Mac OSX
         版本: cuda_6.5.14_mac_64.pkg  (latest 64-bit CUDA 6.5 package)

     4. 點二下安裝, 其中 CUDA DriverLocal 處可以不需要安裝, 除非要執行 CUDA-Z.

-----------------------------------------------------------------------------------------------

B. 在 Remote (Jetson TK1) 上安裝 CUDA toolkit
     1. Verify You Have a CUDA-Capable GPU
         $ lspci | grep -i nvidia
         00:00.0 PCI bridge: NVIDIA Corporation Device 0e13 (rev a1)

     2. 到 CUDA Toolkit 6.5 下載 CUDA Toolkit 6.5 for Linux ARM
         版本: ONLY to be used for Jetson TK1
         cuda-repo-l4t-r21.2-6-5-prod_6.5-34_armhf.debf (L4T 21.2 DEB)
         (將檔案傳送到 remote 的 /home/ubuntu/RD/software 目錄下)

     3. 安裝:
          $ cd /home/ubuntu/RD/software
          $ sudo dpkg -i cuda-repo-l4t-r21.2-6-5-prod_6.5-34_armhf.deb
          $ sudo apt-get update
          $ sudo apt-get install cuda-toolkit-6-5

     4. Add the user to the video group
          $ sudo usermod -a -G video ubuntu

     5. Environment Setup ...
          To change the environment variables for 32-bit ARM operating systems:
          $ export PATH=/usr/local/cuda-6.5/bin:$PATH
          $ export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib:$LD_LIBRARY_PATH

     6. check the L4T version with the following command
         $ head -1 /etc/nv_tegra_release
            # R21 (release), REVISION: 4.0, GCID: 5650832, BOARD: ardbeg, EABI: hard, DATE: Thu Jun 25 22:38:59 UTC 2015

     7. install g++-4.6
         $ sudo apt-get install g++-4.6

     8. (Optional) Install Writable Samples
          $ cd /usr/local/cuda-6.5/bin
          $ cuda-install-samples-6.5.sh /home/ubuntu/RD/projects/
          $ cd NVIDIA_CUDA-6.5_Samples
          $ make

          The resulting binaries will be placed under:
          /home/ubuntu/RD/projects/NVIDIA_CUDA-6.5_Samples/bin


     9. After restart the system, Finally start one of the samples! 
         $ cd /home/ubuntu/RD/projects/NVIDIA_CUDA-6.5_Samples/bin
         $ cd armv7l/linux/release/gnueabihf
         $ ./deviceQuery


         $ ./bandwidthTest

-----------------------------------------------------------------------------------------------

C. Configure Git
       To synchronize CUDA projects between local(Mac) and target(Jetson TK1) systems,
       you need to configure git on both the local and the target systems using these
       commands:


       $ git config --global user.email "lanli0210@gmail.com"
       $ git config --global user.name "lanli"

-----------------------------------------------------------------------------------------------

D.Create nsight project on local(Mac)
     1. 開啟 nsight:
         YourMac > Developer/NVIDIA/CUDA-6.5/libnsight/nsight.app

     2. Select a workspace


     3. Create a new project
         File > New > CUDA C/C++ Project

     4. Project type: Empty Project

     5. Basic settings:

     6. Target system > Manage... > Remote Connections

     7. Configure CUDA Toolkit

     8. Target System(Project Path)

     9. remove the Local System

    結果:

     10. 完成:

-----------------------------------------------------------------------------------------------

E. Adding the source code for CUDA project
     1. File > New > Source File

     2. Source File Configure (just use a template)

     3. 完成:

     4. 按下 "編譯"
       p.s. Note that the build command is /usr/local/cuda-6.5/bin/nvcc,
              which is the compiler on the Jetson, not on the Macbook.

-----------------------------------------------------------------------------------------------

F. Debugging the code on CUDA project
     1. Debug Configurations

     2. Debugging ...
         p.s. By default the debugger will break in main() – this can be switched off
                in the debug configuration.