

The provided code reports various information about your device (see utils.cu:ReportDevice()). Your implementation will run significantly faster. Since it makes all accesses through global memory. As discussed in lecture it is memory bandwidth bound.

The naive version will perform very poorly. It is a variation of the naive code from Hwu and Kirk.

The starter code is provided via a github repository. The starter code may be obtained at this link: Please include the following table in this format in your report as well as a graph mentioned above: (you must have these n values but you should have more (20)). You may refer to either this plot or the plot of speedup (5).įor the twenty or so values of performance, identify and explain unusual dips or irregularities in performance. For other values, just report your cuda numbers).Įxplain how the shape of your curve is different or the same to the BLAS values and theorize as to why that might be. (for the values we gave you in the table. Compare your results to the multi-core BLAS results in the table below. Use at least the values in the table below, but add other values too. Why are some sizes or geometries higher performance than others?įor n=256, 512, 1024, and 2048 compare your best result with the naive implementation.įor at least twenty values of n, plot your performance using the best block size you determined for n=1024 in step (2). Your report should explain the choice of optimal block sizes. If your code has limitations on block size, please state the reason for that limitations. That is, for n=256, 512, 10 plot the performance for at least 3 different block sizes.
#Ucsd cplot qeue free
Feel free to plot or chart results from experiments that did or did not end up in your final implementation and, as possible, provide evidence to support your theories.įor the problem sizes n=256, 512, 10, plot the performance of your code for a few different (at least 3) different block sizes. What ideas worked well, what didn't work well, and why. What was your development process? What ideas did you try during development? If you improved your code in a sequence of steps, document the process.ĭescribe how your program works (pseudo code is fine, do not include all of your code in the write-up). The report presents your results, analyzes them, and offers insights as to why things behaved the way they did. *** (up to -10 pts if report is not organized well).ĭocument your work in well-written, 5-10 page (including figures) report. Reports that require extensive time for us to read will be penalized. *** Your report must follow the format given here so we can easily find the different sections. See Grading Below for specific experiments we want you to perform and report on. This code verifies a matrix product that can be expressed in closed form and depends on a particular matrix set up in the matrix initialization routine ( genMatrix()).
#Ucsd cplot qeue verification
The provided verification code, which resides in genMatrix.cpp, will enable you to establish correctness. To build confidence in the correctness of your code, test it against different combinations of matrix size n and thread block geometry. Another good resources is Volkov's presentation at GTC 2010 which describes increasing parallelism without increasing thread count. tuning parameters) may differ for Kepler. Since the implementation discussed in the paper was designed for a pre-Kepler device, the required optimizations (or the details, i.e. To realize your performance goals, explore the optimizations in this SC '08 paper by Volkov and Demmel: Benchmarking GPUs to tune dense linear algebra. Your performance goal is 360-460 Gigaflops/second goal on an matrix of size 256x256, 512x512, 1024X1024, and 2048x2048(double precisions).Ensure that your data accesses are coalescing. Modify the code to use shared memory, but try other performance programming techniques to improve performance still further. You can improve performance significantly by buffering frequently accessed data in fast on-chip shared memory (Be sure to run your code without the -l flag to ensure that it favors shared memory over cache). Optimize the code to use on-chip memory.
