We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High Performance GPU Computing with Ruby

00:00

Formal Metadata

Title
High Performance GPU Computing with Ruby
Title of Series
Number of Parts
69
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
ArrayFire gem a General Purpose GPU computing library can be used for high performance computing in Ruby be it statistical analysis of big data, image processing, linear algebra, machine learning. ArrayFire has an outstanding performance considering other existing Ruby libraries that run on CPU. ArrayFire gem can also be run on clusters and handle real world problems by crunching huge datasets. The ease of using ArrayFire gem makes Ruby a viable choice for high performance scientific computing.
43
Thumbnail
29:29
44
Graphics processing unitParallel computingSerial portSelf-organizationComputerGraphics processing unitLibrary (computing)BefehlsprozessorNumerical analysisCodeOperator (mathematics)Line (geometry)Wage labourDialectVolume (thermodynamics)Computer programmingView (database)XMLUMLComputer animation
GoogolCodeComputer networkAssociative propertyMatrix (mathematics)Student's t-testPort scannerComputational scienceLinear algebraMatrix (mathematics)Projective planeProduct (business)WhiteboardMultiplication signAssociative propertyGene clusterGoodness of fitComputer animation
Computational scienceMatrix (mathematics)Array data structureGraphics processing unitBefehlsprozessorKeyboard shortcutArithmetic meanProgrammschleifeComputational scienceComputerLibrary (computing)Numerical analysisFunctional (mathematics)Kernel (computing)Matrix (mathematics)Pointer (computer programming)Operator (mathematics)Virtual machineCodeCodeComputing platformArray data structureComputer hardwareGraphics processing unitOpen setData analysisLinear algebraSet (mathematics)Sinc functionProgramming languageCore dumpBlock (periodic table)Disk read-and-write headElement (mathematics)Row (database)Order (biology)AreaMachine learningLogic gateMedical imagingKonferenz Europäischer StatistikerExecution unitMetric systemBlogSource codeCompilerPoint (geometry)View (database)Limit (category theory)BitGodComputer animation
Computer hardwareCASE <Informatik>Wrapper (data mining)BefehlsprozessorAbstractionLibrary (computing)CodeComputerVotingWordVirtual machineGenderComputer programmingTerm (mathematics)XML
Dimensional analysisPrice indexFluid staticsNormed vector spaceCholesky-VerfahrenInverse elementStatisticsArithmetic meanMedianVarianceNumberElement (mathematics)Matrix (mathematics)DeterminantBenchmarkCoprocessorGraphics processing unitSummierbarkeitMultiplicationCodeRow (database)Dimensional analysisCASE <Informatik>AbstractionArithmetic meanMultiplication signVarianceArray data structureBenchmarkLine (geometry)Data storage deviceNumerical analysisResultantComputerFunctional (mathematics)Type theoryFrobenius methodMatrix (mathematics)DivisorPointer (computer programming)Semiconductor memoryDoubling the cubeElement (mathematics)Front and back endsNumberOperator (mathematics)Kernel (computing)DeterminantCalculation2 (number)BefehlsprozessorInverse elementMusical ensembleEinbettung <Mathematik>Flow separationBuildingPhysical lawPoint (geometry)WordMobile appCholesky-VerfahrenRight angleComa BerenicesFile formatRule of inferenceMedianDifferent (Kate Ryan album)Metropolitan area networkEndliche ModelltheorieEuler anglesEvaporationSound effectComputer architectureView (database)Core dumpSign (mathematics)Maxima and minimaDigitizingGraph (mathematics)AreaData managementComputer animation
Information managementMatrix (mathematics)AdditionNumberProcess (computing)DeterminantCASE <Informatik>CalculationArithmetic meanMatrix (mathematics)Element (mathematics)BenchmarkAdditionAbstractionEvaporationDiagramComputer animationDiagram
Matrix (mathematics)NumberElement (mathematics)MultiplicationMatrix (mathematics)MultiplicationFigurate numberArithmetic meanDifferenz <Mathematik>Operator (mathematics)CoroutineComputational scienceElement (mathematics)CASE <Informatik>Metric systemSound effect2 (number)Translation (relic)Standard deviationGoodness of fitMessage passingMechanism designAdditionComputer animationDiagram
Matrix (mathematics)CodeComputer hardwareArithmetic meanSinc functionGraphics processing unitProjective planeComputing platformMedical imagingExecution unitHypermediaOpen setAbstractionXML
Graphics processing unitGeneric programmingPointer (computer programming)Element (mathematics)BefehlsprozessorRead-only memoryMatrix (mathematics)BenchmarkCoprocessorNumberMultiplicationProcess (computing)Goodness of fitCategory of beingCodeStudent's t-testComputer programmingData recoveryMetric systemAreaSoftware developerView (database)CodeProjective planePoint (geometry)Matrix (mathematics)Mathematical optimizationNeighbourhood (graph theory)Dimensional analysisSystem callMultiplicationImplementationBefehlsprozessorBridging (networking)Pattern languageKernel (computing)BenchmarkSound effect2 (number)Arithmetic meanRandom number generationCoroutineLine (geometry)OpticsBlock (periodic table)Multiplication signData storage deviceObservational studySemiconductor memoryInterior (topology)Pointer (computer programming)Doubling the cubeType theoryGraphics processing unitUniform resource locatorRight angleRandomizationVirtual machineElement (mathematics)Computer animation
Image processingAssociative propertyType theoryCore dumpField (computer science)BenchmarkComputer architectureGraphics processing unitMultiplication signProjective planeCodeCoroutineDoubling the cubeComplex (psychology)Machine learningLibrary (computing)Medical imagingVirtual machineOpen sourceData typeWebsiteGoodness of fitSubject indexingRevision controlOperating systemAssociative propertyLevel (video gaming)Order (biology)ComputerFluidSystem callWater vaporView (database)Image processingExecution unitSoftware developerXMLComputer animation
TwitterBlogRow (database)ComputerComputer animationXML
Transcript: English(auto-generated)
Hi, I am Prasoon Anand, and I am here to talk about high performance GPU computing with Ruby. I'm really glad to be here, and I really thank
the RubyConf organizers for having me here. Very few people realize that even the modest computers today have very high powered GPUs that can be used in parallel with the CPUs, or in serial with the CPUs to deliver really awesome performance.
So, in this talk I would like to talk about two Ruby gems that I have created in the last year. One of them is rfi gem, and another is rbcuda gem. What these libraries do is that they help you accelerate your code for number crunching
or scientific programming, and gain performance improvements just by adding a few lines of code, maybe four or five per operation. So, before we delve into the topic,
let me introduce myself. I am a SciRuby contributor. SciRuby also stands for Ruby Science Foundation. What we do is we create Ruby gems for scientific computing. I worked as a Google Summer of Code student for Ruby Science Foundation in 2016 and 17.
And currently I'm associated with the Gene Network project. What we do at Gene Network is that we create tools for high performance genome scans on clusters, and GPUs, and Teslas, even Intel Xeon 5s. Recently I have been awarded Ruby Grant 2017
by Ruby Association to work on rbcuda gem. These are the projects that I did. First is JRuby port of N matrix. N matrix is a linear algebra library, and I just ported it to JRuby. Then I created RFI gem for GSoC 2017, and currently I'm working on rbcuda.
Scientific computing means it's, Ruby has been around for 25 years, but people still don't prefer it as a go-to resource for scientific computing or solving the problems
or for number crunching. So in the last few years, Ruby Science Foundation and others have created gems for scientific computing. What we do is we handle very large sets of data for data analysis, machine learning, and all.
Currently, SciRuby has the gems like N matrix, Daru, Neaplot, what the N matrix is for linear algebra, whereas Daru is for data analysis. It's just like Pandas in Python, and Neaplot is a plotting library.
Also, since Python has a head start, we use Python for solving certain problems which we can't directly do it in Ruby. So we have gems like pycall.rb that helps you call Python from your Ruby code for computing.
So arrays and matrices. For any scientific problem, when you have any scientific problem, the data that you have can be, means at the core you need an array or a matrix. These arrays and matrix are of huge size, means for example, if you have a block data,
you have a matrix of 5,000 rows and 5,000 columns, which is the, I can say that it's the least amount. So to handle these large arrays and matrices, you need specialized libraries. For example, N matrix helps you handle matrices on the CPU.
What it does is that these linear algebra libraries that need to handle matrices must be memory efficient and need to be fast, means you need fast loops to iterate through the entire elements of the matrix or an array. Moreover, you need to save memory because since your RAM size is limited,
you need to handle efficiently so that you don't run out of the RAM. But we have BLAST and LAPACK libraries. What it does is that BLAST and LAPACK are Fortran libraries that help you do matrix computation
by harnessing the multicore support of the CPUs. So whenever you need to do scientific computing or number crunching in C, you need BLAST and LAPACK or Eigen libraries or Intel MKL libraries. So since BLAST and LAPACK are Fortran libraries,
we have C bindings for it, and N matrix calls these C bindings for linear algebra. Similarly, Numo is another package that provides N array which does the same thing. Means N matrix and N array are almost provide the same functionalities.
So now let's move on to GPU computing because GPU computing is not easy. Means for a beginner, if you are trying to do GPU computing on C, you need to handle pointers. And currently, GPU computing is done
using writing kernel codes where you write C-type of kernel code in .cu file or .cl. What you do is that when you have these kernel files, you compile it and then you inject that code into the GPU hardware, and then you need
to handle the pointers that you have created. And perform operations on it. So CUDA and OpenCL, these are the two platforms that we currently use for GPU computing. While CUDA is limited to Nvidia hardware and its appropriate solution,
it can't run on all the other vendors like the GPUs from AMD or Intel. Whereas OpenCL stands for Open Computing Language and it runs across all GPU hardwares regardless of the vendor. And in my experience, I've seen that CUDA has a better performance over OpenCL
even on Nvidia driver, Nvidia GPUs. So here comes ArrayFire. ArrayFire is a C library that is used for general purpose GPU computing. What it does is that it's an abstraction
over an array where you create an AF array that's on the GPU device, and here you don't have to bother about what kind of hardware you are using, whether a GPU which is from Nvidia or AMD or Intel, or what should you use, whether CUDA is more suited
to your needs or OpenCL, it just tries to give you the best performance. And recently, ArrayFire also supports CPU. For example, in case you don't have access to a nice GPU on your machine, you can just use ArrayFire, means it will automatically try to run that code, the same code on the CPU.
Also, ArrayFire has wrappers in Python, Go, Julia, Rust, and what I did was I created the Ruby wrapper for ArrayFire, and it really makes our work easy. So this is how you create an AF array, means AF array stores your array,
means it can be up to four dimensions. In this slide, I show you how to create an AF array of two dimensions with the row size. So the highlighted syntax there shows that we have A as an AF array, which has a dimension of two,
and each dimension, and the next argument, two comma two is the size of the array, means you have two rows and two columns. Next is the elements, one comma two comma three comma four, so after that, when you create a matrix
or an AF array using this code, you get the elements as shown below. This is a column major format, so you can see that one, then two, then three, and four. Next is we try to add this, add the array A to itself
and store it in B, this is the code that shows it. So it's pretty easy, means one comma two comma three comma four added to itself gives you two comma four comma six comma eight. Next is the matrix multiplication.
If someone here is familiar with data science, means we use matrix multiplication most of the time in our code for number crunching. Means here we have, I created two arrays called left and right, and one of the arrays is of dimension three comma three, while the other is three comma two.
Next is we do matrix multiplication as simple as this. So how we implemented it means what we do is I create an AF struct type called AF array, and then in the next highlighted line of code,
I just cast these values from the Ruby VM I got into from num to a double data type. Next is that I create an AF array using AF create array API provided by ArrayFire, and it just copies the host array data into the GPU data. Means in GPU computing you just can't get access
to your data directly, means when you create a data, when you create an, you initially create an array on the host device that is a CPU, then copy that array from the CPU to the GPU, and then on the GPU you pass the kernel code that interacts with that array,
and when you get the final result, you then just copy that data back from GPU to the CPU. But in the case of ArrayFire you don't have to worry about that because it just abstracts that and makes it as simple as this where you created an AF array, and in the next example what I do
is that I just get that pointer, and I do a matrix multiplication of it. For example, in the first highlighted line we have created an AF struct left, then we also created an AF struct result, and we allocated the device memory, the memory to it,
and next we call the AF matmel API. What it does is that it takes the device pointer of left and right, and then does multiplication of it, multiplication, and then stores it into the result.
So these are the blast functionalities and the LAPACK functionalities. Blast functionalities are matrix multiplication and transpose, whereas LAPACK functionalities are determinant calculation, inverse calculation, calculating the Frobenius norm, and then QR factorization, Cholesky factorization,
SVD factorization, and lower upper factorization. ArrayFire also provides you with APIs for calculating mean, median, or variance along different dimensions of your matrix, and these are provided by the mean, AF mean, AF median, and AF variance.
Next, let's come to the benchmarks. If I, means how is this really provides you high performance, how it accelerates your code. So I ran the benchmarks on an AMD FX8350 processor, and NVIDIA GTX 750 Ti GPU,
which is of the Maxwell architecture, and the recent one is Pascal, but yeah, means it's a decent GPU, and we use double D type, and with the coded backend. So for calculating the matrix determinant, it takes around,
means in this graph, on the x-axis, we have the number of elements in a matrix, whereas on the y-axis, we have the computation time that it took for us to do that operation. So the lower is the computation time, the better is the performance.
So we are comparing N matrix LAPACK Ruby, N matrix JRuby, and ArrayFire. N matrix JRuby is what I created, means for, which is a JRuby port of N matrix, and N matrix Ruby LAPACK uses LAPACK for matrix calculation.
So in this case, N matrix LAPACK Ruby takes around 12 seconds for determinant calculation, whereas ArrayFire takes around two seconds. So we have an improvement of 10x, means ArrayFire is faster than N matrix LAPACK by 10 times.
So yeah, we did a nice job. Similar goes the case for matrix LU factorization, means when you do an LU factorization, you get, the next step is that you can calculate the determinant from the diagonal element.
So this benchmark is exactly the same as matrix determinant calculation. So for matrix addition, we have this benchmark, means N matrix Ruby takes around six seconds, whereas ArrayFire takes around 0.0004 seconds, means 400 microseconds.
So the performance improvement is 10,000x. Matrix subtraction is similar as matrix addition, because both are element-wise operations.
Instead of just adding the two elements, I'm just subtracting it, so the exact same figures. Now comes matrix multiplication, means at the crux of any scientific computing code, we have matrix multiplication, means we call it a lot of times, and in this case, N matrix,
means N matrix Ruby has two ways how you can call this blast routine for matrix multiplication. You can either use N matrix blast Ruby, or N matrix Ruby. So N matrix blast Ruby is faster, because it uses Fortran, whereas N matrix Ruby runs C code.
So in this case, N matrix blast Ruby takes around 31 seconds, whereas ArrayFire takes 0.00062 seconds, so 622 microseconds. So the performance improvement is 100,000x.
So coming to all this means when you use ArrayFire, you don't have to worry about what kind of GPU hardware you are using, means you just write your code without worrying about whether you are going to run that code on a CUDA device, a CUDA platform, or an OpenCL platform,
or an NVIDIA GPU, or an AMD GPU. It just tries to give you the best performance out of it means you can also tune it according to yourself. Next, since NVIDIA has a better performance on GPU devices, so what I did was, since ArrayFire is an abstraction,
I tried to create something that would be even closer to the GPU hardware. So for that, I created another project called RB CUDA, means it runs only on NVIDIA devices. For ArrayFire, it was very easy,
because we didn't have to worry about transferring the data from the CPU to the GPU, or vice versa. Here, we need to handle everything, means we need to take care of how you have created the GPU array pointer, and then how we copy it from the CPU to GPU, and see that if we have the pointer
which is not garbage collected or not. So what we do is we created a generic pointer that is void start, means it just stores the device array location in the VM, and then you copy memory from CPU to GPU, and yes, it has been interfaced with N matrix and N array,
means you just add one line of code that is equal to N matrix has two GPU, and you'll get a GPU pointer. Similarly, we can do it with N array, but it's under development right now.
So this is an example of a Cuddle code. So when you have created your program, and you think that you can create more optimizations on it, you might be interested in running your custom kernel code on the GPUs. So RbCuda helps you do that,
means you couldn't directly run your custom kernels on the GPU using array file, but yes, with RbCuda, we have created a bridge that can help you run your custom kernel on the GPUs. So this is how a kernel code looks like, means we have a block IDX.X, means it just refers to an element in the blocks,
and when we call this, we have two arrays, int A, int star A, and star B, and we add these two arrays, and we store it in C. So what we do here is that what RbCuda is different from running this CUDA code kernels directly on the C
is that you can run this kernel code online, means you are running your code in PRI, and you can just inject this kernel code. So what I do is that I take this kernel code, I store it in a temp file, and then I compile it using NVIDIA CUDA compiler, and then, as a result, I get a .ptx file,
which can be done on a NVIDIA GPU. So this is the code, means it's tough to understand, but yeah. So also, running a custom kernel code was already done by another Ruby gem called SDC Ruby CUDA,
but what it lacked was it didn't provide the solutions for other libraries, like it didn't have support for CUBLAST libraries, CUSOLVER, and CURAND. So in RbCuda, we will have support for all these, means we will have ready-made routines for BLAST and LAPACK, means you can do matrix multiplication
and even matrix decomposition, and you can also create random numbers using machine twister and other engines, random engines. So these are the benchmarks, means, again, the benchmarks were done on this AMD FX-8350 octa-core processor,
GTX 750 Ti GPU, and double D type. So for matrix multiplication, you can see that the lowest line, RbCuda, is even faster, means N matrix BLAST Ruby takes you around 31 seconds, ArrayFire takes you around 0.0006 seconds,
whereas RbCuda takes you 0.00004 seconds. So we have a performance improvement of one million times. So here comes the future work,
means ArrayFire, being a GPU GPU library, means a general purpose GPU computing library provides you ready-made routines for image processing, and it also helps you write classifiers and all for machine learning. So I'll be working on creating these APIs
and even indexers, and currently only double data type is supported, so in the future, we are going to have support for complex floats, et cetera. Now, RbCuda is under active development. It's being currently funded by Ruby Association.
Contributions are welcome. You can check out these repos, and benchmark code can be found on ArrayFire, means, github.com slash prasoonanan slash ArrayFire Rb Benchmarks. So you can try it on your machine. Since I did the, I ran these benchmarks
on a Maxwell architecture that is 750 TI GPUs. When you run it on Pascal GPUs, that is 1050, NVTA 1050 series, you can expect a performance of even 10 times more. Now, acknowledgements.
I would like to thank my Google Summer of Code mentor, Piotr Prince. He's involved with BioRuby project and other projects in D and Scala. And next is Pradeep Garigipati. He's a core contributor of ArrayFire.
Also, I'd like to thank SciRuby, Google Summer of Code, and Ruby Association for helping me continue my work in the field of open source. Thank you.