We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using Ruby in data science

00:00

Formal Metadata

Title
Using Ruby in data science
Title of Series
Number of Parts
69
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
I will talk about the current situation and the future of Ruby in the field of data science. Currently, Ruby can be used practically in data science. In the first half of this talk, I will perform some demonstrations to prove this. You will see that pandas, matplotlib, scikit-learn, and several deep learning frameworks are available from Ruby scripts in these demos. In the future, in order for Ruby to continue to be used in data science, we need to continue some efforts. In the latter half of this talk, we will introduce Red Data Tools project that plays an important role in this context.
43
Thumbnail
29:29
44
Software developerTwitterStatisticsComputer programmingArithmetic meanPhysicalismXMLUMLComputer animation
Electric currentPerspective (visual)Ocean currentCycle (graph theory)Projective planePerspective (visual)Arithmetic meanPattern languageCartesian coordinate systemDiagramPhysical systemWordWebsitePresentation of a groupSlide ruleComputer animation
StatisticsSample (statistics)Matrix (mathematics)View (database)Projective planeSemiconductor memoryTensorSet (mathematics)Library (computing)Metric systemSoftware developerComputer animation
Library (computing)Numerical analysisMatrix (mathematics)Module (mathematics)StatisticsAlgebraMeta elementKeyboard shortcutFloating pointDisintegrationKernel (computing)MathematicsDistribution (mathematics)Prime idealCuboidInternet service providerTable (information)Frame problemUsabilityData structureTask (computing)DialectMetric systemProgramming languageForm (programming)Sparse matrixOcean currentOperator (mathematics)PiDifferent (Kate Ryan album)Process (computing)Linear algebraImplementationObject (grammar)Integrated development environmentNatural languageGeneric programmingStudent's t-testSpacetimeTwo-dimensional spaceAmenable group2 (number)Library (computing)Computer animation
Cluster samplingComputer fileMatrix (mathematics)BenchmarkCellular automatonView (database)Kernel (computing)Codierung <Programmierung>Array data structureSummierbarkeitElement (mathematics)Parameter (computer programming)LoginMathematicsFunction (mathematics)Electric currentVariable (mathematics)Menu (computing)Element (mathematics)SummierbarkeitNeuroinformatikPoint (geometry)InjektivitätMetric systemSystem callParameter (computer programming)2 (number)Different (Kate Ryan album)Symbol tableArray data structureComputer animation
BenchmarkMatrix (mathematics)Menu (computing)LoginPhysical systemLibrary (computing)Plot (narrative)Total S.A.PlotterMatrix (mathematics)Multiplication signMeasurementMetric systemCodeResultantInterior (topology)Library (computing)Run time (program lifecycle phase)Position operatorComputer animation
Kernel (computing)Computer fileView (database)BenchmarkMatrix (mathematics)LaptopCellular automatonFunctional (mathematics)Multiplication signRun time (program lifecycle phase)Metric systemNumberSummierbarkeitSoftware bugComputer fileComputer animation
Software developerFunction (mathematics)Matrix (mathematics)ImplementationSoftware developerTensorLibrary (computing)Formal languageProjective planeFunctional (mathematics)Task (computing)Data miningSparse matrixNumeral (linguistics)BefehlsprozessorFrequencyMetric system10 (number)Content (media)Coma BerenicesMatrix (mathematics)Arithmetic meanFrame problemComputer animation
Numerical analysisMatrix (mathematics)Arrow of timeSoftware developerCore dump1 (number)Slide ruleUniform resource locatorVirtual machineNumeral (linguistics)Core dumpArrow of timeFile formatMultiplication signProjective planeDeterminantPlanningSoftware developerMetric systemCASE <Informatik>ImplementationCoordinate systemReading (process)Product (business)Fiber (mathematics)Text editorFormal languageBitSemiconductor memoryResultantView (database)Error messagePattern languageCuboidComputational scienceComputer simulationNoise (electronics)System callComputer animation
Data analysisExploratory data analysisCodeMereologyKeyboard shortcutLibrary (computing)Object (grammar)Communications protocolGateway (telecommunications)Web applicationLibrary (computing)MereologyData miningObject-oriented programmingPower (physics)Process (computing)Phase transitionStandard deviationInterpreter (computing)2 (number)Functional (mathematics)CodeExploratory data analysisRuby on RailsKeyboard shortcutCASE <Informatik>Formal languageGateway (telecommunications)Core dumpMachine learningPoint (geometry)Musical ensembleTransport Layer SecurityInheritance (object-oriented programming)View (database)Bit rateLogic gateComputer fontComputer animation
Hydraulic jumpWrapper (data mining)Computer fontMereologyLibrary (computing)Wrapper (data mining)Module (mathematics)MultilaterationOnline helpStructural loadPower (physics)Computer animation
Cluster samplingComputer fileKernel (computing)LaptopView (database)Library (computing)BenchmarkElectronic visual displayComputer configurationMatrix (mathematics)Menu (computing)LoginDemo (music)CodeShape (magazine)Array data structureTelephone number mappingKeyboard shortcutSummierbarkeitStatisticsLocal GroupFunction (mathematics)SineMetric systemParticle systemModule (mathematics)Element (mathematics)CalculationRun time (program lifecycle phase)MeasurementSummierbarkeit2 (number)Multiplication signArray data structureExpressionResultantInjektivitätFrame problemBenchmarkMonoidLibrary (computing)System callGroup actionMultiplicationQuicksortNominal numberState of matterDivision (mathematics)Matrix (mathematics)Data miningServer (computing)Endliche ModelltheorieComputer animation
Matrix (mathematics)Menu (computing)LoginDemo (music)Functional (mathematics)Frame problemResultantMetric systemComputer animation
Telephone number mappingMenu (computing)LaptopDemo (music)PlotterResultant2 (number)Metric systemComputer animation
BlogOptical character recognitionForestRandom numberComputer-generated imageryStandard deviationData analysisSlide ruleForestProcess (computing)Ruby on RailsExploratory data analysisPattern languageCartesian coordinate systemINTEGRALStandard deviationCodeUniform resource locatorPower (physics)BlogFunctional (mathematics)Materialization (paranormal)Formal languageTask (computing)RandomizationShared memoryRight angleParticle systemGodArithmetic meanSampling (statistics)AnalogyDisk read-and-write headCausalityIPSecSystem callIdentity managementComputer animation
ImplementationSystem programmingDatabaseSigma-algebraArrow of timeCore dumpMereologyFile formatCore dumpImplementationPhysical systemPattern languageFunctional (mathematics)Row (database)Device driverArrow of timeDirectory serviceDatabaseEndliche ModelltheoriePerspective (visual)MathematicsElectronic data processingCartesian coordinate systemBuildingSerial portProjective planeINTEGRALLattice (order)CodeFrame problemRAIDMusical ensembleWebsiteProcess (computing)Operator (mathematics)Shared memoryCuboidComputer animationDiagram
CloningChainKeyboard shortcutPerspective (visual)Physical systemProcess (computing)Computer-generated imageryKeyboard shortcutRun time (program lifecycle phase)Projective planeRewritingVisualization (computer graphics)BenchmarkLaptopWeb 2.0Arithmetic progressionTensorPattern languagePhysical systemCloningMedical imagingMereologyCartesian coordinate systemElectronic data processingProcess (computing)Roundness (object)MetadataChainSeries (mathematics)Presentation of a groupSampling (statistics)WeightFreezingPulse (signal processing)Game theoryComputer animation
Row (database)Coma BerenicesXML
Transcript: English(auto-generated)
Hello, everyone, thank you for coming. I'd like to talk about using Ruby in data science.
So how many people related to data science? Hmm, not many, okay. So how many people do you want to use Ruby for data science? Oh, so many people want to use Ruby for data science.
Me too. Even for us working with Ruby, the opportunities to be involved in data science will increase more and more. But now Ruby hinders us from becoming familiar
with data science because Ruby is difficult to use in data science. However, it hasn't been so until recently. The situation is changing. Currently, Ruby is getting easier to use
with data science little by little. Did you know? No? No one knew it. But don't worry, no problem, I'll describe it in this talk.
So today I'll talk about how we can use Ruby in data science. Before that, let me introduce myself. My name is Kenta Murata. My handle name is M-R-K-N. Please call me Kenta or my nickname Muraken.
I'm working at the SPE, as a full-time CRB committer. So this is my company's logo. The company name SPE literally equals it to the successor of SPEED. It means that it's faster than faster what SPEED means.
In other words, the company iterates its business to try our cycles in overwhelming speed. As I mentioned in the previous slide, my company employs me as a full-time CRB committer.
I'm permitted to do any great things for Ruby ecosystem. In this year, I'm mostly working for making tools for data science that are used with application written in Ruby.
My talk consists of these topics. At first, I'll talk about the current situation of Ruby in data science. Next I'll show you the patterns to use Ruby in data science. Then I'll explain my perspective of the future of Ruby in data science.
Finally, I'll conclude my presentation. So let's start the main topics. The first topic is the current situation. Now, there are three major projects for data science in Ruby. This diagram shows the relationships between them.
So let me describe in more details about this project. The first is SciRuby. Do you know SciRuby? Not many. I think that SciRuby is most famous project
for people outside of Japan. SciRuby is a set of many gem libraries that use in metrics as they are in memory tensor data. It has many gem libraries, but many of them are dead
because their development has been stopped. You can check all the gems under SciRuby in this webpage. Let's check it.
So, as you can see, there are a lot of libraries listed in this page, but unfortunately, so this red and yellow is not usable for now. So half of them are marked dead.
I think there are three benefits when you use SciRuby gems. The first benefit is you only need Ruby. So, in other words, you don't need to prepare other programming language like Python.
The second benefit is you can use sparse metrics within metrics, but the current implementation doesn't support linear algebraic operations such as PCA for sparse metrics. So you cannot use any metrics for generic NLP,
natural language processing tasks. And the last benefit is you can use data frames with Darrow. Do you know data frame? Data frame is a basic data structure
to manipulate and visualize living data in data science. It is a two-dimensional data structure like a SQL table. In Ruby, Darrow provides data frame, so we can use data frame with Darrow.
As I'll describe later, you can also use data frames by using PyCall with Python environment, but while you use Darrow, you don't need to use Python together. SciRuby can be usable, but it also has several drawbacks.
I think three drawbacks, I think there are three important drawbacks in SciRubyGems. The first drawback is that in metrics it's extremely slow. So let me show you this problem in the demonstration.
Can you read the text? First, to show how any metrics is slow,
the following calls compare the three different kinds of summation methods with 1,000 element arrays. One method is any metrics is sum. Second is array sum, introduced in Ruby 2.4. And last is array inject with plus symbol argument.
At first, I should require the dependencies. And this code measures run time of each method.
As you can see, this end matrix's result run time is very large. So drawing a plot to visualize the result
by using rbplot library, so in metrics sum,
consume this amount of time. But others cannot see, because in metrics one time is very large.
As you can see this chart, in metrics sum is tremendously too much slow. This bug is filed as the issue number 362. So if you want to investigate and fix it, please check this issue and please send your request.
Back to the slide. The second drawback is about DALU. DALU can be usable for basic data manipulation,
but it lacks functions which pandas supports for practical data science tasks. So I strongly recommend you to use Python or R language if you need to do data mining in your business.
And the last drawback is that Sylluby is less documented. So it is hard to use when you are a beginner of Sylluby. The reason why there are three toolboxes due to the small population of developers and users,
so Sylluby always welcomes your contributions. The second project is Ruby Newmo. The founder of this project is Masahiro Tanaka. He is the original developer of the old original NRA.
It is the first in-memory tensor library for Ruby. Almost all Rubyists who need to manipulate tensor data in RubyScript use the original first NRA in ancient period. Since 2016, Masahiro Tanaka started Ruby Newmo project
to rewrite the old NRA for supporting the latest Ruby and the external libraries like OpenPlus, and to realize his new ideas of the implementation.
Like Sylluby, you only need Ruby, ah, I'm sorry. Ruby Newmo has some benefits and drawbacks. Like Sylluby, you only need Ruby for using Ruby Newmo.
And Newmo NRA is faster than any metrics, and pure Ruby, so in my opinion, it is the best library for manipulating in-memory numerical tensor data on CPUs. But Ruby Newmo does not support sparse metrics
and data frames. It means it is hard to use Ruby Newmo for NLP and data science tasks. And Ruby Newmo is also as documented as Sylluby is. If you're interested in Ruby Newmo and want to know the details of Ruby Newmo,
you can access the English slide and the Japanese talk movie in Ruby Kage 2017 at this URL. Please check them. I think you may want to know which Sylluby or Ruby Newmo is better.
The answer is case by case, I think. If you want to use Ruby for data science without any other languages, Sylluby is better because you need to use data frames, and Sylluby has Darwin.
If you want to do just scientific computing, such as numerical simulations, and try to implement your own machine learning algorithms, Ruby Newmo is better than Sylluby because NRA is faster than in metrics.
And the third project is Ruby Data Tools. This is based on Apache Arrow and its Ruby binding, Red Arrow. Red Data Tools is a very young project.
It started since this February. But as of today, it has these five gems. The biggest benefit of Ruby Data Tools is that you can try to use Apache Arrow in Ruby. Additionally, the core developer of Ruby Data Tools,
Kohei Suto, is a member of Apache Arrow's Project Management Committee. And this means you can continue to use Apache Arrow in Ruby in the future too. But there are two drawbacks.
The first is that gem of Ruby Data Tools is too young to use in production. So you should have a strong determination to employ this for your business products. The second drawback is now Apache Arrow is just a data format for in-memory and streaming IO.
So you cannot use it for manipulating data. Now you can do only load, save, and converting data, but they are too much faster than other way. Apache Arrow has a plan to implement
data manipulation APIs, so this drawback will be resolved by time. As you can see so far, it is hard to use data science by only Ruby. And almost all data scientists shouldn't want
to use Ruby in their jobs because they need the biggest powers of standard data tools in Python and R, such as Pandas and Spark, especially in exploratory data analysis. Exploratory data analysis phase is most important
for data mining and machine learning. The existing tools in Ruby aren't very durable for such use. But as you know, Ruby and Ruby on Rails are best for writing business web applications. So you should use Ruby and other language
like Python together. How to do it? I made a PyCall for such use cases. So what is PyCall? Using PyCall, you can use Python libraries
from your Ruby code very naturally. PyCall consists of two parts. The one part is the Ruby binding library of libpython.swell. It is a core of Python interpreter. Another part is a gateway between Ruby and Python
to translate their object systems. By the first part, PyCall provides us to access to the functions of Python interpreter. And by the second part, PyCall realizes the natural feeling for us Rubyists in the use of Python functions.
Let's look at the simple example use of PyCall.
I'm sorry. Too small font. I skipped this slide.
Don't worry, I prepared a demonstration for later part. Currently, I made wrappers for Namookpipe and Mataprot lib. But also, I want to make wrappers for scikit-learn, seaborn, bokeh, keras, and so on.
I need help to increase supported Python libraries. If you are interested to write such wrappers, please write your own wrapper, publish on GitHub and tell me.
By the way, PyCall also provides features to use Python libraries without writing wrapper libraries. So you can just load Python libraries as modules and use the modules. Let me show you a demonstration of PyCall.
In this demonstration, we correct the result of benchmark in pandas data frame and visualize it by seaborn library. At first, we needed to prepare requirements.
In this demonstration, I use nmatrix, new monoid array, and pandas, and mataprot lib.
And this is the benchmark call. In this benchmark call, I use 10,000 element arrays and calculate summation 100 times for each method.
The first method is array inject. The second is while expression. And the third is array sum. And second is enumerable sum.
And fourth is nmatrix sum, and the last is new monoid sum. Okay? Let's go.
In this call, measures runtime of the method by benchmark.realtime. Directly store into the pandas data frame.
So here I get the results in pandas data frame. This is the result. Using group by method calculates statistics,
this summaries for each method. As you can see, nmatrix is most slow,
and the second is inject. And then I visualize the data frame by seaborn's function. Using pycall.importmodule method,
we can import Python libraries as a module. So this is the result. So nmatrix's result is very large,
so we cannot see other method. I want to show only other except nmatrix result using these data frame as functions. Okay?
This is a bar plot result, nmatrix's result. So most of first method is array sum, and the second is an array. This demonstration is finished.
So this demonstration is very simple, but there are other resources I used previously
to show pycall examples. In RubyKaigi 2017, I demonstrated Keras examples and Rails integration example. Also, in RubyKaigi 2017, I did Ruby data workshop.
These resources and materials are available on GitHub. I've already uploaded this slide on my speaker deck, speakerdeck.com slash mrkn. So you can access these URLs from the slide.
Please check it. And there are other example users of pycalls. So there are two blog posts about scikit-to-run examples by Soren D.
The first article is simply users of scikit-to-run, and the second is OCR with scikit-to-run's random forest crossfire. And in kv-ruby conference in this year,
May held the workshop about pycall. You can see the workshop materials by this link. Please check them.
So as you can see in the demo, pycall provides us access to the functions of pycall data tools. So you can use all the following tools from your Ruby code. They are all the standard tools in data science.
So far we've learned about benefits and drawbacks of SciRuby, Ruby New More, and red data tools. And we've learned what we can do with pycall.
From here, I want to show you the current best patterns to use Ruby in data science. As I mentioned before, you should use Ruby and other languages like pycall together because almost all data scientists shouldn't want to use Ruby in their jobs.
And they need the biggest powers of standard data tools like Pandas to do exploratory data analysis. It's most important task is to find valuable knowledge from living data from business. And also, we won't use Ruby and Ruby on Rails
for writing our business application because it is best for us. So I propose three implementation patterns to integrate application within Ruby and data processing system within Python.
The first pattern is referring the same database directory from both systems. This is very easy to implement, but the changes in application may affect data processing side, especially about the data schema changes.
The second pattern is calling the functions of data processing side from application side. To implement this pattern, we need to serialize data to pass it from application to data processing system. So, large serialization cost can be occurred.
The last pattern is using pycall to call the function of data processing system. We can write the driver code in Ruby, so we can share the active record models between application and the data processing system.
Using the pycall, we can build Pandas data frames directly from Ruby. So, no serialization cost in this pattern. We need to choose the right way according to the situation.
The last part of this talk is the future perspective of Ruby in data science. I want to explain about two topics about the future. One is Apache Arrow, and the other is GPGPU and Deep Learning.
Apache Arrow is efficient data format and going to have efficient data manipulate operations. So, I think it will be the core of almost data tools in the future. It's already decided to replace the core part of Pandas 2.0 with Apache Arrow,
and PySpark already uses Apache Arrow for exchanging data between Python and Spark. So, the red data tools project is important for the future of Ruby's data science system. If you're interested in Apache Arrow, I strongly recommend to join red data tools project.
You can access the red data tools project at the URL. Next is about the GPGPU.
About the GPGPU, as Presenland presented yesterday, we already have array fire for use GPU. Moreover, two GPGPU projects were accepted by Ruby Grant this year.
One is RBCUDA by Presen. That is binding of CUDA runtime rivalries. Another Ruby Grant project is proposed by Sonac, one of the Ruby committers maintaining logo. He will make CUMO. That is a CUPI clone for NUMA NRA.
And for Deep Learning, we already have TensorFlow RB written by Arafato. Moreover, there are two work in progress projects I recognize. Red china is a part of red data tools project.
This is started by Hatapi to rewrite china in Ruby. This project uses the NUMA NRA for tensor data, so it will be able to use GPU by CUMO. And now I'm working for writing Ruby binding of MXNet.
I think it can be released after a few months. Finally, I'll conclude my presentation. In my presentation, I describe three major projects in Ruby about data science.
Say as high Ruby, we need more on the web data tools. Next, I demonstrate PyCall with an example of benchmark visualization. Moreover, I restulate the three patterns to integrate application between Ruby and data processing system written in Python. Finally, I talked about a future past
picked about Apache role, GPGPU, and Deep Learning. And we prepare the Docker image that contains almost all the things to try Ruby's data science ecosystem. So you can run Jupyter notebook as my demonstration
in a Docker container by this command. Please try it. That's all. Thank you.