We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback
00:00

Formal Metadata

Title
Apache
Title of Series
Part Number
9
Number of Parts
10
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
NeuroinformatikData analysisPersonal digital assistantData storage deviceSource codeStructural loadDescriptive statisticsAdditionTask (computing)Point cloudData centerLevel (video gaming)AlgorithmProcedural programmingCurvatureVirtual machineImplementationComputer fileDecision theoryCloud computingMobile WebGraph (mathematics)Process (computing)InternetworkingCollisionCybersexNeuroinformatikTwitterMachine learningProduct (business)Electronic data processingCollaborative filteringData warehouseMereologyConnected spaceData analysisClassical physicsEndliche ModelltheorieOpen sourceMathematical analysisPreprocessorBuildingMultimediaDifferent (Kate Ryan album)Uniform resource locatorStreaming mediaReal numberStapeldateiCartesian coordinate systemProgramming paradigmTheory of relativityComputational scienceCoprocessorNatural numberTransformation (genetics)Gene clusterNetwork topologyXMLComputer animation
Process (computing)Virtual machineMachine learningSinguläres IntegralMultimediaStapeldateiDew pointCollaborative filteringLinear regressionDiagramProgramming languageSelf-organizationVirtual machineModule (mathematics)Streaming mediaDifferent (Kate Ryan album)Machine codeClassical physicsComputer fileEndliche ModelltheorieDeclarative programmingElement (mathematics)Point (geometry)Band matrixSet (mathematics)NeuroinformatikDependent and independent variablesElectronic data processingUniform resource locatorGene clusterSoftware frameworkObject (grammar)Utility softwareLine (geometry)Gastropod shellLevel (video gaming)MereologyForm (programming)Latent heatStructural loadCodeFlow separationTwitterImplementationData storage deviceOpen sourceDatabaseMassPoint cloudMarginal distributionResultantCore dumpParallel portAdditionVery-high-bit-rate digital subscriber linePreprocessor2 (number)Library (computing)SoftwareView (database)Medical imagingVariety (linguistics)Process (computing)Integrated development environmentProgramming paradigmComputer clusterScaling (geometry)Java appletData typeData analysisComputer architectureFile systemReduction of orderCurvatureMachine learningMappingForcing (mathematics)Computer animation
DiagramGoogolInstallable File SystemComputer filePhysical systemSet (mathematics)ScalabilityComputer hardwareSingle-precision floating-point formatServer (computing)Query languageData storage deviceProcess (computing)Modul <Datentyp>Reduction of orderBlock (periodic table)Replication (computing)Execution unitDefault (computer science)Computer filePhysical systemFile systemMultiplicationSingle-precision floating-point formatServer (computing)Goodness of fitData structureVideoconferencingMereologySelf-organizationBackupData managementClient (computing)Virtual machineComputer architectureData storage deviceBlock (periodic table)Module (mathematics)Scripting languageInteractive televisionImplementationQuery languageDifferent (Kate Ryan album)InformationExecution unitSlide ruleMetadataWritingReading (process)Device driverKey (cryptography)Task (computing)Semiconductor memoryoutputResultantDatabaseAdditionConnected spaceSource codeGastropod shellData recoveryStapeldateiComputer animation
Block (periodic table)Execution unitDefault (computer science)Data storage deviceReplication (computing)Replication (computing)Directory serviceDatabaseContent (media)Computer fileElectronic mailing listWave packetRow (database)TupleComputer-assisted translationCore dumpAdditionArray data structureClassical physicsComputer programmingInformationTheory of relativityMereologyBlock (periodic table)Table (information)NumberFile systemData storage deviceOperator (mathematics)Attribute grammarType theoryConnected spaceDifferent (Kate Ryan album)Relational databaseInterpreter (computing)Interior (topology)System identificationWritingPhysical systemData typeData managementSubject indexingPower (physics)Mathematical optimizationLattice (order)Element (mathematics)Flow separationInternetworkingDescriptive statisticsOrientation (vector space)ImplementationPoint (geometry)Graph (mathematics)Parallel portPartition (number theory)Representation (politics)Frame problemScheduling (computing)Computer-aided designSequenceSelf-organizationFilter <Stochastik>Cycle (graph theory)Key (cryptography)Exterior algebraNatural numberFault-tolerant systemStudent's t-testPearson product-moment correlation coefficientCountingGenderPairwise comparisonMultiplicationInterface (computing)Level (video gaming)Range (statistics)Transformation (genetics)Error messageGroup actionReading (process)Procedural programmingSet (mathematics)ResultantCondition numberGene clusterReduction of orderNeuroinformatikDistribution (mathematics)ScalabilityObject (grammar)Variety (linguistics)Formal languagePhysicalismLinear independenceComputer animation
View (database)Computer-generated imageryPrice indexTable (information)Hash functionPhysical systemMereologySubject indexingFlow separationRelational databaseScaling (geometry)ScalabilityKey (cryptography)Uniqueness quantificationDifferent (Kate Ryan album)CollisionDatabaseData managementResultantTransformation (genetics)Computer fileClassical physicsCombinational logicGraph (mathematics)Operator (mathematics)Electronic data processingSequenceSystem identificationLevel (video gaming)Group actionCycle (graph theory)Computer animation
Type theoryTable (information)GoogolScalabilityVertex (graph theory)QuicksortBit rateBenchmarkScale (map)Computer networkJava appletSource codeData analysisVirtual machineGraph (mathematics)Range (statistics)Streaming mediaTransformation (genetics)AbstractionNumberSupercomputerScripting languageMessage passingFormal languageProcess (computing)Graph (mathematics)WindowAbstractionSoftware frameworkNeuroinformatikFunctional (mathematics)Library (computing)Cycle (graph theory)Shared memoryDifferent (Kate Ryan album)Latent heatMiniDiscPairwise comparisonReading (process)Flow separationWritingLevel (video gaming)Procedural programmingInformationFile formatClassical physicsIterationResultantTask (computing)ImplementationArithmetic meanDatabaseFacebookSubject indexingMultiplication signConnected spaceData managementMathematical analysisPhysical systemSelf-organizationAdditionGroup actionOperator (mathematics)WordComputer fileMessage passingElement (mathematics)MereologyTransformation (genetics)Core dumpExtension (kinesiology)Fitness functionLogistic distributionEndliche ModelltheorieData storage deviceElectronic data processingArtificial neural networkFront and back endsCartesian coordinate systemDependent and independent variablesExpressionCodeModul <Datentyp>Virtual machineBefehlsprozessorTable (information)MultiplicationKey (cryptography)Real numberAreaConnectivity (graph theory)Integrated development environmentComputer programmingInterface (computing)CountingJava appletGene clusterPower (physics)Combinational logicInteractive televisionFilter <Stochastik>Service (economics)MassUniform resource locatorFundamental theorem of algebraFrame problemAssociative propertyRule of inferenceScheduling (computing)DemonExploratory data analysisResource allocationParallel portStudent's t-testUniverse (mathematics)2 (number)SubsetCoprocessorFingerprintTwitterThread (computing)AuthorizationAlgorithmEntire functionSeries (mathematics)Sensitivity analysisProgramming paradigmHoaxPeer-to-peerData warehouseFile systemBlock (periodic table)Data analysisLinear algebraReduction of orderNumberRow (database)Linear regressionQuicksortBit rateQuery languageRelational databaseStapeldateiMultiple RegressionDataflowSoftwareComputer networkCASE <Informatik>Representation (politics)Database transactionLambda calculusLine (geometry)Source codeLinear independenceRange (statistics)SequenceInstance (computer science)Device driverSemiconductor memoryDistribution (mathematics)outputSet (mathematics)Object (grammar)Variety (linguistics)Partition (number theory)Array data structureDemosceneNatural numberPearson product-moment correlation coefficientElectronic mailing listPhysicalismCondition numberMechanism designRead-only memoryExterior algebraGenderAnalytic setError messageXMLComputer animationJSONUML
Transcript: English(auto-generated)
So, today we discuss with you about technologies, allows to build real big data and machine learning applications.
Last lecture, we analyze the description definition of my produce have can be produced for different tasks. In additional, I showed you example of realization of my produce, but actually my produce
is one of the technologies, actually one of the procedures used in the data stack. And today we discuss with you collection of open source software utilities for big data, for my produce, for machine learning implementation and so on.
And today. First of all, we need another technologies because if you remember, one of the definition of big data is that we have data, so huge that when we need new technologies, new algorithms to analyze this data, and our lecture today is about this new technologies.
First of all, the reason of that is a big amount of data, a lot of terabytes, megabytes, gigabytes and so on. From other side, we need another solution for data storage for access to this data and so on.
And from other side, we need also new algorithm, because if you remember, a lot of machine learning algorithms are not able to be to be paralyzed for example decision tree.
And that is why we need new algorithms or new implementation of machine learning algorithms to analyze big amount of data. And here you can see examples where actually we can collect this big amount of data
of this. First of all, this is internet online data about search results, mobile GPS location. The next very important sector, this is healthcare and scientific computations for that graph data, it's about
telecommunication, cyber security, computer network, social network data analysis, internet of things, of course, and financial data too. And how actually this data are big. For example, here you can see that when
we talk about Hadoop Collider, this collider produce about 30 petabytes of data per year. Of course, we cannot save all of this data without processing of this data, because this is a lot.
In addition, here you can see production of Facebook, YouTube, and so on. And that is why for this reasons, we should use other technologies that data centers for data collection and data storage.
And we talk about cloud and distributed computing. In the next lecture, we will discuss also the different paradigm, not only cloud, but also for computing. And this is actually the second trend that allows us to use another technologies for big data solution.
I talk about Hadoop. Because our data are stored in different computers, we need to organize a stable connection between these computers from one side. From second side, we need a clear understanding where actually this part of data is situated, and how this data can be connected with other part.
From other side, our solutions must be dynamically scalable, because we don't know how many data we should achieve tomorrow.
And this solution, of course, must be cheap, because each day we have more and more data. That is why the new challenge for technology used in big data is not only huge amount of data, but also specific storage of this data.
And this specific storage is organized as a distributed computational cloud solutions and so on. In this case, data processing and machine learning models can be used together for data storages and data analysis.
And for example, if you talk about machine learning, first of all, we talk about big data set processing and analysis. The classical trends in machine learning, you know, is classification, regression, clustering, and one more is collaborative filters.
I will talk with you about collaborative filters in lecture about data pre-processing. When we have noisy data, when we have data with uncertainty, and we would like to reduce this uncertainty.
And this is example of collaborative filters. From other side, we need strong understanding how our data can be collected and how our data can be stored. For this reason, we talk about distributed computing or actually implementation of MapReduce paradigm that we have a pipeline of methods for data processing.
And these methods can be implemented into different locations with different processors and so on. From other side, we talk about a huge amount of sources.
It can be relational database, non-relational database, flat files, and so on. That is why we need strong understanding the nature of this source and how we can extract, transform, and load this data.
That is why for big data, we should talk about data warehouse too. Because extract, transform, and load is one of the very important stages in data processing using data warehouse technologies.
So, from this side of view, in addition to big data, to huge amount of data and cloud and distributed solutions, we need also understanding of our data.
That is why we talk about traditional extract, transform, load, or traditional data warehouse approach. From other side, we need to care about data storages. For example, we can use specific data storage, XBase.
We will talk about this today. The next, we need tools for processing of streaming multimedia and batch data. Because we can analyze data not only from classical storages, such as databases, but also from flat file and streaming.
For example, streaming we can use for audio data analysis, for different recording, for streaming in YouTube, for example. Or even streaming can be used for IoT sensors, data analysis.
From other side, we can process together different kinds of data. In this case, we talk about multimodal data. For example, we process three streams together. The first one is video, the second one is audio, and the last is text.
Classical text file. And based on intersection of these three stream, we can improve the accuracy and quality of our data analysis. But we need understanding how we can integrate this data, how we can process
this data, which machine learning models can be used for that, and so on. And from this point of view, big data technology is intersection of all of these elements. Big data set from one side, machine learning models for this data
processing from other side, distributed computing for data storage and data computation. This is the next part. And data processing stage or extract and form load is the last part of this trend.
That is why big data technology is intersection of all of these elements together. So for this intersection, we can use specific set of technologies.
This is Hadoop ecosystem. First of all, this is Apache Hadoop. This is a collection of open source software utilities that facilitate using a
network of many computers to solve problems involving massive amount of data and computations. So from this definition, we understand that we talk about cloud and distributed computing from one side. We talk about a massive amount of data, a huge amount of data from second side.
And we talk about data processing and data preprocessing. This collection, this set of technologies provide a software framework for distributed storage and processing of big data using the MapReduce programming model.
So in the core of Apache Hadoop, we have MapReduce model or we have a clear understanding that we have two very important parts of our pipeline. We have map path or merging of our data and we have a reuse path.
We have aggregation of results of maps. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use.
The core of Apache Hadoop consists of storage path. We'll talk about that. The next we have processing path or MapReduce programming model. And in additional, we have additional possibilities for packaging of this data,
the next for processing of this data, and of course, for parallel file system implementation for high speed networking. And based on Hadoop, we can process a huge amount, huge variety of data types.
This is force V of big data variety. We can process text data, graphs, streaming data, images and so on. Next lecture, I will show you implementation of Spark and usage of one part of this Hadoop ecosystem.
In additional, we have shared environment or we have specific data storage architecture. And actually, Hadoop ecosystem is organized as a layered system, consists of a lot of layers.
And the last layer of this diagram, we have specific file system implementation, HDFS. Next layer can be used for data processing, data organization, and so on.
And the last can be processed, can be shown as specific databases and specific machine learning models for this data processing. So, when we talk about Hadoop framework, this framework is composed of the following models.
The first one, this is Hadoop common, or this is libraries and utilities need to add Hadoop modules. Last layer is Hadoop distributed file system or HDFS.
This is distributed file system that stores data on commodity machines, providing very high average bandwidth across the cluster. Next, we talk about Hadoop YART. This is a platform responsible for managing computing resources in clusters.
Or actually based on YART, we can find the real location of our data and understand how we can combine this data together. Next, we have Hadoop MapReduce programming model for large scale data processing.
And we have Hadoop Ozone introduced in 2020. This is object storage for Hadoop. And based on these modules, we can use different technologies.
For example, Apache Pig, Apache Hive, Apache HBase, and so on. And we will discuss with you all of these technologies together. From other side, Hadoop framework itself is mostly written in the Java programming language.
There are some native codes in C and command line utilities written in the shell script. And though MapReduce Java code is common, any programming language can be used with Hadoop streaming to implement the map and reduce parts of the user's program.
And this is very important. From other side, a lot of separated modules in Hadoop is written in Scala. This is interpretable declarative language that allows us easily to implement machine learning models.
And there are a lot of implementation of ready modules for machine learning. So, and in this shell we have different databases implemented in Hadoop. We will talk about these databases too.
So, here you can see these modules that I talked about before. First of all, MapReduce module, the next one, data module. This is implemented as peak scripting and high SQL-like query.
In additional, we can use other modules. We will talk about that in additional. And we can use modules for data streaming, data analysis, and so on.
And we start with you from the first layer in our, or maybe the last, I don't know, from the bottom layer in Hadoop ecosystem. We talk about Hadoop distributed file system. This Hadoop system is presented in 2004, and this system based on Google file system.
First of all, this is the server as distributed file system for the most tools in Hadoop ecosystem.
This file system is very scalable for big data sets. It means that we can store a lot of data. We can store additional data without changing the structure of this data. This is very important. Here you can see that, for example, one single Hadoop cluster consists of approximately 5,000 of servers.
And it's around 250 petabytes of data. It's only one cluster in Hadoop.
From the other side, Hadoop is good for large files and streaming data. And for example, if you would like to use Hadoop for small files, consist of maybe hundreds of megabytes, Hadoop will be not good for this reason,
because we have specific structure. We have a specifically organized connection between different parts of data. I will show this in the next slide. That is why this is very important to understand that Hadoop can be used only for huge amount of data.
For huge data sources. The one of this data source can be huge databases. For example, no SQL databases, because SQL databases have all good servers for data management.
It can be no SQL, for example, MongoDB with a lot of collections inside. And for streaming data. Video stream, audio stream, and so on. That is why Hadoop can be used only for large files.
First of all, design of Hadoop distributed file system. We have layered structure of HDFS system 2. First of all, we have master-slave organization. Master, this is a single name node for managing metadata.
Or based on master, we can recognize which data are situated in which part of our cluster. And the next, this is slave node. This is a multiple data nodes for storing data. Actually, our data are stored directly in slave node.
And master node we use only for managing of this data. Or in additional, we can use another node, for example, secondary name node as backup of our data.
And this secondary node can be used for slave node, for this layer, or even for master node too. So here you can see HDFS architecture. We have our master node, or name node, keeps the metadata, the name of our data, allocation, directory, and so on.
This name node directly interacts with our client. Our client actually interacts with name node. And our client has access only to this part of our architecture.
Only to the name node. Or if we would like to organize backup, we can use secondary node. But in this case, the secondary name node interacts directly with name node too. And after that, we have a lot of data nodes.
And these data nodes can be used for data storage, for storage of different blocks of data. So here you can see example of implementation of data storage. And how we can use HDFS system for solving the problems with data storage.
For example, if one or more nodes fails, or for example, fails our connection, and so on. So first of all, we have our master node.
And we have information about parts of file that we would like to save in HDFS system. For example, our file consists of four parts for batch. From B1 to B4. After that, we have multiple storage of all of each this part in our slide nodes.
For example, here you can see multiple storage of B1, multiple storage of B2, B3, and B4. And if we have some problems with one of these nodes, of course we can use other nodes for data recovery.
And master node is responsible for understanding how each part is stored and where.
For example, we know that this B2 is stored in this node, this node, and this node. And this is why this master node, this is very important for us, because this is some kind of metadata.
So, in this case, each file is divided into blocks, as you can see here. For example, four blocks or more. And each block, this is a basic unit of read-write. Or it means that we can organize direct access to each of these blocks.
We can change the size of this block. By default, we have 64 megabytes. But, of course, we can organize larger blocks. For example, 128 megabytes.
And based on that, we can store the large databases. From second side, as we saw from this slide, HEFS blocks are replicated multiple times, usually three times, but, of course, you can increase the number of replics.
And this makes our system, our storage, more tolerant and faster to read, because we know where actually our data are stored.
This comment is very similar to other file systems. For example, for Windows, for Linux, and so on. First of all, we can create the directory. And we can use for this make gear. It's similar, as you can see, you can use in command line, prompt command line.
The next, this is the list of the contents. LC, upload and download file, put and get. And look at the content of this file.
We can use the command, command cat. This is why these comments are very similar to UNIX system. And it's easy to use HDFS system without additional training.
The next very important part is HBase. HBase, this is NoviScale data storage, built on the top of HDFS. And NoviScale database storage allows us to implement fast access to data
and storage a lot of semi-structured information. Because when we talk about classical relational databases, you know that we have a lot of relations or tables actually. The next, we have connections or relations between these tables.
And for example, operation join, doesn't matter, inner join, cross join, whatever, is one of the slowest operations in relational databases. And if we have, for example, two tables with billions of records,
it means that this is very hard to join together these tables. Because first of all, we create cross join or join each record in the first table
with each record in the second table. And only after that, based on different optimizers, we choose appropriated records for us. Of course, this depends on this database management system.
Optimizer, for example, Oracle has a more powerful optimizer than SQL Server, for example. But doesn't matter. We have huge tables and to connect all of these tables together,
we need organized cross join. That is why no SQL is very good for internet data, when we have a lot of data. From other side, HBase as no SQL database storage can be used for various types of data.
In addition, based on XML extension, you can build all data type and provide the description of this type. This HBase database can be used for storing a large amount of data, terabytes, petabytes.
This is column oriented data storage. Dear students, do you know something about no SQL databases or not?
Because I don't know, should we add additional lecture about no SQL databases?
Because this is very important part of big data technologies to MongoDB, but I have never worked with it. OK, OK, MongoDB is actually a document oriented database.
Maybe next lecture, I will show you other implementation of no SQL databases. Because, for example, this each base is column oriented. What does it mean? When we talk about relational database, you know that the core of this database is record or tuple in our table.
And we save all of our data, tuple by tuples. Each tuple is saved together. This is similar as arrays in classical languages, programming languages. When we talk about column oriented databases, for example, not only each base, but also Cassandra,
it means that we store each attribute or it means that we have collections of attributes. For each attribute, we know value and ID of this value.
I will show you maybe the next lecture, different interpretation of no SQL databases. From other side, this is very important that based on this implementation, we can organize a random read and write of our data.
This simplifies the access to our data. Because if you talk about relational database, first of all, we need to know the identification of tuple of this record. To do that, first of all, we investigate separately index table.
After that, based on this index table, we are going to the original table. In this table, we search this value of identification. After that, we analyze separated elements in our tuples.
In this case, of course, we need identification too, but we can organize direct access to each value of our tuple separately. And in addition, this database is horizontally scalable.
This is the same for all no SQL databases. What does this mean, horizontally scalable? I maybe share the point.
So, when we talk about classical relational database, we have this table, for example.
We have this table. Here we have primary key or unique indicator of our data. And in additional, we have index table. This index table is created by database management system automatically for each primary key, for each foreign key, and for each indexes.
So, here we have a lot of indexes and a dedificator. And how we can organize scaling of our data? We can organize only vertical scaling. It means that we can divide our data by different tables.
And in this index table to know only an indicator of this data. But for different approaches, we need to organize also horizontal scaling.
It means that we would like to divide this huge table for separated parts. That is why this is horizontal scaling. But in relational databases, this is very hard to implement. Because we have this index table. And this index table is built as hash function actually.
And if we divide this table by separated parts, we should divide this index table too. And this allows the collisions based on hash function.
That is why relational database is very hard to horizontally scale. But for non-scaled database, this is more simple. And that is why for HBase, we can organize horizontal scaling of our data.
So, from other side, HBase similarly to other non-scaled databases is not so good for transactional databases. Such as relational database model.
Because in this case, we need to care about a lot of operations implemented automatically in relational database management systems. For example, we should care about this indexing or data connection.
From other side, a lot of operations such as aggregation operations, store procedures and so on, must be written manually in non-scaled databases. In relational databases, you can use ready solution for that.
From other side, this is how to organize data analytics based on non-scaled databases. For example, when we talk about relational database or data warehouse, even data warehouse, we have a lot of implemented models for data analysis and data analytics.
For example, when we talk about analytical services for MS SQL Server, we have ready tools for association rules, mining, we have ready implemented tools for regression model and so on.
For this, for non-scaled database, we need to write this code manually. From other side, HBase is not so efficient for text searching and processing.
This is related with column data organization. And for text searching, we can use other non-scaled databases, for example MongoDB. We'll talk about this later.
Next, very important thing in Hadoop ecosystem is MapReduce. This is the simple programming model for big data. We talked with you about MapReduce last lecture. And the core of MapReduce, this is from one side possibility of parallel processing of our data.
Because you know that each part of map procedure, for each part, you can use other processor or other core.
From other side, you can organize threads of your data, different sensitization mechanism and so on. And this is one more again about MapReduce paradigm. We have two important steps.
The first one is map and the second one is reuse. Map is used for organization pairs of key and value. And reuse is used for combining these pairs by value of key together.
And here you can see example actually of implementation. And we can use MapReduce for different areas. For example, this is one more implementation of word count program.
We have our file. Our file is divided by blocks because we use HDFS file system. Yes, for that is why we have automatically divided data by different blocks.
And after that, we can analyze or proceed all blocks separately. After that, we have shuffle and sort procedure. And the last, this is a result of this data processing.
And that is why MapReduce is very good for HDFS file system. Because if you remember, one of the most important disadvantages of MapReduce is that we must store immediate results of map procedure.
HDFS is ready for that because we have actually these blocks and we can use these blocks in the future. That is why if you talk about combination of MapReduce and HDFS file system,
in this case, MapReduce is very powerful solution for huge amount of data processing. So, first of all, we can use map for data filtering.
In addition, we can use different aggregation function. From other side, we talk about a cycling data flow from disk to disk in our HDFS file system. Because we have divided blocks and we can process all of these blocks separately.
Reading and writing to disk is organized before and after MapReduce procedure. And that is why this procedure is not efficient for interactive tasks. For example, as machine learning, when we talk about neural networks or when we talk about different collaborative filters,
we have a lot of iteration or even a key means. And that is why if you have a lot of iterations, in this case, such organization is not so suitable.
Because, as you understand, each iteration based on this procedure, the result of each iteration must be stored separately. And this is additional resources for storing, for data accessing and so on. MapReduce implementation in Hadoop support Java API.
And, of course, MapReduce is implemented for batch processing. And that is why this can be very easy to transform to streaming data analysis.
Because we have batch or we have separated blocks and we can process all these blocks separately. The next, this is Yarn. Yarn actually is the core of Hadoop ecosystem.
Based on Yarn, the fundamental idea of Yarn is to split up the functionalities of resource management and job scheduling into separated daemons.
And the idea of Yarn is to have a global resource manager, or RM, and per application master AM. And application is a single job in our job scheduling.
So, based on Yarn, we can allocate resources for different tasks and organize the scheduling of these tasks. And here you can see the schema of Yarn.
So, we have resource management manager from one side. And from second side, we have application master. The resource manager and the node manager from the data communication frameworks. The resource manager is the authority that arbitrates resources amount of the application in the system.
And from one side, application master or node manager is per machine framework agent who is responsible for containers monitoring their resources.
For example, CPU resources, memory, network resources, and so on. And from second side, node master also is responsible for reporting the same to the resource manager scheduler.
The peer application or application master is a framework of specific libraries. And it's tasked with negotiation resources from resource manager and working with node managers to execute and monitor our tasks.
So, the main idea of Yarn is to combine together computational resources and tasks.
The next part of Hadoop ecosystem is Apache Hive. This is some kind of data warehouse component. And Apache Hive is used for reading, writing, managing a large amount of data in a distributed environment.
And moreover, we can use for that SQL-like interface. Because if you talk before about each HBase, this is no SQL database.
And one of the disadvantages is that we need to implement all SQL commands for data processing, data analyzing. We have very simple operations implemented automatically. Read, write, index, and so on.
And more complicated operations implemented in SQL, we must implement manually for each base. When we talk about Hive, we have this SQL-like interface, and this allows us easily to analyze our data.
This is very important. The next, this is Meaud. Meaud provides an environment for creating machine learning applications. And moreover, that these applications are scalable.
It means that we can use these applications for different amounts of data. And this allows us to use HDFS resources to store our data.
First of all, this is some kind of distributed backend. This backend can be extended to other distributed backends too. And the main responsibilities of Meaud is, from one side, this is mathematically expressions built on Scala.
Support of multiple distributed backends, including Apache Spark, we will talk about that.
Modular native solvers for CPU, GPU, code acceleration, and so on. And based on that, Meaud can be used for different sources, for different solutions.
And first of all, the main idea for this solution is to use ready solutions for linear algebra.
And you know that a lot of machine learning models are built on linear algebra. That is why Meaud can be used for mathematical solutions and implementing these solutions in GPU, CPU, and so on.
So, next we have one solution, the next solution is Apache Spark. The main idea of Apache Spark is to implement a lot of machine learning models and implement these models and scalable solutions.
Apache Spark is connected with HDFS file system, of course YARN. We can use HBase as NoSQL database. We have additional possibilities for SQL data processing without Apache Hive.
I will show you implementation of Spark next our lecture, because I would like to show you NoSQL database. Again, this NoSQL database to analysis and after that Apache Spark usage.
The next Apache Spark is native Scala, Java, Python, and R support. That is why you can use any of that language for Spark. But we have one limitation, Spark is not good for Windows.
It's much better to use a Linux system for Spark, but in my computer I have Windows. With multiple problems, I install Spark in Windows and use them, and I will show you how we can use Spark.
And you can use any language for that. We have specific libraries in Spark. These libraries are developed by MPLab in Berkeley.
And the main idea of Spark is to use in-memory data sharing. And here you can see comparison of Hadoop and HDFS reading and writing of information and Spark.
So, when we talk about classical MapReduce and Hadoop, what we have? We have our data storage. We must get information from this storage to do something with this information as a result of the first iteration of our data.
After that, write again the result of this data processing. For example, as a result of map procedure. After that, one more again, we must read this data and use this data in the next iteration.
The next, we must write this data again and so on. So, each time, for each iteration, we have two additional procedures, write and read. And if we talk only about two procedures, map and reuse, for each map we must write.
And it means that we spend time for that. It's very good because our solution is stable. We know that the result is saved into HDFS file system. We have access to that. But if we would like to analyze data online, we cannot do that.
Because each time we must write and read, write and read. And for a lot of machine learning models, we use multiple iterations. Example of that can be key means or, for example, other kind of clustering or
naive bias approach or, for example, neural networks with thousands of iterations and so on. That is why, in this case, classical HDFS approach is not so good.
And in this case, Spark is more suitable for that because we use in-memory data sharing. It means that we use only one operation for data storage.
For example, we have iteration, we have intermediate results. We save this result in in-memory data and we can immediately use this data without additional reading. Because this data is stored in in-memory, in very fast memory.
That is why, based on that, we can reduce time for data analysis. And this is the most of the important advantages of Spark implementation and realization. And here you can see a comparison of Hadoop, MapReduce implementation and Spark.
For example, you can see here as the size of our data, we have more or less the same time. But, as you can see in the next records in this table, we have three times faster processing of our data.
Moreover, we have less nodes for this data processing. Here we have two thousands of nodes and for Spark we have only two hundreds.
The next number of cores in Hadoop, MapReduce, we use physical cores. It means that we use physical clusters, physical storages for that and physical resources.
And here you can see that we have 50 thousands of this physical cores. And for Spark, we have around 7 thousands of virtualized cores.
It means that you can use a combination of cores from different locations of your cluster. And the next you can see other of comparison. But, when we talk about sort rate for Hadoop, we can organize sorting for 1 terabyte by minute.
And for Spark, we can organize sorting of more than 4 terabytes per minute.
This is a huge comparison in Spark compared with Hadoop. So, what actually Apache Spark is?
As I talked before, this is libraries for data analysis, machine learning, graph analysis, streaming data analysis and so on, supported by multiple languages such as Scala. The native language for Spark is actually Scala. But, of course, you can use Java, Python, R, SQL tool, but we have additional implementation of SQL.
That is why I talked to you that in this case, Apache Hive is not so good for us because Spark has all implementation of SQL-like language.
In addition, we can organize all of our data by data frames. This is classical for Python. That is why data frames can be easily connected with SQL because actually each data frame can be presented as a separate table from one side.
Second, we can use machine learning pipelines. We can organize the pipeline of our model implementation data analysis and so on. And we can use for that different libraries. The most important for us is ML library and graph library, allow us to process graph data.
And in addition, of course, we have Spark streaming, allow us to work with streaming data. And all of that is based on Hadoop HDFS file system, HBase, because here we have implemented SQL, Hive as NoSQL database,
Apache S3, streaming data, and so on. Next, very important thing for Hadoop and particularly for Spark is RDD, Resilient Distributed Database.
This is actually a kind of container and this is the core of Spark, some kind of abstraction.
RDD is a read-only collection of data that can be partitioned across a subset of Spark cluster machines and form their main working components. RDD are soft integral to the function of Spark.
Then the entire Spark API can be considered to be a collection of operations to create, transform, and export RDDs. Moreover, every algorithm implemented in Spark is efficient.
Additionally, a series of transformable operations performed upon data represents as RDD. So actually, our pipeline is kind of RDD for machine learning models. The key performance driver of Spark is then an RDD can be caught in memory
of the Spark cluster compute nodes, and those can be reused by many interactive tasks. This is very important that we can reuse our result. The input data that forms RDD is partitioned into chunks
and distributed across all that nodes in the Spark clusters. So one more again, we have distribution of our data, each node then performing computation in parallel upon its own set of chunks. Physically, RDD is a Scala object
and can be constructed from a variety of data sources. For example, including HDFS file system directly from Scala arrays or from a range of transformations and can be performed upon an existing RDD and so on.
The key feature of RDD is that they can be reconstructed if RDD partition is lost using a concept called linearage and those can be considerate to default tolerance.
So based on this RDD, we can easily to organize parallelization of our data, we can organize different kinds of transformation and so on. Next very important part is data frame.
This is some scenes other than RDD because when we talk about RDD, this is the data container, organized as chunk of our data and when we talk about data frame, it means that this is some kind of distributed data sets.
We have columns, we have strict understanding the nature of this data and this data frame is similar from one side to relational databases.
From other side, this data frame is similar to Python data frame or to R's data table. So if you use data frame in Python, this is the same as data frame in Spark.
From other side, this is same to classical relational database and you can use similar operations to data processing, for example, grouping, sorting and so on and these operations are similar to classical SQL languages.
Data frame can be constructed into different ways, for example, we can read data from physical file and save this as a data frame. As data frame, we can transform existing data frames,
for example, from pandas, we can parallelize a Python collection list and so on. And here you can see example of this data frame. First of all, we create new data frame that contains students.
In addition, here you can see that we can use different filters from other table users, for example. In addition, the alternative key in pandas, it looks like here. So here you can see that this is very similar.
Next, for example, in Spark, we would like to count a number of students by gender and this is implementation. Or for example, we can organize joining students with another data frame.
And this is similar as in SQL for join operation. And moreover, you can see here the kind of join, for example, here we have left join. So we have very similar syntax to SQL, to pandas and so on.
And when we talk about comparison of RDDs and data frame, in this case, RDD, this is a low level interface and data frame, this is a schema of our data. So data frame, this is similar as schema
of relational database. And RDD, this is container of this database. And data frames are built on the top of RDDs and this is the core of Spark API.
And here you can see example of Spark operations. There are two kinds of operation. This is transformers and actions. This transformers or transformation is used for a new RDD creation,
changing this RDDs and so on. And actions can be used for returning the results of driving problems. So here you can see, this is the map because we create actually new part of data filter because we change something in this data,
group by, reduce by key and so on. And when you talk about actions, it can be result as reduce because we return the result first or last or top, the same as an SQL count by key and so on.
The next very important thing for Spark is there are directed cycling graph or DAG and DAG is used for your operation representation. This is some kind of scheduler of your operation.
So first of all, we have nodes, our nodes are presented as separated RDDs and we have the list of our operations. And all of our errors are presented as transformations,
this group of operations that we discussed before. And based on this directed cycling graph, we can analyze the procedure of our dataset transformation. So we start from this state,
after that we use, for example, this operation, this allow us to create or transform our dataset into this condition, after that we use next transformer, we use the result of that is that condition and so on.
And when we talk about comparison of classical map operation and operation built on directed cycling graph, a map operation is always presented as this narrow, narrow,
that we have a strict sequence and we strict linear dependence between our operations. We know that this is the first operation and only this operation can be realized after this operation and so on.
When we talk about directed cycling graph, we can organize combination of all of these operations together because we can build this graph and based on this graph,
organize transformation of our data. So for classical map reduce, we have strict sequence of operations. For a cycling graph, we can combine these operations and transform our data.
Next, when we talk about actions, the second part of this operation, this one, first of all, actions present the final stage of our workflow when we must present the result to end user to arrange for our data and so on.
And actions, this is actually triggers the execution of that, or action is situated after the last operation in directed cycling graph here in this place.
Moreover, the result of action always the final result of our data processing and this result can be presented as data saved into HEFS system or directly into a file and so on.
And here you can see actually classical Spark workflow and how this workflow can be implemented. First of all, we have our data based on HDFS and chunks.
We can divide our data by groups, small parts. And this is actually part of map. After that, we can shuffle our data, we can organize the sequence of operations and this is a result of this part.
And after that, we can present the result of this operation. And all of this can be saved for further analysis. And this is an example of Python RDD API representation.
First one is for word count. For example, we have our file. Next, we use the same as in our previous example in our Lambda function to split our file by separated line. In each line, we would like to create a key word
and number of this word. And in the reuse part, we would like to calculate number of this word. And after that result, this is action to save the result into this new file. So this part is present as transformer
and this part, this one last is present as action. The same we can do with logistic regression. First of all, we have data frame, create data frame. After that, we use ready logistic regression model.
For example, we use 10 iteration. Fit this model. And after that, shows the result. So here you can see combination of two parts, part transformer, this part,
and the last part, this is action, shows the result. This part is transformer and this part is action under this transformer. So we can skip that.
So Spark, Spark's main use cases. This is streaming data, machine learning model implementation, interactive analysis, data rehousing, and then we have a lot of data batch processing, of course, for streaming data and for neural network implementation.
Exploratory data analysis, graph data analysis, because we have specific library for that. I will show you this again, this GraphEats library.
Special data analysis and many, many more. Use cases that we implement with our students in our university, this is fingerprint matching,
Twitter sentiment analysis, particularly this sentiment analysis can be used not only for classical sentiment analysis, but also for fake news analysis, propaganda analysis, and so on.
Spark in real world can be used for different purposes. Here you can see some examples for that. The first one is used in Uber. Kafka Spark streaming instance is the next. Pinterest for et al pipeline.
The next, Convita, this is video streaming analysis. Capital One, this is a model for customer understanding.
Netflix for recommendation system and so on. And this is another example how we can use, for example, Spark and Hadoop ecosystem
and whole for different machine learning models. For example, here you can see implementation of key means. You know that for key means we use a lot of different epochs for data analysis. And for example, here you can see implementation
that we would like to create different clusters and choose the most appropriate number of clusters for us. And of course, all of this elbow method implementation because all of this implementation of elbow methods
can be implemented into different clusters and so on. And here you can see comparison of classical MapReduce, M-Out, Python scripting, Spark, and so on.
And in this case, MPI outpace Hadoop, but also we can use also hybrid approach for all of that. The next, this is Apache Graph. This is interactive graph processing framework.
Apache Graph allows us to work with graph data. And this can be implemented into different solutions.
For example, in social network analysis and this graph can be used for big amount of graphical data analysis.
Graph is built on Apache Hadoop MapReduce implementation for process graph. For example, Apache Graph is used in Facebook for performance data analysis, for user data analysis and so on.
Graph is based on a graph processing and very important thing is that this graph can be used to gather this Spark tool.
And this is example of implementation of graph analysis. So we have a lot of data. We can analyze the message passing between all this data. We can easily find the shortest path
based on graph data and so on. And conclusions for our lecture today. We talk about Hadoop ecosystem. The most important elements in Hadoop is HD file system, MapReduce as computational model,
YARQ as heart or core of this technology. The next Spark extension of MapReduce for machine learning model implementation. Next, we can use data frame similarly
to classical databases and pandas data frames. And after that use RDD as transformers for Hadoop. For this data and in additional, we can deal not only with streaming data,
but also these graphical data, graph data tool. And this is all from my side for today. Questions here students.