Into Big data definition - TIB AV-Portal

Into Big data definition

00:00

123

Lviv Polytechnic National University

Projekt 'Open Education Resources with Ukraine'

Shakhovska, Nataliya

Formal Metadata

Title

Into Big data definition

Alternative Title

Визначення великих даних

Title of Series

Big data and machine learning (Великі дані та машинне навчання)

Part Number

1

Number of Parts

10

Author

Shakhovska, Nataliya

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/63041 (DOI)

Publisher

Lviv Polytechnic National University

Projekt 'Open Education Resources with Ukraine'

Release Date

Language

Content Metadata

Subject Area	Computer Science
Genre	Lecture

Big data and machine learning (Великі дані та машинне навчання)1 / 10

1

20:06

Into Big data definition

2

1:12:32

Classification, regression, clustering

3

47:32

Data preprocessing

4

41:44

Model evaluation

5

42:11

Neural networks

6

1:24:20

7

37:53

8

26:02

Big data architectures

9

1:20:09

10

34:57

Big data formats, Databricks

Automatic playback

Speech

Text

Image

00:00

NeuroinformatikPreprocessorArchitectureModal logicReduction of orderProcess (computing)Mobile WebData compressionDatabaseComputer-generated imageryInternetworkingComputer fileMP3Integrated development environmentDatabase transactionTransportation theory (mathematics)Archaeological field surveyKolmogorov complexityVariety (linguistics)Menu (computing)VolumePlastikkarteFood energyTheory of relativityDifferent (Kate Ryan album)Port scannerDomain nameMachine learningDatabase transactionDatabaseMedical imagingNeuroinformatikInternetworkingVelocityParameter (computer programming)Data structureQuery languageSingle-precision floating-point formatNumberComplex (psychology)PreprocessorInformation overloadLimit (category theory)Computer architecturePerspective (visual)Sampling (statistics)InformationVariety (linguistics)Object (grammar)Presentation of a groupElectronic data processingData analysisMultiplication signSimilarity (geometry)Reduction of orderCASE <Informatik>Constructor (object-oriented programming)Source codeInteractive televisionSoftwareAsynchronous Transfer ModeHypermediaIntegrated development environmentRoundness (object)MereologyXMLComputer animation

06:08

Database transactionComputer networkMobile WebObject (grammar)Social softwareQuicksortScalabilityForm (programming)Data modelApproximationMultimediaProcess (computing)Streaming mediaConsistencyDatabaseData managementSet (mathematics)Scale (map)Kolmogorov complexityAlgorithmArchitectureData managementDatabaseComplex (psychology)Different (Kate Ryan album)InformationAdditionWeb browserParameter (computer programming)Cartesian coordinate systemElectronic data processingInteractive televisionPhysical systemSet (mathematics)AuthorizationAnalytic setVariety (linguistics)Query languageProcess (computing)Computer architectureEnterprise architectureAlgorithmClassical physicsRelational databaseData analysisPlastikkarteBlogComputer animation

12:15

Adobe AcrobatYouTubeElectronic data processingVideoconferencingDigital photographyAdditionAlgorithmComputer animation

12:53

Point (geometry)Adobe AcrobatElectric generatorSet (mathematics)Data storage deviceTelecommunicationVertex (graph theory)Query languageDatabaseSource codeFunctional (mathematics)Computer programmingScripting languageGraph (mathematics)MeasurementProcess (computing)View (database)Analytic setGoogolVirtual machineInteractive televisionVisualization (computer graphics)Twin primePredictionStatisticsFormal languageMathematical analysisLemma (mathematics)Execution unitFingerprintWindowSimplex algorithmPresentation of a groupPoint cloudAlgorithmPerformance appraisalEndliche ModelltheorieLogicTheoryMathematicsMaxima and minimaProcess modelingPhysicsNuclear spacePolygon meshModal logicMachine learningHypermediaDemonDifferent (Kate Ryan album)Latent heatMereologyView (database)Branch (computer science)Point (geometry)Self-organizationExpected valueTranslation (relic)Cartesian coordinate systemProcess (computing)Visualization (computer graphics)Dependent and independent variablesPerspective (visual)Information securityAlgorithmCybersexPresentation of a groupVirtual machineDataflowMobile WebData analysisBusiness modelComputer animationProgram flowchart

15:42

Heat transferProcess (computing)ComputerControl flowElectronic meeting systemTelecommunicationView (database)Point (geometry)Data storage deviceDemosceneInternet service providerChannel capacityNeuroinformatikComputer fileData streamLatent heatProcess (computing)TelecommunicationCybersexComputer animation

17:04

OracleVisualization (computer graphics)Service (economics)DatabaseHypermediaMobile appVertical directionAdobe AcrobatProcess (computing)Linear regressionMathematical analysisVisual systemAssociative propertyComputer networkPreprocessorReduction of orderAnalytic setMereologySlide ruleOperator (mathematics)Virtual machineVisualization (computer graphics)Vertex (graph theory)Linear regressionEndliche ModelltheorieExploratory data analysisPreprocessorData structureLevel (video gaming)AlgorithmStatisticsElectronic data processingData miningComputer animationJSONXMLUML

Transcript:

00:00

Here, colleagues, the objectives of big data and machine learning course you can find on this slide. First of all, we will try with you to understand the problems related to big data, particularly

00:27

what is big data, which parts, which are the most important parameters we have. The next one, this is about data pre-processing, sample reduction technologies, the next, this is important data pre-processing and data analysis.

00:43

And the last one will be related to big data architecture. And today we will talk with you about big data definition, what actually big data is. And first of all, there are no single definition of big data, but big data is related to that

01:01

fact that data is the center of the knowledge, economy and society. And that is why this is very important to analyze data from different sources, from different domain and based on that intersection to find something new, to find something important for us.

01:21

And from this perspective, we could find a lot of different domains related to big data. The first one is science. In this case, we talk about astronomy, data analysis, genomic data, environmental data collected from different sensors and so on.

01:43

Then we talk about humanities and social science. We talk about scannet books analysis, historical document, social interaction data collected from social media networks and so on. When we talk about business and commerce, in this case, we analyze corporate sales and

02:06

stock market transactions, airline traffic and so on. Big data, this is about construction data, such as different movies, audio files, internet images and so on.

02:20

Very popular big data is in medicine. In this case, we talk about patient reports, patient data analysis, about different images, for example, they're related to magnetic resonance images, computer tomography, scans

02:41

and so on. And the last example of big data domain is industry and energy. In this case, we talk about collected data from different sensors, smart devices, for example, for energy, we could analyze data from wind farm based on that data to predict

03:01

the overloading of energy and so on. And in this case, big data can be presented into intersection of different parameters. And one of the most important parameters for big data is parameters related to three V's.

03:22

And the first V in this case is size or volume, because when we talk about big data, this is a lot of data, I mean terabytes and petabytes of information. And here you can see example of that, for example, genomic data.

03:41

This is around three gigabytes of genetic data per each person. When we talk about astronomy, in this case, you could collect around 100 gigabytes per week. And when we talk about transactions, each year we collect billions and billions of transactions

04:05

and it spends around of hundreds of terabytes of data transmitted data to different bank system, for example, Visa or something similar.

04:22

So the next parameter, the next V is speed, because we need not only to collect this data, but also to analyze this data in appropriate time. And moreover, when we talk about sensor data, streaming data, this is very important to

04:43

analyze this data online or in a mode very close to online. That is line, velocity or speed. This is next important parameter of big data. The last V from these three V's is variety of sources.

05:00

As I talked before, there are a lot of different sources of data, for example, databases from different enterprises, social networks, sensors, different smart devices, scanned books, and so on.

05:21

So we have a variety of data with different structure, with different hidden relation in this data and so on. And all of this data must be analyzed together. So based on this three V, we could present big data as intersection of very complex data

05:42

with huge amount of this data, and we have additional requirements to the speed of this data processing. From other side, this is very important to present data in the perspective of amount of information and complexity of queries to the data.

06:04

So when we talk about limited number of data and more or less known structure of queries for the data, in this case, this is example of enterprise system, when we would like to analyze not only internal data, but also authors data, for example, about our clients, in this

06:26

case, we talk about client-oriented systems. And in this case, we collect data not only from our own enterprise, but also from other sources. And also, in this case, we talk about more complex data queries for this data.

06:44

When we would like to analyze not only information about clients, but also their interaction with our system, for example, with our website. In this case, we talk about web logs analysis, browser history analysis, and so on.

07:04

And in this case, we talk about the web-preventing system, we collect around terabytes of information, and we must produce different queries, different requests for our data. And if we would like to collect information, not only from human, but also from smart devices,

07:26

from other clouds, and so on, in this case, actually, we talk about big data system. So who is generating big data? As I talked before, we could talk about different social media, bank transaction, scientific

07:46

instruments, and so on. And based on the variety of the sources, we could talk that classic technologies, such as relational database management system, cannot manage, analyze, summarize, visualize,

08:06

and discover hidden knowledge from this collected data. So we need new algorithms, we need new technologies and architectures for big data analysis. And that is why big data, in short, this is a problem related to a representative, a huge

08:29

amount of data, and traditional applications cannot process or actually have a challenge to process this data.

08:42

So based on that, we could add additional parameters for big data or variety. If we have a lot of different sources, this is very important for us to understand the quality of this data. For example, data can be presented with inconsistency, with outliers, with incompleteness, and so

09:06

on. That is why the quality of evaluation, this is the next challenge of big data technologies. And actually the last for today, but not last in all possible parameters of big data, is

09:26

value. Because if we have problems with data, for example, uncertainty data and so on, this is very important for us to understand the value of this data, how many data or how many

09:43

hidden knowledge we could find in this data source, how we could interact with data, and so on. And from the other side, if you have a lot of different sources, as you can see here, we could talk also about innovative new approaches and technologies related to this data processing.

10:06

And in this case, I would like to present to you two definitions of big data. The first definition is that big data is a collection of data sets so large and complex

10:21

that it became difficult to process using on-hand database management tools or traditional data processing applications. Why? Because, as I talked before, we have not only structured data, but also we have streaming data, we have unstructured data from movie and so on.

10:42

So traditional database management tools cannot be used for big data processing because we have complex data. From other side, other definition is that big data is data whose scale, diversity, and complexity require new architectures, techniques, algorithms, and analytics to manage it and

11:07

extract value and hidden knowledge from it. So in this case, we need not only new database management tools for that, but also we need new architectures and new algorithms.

11:23

And this whole algorithm cannot be used for big data because we have a lot of data with requirements to speed the complexity of our queries. So how actually much is big data? And here you can see a very simple example.

11:40

When we talk about only one flight, per one flight, we have around one terabyte of information. And around 100,000 flights we have per day.

12:01

So this is very important to analyze how long this data must be stored, when this data must be analyzed, how this data must be analyzed, and so on. And here you can find another examples of that. For example, per minute in internet, we have around 1,400 Uber rides, or we have around

12:30

200,000 photos posted to Instagram per minute, or we have around 300 hours of video uploaded

12:41

to YouTube per minute. So this is very huge amount of data. It is additional, required additional algorithm technologies and techniques for this data processing. And based on that, based on this parameters of big data, five of these, we have different

13:02

points of view for big data definition. And first point, allow us to analyze this data from different branches. The first one, when we talk about big data generators, we talk about logs, different

13:20

business process, scientific measurement, people's generated data using different mobile devices and social networks. From other side, from a competencies perspective point of view, we could talk about different responsibilities of engineers, such as software, business understanding, presentation of our

13:49

data, using different visualization techniques, reporting, translation to business language, and so on. From other side, we talk also about specific algorithm, particular machine learning algorithm

14:04

for big data analysis. From other side, when we talk about applications, we need specific applications for big data. And we could classify this application by data set placement, by domain, for example, healthcare, social modeling, by informatics, and so on.

14:24

From other side, this is very important for us to analyze social trends, the expectation related to big data analysis, abilities to new something important for us, and so on.

14:41

The next branch is related to data science and engineering, for example, how we can find new knowledge, how we can organize business understanding, which business models can be used for that, which visualization techniques can be used.

15:01

The next very important thing is technology. So for big data, as I talked before, we could use specific technologies, such as MapReduce, Hadoop, NoviScale databases, Spark, and so on. We will talk about these technologies with you in the Avanet lectures.

15:20

And the last part is related to data science, and we talk about specific data storage, about data flow and data lifecycle, about cyber security related to data access and data transfer, and so on. So the next point of view, this is engineering point of view.

15:45

So from engineering point of view, we talk about storages and possibilities of storages, providers of the storages. We talk about a specific process in technologies. As I mentioned, for example, Google File System, Hadoop ecosystem, MapReduce.

16:05

We talk also about cyber security, who can access this data, when this data can be accessed, how this data can be accessed, using which protocols, and so on. The next, this is about data streaming.

16:21

First of all, this is about communication, human-to-human communication, or computer-human communication, and so on. Capacity of this data streaming. At the next, this is very important scene related to data lifecycle, when our data must be created, processed, analyzed, when we can delete this data, because the importance of this data,

16:45

it's not so huge at that moment, and so on. And the last one, this is about data computing. Particularly, when we talk about big data and sensor data analysis, this is very important for us to use also for computing, or computing based on edge device processing.

17:05

And here you can see big data landscape, and this landscape related to different technologies, to services, such as infrastructure services, operational services, analytics infrastructure,

17:20

and so on, related to data sources, such as structured databases, semi-structured databases, such as logs, vertical data storages, business analysis, and analytics, and we will talk about specific algorithms for that.

17:44

And in this case, big data can be presented as intersection of hacking skills, because we must organize access to the data, mathematical and statistical, and statistics knowledge, because we must analyze this data using machine learning algorithms and deep learning algorithms,

18:05

and also substantive expertise, such as data search, data analysis, and so on. And the last part, traditional data science consists of three stages.

18:24

The same is for big data. The first stage is data pre-processing, and we spent around 70% of the whole time for that. We talk about data collection, data cleaning, data aggregation, and so on.

18:41

The next part is data pre-processing, related to exploratory data analysis, for data visualization, and data representation, and so on. And the last one, this is actually analytics, or hidden knowledge, and mining based on different

19:00

machine learning models, related to clustering, classification, regression, and so on. And next lecture, we start with you from this last slide part, related to data analytics, because methods of clustering, classification, and regression can be used for other stages

19:22

too, on the data processing stage, and also as well as data pre-processing stage.

Recommendations

Series of 2 media

10:51

Quickstart Big Data

26:02

Big data architectures

46:24

Little Big Data

19:16

Big Data meets Fast Data

1:09:31

Data Ingestion and Big Data

59:17

Big Data und Arbeitnehmer

28:01

Big Data? Intelligente Maschinen

22:24

Universes as Big Data

23:32

Validating Big Data Jobs