We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Into Big data definition

00:00

Formal Metadata

Title
Into Big data definition
Alternative Title
Визначення великих даних
Title of Series
Part Number
1
Number of Parts
10
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
NeuroinformatikPreprocessorArchitectureModal logicReduction of orderProcess (computing)Mobile WebData compressionDatabaseComputer-generated imageryInternetworkingComputer fileMP3Integrated development environmentDatabase transactionTransportation theory (mathematics)Archaeological field surveyKolmogorov complexityVariety (linguistics)Menu (computing)VolumePlastikkarteFood energyTheory of relativityDifferent (Kate Ryan album)Port scannerDomain nameMachine learningDatabase transactionDatabaseMedical imagingNeuroinformatikInternetworkingVelocityParameter (computer programming)Data structureQuery languageSingle-precision floating-point formatNumberComplex (psychology)PreprocessorInformation overloadLimit (category theory)Computer architecturePerspective (visual)Sampling (statistics)InformationVariety (linguistics)Object (grammar)Presentation of a groupElectronic data processingData analysisMultiplication signSimilarity (geometry)Reduction of orderCASE <Informatik>Constructor (object-oriented programming)Source codeInteractive televisionSoftwareAsynchronous Transfer ModeHypermediaIntegrated development environmentRoundness (object)MereologyXMLComputer animation
Database transactionComputer networkMobile WebObject (grammar)Social softwareQuicksortScalabilityForm (programming)Data modelApproximationMultimediaProcess (computing)Streaming mediaConsistencyDatabaseData managementSet (mathematics)Scale (map)Kolmogorov complexityAlgorithmArchitectureData managementDatabaseComplex (psychology)Different (Kate Ryan album)InformationAdditionWeb browserParameter (computer programming)Cartesian coordinate systemElectronic data processingInteractive televisionPhysical systemSet (mathematics)AuthorizationAnalytic setVariety (linguistics)Query languageProcess (computing)Computer architectureEnterprise architectureAlgorithmClassical physicsRelational databaseData analysisPlastikkarteBlogComputer animation
Adobe AcrobatYouTubeElectronic data processingVideoconferencingDigital photographyAdditionAlgorithmComputer animation
Point (geometry)Adobe AcrobatElectric generatorSet (mathematics)Data storage deviceTelecommunicationVertex (graph theory)Query languageDatabaseSource codeFunctional (mathematics)Computer programmingScripting languageGraph (mathematics)MeasurementProcess (computing)View (database)Analytic setGoogolVirtual machineInteractive televisionVisualization (computer graphics)Twin primePredictionStatisticsFormal languageMathematical analysisLemma (mathematics)Execution unitFingerprintWindowSimplex algorithmPresentation of a groupPoint cloudAlgorithmPerformance appraisalEndliche ModelltheorieLogicTheoryMathematicsMaxima and minimaProcess modelingPhysicsNuclear spacePolygon meshModal logicMachine learningHypermediaDemonDifferent (Kate Ryan album)Latent heatMereologyView (database)Branch (computer science)Point (geometry)Self-organizationExpected valueTranslation (relic)Cartesian coordinate systemProcess (computing)Visualization (computer graphics)Dependent and independent variablesPerspective (visual)Information securityAlgorithmCybersexPresentation of a groupVirtual machineDataflowMobile WebData analysisBusiness modelComputer animationProgram flowchart
Heat transferProcess (computing)ComputerControl flowElectronic meeting systemTelecommunicationView (database)Point (geometry)Data storage deviceDemosceneInternet service providerChannel capacityNeuroinformatikComputer fileData streamLatent heatProcess (computing)TelecommunicationCybersexComputer animation
OracleVisualization (computer graphics)Service (economics)DatabaseHypermediaMobile appVertical directionAdobe AcrobatProcess (computing)Linear regressionMathematical analysisVisual systemAssociative propertyComputer networkPreprocessorReduction of orderAnalytic setMereologySlide ruleOperator (mathematics)Virtual machineVisualization (computer graphics)Vertex (graph theory)Linear regressionEndliche ModelltheorieExploratory data analysisPreprocessorData structureLevel (video gaming)AlgorithmStatisticsElectronic data processingData miningComputer animationJSONXMLUML
Here, colleagues, the objectives of big data and machine learning course you can find on this slide. First of all, we will try with you to understand the problems related to big data, particularly
what is big data, which parts, which are the most important parameters we have. The next one, this is about data pre-processing, sample reduction technologies, the next, this is important data pre-processing and data analysis.
And the last one will be related to big data architecture. And today we will talk with you about big data definition, what actually big data is. And first of all, there are no single definition of big data, but big data is related to that
fact that data is the center of the knowledge, economy and society. And that is why this is very important to analyze data from different sources, from different domain and based on that intersection to find something new, to find something important for us.
And from this perspective, we could find a lot of different domains related to big data. The first one is science. In this case, we talk about astronomy, data analysis, genomic data, environmental data collected from different sensors and so on.
Then we talk about humanities and social science. We talk about scannet books analysis, historical document, social interaction data collected from social media networks and so on. When we talk about business and commerce, in this case, we analyze corporate sales and
stock market transactions, airline traffic and so on. Big data, this is about construction data, such as different movies, audio files, internet images and so on.
Very popular big data is in medicine. In this case, we talk about patient reports, patient data analysis, about different images, for example, they're related to magnetic resonance images, computer tomography, scans
and so on. And the last example of big data domain is industry and energy. In this case, we talk about collected data from different sensors, smart devices, for example, for energy, we could analyze data from wind farm based on that data to predict
the overloading of energy and so on. And in this case, big data can be presented into intersection of different parameters. And one of the most important parameters for big data is parameters related to three V's.
And the first V in this case is size or volume, because when we talk about big data, this is a lot of data, I mean terabytes and petabytes of information. And here you can see example of that, for example, genomic data.
This is around three gigabytes of genetic data per each person. When we talk about astronomy, in this case, you could collect around 100 gigabytes per week. And when we talk about transactions, each year we collect billions and billions of transactions
and it spends around of hundreds of terabytes of data transmitted data to different bank system, for example, Visa or something similar.
So the next parameter, the next V is speed, because we need not only to collect this data, but also to analyze this data in appropriate time. And moreover, when we talk about sensor data, streaming data, this is very important to
analyze this data online or in a mode very close to online. That is line, velocity or speed. This is next important parameter of big data. The last V from these three V's is variety of sources.
As I talked before, there are a lot of different sources of data, for example, databases from different enterprises, social networks, sensors, different smart devices, scanned books, and so on.
So we have a variety of data with different structure, with different hidden relation in this data and so on. And all of this data must be analyzed together. So based on this three V, we could present big data as intersection of very complex data
with huge amount of this data, and we have additional requirements to the speed of this data processing. From other side, this is very important to present data in the perspective of amount of information and complexity of queries to the data.
So when we talk about limited number of data and more or less known structure of queries for the data, in this case, this is example of enterprise system, when we would like to analyze not only internal data, but also authors data, for example, about our clients, in this
case, we talk about client-oriented systems. And in this case, we collect data not only from our own enterprise, but also from other sources. And also, in this case, we talk about more complex data queries for this data.
When we would like to analyze not only information about clients, but also their interaction with our system, for example, with our website. In this case, we talk about web logs analysis, browser history analysis, and so on.
And in this case, we talk about the web-preventing system, we collect around terabytes of information, and we must produce different queries, different requests for our data. And if we would like to collect information, not only from human, but also from smart devices,
from other clouds, and so on, in this case, actually, we talk about big data system. So who is generating big data? As I talked before, we could talk about different social media, bank transaction, scientific
instruments, and so on. And based on the variety of the sources, we could talk that classic technologies, such as relational database management system, cannot manage, analyze, summarize, visualize,
and discover hidden knowledge from this collected data. So we need new algorithms, we need new technologies and architectures for big data analysis. And that is why big data, in short, this is a problem related to a representative, a huge
amount of data, and traditional applications cannot process or actually have a challenge to process this data.
So based on that, we could add additional parameters for big data or variety. If we have a lot of different sources, this is very important for us to understand the quality of this data. For example, data can be presented with inconsistency, with outliers, with incompleteness, and so
on. That is why the quality of evaluation, this is the next challenge of big data technologies. And actually the last for today, but not last in all possible parameters of big data, is
value. Because if we have problems with data, for example, uncertainty data and so on, this is very important for us to understand the value of this data, how many data or how many
hidden knowledge we could find in this data source, how we could interact with data, and so on. And from the other side, if you have a lot of different sources, as you can see here, we could talk also about innovative new approaches and technologies related to this data processing.
And in this case, I would like to present to you two definitions of big data. The first definition is that big data is a collection of data sets so large and complex
that it became difficult to process using on-hand database management tools or traditional data processing applications. Why? Because, as I talked before, we have not only structured data, but also we have streaming data, we have unstructured data from movie and so on.
So traditional database management tools cannot be used for big data processing because we have complex data. From other side, other definition is that big data is data whose scale, diversity, and complexity require new architectures, techniques, algorithms, and analytics to manage it and
extract value and hidden knowledge from it. So in this case, we need not only new database management tools for that, but also we need new architectures and new algorithms.
And this whole algorithm cannot be used for big data because we have a lot of data with requirements to speed the complexity of our queries. So how actually much is big data? And here you can see a very simple example.
When we talk about only one flight, per one flight, we have around one terabyte of information. And around 100,000 flights we have per day.
So this is very important to analyze how long this data must be stored, when this data must be analyzed, how this data must be analyzed, and so on. And here you can find another examples of that. For example, per minute in internet, we have around 1,400 Uber rides, or we have around
200,000 photos posted to Instagram per minute, or we have around 300 hours of video uploaded
to YouTube per minute. So this is very huge amount of data. It is additional, required additional algorithm technologies and techniques for this data processing. And based on that, based on this parameters of big data, five of these, we have different
points of view for big data definition. And first point, allow us to analyze this data from different branches. The first one, when we talk about big data generators, we talk about logs, different
business process, scientific measurement, people's generated data using different mobile devices and social networks. From other side, from a competencies perspective point of view, we could talk about different responsibilities of engineers, such as software, business understanding, presentation of our
data, using different visualization techniques, reporting, translation to business language, and so on. From other side, we talk also about specific algorithm, particular machine learning algorithm
for big data analysis. From other side, when we talk about applications, we need specific applications for big data. And we could classify this application by data set placement, by domain, for example, healthcare, social modeling, by informatics, and so on.
From other side, this is very important for us to analyze social trends, the expectation related to big data analysis, abilities to new something important for us, and so on.
The next branch is related to data science and engineering, for example, how we can find new knowledge, how we can organize business understanding, which business models can be used for that, which visualization techniques can be used.
The next very important thing is technology. So for big data, as I talked before, we could use specific technologies, such as MapReduce, Hadoop, NoviScale databases, Spark, and so on. We will talk about these technologies with you in the Avanet lectures.
And the last part is related to data science, and we talk about specific data storage, about data flow and data lifecycle, about cyber security related to data access and data transfer, and so on. So the next point of view, this is engineering point of view.
So from engineering point of view, we talk about storages and possibilities of storages, providers of the storages. We talk about a specific process in technologies. As I mentioned, for example, Google File System, Hadoop ecosystem, MapReduce.
We talk also about cyber security, who can access this data, when this data can be accessed, how this data can be accessed, using which protocols, and so on. The next, this is about data streaming.
First of all, this is about communication, human-to-human communication, or computer-human communication, and so on. Capacity of this data streaming. At the next, this is very important scene related to data lifecycle, when our data must be created, processed, analyzed, when we can delete this data, because the importance of this data,
it's not so huge at that moment, and so on. And the last one, this is about data computing. Particularly, when we talk about big data and sensor data analysis, this is very important for us to use also for computing, or computing based on edge device processing.
And here you can see big data landscape, and this landscape related to different technologies, to services, such as infrastructure services, operational services, analytics infrastructure,
and so on, related to data sources, such as structured databases, semi-structured databases, such as logs, vertical data storages, business analysis, and analytics, and we will talk about specific algorithms for that.
And in this case, big data can be presented as intersection of hacking skills, because we must organize access to the data, mathematical and statistical, and statistics knowledge, because we must analyze this data using machine learning algorithms and deep learning algorithms,
and also substantive expertise, such as data search, data analysis, and so on. And the last part, traditional data science consists of three stages.
The same is for big data. The first stage is data pre-processing, and we spent around 70% of the whole time for that. We talk about data collection, data cleaning, data aggregation, and so on.
The next part is data pre-processing, related to exploratory data analysis, for data visualization, and data representation, and so on. And the last one, this is actually analytics, or hidden knowledge, and mining based on different
machine learning models, related to clustering, classification, regression, and so on. And next lecture, we start with you from this last slide part, related to data analytics, because methods of clustering, classification, and regression can be used for other stages
too, on the data processing stage, and also as well as data pre-processing stage.