We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data Warehouses Meet Data Lakes

00:00

Formal Metadata

Title
Data Warehouses Meet Data Lakes
Title of Series
Number of Parts
112
Author
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
EuroPython 2022 - Data Warehouses Meet Data Lakes - presented by Mauro [Liffey Hall 1 on 2022-07-13] In this talk, I will explain the current challenges of a datalake and how we can approach a moderm data architecture with the help of pyspark, hudi, delta.io or iceberg. We will see how organize data in a data lake to support real-time processing of applications and analyzes across all varieties of data sets, structured and unstructured, how provides the scale needed to support enterprise-wide digital transformation and creates one unique source of data for multiple audiences. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License http://creativecommons.org/licenses/by-nc-sa/4.0/
29
Metropolitan area networkBoiling pointDifferent (Kate Ryan album)LeakArithmetic meanComputer animationLecture/Conference
Projective planeSelf-organizationMathematical analysisWage labourMultiplication signBuildingDifferent (Kate Ryan album)QuicksortPhysical lawLeakCoefficient of determinationProduct (business)Endliche Modelltheorie
Real numberLabour Party (Malta)ComputerSystem programmingBusiness IntelligenceExtreme programmingPhysical systemInformationPlastikkarteDigitizingGreen's functionIRIS-TSource codeTrailMotion captureTerm (mathematics)Design by contractData managementInteractive televisionProcess (computing)Analytic continuationRegulator geneSelf-organizationWage labourInformationOcean currentWhiteboardWebsiteService (economics)EvoluteMereologyWave packetBoundary value problemGroup actionAnalytic setDimensional analysisWeb 2.0Repository (publishing)Projective planeArchaeological field surveyLink (knot theory)Observational studyInternet service providerData warehouseMultiplication signTraffic reportingDecision theoryInformation systemsStatisticsSlide ruleOffice suiteDisk read-and-write headSystem callRevision controlMathematicsDifferent (Kate Ryan album)Formal languageVirtual machineTwitterPresentation of a groupEmailUniverse (mathematics)Gradient descentDemosceneEqualiser (mathematics)Water vaporHypermediaContext awarenessLattice (order)Integrated development environmentEndliche ModelltheorieAreaRight angleShared memoryObject (grammar)Order (biology)Address spaceResultantQuicksortArithmetic meanRing (mathematics)BitLevel (video gaming)Open sourceUniform resource locatorComputer animation
Function (mathematics)Spring (hydrology)Mach's principleMaxima and minimaSource codeMachine visionData analysisBusiness IntelligenceSignal processingInformationSystem programmingControl flowSolomon (pianist)Data storage deviceArchitecturePlot (narrative)Open setComputer-generated imageryVideoconferencingPole (complex analysis)Real numberGoogolFeedbackStatisticsStudent's t-testLeakSignal processingState of matterUniverse (mathematics)Data storage deviceOrder (biology)Core dumpRow (database)Physical systemBookmark (World Wide Web)Wage labourHypermediaVirtual machineBitProcess (computing)InformationMereologyFamilyType theoryCASE <Informatik>Presentation of a groupMachine visionInheritance (object-oriented programming)Speech synthesisSinc functionSource codeWebsiteWeb 2.0Cartesian coordinate systemEnterprise architectureData warehouseRepresentation (politics)Self-organizationPiRight angleTransformation (genetics)Electronic mailing listDecision theoryArchaeological field surveyDependent and independent variablesMultiplicationComputer animation
Source codeVideoconferencingComputer-generated imageryPole (complex analysis)Real numberGoogolFeedbackOpen setGreatest elementStapeldateiData analysisData warehouseReal-time operating systemInformation managementVirtual machineWage labourGroup actionHypermediaRemote procedure callWeb 2.0Decision theoryFormal languagePattern recognition2 (number)Projective planeDimensional analysisProcess (computing)Endliche ModelltheorieDifferent (Kate Ryan album)Data storage devicePersonal digital assistantLattice (order)Normal (geometry)WeightBitOrder (biology)Sampling (statistics)PressureMereologyGreatest element
Endliche ModelltheorieSource codeWhiteboardComplex (psychology)CodeÖkonometrieProcess (computing)Decision theoryWeb pageMetric systemVirtual machineContext awarenessMathematical analysisEvoluteINTEGRALData warehouseWage labourGoodness of fitTwitterComputing platformDescriptive statisticsMatrix (mathematics)Canonical ensembleWebsite
Data warehouseDatabaseDisintegrationPhysical systemData managementMathematical analysisRepository (publishing)Data structureData analysisType theoryControl flowProcess (computing)Decision theoryData modelNumbering schemeFile formatInformationSelf-organizationRight anglePresentation of a groupWeb 2.0Video gameIntegrated development environmentData warehouseRaw image formatState of matterVirtual machineComplex (psychology)DeterminismMathematics
Enterprise resource planningData storage deviceEmailTime zoneSource codeSystem programmingCustomer relationship managementSocial softwareService (economics)Data warehouseQueue (abstract data type)Mathematical analysisProcess (computing)Data structureTask (computing)Signal processingStapeldateiTime zoneInformationDifferent (Kate Ryan album)Data analysisData warehouseWage labourMathematicsSlide ruleProjective planeClassical physicsDimensional analysisStatisticsCodeGame controllerSource codeWebsiteSelf-organizationVirtual machineCASE <Informatik>Web 2.0Decision theoryAreaLibrary (computing)Representation (politics)Validity (statistics)Level (video gaming)Distribution (mathematics)HypermediaEndomorphismenmonoidFood energyEndliche ModelltheorieData storage deviceWeightMachine visionAverageArithmetic meanGreatest elementQuicksortThomas BayesComputer animation
EmailScalabilityStapeldateiStreaming mediaSource codeWide area networkView (database)Mathematical optimizationDescriptive statisticsCompact spaceProcess (computing)CASE <Informatik>Classical physicsKernel (computing)VacuumMultiplication signMathematicsTable (information)Type theoryDifferent (Kate Ryan album)Right angleSoftware maintenancePartition (number theory)File formatStapeldateiSignal processingEndliche ModelltheorieVirtual machineData compressionEntire functionPhysical systemAddress spaceCore dumpForm (programming)Data managementRow (database)Order (biology)NumberMedical imagingCausalityBlock (periodic table)Key (cryptography)
SchemaevolutionPartial derivativeTime evolutionOperations researchComputer fileFile formatPartition (number theory)BackupDatabase transactionFile formatView (database)CASE <Informatik>InternetworkingDecision theoryProduct (business)Surface of revolution2 (number)Form (programming)MathematicsLattice (order)Computer animation
Multiplication signCoefficient of determinationRemote procedure callType theoryTask (computing)Data warehouseBitError messageLecture/Conference
Cartesian coordinate systemProcess (computing)QuicksortEmailEndliche ModelltheorieMachine learningWeb pageReal numberAreaCodeVideoconferencingClassical physicsFuzzy logicMatching (graph theory)2 (number)Descriptive statisticsCASE <Informatik>Signal processingWebsiteDifferent (Kate Ryan album)Table (information)Data storage deviceRaw image formatNumberArithmetic meanMedical imagingFigurate numberLeakMathematical singularityFile viewerFormal languageOrder (biology)WeightResultantLecture/Conference
Speech synthesisDifferent (Kate Ryan album)WebsiteTerm (mathematics)Greatest elementDescriptive statisticsMathematical analysisCollisionSignal processingData qualityLecture/Conference
Archaeological field surveyReal numberCASE <Informatik>DataflowProcess (computing)Different (Kate Ryan album)NumberType theoryWeb 2.0WebsiteDecision theoryPairwise comparisonStatisticsQuicksortRepresentation (politics)Classical physicsLine (geometry)HypermediaBitDependent and independent variablesTerm (mathematics)VideoconferencingWeb pageGodSpeech synthesisLecture/Conference
Gamma functionXML
Transcript: English(auto-generated)
Thank you very much and before to start I wanted to say thank you to EuroPython and to invited me to speak with you today and also I want to To say thank you also to Paul, Matteo, Amber and Kian for the support for today
So let's start because I have a lot of stuff to explain to you And so I will go very fast and very directly with the topic of today. So I I will try in the next minutes to explain to you the difference about Data Warehouse and Data Lake and
What we need to create Data Warehouse and Data Lake, why we need to create Data Warehouse and Data Lake in your organization, in our organization So to try to explain to you Data Warehouse and Data Lake I will use a project I will use a project that I led from
2016-17, I don't remember, sorry It is a project of the European Union because I work for the European Union and this is a project about the labor market So I will start to show you this project and then I will go in detail about a different approach and different methodology About the analysis of the data So I will speak about the top-down approach and the bottom-up approach. At the end of my talk
I will present to you three different technologies I will not go in detail, but if you want to chat with me offline, then we will have a lot of time I will present to you 3D infrared technology to support you, to support us, to build Data Lake. So we present
Data.io UD and Ashberg. So let me to present myself I am Mauro Pelucci But you can also call me Mauro Pelucci because it's typical from English to pronounce my surname Pelucci and it's okay So since 2016 I led this project for the European Union mainly from Aurostat and Chedefop and ETF
So Aurostat, I think it's quite known Chedefop and ETF are two European studies that take care about the vocational training so they take care about the Develop of labor market and vocational training in the European Union and
Chedefop mainly in the countries that are part of European Union, ETF mainly for the countries that are boundaries of European Union like Morocco, Tunisia, for example, Georgia and Ukraine Currently I work for Lightcast where I led the
Global Data Science. What do we do? It's quite particular so I will present in the next slide I also teach in university in the Bergamo University because I come from Italy and in the Milano Vicoco University, so This is my contact if you have to contact me on Twitter and mail
Feel free to write me and also in LinkedIn and the presentation is public So if you go to this Repository of GitHub, you will find my slide. So I think I will update the slide after the call usually But of course you can find the latest version in this GitHub repository. You can take and share
So feel free to use the slide as you want. So let's start for the topic for today So the project that I want to present you to explain data lake data warehouse It's a very time labor market information system skill requirement So it's a long title. I know
So let me to explain what we do. What we do is to try to explain the evolution of labor market Why because five years ago six years ago the European Union? understood that the labor market is in continuous evolution Evolution so the official tactics are not enough to explain the evolution and they call us to build
So Because they wanted to try to track these topics so the evolution of profession the evolution of the skill The impact of the globalization and also some new in last two years
It was terrible about the labor market because no, you know, we we were in a pandemia. We have a the Ukrainian war We have a lot of new New regulation opinion about the banking transition. So all of this topic, of course
Change available market. So how to how we try to address this request? so to try to create a system and that Try to understand the this evolution so data warehouse about the market. So, of course we started from the officer statistic Survey mainly
But these of statistics was not enough why because we have a lot of lack information We don't have a fresh information We don't the statistic usually don't speak the language of employer and the language of employee, of course So we started from the web So in 2016, we started to collect data from the web mainly job
Advantagesment online job advertisement in all the European Union to be the data warehouse At the end we have data warehouse. So for me, it's 20 years that I work with data warehouse and so it's not a new Topic at the warehouse at the end. We release insight analytics
Mainly we release dashboard and with this dashboard European Union agency training providers University can address some question about the available market. So This is a link about the tool. This is a public tool. You can access to the tool
It is built on top of tableau dashboard and under the hood we have On the tower house. We've all the dimension available market. So what we do is quite simple last year I was here remotely and I present how we collect data by scripting So mainly we start to scrape data from the web from online job portal
from job boards from public and public services sites By mainly by scraping, but of course, we take care about the quality of the data We create a big data lake with all the data that we have and on top of this data lake
We release data in the warehouse mainly So at the end we have a data warehouse inside our organization that provided data to our stakeholder Is what we do, of course to move the data from a data lake to the warehouse We apply machine learning techniques AI techniques because the data that we collect from online job advertisement are quite different
You know, so speaking about my profession so you can call me data scientist But in some organization, I am a engineer difference in some organization I am a senior data scientist in other I am a head of data science
So all these professions are quite similar and it's difficult to analyze their information for a decision maker If we don't normalize this data So what do we do is before to create a data lake with all the information and then we produce it
visit our houses is not public, of course, but all of the National Statistical Institute can access to visit our house to release statistics Information report and so on and then we have a lot of technology. Of course, mainly we use Spark a PY spark to process information. We use a lot of machine learning to normalize the data
To be that a lake of course, we don't use only one technology But we use a bunch of technology because for each question We need to choose the best technology. For example, we use a Ws3 to start the information because we need to reduce the cost we use
Neo4j and elastic to release some data to our stakeholder We use tableau for representation in some cases. We only use Python and multiple deep. It depends by the question. So Let's return now to a presentation because what we need to understand is why we need a data lake at our house
Because when I teach in university a lot of students, I don't know Soch on the web on Google and okay data warehouse are died We don't need to create a data warehouse But really we need a data warehouse because we need the quality data So but also we need a data lake because the two warehouse, data warehouse, data lake
Really are four different questions. So let's to start with the topic I have a question for you. It's not my question. I took this question from a presentation of Cook here in 2014 so from this deck store you can you can find the end of the web
So what is in your opinion the America's favorite type of cake? Apple pie, thank you Thank you, it is the same response of ten years ago of eight years ago Sorry, so it is we say the same response
it is a beautiful question for me because I asked you what is a bit more favorite type of cake and You are applying in your mind or the typical bi process. This is intelligent process So I try to recap the process in your mind Of course, you collect the data in this case usually
Usually we use the retail sales data and to response to this question in your mind You are applying techniques to optimize the information. So to aggregate the information We can call them this bunch of approach ETL So extract, transform and load
You build the warehouse in your mind with the list of the type of cakes So apple pie and also the other and then you use the turnover to take the first one apple pie right Okay, so is a typical bi process if you retire to
2004 Solomon described the bi process as a bunch of processes in our organization that take care about the collection of the data The cleaning, the improvement of the quality of this data, the store of this data in the warehouse To release question, to release answer for the decision maker. So there is a typical bi process
but this process usually so always start from a question Starting from a question, we apply, we select the source system in your question
We select the data from an ERP from an enterprise resource system To respond to the question. We collected the data. We store the data. I don't know where We provide the data and we create an application. This is a typical bi process and I
Reuse this process to take care about the available market data because what I do for the European Union, of course Not with the type of cake, but with 100 terabytes of data. It is the same It is the same. I collect the data from online job website.
I try to address a lot of issue about the quality of the data that we collect. I store the data and Then I create something to provide the data to my stakeholder Of course, my stakeholders are quite complicated because I wrote a lot of politicians But of course, I try to simplify the dashboard for them because they need this data to
improve our Training system, our labor market system and so on So it is a typical top-down approach because we start from a question We select the data to address this question and then we make something to produce answer
but I need now to return to the beginning because the question it was the what is the America's favorite type of cake and we need to understand the different favorite and Rhetoricis data because so in our mind we make a process but I don't know if
Speaking about one year two years ago during the pandemia I need to buy a printer because I have three children at home And I didn't have a printer before and I choose one printer not the best one the printer available
So speaking about the question I don't know if favorite is the same of Rhetoricis data. I don't know So, of course now we have a lot of new data with a lot of new question It is a it is not the same of 20 years ago when I started to work and I have only
surveys, surveys, surveys, surveys Now we have a lot of new question coming from our decision maker, our colleagues My wife and so on and we have a lot of new sources Of course, we need to understand what is the question and what sources we can use
So we need to change the approach not the top-down approach So starting from a question, but we need a bottom-up approach
Starting from a question is right because we can go quickly but we need to Understand that we can also address a new question. So we can use a bottom-up approach. We can store everything Now not only the data But we can also store social media data web data data coming from Amazon and e-commerce
We can store everything now and then use machine learning, AI pattern recognition to extract a new insight for our second. Let me to give you an example. I Started in 2016 to collect data about labor market
The first question from the decision maker was what are the top five occupation, profession, more requested European Union And okay, I build a machine learning model, I try to normalize the job titles, I try to To build this machine learning model in the 45 different languages because in Europe
There is not only English but also Spanish, Italian, Portugal, Basque, Catalan, Galician and so on So it was not easy, but now we are at our house. Last year The same decision maker asked me What is the request of remote workers?
So new different question, new question. Of course using this approach I started to use the raw data that I stored in seven years of project and I released that the remote working
before February 2020 was not present in Europe and after now is present Speaking about the green transition How to measure now the impact of the green economy in the labor market? Because we have the data now, I can now respond to my decision maker
Not to build our house because our house is built with fixed dimensions. So occupation, contract, working hours I don't know, employer, salary. These are different Questions, of course if we use a button-up approach We have a lot of challenges to address starting from the use of machine learning
So we need a way to integrate data, AI, machine learning, tool for analysis We need a way to introduce any data and any sources because online job posting are okay But now we have also a job platform
I can collect data about the labor market from Facebook, Amazon, Twitter I'm starting to watch also Twitch because it's another channel Three years ago. I found one job advertisement inside the HTML code of a company. In this HTML code if you go to the source code of a page
There was a job advertisement, of course for technicians So when I collect the data I need to take care about the evolution of the sources and of course The HTML code is not like the job description But usually I find on for example a deck or a straddle so this typical job boards
So of course also the accessibility of the data Why? Because when I for example provide the data to the decision maker They can access only to the final data I don't want that they access to my kitchen where I prepare the data
I don't want but of course when I have a lot of good data scientists, my colleagues I want that they access to the data and Play with the data because they can found new patterns, new I don't know, new metrics Speaking about the new metrics, one month ago
One my colleague started to play with the data Started to play with a new model of 40 years ago But a new model about the context. We apply for example a model coming from econometrics It is called economic complexity and we found something interesting We present to my economic
Economic colleague, it is okay, and we are working on it So I want that the data scientists have access to the data But I don't want that the decision maker access to the data because they cannot understand the complexity of the data So we need data warehouse We need data warehouse because data warehouse are like the signal that we have in our organization
We need this signal to understand what are the state of organization because data warehouse are clean, are integrated, are sandwich-oriented So we don't have
raw questions in data warehouse. Data warehouse are built to respond to Predefinite questions. Of course we need also data lake Why? Because in data lake we can found a lot of new insights, new questions Okay So, of course data lake are more complex data warehouse because we have machine
We need to prepare the environment to be machine learning needed. We need to give access to the raw data We need to be schema free Please not schema-less, but schema-free. Okay, because every information have a schema, okay sooner or later
So we need these Characteristic and also we need tools and techniques that support a rapid change of the information so I Love this representation of data lake. I don't know because I found it on the web and it's a beauty for me So I don't know who prepared it, but as you can see because I love it because it is my 20 years experience
Recapping was live. So as you can see we have data coming from ERP We have a data lake where we can store the raw data, where we can store our process to Improve the quality of the data and we have the whole data warehouse
So as you can see we have the access zone where the data are available So it is the same of the BI process that I present 10 minutes ago It's the same. Only the name is changed, data lake. The name is changing because the approach is changing because we have the Button-up approach. We don't start from a predefinite question
But also we leave the data to speak. We leave the data available for our data scientists, our data analysts. So speaking about the process This is a recap of what we do in my organization for European Union
As you can see we ingest the data from scraping We have some stage that clean the data and apply some pre-processing techniques Then we apply a lot of machine learning to extract the single information from our old question So we extract the occupation, we try to extract the skill requested and we have 30 different dimension
classical dimension of the labor market and we build the data lake. Of course, we have 100 terabyte of Compressed columnar data. So it's difficult to handle this change also because so Every six months they are change the occupation code list. So we need to reprocess all the data
So in this case, we use a particular technology that we present in one minute Of course, this area is available, but only to me and my data scientist. Then Every month we copy the data
Really, it's a copy and filter the data. We move the data in quality only the data that are in quality are moved to our data warehouse and It is not the end because before the data are available for data analyst, decision maker and also citizen We apply some validation rule. Why validation rule? Because we don't we don't have the control in this case of the source of
the data. We don't have the control of the 35,000 different website that send us the data. So every month we have a WS batch process that apply statistical task
distribution check And so on to the data. If we detect an issue we stop the release of the data to the citizen We watch the data. We understand if this change is okay or not And then we release the data So speaking about technology Mainly the data warehouse and data lake that we build is based on these two technologies
Data lake, data.io is a technology mainly supported by Spark because in the wood there is bricks and UD, Apache wood is a public project. Also data.io is public but with a different latency
Apache wood is mainly mounted by Uber and the Apache Foundation So we use these two different technology to build our data lake and also there is Iceberg So currently I'm not using Iceberg but it's a beautiful tool So I want also to present in one slide Iceberg. Iceberg is mainly it is the same
technology of course with different approach from Netflix and the WS So why we are using data.io? Because we need a library that can help us to merge the data So mainly insert the data because every month we can reprocess our entire data lake
So with the technology of data.io we found a tool integrated with Apache Spark and also integrated so it was easy to integrate it on AWS Because we are using AWS EMR to release the data, to process the data, to classify the data
So we found data.io, a beautiful technology that helps us to merge the data because of course data.io supports also the time travel, the schema enforcement support the change of the data in the partition But really what I was looking for years ago
it was a technology that helps us with Update, insert, update and delete because you know classical kernel format like Avro, Parquet, ORC doesn't support the merge and updates So Apache Uzi is another technology that we use. Why? Because with data.io
we have some issue about the bulk processing of the data because it's quite slower in case of Big bulk processing and Because the maintenance task, so the compact of the table, the vacuum of the table are not automatic
So we have different process with data.io that support the maintenance And we needed to take care about this stuff. With Apache Uzi we simplify last year this approach because Apache Uzi has an automatic compression and We've a different view with Apache Uzi release so we have a read-optimized view but is
with less latency and with more latency and less cost and we have also a real-time view with Less latency in this case But with more cost so with this type of different views on the data
we reduce the processing cost and the processing times About the from one to so we are ten more faster now Why? Because Apache Uzi it's It's designed for batch ingestion
So in case for us it is important to reprocess the data every month Every two months with a processor and we apply all the machine learning model to 100 terabytes of data More or less in our data we're out now we have 400 million job postings and we've also the HTML and the description so on so we've these
optimization about batch ingestion we simplify a lot the reprocessing and also all of the compactation and the vacuum of the table is integrated Apache Iceberg is another good technology Because in this case only mention one topic about Apache Iceberg
Because we don't have to be dependency from spark. So in this case that you can also use our technology So speaking about women different I recapping this right so you can watch it if you want Only one topic because I have 20 years of experience Take a watch to the contribution
Because when you choose, you know an open solution you need to take a care about also be supported so it is because As you can see many of the issue in my opinion my opinion of that It's that it is only supported by data bricks
It's good, but you have only one company that now support the revolution Second we I'm moving now to about you Why because it's why my internet while integrated with AWS a lot of support there a lot of contributors so
It's usual so when I speak with my younger colleague they want to use village as technology in this case data lake at our house are for Long decision you have to be a longer view on what you are building because you cannot change the data format Every two months, so I choose four years ago that I also you because it was the only one
Really really for production now Why because in my opinion it is in the future with more contributors is better So thank you to everyone if you have any question, please use the mic
So thanks tomorrow very impressive setup you have there we have Time for five minutes for questions. If there's any in the hall, please come to the mic And if we have any remote questions, you'll let me know Great dogs. Thank you
I have a question. You said that you have two types of data you have like census data Which is very strong high value and then you have all this data and then you mentioned that you use statistical tasks to Clean your data before you publish it in the data warehouse Can you go a little bit more into detail in how you use the first data to know?
Okay, so About the data and about the quality process is important when we build our data lake, of course to store everything So in our data lake we have two main area We have a raw data area when we store everything as is and then we have a data clean it
Not not ready for the citizen but ready for the scientists what mean to in this area we Apply some classical processing techniques mainly I apply
Two techniques because the main issue that we have with a job data is about the duplication of data So I can find the same page in a lot of different website or from a different website So what we apply mainly in this case is one Model it is like or spam model like your email spam model
That for each page that we collect this idea that the pages or real job posting or it is a fake job posting Of course, we don't delete the so we delete from a video to arouse the fake job posting but we store it in another table because for example three years ago one European agency asked me to release a
Report about the informal economy. So in this case I used the fake job posting to answer this question So mainly we use this model second we use or the duplication model The application model is beauty that we've sort of fuzzy matching between of HTML code job description
the second Duplication model apply machine learning and because in some case we have the same job posting publishing in different language, for example So starting so only to give you a number starting from For me the job posting in a European Union only 100 are really job posting so
Thank you for your description of your process And you're in your talk. I had a question in terms of the data quality like could you comment on the differences between
So obviously if you have Some thing biases or other biases within your data then that's going to affect the insights that you get from your analysis obviously the kind of bottom-up data lake approach Helps you to to you
Be less biased in your posing feel like hypothesis. Yeah. Yeah. Yeah Yes, beautiful question. Thank you because so it's very difficult for me to address the question So I try a few minutes and then we can a chapter Why because so speaking about the classical data that we have from a survey from a company?
We have an idea of the population So if I ask to you how many if I ask to Paul how many? participants I have in this room We have a sense of the door and we have an idea of a population speaking about the job data
We don't have an idea of the number of job posting that we have in the European Union we don't have a very real number So when usually when the decision maker asked me, okay Do you have all the data all the job posting not because I have only the web of posting But you have all of the web of posting. I don't know because I collect the data from
35,000 different website, but I don't know what is the number of a website that we have a job posting in Europe So it would be a good number a better number. I don't know so usually we compare the data offline in this case not online with classical statistics, of course so speaking about statistics
We have different type of data because in this case I have flow data classical statistic more or less are with stock data, so we use techniques to try to create a sort of Representativeness of the web data that we have in comparison with available market
It's something that we are trying to address this year Thanks a lot Mauro. We have to move to the next talk. So please put your hands together for a very interesting talk