We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A Processing Pipeline For European Official Statistics: Towards Standardisation Of Mobile Network Operator Data Processing

00:00

Formal Metadata

Title
A Processing Pipeline For European Official Statistics: Towards Standardisation Of Mobile Network Operator Data Processing
Title of Series
Number of Parts
156
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Disclaimer: The views in this abstract are those of the authors and do not necessarily reflect the position of the European Commission (EC) or national statistical institutes Abstract: The European Statistical System (ESS) - the partnership between the EU statistical authority (Eurostat) and national statistical institutes (NSI), and other statistical authorities in the European member states - considers Mobile Network Operator (MNO) data as one of the most promising new data sources for future statistical production. The production of official statistics based on MNO data has the potential to provide considerable societal value. In this context, the ESS emphasises the need for standardised reference methods adhering to the principles of statistical production, such as quality, privacy protection, and transparency. In line with the ESS Innovation Agenda, following an open call for tenders, in December 2022, Eurostat awarded the service contract "Development, implementation and demonstration of a reference processing pipeline for the future production of official statistics based on Multiple Mobile Network Operator data (TSS multi MNO)"*. The project is a significant milestone towards the future reuse of MNO data for the production of official statistics at EU level. The goal of the project is to develop a complete, open end-to-end processing pipeline that should serve as a starting point towards the regular production of future official statistics based on MNO data Europe-wide. This "processing pipeline" encompasses a combination of a fully documented open methodological and quality framework, plus the implementation of a reference open-source software pipeline compliant with the said framework. The processing pipeline will be demonstrated across data from multiple MNOs. If successful, the reference pipeline developed by the project will be proposed for adoption by the ESS as a methodological standard. The project is being implemented by a consortium providing extensive experience from both the business and the official statistics domains. The consortium is composed of GOPA Worldwide Consultants GmbH (DE) - lead, Nommon Solutions and Technologies SL (ES), Positium OÜ (EE), Statistics Netherlands (NL) and the Italian Statistical Institute (IT). Additionally, five European MNOs from four distinct countries will be involved in the pipeline testing. This collaborative endeavour aligns with the European Data Strategy's goal of providing comparable and reliable statistics across European countries. The project addresses the challenge of providing open and standardised methodologies for official statistics without hampering the development of future private initiatives nor the continuation of the range of analytic products based on MNO data that have been developed and commercialised by mobile operators or other third-party entities for purposes other than European official statistics. While the project is financed by Eurostat (the EU statistical office), its ultimate success will depend on the potential endorsement of the project result by the larger ESS community (integrating all EU statistical offices and other national authorities). It is expected that this will have positive implications for future activities and may serve as a model that can be replicated in other domains, along with seeking closer collaboration with industry or business partners, more in general, in the context of initiating or strengthening co-development undertakings for the production of official statistics. This contribution will focus on the presentation of the overall pipeline architecture and the description of an initial version of the processing pipeline. The architecture design will adhere to the highest technical requirements and methodological soundness. The proposed pipeline considers the division between data processing at the MNO environments and additional processing steps at the NSI or other parties. The software will be divided into modules for (1) the processing of disaggregated data exclusively at each MNO's secured environment, and (2) the post-processing of aggregated and anonymous data at national statistical offices. The latter is particularly relevant since the post-processing will be performed on aggregated data after the application of statistical procedures, such as Statistical Disclosure Control (SDC), that ensure that individual data cannot be referenced back. Comprehensive documentation, including functionality, implementation details, and usage instructions, will accompany the software. Reference test data, consisting of synthetic or semi-synthetic samples, will be created for each software module to ensure reproducibility and ease the development of alternative but fully compliant software implementations by independent entities. The entire open-source pipeline, including the codes and related documentation, as well as the methodological framework, will be openly published. The software codes will be published under an EUPL license, promoting transparency and accessibility, facilitating the replication and adoption of the developed software solutions, and encouraging collaboration and further advancements in the field of statistical production. The reference implementation of the pipeline will be public, and results will be communicated to interested audiences through public official channels.
Keywords
127
Streaming mediaComputer networkProcess (computing)StatisticsOperator (mathematics)PlanningTransportation theory (mathematics)Distribution (mathematics)Digital rights managementMachine visionOpen sourceRegular graphInformation privacyGamma functionStatisticsSoftware frameworkDiagramPerfect groupOperations support systemSoftwareLevel (video gaming)Open sourceMultiplication signTransportation theory (mathematics)TrailWhiteboardVideo gameInformation privacyResultantPrice indexDifferent (Kate Ryan album)Real numberScripting languageUniform resource locatorPhysical systemLatent heatProjective planeOffice suiteStandard deviationTowerGroup actionScaling (geometry)Point (geometry)Distribution (mathematics)Local ringObservational studyAreaBus (computing)Regulator geneNegative numberOrder (biology)Raw image formatSingle-precision floating-point formatLine (geometry)Presentation of a groupInformation technology consultingIntegrated development environmentLecture/ConferenceComputer animation
Software frameworkMultiplicationDiagramProcess (computing)DataflowArrow of timeCylinder (geometry)Set (mathematics)Object (grammar)Function (mathematics)RectangleSoftware10 (number)Software frameworkDifferent (Kate Ryan album)Extension (kinesiology)Term (mathematics)ResultantBitInformation privacyPhysical systemParallel portProjective planeEvent horizonOperations support systemThermodynamischer ProzessInformationTowerLevel (video gaming)DiagramForm (programming)Raw image formatComponent-based software engineeringPrice indexOffice suiteCASE <Informatik>Slide ruleCovering spaceStatisticsMereologyAdaptive behaviorSet (mathematics)Point (geometry)CircleSystem callArithmetic meanMultiplication signCellular automatonSoftware testingConfiguration spaceNoise (electronics)AreaDirection (geometry)Medical imagingInformation technology consultingUniform resource locatorCalculationReference dataComputer animation
Personal digital assistantModule (mathematics)Modul <Datentyp>StatisticsObject (grammar)Function (mathematics)Term (mathematics)Thermodynamischer ProzessoutputOperations support systemTerm (mathematics)Projective planeModul <Datentyp>Office suiteMultiplication signPatch (Unix)Integrated development environmentServer (computing)Software testingThermodynamischer ProzessPhysical systemInformation privacySoftware frameworkDifferent (Kate Ryan album)Open sourceDimensional analysisReal-time operating systemPoint (geometry)StatisticsLevel (video gaming)SoftwareCASE <Informatik>Uniform resource locatorOrder (biology)MIDIPrice indexBitNear-ringProcess (computing)Mathematical analysisStrategy gameCycle (graph theory)Video gamePresentation of a groupGoogolTask (computing)Sensitivity analysisLecture/ConferenceMeeting/Interview
Least squaresGeometryComputer-assisted translationComputer animation
Transcript: English(auto-generated)
Thank you. So I hope you can hear me well. Nice to see you here in this late afternoon. I know that you are all quite tired already. I also enjoyed the conference really much and this is the fourth day already. So let me start now.
My name is Marco. I come from Positum. Positum is actually a company located in Tartu and it's a spin-off company from the University of Tartu. So Positum has been analyzing mobility data or location data almost about 20 years already.
And of course we have been using different kind of data, data sources but one of the main has been mobile network operator data.
So of course you already may think that is it about tracking and so on so I can just say for introduction also that we don't study the movements of individual people.
It's rather understanding the movements of groups of people and so on. And yes, today's presentation is specifically about kind of creating standards inside of Europe.
Standards which all operators and statistical offices can follow where all the privacy issues and technological issues are put into a single framework so everybody who will see the results later can understand them the same way.
And yeah, maybe just to start off with this example, why mobile network operator data? So after that I will refer this to as MNO data.
So mostly it has been widely used to understand the tourism. For example roaming subscribers from other countries inside domestic operator network and also for example domestic or the local people going abroad who went to Finland,
who went to Latvia for example. And also it's good for understanding and planning traffic and public transportation. We have also quite good example from the same city here in Dartmouth
that mobile network data was one of the sources which were used in order to understand which bus stops actually matter. And so the bus transit lines were shifted or rearranged later.
Of course you cannot never get this perfect transportation system but still this is one of the ways of understanding the hotspots of where the people are like moving or gathering.
And of course, yeah, geographical distribution of the population. So this is of course a more deeper topic going into the census and these kind of things. But also this foundation is already there to use it in the future.
And of course crisis management, the rescue board for example wants to understand at a given point of time or some specific time slot where are the people actually present or how many of them are in larger scale.
And also, yeah, we can also say that the mobile positioning data or the MNO data is not that precise. Of course it's more precise inside of cities where there are more network towers but of course it's less precise like outside of urban areas maybe.
But over these years we have understood quite well how difficult it has been to work with different kind of operators
all over the world including Estonia. So they all have different systems, they all have different understanding of how they extract this kind of raw data to make it appropriate for these kind of scripts that will make these meaningful indicators.
And of course as well maybe the statistical offices also have their own understanding. So we quite well know the pain let's say and this is really good approach with this current project.
And maybe just one thing to highlight also the map on the right side showing Estonia is one of the good examples of this commuting for example. So basically it doesn't show you precise locations where the people are moving
rather than between the municipalities, cities and so on. And what we have also done with mobile data is that just maybe to have better understanding of for example how well or how much have been used local roads
between different areas. You can just also try to merge it for example with road network layer like just regular GIS component, the road data and to understand what are the most probable roads which the people took.
Okay, so moving forward. So basically repeating the same things I was already telling is that to understand, to have the common understanding to have also the willingness from the MNO side, the operator side
to work with the statistical offices because with GDPR exactly six years ago this GDPR was applied in Europe so a lot of new regulations came and the mobile operators are not really happy
or willing to give out or let the process, the data which is of course in some sense it's a correct thing for them to do.
But on the other hand they have huge amount of this kind of passive sensor data because all the mobile devices are actually just sensors which could be used for the benefit or something which everybody could just have better result, better urban environment
or bus stops and these kind of things. The project is led by Eurostat which is the European level statistical office basically.
And we have several partners. COPPA consultation company is actually leading it but for example Epositum along with Nommon is one of the or two of the technical partners who are actually making this kind of framework
and software in real life happening so this software framework could be installed inside of the MNOs premises
and here we already see in the diagram that we have five mobile operators we have Luxembourg Post, Slovenia, A1, Vodafone Italy, Vodafone Spain and also Orange Spain. So these are the five operators who have given the
or agreed, given the permission to participate in this project in an active way so all these kind of tests and the things we need to validate, agree on
so on and so on put in place so all of these are happening inside of these five operators. As well as we have different kinds of statistical offices and consultation partners because basically after the data has been processed
within the mobile operator network or the premises the indicators, aggregated indicators will later move to the statistical departments and this is another level of what will be happening with this data.
So maybe just to highlight the image on this next slide is that you can see that there are like many components which need to connect basically to make it happen
and to have the data or the raw data put in meaningful form to make any conclusions out of it. One of them is legislation and privacy and these kind of things
and also partnerships with the mobile network operators because it's not basically a one time project it's meant to be something that will keep on running constantly otherwise it doesn't make any long term meaning
and at this point we are focusing on this standard methodology with this red circle and then the other parts are covered by other projects in this European system.
So basically what we are trying to achieve within these two years or a bit more than two years is to have the software
at least to some extent or the framework put in place and also successful software processing on the operator premises and also to validate that if this kind of package consisting of methodology
and software is given to the operators so they can also manage it without any specific background
with having tens of tens of specialists working on the same project rather than giving the software and so they can successfully run it and export these statistical indicators and of course we are trying to cover different use cases
like population and commuting and also tourism and so on and we are still hoping that it will after this specific project within five or six years
it will be applied as a working kind of new framework and applied also by the European Commission level and of course we have many many these kind of challenges
just to ensure the flexibility and adaptability and also how these all work together so we are working in parallel with all of these
let's say twelve kind of different topics but by the end of the day these are just overlapping in many cases so yes some of them are like let's say precaution or things that you can prepare
the things which are like many processed and these are set for example this kind of privacy issues and these kind of things but of course there are many things that will be ongoing
for example this quality assurance because technology changing data will be changing in the future as well operators have really different kind of internal frameworks with these network frameworks and this is also in many cases will be something
that needs to be taken care of in the future as well and also yes the methodology and all these kind of things of course methodology can be put in place but we still hope this to be something that will evolve in the future as well
so this is the high level diagram which basically means that we have MNO data on different level we have this network data which means that for example we have the network tower
somewhere it has geographical coordinates and later this tower information will be applied to some kind of network event we don't distinguish these network events like SMS call or whatever we just have them in an abstract way
every event in the network will be getting coordinates and after that it could be located in some area so we basically work on different kind of data sets
like I mentioned MNO data itself also we have the contextual data which could be for example country borders, city borders these kind of things, different kind of reference data and also it could be later merged to population data
and these kind of things and also we separately of course we have the configuration which is important to understand what do you actually want to export from this kind of calculations
so here is just an example of one of the ways to assign location to the network event so some events may have coordinates like latitude, longitude
but usually these are just coordinates of the cell tower and every cell has its direction it has the signal strength and all these things are taken into account to place the people in some kind of geographical area
and of course yes like I mentioned quality is really important there are a lot of cleaning of the data because there are like missing columns, missing values all these kind of things which does not make any sense you just have the package of noise which does not make sense to be processed
because we are still talking about terabytes and terabytes of data which needs to be processed but of course for this project we are selecting probably a bit smaller areas
just to have the results faster but in the long run we are looking at this kind of daily processing so the activity of daily events and also long term events
and also yes the modularity which I spoke about if new analyses are needed we can just modularly add functions, operations it won't affect any other things
we will have just enrichment of different dimensions and yes just to come back to this mid term, long term is just understanding of what happens inside of one day or one month or even the season, let's say four or three, four months
and also to understand what happens in one year or two years time and yes by the end of the day the indicators are also something
that needs to be merged and used with other data sources to make greater value out of them
Just to summarize, this project is about creating the framework and testing it out within a few operators who are willing to participate
and also just to stress more that everything regarding the privacy will have really strong this kind of processing and agreements anyway so because this is really sensitive topic
and everybody are of course afraid of this kind of my data getting out there somewhere which I don't know about but this is also which is being taken care of on a higher level and once all of these things are put in place
we hope that in few years or less than five or six years we would have successful this kind of framework which will work for and everybody would benefit from it
like I said there's a better urban environment or something else and if you have any specific interests you can check out the project from this webpage, I will put it back later so yes thank you, I'm finished
Thank you very much for the presentation Are there any questions? Thank you, first I would like to acknowledge that it's quite a difficult task so congratulations because we have been involved also in some MNO data
and I know from the experience that it's extremely difficult the data we were getting it's very heterogeneous so for that I was wondering if you were involved also in the anonymization process or you were getting the data already anonymized
so that's one question because I have the feeling that it was treated quite differently by the providers and at the end it was not so useful for our case because it was like also with the missing data and that it was not homogeneous it was very difficult and then as a follow-up question
knowing how difficult it is and how heterogeneous it's also very how is your thoughts on other data that are not available to us but that are in hands of like Google they have much better data
it's not heterogeneous they have all the data they need and okay they do whatever they want with their data to get to the analysis so that's maybe a kind of an unfair battle to be able to work with data that is so
the quality is not so well so a bit maybe your thoughts on that okay thank you so yeah the first question so basically the data is anonymized on the operator side so we don't have the or even access to this so everything will happen on their side
and after that data is being extracted this kind of testing environment but in most cases the operators also will not let us access so it's totally on their hands I personally have been in mobile operator offices
months and days and weeks all over the world waiting for the data and then just pointing with the finger sitting next to them but usually yeah they don't which is of course a cool thing
from privacy side that they don't allow any this kind of access unless all this animation and things are done and but yeah in our project the high probability is that all of these operations will happen only in their team
or involvement not even us we will probably have to step out of the room after the testing the system is up and running they will take the real data and put it in and just the follow up question is yeah true
that these really giant technological systems have of course they have better understanding of us than we do most probably but yes I don't even know if it's somewhere in discussion
but of course I think in many cases people are wondering if this could be used but understand it's quite difficult to put something
or give something to them in order to get something back so I don't know it's definitely it's unfair because yeah the precision of mobile network data is definitely poor compared to this really precise GPS and location data
but at least with MNOs it's at least a starting point because I don't know the MNOs have been also there like 20, 30 years already so they are willing to actually participate but this is Europe we have to see what happens all over the world after that
Any more questions? Thank you for the interesting presentation I myself had in previous life work on the data so more or less I get some ideas of the challenges
I wanted to know first the technical question is what is your data life cycle strategy looks like? you got the massive data so how do you get it ingested from the source? and then the second one is are you guys doing much or focus more on streaming processing?
and then I'll just give the rest for chance to ask questions okay thanks so regarding this specific project we will first maybe focus on a few months of data
but it will be patch at this point however in the future so in order to have this kind of daily statistics running we definitely would need to be ready for maybe not real time but near real time operations as well
so it really depends on the capability of maybe the operators framework the infrastructure and servers and everything as well as just to methodology side maybe just to understand what needs to be calculated like fast
and what could be done later with the patch process maybe it takes more time but it's more efficient not to maybe spend the resources at once so much