Towards Big Earth Data Analytics: The EarthServer Approach
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 95 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/15588 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Nottingham |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G Nottingham 201387 / 95
17
25
29
31
32
34
48
50
56
58
68
69
70
82
89
91
00:00
Design by contractArithmetic logic unitComputer fontVisualization (computer graphics)Open setStandard deviationComputing platformScalabilityService (economics)Maxima and minimaOrdinary differential equationComputerData storage deviceAnalytic setAbfrageverarbeitungClient (computing)World Wide Web ConsortiumVariable (mathematics)Data modelProcess (computing)Uniform resource nameHash functionWide area networkSolid geometryMultiplicationCovering spaceCurveVariety (linguistics)File formatCodierung <Programmierung>AbstractionSurfaceTemporal logicClient (computing)Web 2.0Physical systemEndliche ModelltheoriePresentation of a groupFocus (optics)Interpreter (computing)Metropolitan area networkDivision (mathematics)Genetic programmingCausalityQuicksortMathematicsCondition numberElement (mathematics)Order (biology)Process (computing)NeuroinformatikServer (computing)Type theoryImplementationMappingSpacetimeWeb serviceSet (mathematics)Directed graphDimensional analysisPoint (geometry)Large eddy simulationGroup actionPopulation densityAdditionCoalitionElectronic mailing listUniform resource locatorStandard deviationSoftware developerAnalytic setMathematical analysisSurfaceCoordinate systemOpen sourceComputing platformCurveProjective planeTime seriesScalabilitySolid geometryCubeSoftwareReading (process)Different (Kate Ryan album)MetadataComputer programmingRepresentation (politics)BuildingSatelliteOpen setMedical imaging2 (number)INTEGRALComputer animation
03:36
HookingAbstractionData typeTime domainRange (statistics)Set (mathematics)Social classState diagramFirst-order logicServer (computing)Type theoryProcess (computing)ScalabilityService (economics)Query languageStandard deviationGeometryData structurePhysical systemKey (cryptography)Likelihood-ratio testQuery languagePattern languageSeries (mathematics)Data miningDemo (music)AverageWeb serviceWorld Wide Web ConsortiumVariety (linguistics)Different (Kate Ryan album)Standard deviationCore dumpEndliche ModelltheorieData modelRange (statistics)Coordinate systemData structureState observerMereologyINTEGRALPhysical systemExtension (kinesiology)Query languageProjective planeInformationSet (mathematics)Musical ensembleType theoryNetwork topologySpacetimeDomain nameOnline helpView (database)PixelElement (mathematics)Semantics (computer science)Numbering schemeSpectrum (functional analysis)Formal languageState of matterCellular automatonMetropolitan area networkCausalityLetterpress printingWeb serviceComputer animationProgram flowchart
07:17
MathematicsFormal languageServer (computing)GeometrySubsetProcess (computing)DisintegrationQuery languageOperator (mathematics)Set (mathematics)MassProcess (computing)Query languageNeuroinformatikMusical ensembleFormal languagePhysical systemVariable (mathematics)Multiplication signSubsetOperator (mathematics)Coordinate systemElement (mathematics)Internet service providerSelectivity (electronic)ResultantCodeWeightComputer animation
08:44
VarianceInflection pointBit rateData bufferPairwise comparisonQuery languageSemantics (computer science)outputQuery languagePhysical systemForm (programming)Functional (mathematics)Process (computing)Semantics (computer science)Set (mathematics)Source code
09:13
DisintegrationAreaComputer-generated imageryQuery languageImplementationVisualization (computer graphics)Standard deviationGUI widgetLibrary (computing)Process (computing)World Wide Web ConsortiumWeb serviceClient (computing)Raster graphicsQuery languageInterface (computing)Server (computing)Graphics processing unitDiscrete element methodScalabilityAlpha (investment)DatabaseConvex hullComputing platformParallel computingDistribution (mathematics)Mathematical optimizationArtificial neural networkKey (cryptography)Hill differential equationAbfrageverarbeitungParallel portMultiplicationCompilation albumMaxima and minimaPoint (geometry)Single-precision floating-point formatPoint cloudBinary fileInformation retrievalData storage devicePerformance appraisalData typeData structureEuclidean vectorCore dumpChemical polarityExt functorComputer fileConsistencyArmProduct (business)Endliche ModelltheorieInternet service providerCovering spaceMathematical analysisFunction (mathematics)Computing platformLevel (video gaming)Physical systemData storage deviceElectronic visual displayProjective planeDatabaseINTEGRALVisualization (computer graphics)CubeResultantMetadataExtension (kinesiology)Medical imagingCondition numberProcess (computing)MereologyDomain nameContent (media)Query languageFlow separationTesselationDifferent (Kate Ryan album)Link (knot theory)Server (computing)Mathematical optimizationTelecommunicationClient (computing)Mathematical analysisUser interfaceFile archiverFrame problemWeb 2.0Pattern languageAlpha (investment)Core dumpFocus (optics)Thread (computing)Parallel portComputer architectureMusical ensembleScalabilityImplementationDigitizingReference dataComputer fileData structureStandard deviationProduct (business)CodeState of matterMultiplication signSubsetBitMappingOperator (mathematics)Uniform resource locatorMachine visionCausalityReal numberAreaSemiconductor memoryPoint (geometry)Cartesian coordinate systemDisk read-and-write headAuthorizationSheaf (mathematics)Goodness of fitSoftware testingSet (mathematics)Bit rateComputer animation
16:53
Computer-generated imageryInternet service providerSatelliteProduct (business)Web serviceServer (computing)Texture mappingRight angleTemporal logicData storage deviceParameter (computer programming)Product (business)Interface (computing)Profil (magazine)Computer animation
17:25
World Wide Web ConsortiumType theoryInterface (computing)Computer-generated imageryInternet service providerUniform resource nameArmQuery languageVisualization (computer graphics)Interface (computing)User interfaceWeb serviceWeb 2.0VotingComputer animation
17:52
Computer-generated imageryInternet service providerDifferent (Kate Ryan album)Type theorySolid geometrySurfaceInformationSpacetimeSolid geometryVisualization (computer graphics)Computer animation
18:21
Computer-generated imageryInternet service providerVisualization (computer graphics)Mathematical analysisSocial classPoint cloudDisintegrationSuite (music)Set (mathematics)DiagramCubeQuery languageProcess (computing)Standard deviationInterface (computing)Visual systemClient (computing)Internet service providerPhysical systemWeb 2.0Multiplication signSet (mathematics)INTEGRALMachine visionIntegrated development environmentCartesian coordinate systemLevel (video gaming)Row (database)Web serviceImplementationNeuroinformatikPerspective (visual)Server (computing)Query languageStatisticsProjective planeFormal languageCASE <Informatik>Speech synthesisVisualization (computer graphics)BitTerm (mathematics)CuboidQuicksortGroup actionWater vaporReal numberGastropod shellWebsiteHistogramNumbering schemeTheoryFeedbackOpen sourceAnalytic setDiagramMetadataDynamical systemGoodness of fitInterface (computing)AdditionDomain nameLibrary catalogMereologyStress (mechanics)Different (Kate Ryan album)Computer animation
Transcript: English(auto-generated)
00:02
Okay, second session of this afternoon. We have a small change on the program. We have Alan Bacati instead of Peter, talking towards Big Earth Data Analytics. Thank you. So I'm Alan Bacati from Jacobs University, Bremen and we are coordinating the Earth Server Project.
00:21
So I'm here to talk to you about the approach of this project towards Big Data Earth Analytics. What we will see in this presentation is about the brief overview of the Earth Server Project itself and which open standards we are using in the project.
00:40
Then something about the technical platform oriented to the scalability issues of dealing with big data and some demonstration of the services that are being built up on the Earth Server infrastructure for delivering access to datasets themselves. So Earth servers is an EU founded project.
01:02
It involves 11 partner from both computer and the Earth sciences putting together the software developments and the technologies to build an infrastructure for serving and accessing efficiently the science, the datasets, providing analytics over there in a flexible way
01:24
and build on top of that technology, pre-operational services for that access and analysis. What's our approach? Well, we use distributed systems for server-side processing
01:40
and we move toward the integration of data and metadata for the datasets analysis and location that I will show you later. We visualize that on the web using 3D clients and 2D clients. All of that based on open source software and open standards.
02:00
Let's talk about standards to begin with. To ensure interoperability of the data serving in the archives, we use an OGC standard. I'll focus this presentation on the data model, which is GML Cove. It's a coverage model.
02:21
What basically is a coverage? Well, it's a representation of a spatio-temporal varying phenomenon. It is provided in a standardized way where we have ISO definition for a coverage which provides the abstract definition. The GML definition provides the concrete implementation over which you can serve and deliver your data and
02:41
read data from other systems in an interoperable way. Basic type of coverage that we deal with are grids. So gridded data, actually not just maps, but going on on dimensionality, you have multi-dimensional grids like data cubes in space and time, for example,
03:01
if you have a time series of satellite image. Well, you have to locate your data in space. So there is a standardized way of dealing with the coordinate reference system of your dataset. Beyond gridded data, which is quite convenient for the technology behind,
03:22
the coverage model defines other kind of dataset like multi-point or topologically different coverage like curved surface and solids. Let's have a look into the standard itself. This is the conceptual top-level view of it.
03:45
It's basically a feature from GML. So it is compatible with the GML. The coverage is defined by three main elements. Well, I said that it's a definition of a dataset. So the core element is the range set,
04:03
which is the container of the actual values that you have to access. So if it's a multi-band spectral image, all the pixel values that say the structure of pixel values are contained and then delivered into this range set element.
04:22
Well, I said it can be structured data. So you have to find a way to deliver information about the structure of the data itself. We have the range type element for that. It comes from SVA common and tells you how each pixel, how each value, let's say, is structured. So if it's a multi-spectrum image,
04:40
you get information about each band value and semantics of the value itself. This is the data part. Then you have to locate these data in space. How do you do that? With a domain set element. The domain set is again coming from GML and it is holding the coverage types.
05:03
It's also what defines the type of coverage. So if it's a graded coverage or a multi-solid coverage, it all depends on the domain set element. So with this element, you deliver the coordinates of the data and it can take on different topologies and can be
05:22
compact or extended depending on the layout of the data itself. So what is the idea behind this having a coverage? Well, it will help integration of data because you have an unified model that takes on
05:42
observations from very different variety of sensors. And put them within a generic schema that can serve out n-dimensional data in n-dimensional coordinate systems. So from one side,
06:00
you can fill in the coverage with different data sources, and on the other side, you can access and process these data, with the aim of the interoperability of systems. Okay. There are many standards based on the coverage model. There is core and extension model.
06:22
But the key aspect of this project that I want to show you is flexibility. So which part of these standards we can use for that? Well, how do you get flexibility for analyzing and accessing a data set, standardized data sets? We use an high-level query language approach. So we use a standard that allows you
06:44
to write direct queries over your data model. Well, having a query over data set is a proven valid model. We do that on the coverage model.
07:01
So basically, we have the Web Coverage Processing Service standard, which defines this query language over the n-dimensional data set that is stored into the coverage. What can you do with this kind of language? Several operations. You can do server-side computation.
07:22
Well, the language is composed of main elements where you define which coverage, so which data set you want to operate the query to, in the four clouds, and you decide what to return out of this selection. For example, here, there is a band mass computation defined into the query.
07:42
So you're processing a coverage data set and extracting already a subset, set sub-selection of the bands stored in this value with some computation done on them. Obviously, if your coverage is large, you want to subset it to predefined data of interest.
08:02
The query language allows for that. So you can specify subsetting in the coordinate reference system in which the coverage is stored and defined. Here, the example is latitude, longitude, and time for a three-dimensional cube of data. Another interesting thing is that you can integrate
08:23
different coverages together by specifying them as different variables in the query and provide operators over these different coverages into the single query to provide an integrated result.
08:47
For example of the semantics that you can get with the query, once you learn how the query language is laid out and how it works, you have the semantics of what you want to get from the processing of the data set directly encoded into the query,
09:03
instead of having it in extended human readable form like in WPS. So it's a compact way also for representing your function. What are we doing on top of that within the frame of the project is
09:21
integration of not only the processing part and the access part, but also of accessing the metadata relative to the coverage stored into the server. What does it mean? That you don't have to know all the coverages by name and know what they mean. You can do that by describing the coverages, but the goal is to have
09:40
predefined metadata and specify the query directly on that. So for example, you can tell, I want to process with this query content all coverages that deal with Barcelona as the geographic extent. Or you can get metadata from the result of the processing,
10:02
like I want to test some condition on the coverages, and I want to return the ID and the extent of these coverages matching the processing. The implementation is ongoing from the Jacobs University and the Athena Research Lab partners of the project.
10:22
So this is for the storage and processing level. Then you want to visualize some of the data after the processing, and you do that with the X3D standard that is being employed in the project for visualizing multidimensional data.
10:40
So once you do your extraction and you define how to visualize that into 3D, you can visualize your data. So what we do basically is again leverage on the query standard to build the web interfaces that builds the query for you and provides you the display of the results.
11:03
An example of that is the 3D visualization that we've used, and you can see the result of a query extracting data from two different coverages. One is the red, green, and blue bands of the dataset,
11:22
and the other is the alpha channel that is built with digital elevation model, so that you can get it displayed as an image laid out on the 3D scene. This we are doing with the round offer project pattern. Okay. Let me talk about then the platform,
11:42
technical platform that we are using for storing and accessing the data. We are basing that on Razdaman, which is an array database, and it is providing the core storage of the system. Core focus of the project is dealing with the scalability issues of accessing data.
12:02
So we aim at the scalability through the parallelization of the system, and the approach is then to, well, we use queries to access the data, to extract the data. So we want to distribute the query based on their content and on the data location.
12:25
So the Razdaman system offers several optimization for that, and for dealing with the data itself on a server, we have a tiny architecture and we performed tile processing in a pipeline,
12:41
and well, multi-threading is employed there. But what we aim to do is the parallelization at the query level, so that you receive a query and you're able to split it according to the processing node that you have available, and according to the content of the query. So where the coverage are located,
13:03
you receive a single query, you split it over different processing servers, and you then join back the results, which hopefully are reduced in the dimensionality, because it was subsetting on the single server, and you fuse back the resulting coverage.
13:26
Okay. So as I told, the Razdaman system is providing the horsepower for storing and accessing the data. And again, it is an array database, so it's particularly well-suited for the greeted coverages,
13:42
but in the project we are extending the system for supporting the regular grids. We have already a working example of multi-point coverage extraction, and we are moving toward the irregular greeted coverages, so that it can be used into the system itself.
14:01
One interesting feature that we employ to avoid duplication of data, is in situ that I will show you later, and the distributed query processing system is being implemented, also with the integration of the metadata.
14:21
Okay. So brief note, Razdaman is an array database system, and it works on n-dimensional data. So you can have not only maps, not only 3D data volumes, but 4D data cubes like in climate simulation. And well, it provides not only an array engine,
14:41
but also an implementation, the reference implementation for the standards, both the WCS and the WCPS that you can use to access your datasets. Okay. One feature, one interesting feature is that it partition your data archive, like in custom tiling, 2D or 3D tiling you can do.
15:01
And so that optimize the access of the data, according to the layout of the data pattern. Well, the interesting feature I've heard was the in situ feature. What does it mean? Well, to obtain the optimization of the pilot storage,
15:21
you have to import your data into the database, and lay it out according to what you expect to be the access pattern for your clients. One extension is to reference data files themselves, or existing archives themselves. So you can not import, but register your data.
15:41
Of course, with this approach, the optimization of the data structure is lost, but you don't have to duplicate your entire archive. So what you can do is load, link all the archive, and when you have hotspots where the data is accessed more frequently, you import directly and build the data structure for communication.
16:02
Okay. Let's have a look at the services that are being built on top of this technology stuff. So for the storage and processing and visualization, we have different domains of data providing solution for accessing and processing and analyzing data.
16:25
One of them is the CreoSphere data service, which provides you a web interface for accessing and analyzing the Snow Cover products. And you can do combined analysis with digital elevation models and we're even basing this data.
16:44
That can be done directly on the web interface, which builds the query for you, and you get the results displayed. Second domain is the atmosphere data service, which is employing the same technology, and providing similar interfaces to MODIS-derived atmosphere products.
17:07
So with this service, you can access, for example, two-dimensional extracts of the data coverage that is stored in the service, or temporal profiles of the values of the parameters.
17:25
So the service deals with the ocean data, so it provides web access to marine datasets that can be analyzed dynamically on the web interface. And again, this is leveraging this same stack of standards.
17:43
So what it's flexible is that you can parameterize the query toward the interface quite conveniently. One example of the three-dimensional visualization comes from the geology domain, where you can have solid earth information like geology parcels,
18:04
and the access can be done by visualizing them in 3D and moving them into space to see how they are located and how they relate one to each other. So that's a good example of the 3D visualization.
18:20
To give it a little fancy, not only Earth is considered, but we also have planetary science implementing these kind of services. And this service is provided from Jacobs, and it's analyzing multispectral, hyperspectral, sorry, datasets coming from Mars. Again, towered with the web interface,
18:41
and it provides for good examples of query that compute statistics, or in this case, histogram of the dataset itself. So writing a single query, you get the data for your diagram directly on the web. So for concluding, the aim of this project and technology stuff
19:03
is to provide what we call giant analytics over the datasets. And the concept then is to have a flexible way of querying your data without having to program, without having to deal with the internal dynamics of the access to the dataset,
19:21
but just provide a high-level access to a structured dataset. Well, the integration of the data and metadata to provide search for catalogs is a good addition for that analytic part. And yeah, basically we have the standard-based interface to the system,
19:43
all implemented with the open source solutions, and the visualization toolkits. So one thing I want also to stress before concluding in that, the project is building user interest groups. So whatever colleague that is working in this domain area,
20:01
please have a look at the website and at the service provider implementation, and what you find useful, use their data and access to these datasets so we have feedback on the solution and the implementation is going. Good, that's all. Thank you for your attention. The query language you've defined as part of WCPS,
20:40
is that language sufficiently well-defined that I could in theory implement it myself on a different backend? Indeed. It's provided and defined in a standard document from the OGC, so it's completely defined. And you can provide different implementation, not only this one, of course. We have the reference implementation of RASdaman, but you can build your own, obviously.
21:14
Just roughly speaking, if I had a terabyte of data, how long would it take me to ingest it into RASdaman, roughly,
21:22
and how big would it be once it ended up in the database? Roughly? Well, it depends on which timeline scheme you are using for ingesting the data, on which backend engine you are implementing, so roughly, we can get some feedback from the service providers themselves.
21:40
So, I don't know exactly. We have one service provider in the audience, so maybe he can.