We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Big Data In Standardization: Can This Fly?

00:00

Formal Metadata

Title
Big Data In Standardization: Can This Fly?
Title of Series
Number of Parts
95
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceNottingham

Content Metadata

Subject Area
Genre
Abstract
In geo data, a main footprint coming from Big Data stems from remote sensing, atmospheric and ocean models, and statistics data. In the strive for interoperability, standardizaiton bodies establish interface specifications for large-scale geo services. Are these standards really helpful, or do they inhibit performance? We investigate this and show both positive and negative examples, based on OGC, INSPIRE, and ISO standards relevant for scalable geo services.
Semantics (computer science)QuiltMereologyStandard deviationGeometryService (economics)Open setAbelian categoryGroup actionCore dumpDigital signalDarstellungsraumTime domainInterface (computing)World Wide Web ConsortiumInternationalization and localizationCovering spaceVariety (linguistics)AbstractionEndliche ModelltheorieState diagramHookingImplementationSupersonic speedData typeCodierung <Programmierung>Data modelData structureOperations researchExtension (kinesiology)File formatSubsetComputer reservations systemTemporal logicInstance (computer science)Attribute grammarServer (computing)Envelope (mathematics)Degree (graph theory)Computer multitaskingFunction (mathematics)Formal languageMusical ensembleInterpolationAxonometric projectionImage processingProcess (computing)Computer-generated imagerySeries (mathematics)AverageMathematicsBridging (networking)Moving averageSchmelze <Betrieb>Menu (computing)DisintegrationCodeQuery languageMUDLevel (video gaming)Software testingPixelChi-squared distributionSummierbarkeitArmMetropolitan area networkWebsiteEmulationExponential functionRaster graphicsArray data structureArchitectureData storage deviceDatabaseScalabilityComputing platformVisualization (computer graphics)ParametrisierungVariancePairwise comparisonComplete metric spaceInformationMultiplicationMobile appRange (statistics)Set (mathematics)Link (knot theory)Line (geometry)Sample (statistics)Mixed realityComa BerenicesArc (geometry)Water vaporCartesian coordinate systemMathematical analysisScalar fieldInstallable File SystemImage warpingPersonal digital assistantVector spaceTable (information)DemosceneHistogramOracleIntegerExtension (kinesiology)Data structureInformationDescriptive statisticsFunctional (mathematics)Flow separationGroup actionAdditionMultiplication signPhysical systemForm (programming)PixelService (economics)Latent heatBitAxiom of choiceFormal languageBasis <Mathematik>Lattice (order)ImplementationThermal expansionEmailSpeech synthesisSymbol tableComputer fileMereologyState of matterDifferent (Kate Ryan album)SurfaceMappingOperator (mathematics)Level (video gaming)Limit (category theory)NeuroinformatikPairwise comparisonDevolution (biology)Forcing (mathematics)Condition numberMachine visionCausalityField (computer science)Sampling (statistics)INTEGRALCodePoint (geometry)Program slicingFood energyPlanningSubsetVolume (thermodynamics)WhiteboardDatabaseWeb browserMoment (mathematics)Sheaf (mathematics)Query languageNumberVariety (linguistics)Metropolitan area networkSoftware testingSpacetimeSelectivity (electronic)Virtual machineMedical imagingCombinational logicRange (statistics)Uniform resource locatorWeightQuicksortRight angleSemantics (computer science)Type theoryDomain nameBuildingDimensional analysisEvent horizonGame theoryDressing (medical)Disk read-and-write headOnline helpProcess (computing)ResultantEndliche ModelltheorieWeb 2.0WordDivision (mathematics)FreewareMixed realityScaling (geometry)Standard deviationSet (mathematics)Thermal conductivityWebsiteSpecial unitary groupArithmetic meanData modelFile formatAnalytic setCodierung <Programmierung>Time seriesWeb serviceCore dumpCompact spaceTemporal logicInstance (computer science)Binary fileElectronic visual displayMusical ensembleServer (computing)Regular graphCubeThree-dimensional spaceLibrary catalogSatelliteAutomationComputing platformComputer reservations systemCartesian coordinate systemFile archiverLink (knot theory)Element (mathematics)Projective planePresentation of a groupVector spaceAuthorizationSuite (music)Default (computer science)Associative property1 (number)Raster graphicsGradientKey (cryptography)PreprocessorMathematical optimizationMetadataRepresentation (politics)Self-organizationComputer architectureCoordinate systemComputer animation
Transcript: English(auto-generated)
So the particular field of sub-navigation that I've been working on is the OGC standards and the OGC standards suites. It's a large set of geospatial standards. You may already know some of them, but that's a picture that puts them together. We have different kinds of data sets with different kinds of data to deal with. Most of the GIS world speaks about
vector data that are polygons, and for that we have the web feature server. Then from that you generate maps. Maps is the main topic of this conference. There is the web map service for serving just maps to be pictured on web browser or any other
display. Well, maps are generated not only with vector data but also with raster data. So like in multispectral satellite imagery, you have an array of multiband data sets, and well, the first two standards doesn't deal with that. So you get either a map or a feature,
but you don't get the actual numbers, the actual band values and their geolocation. For that we have the coverage service, coverage standards that I will be speaking about. Then for finding your data, OGC defines catalogs like CSW, which deals about metadata.
So you have different kinds of data and description of this data that you can put together, all with standards. For those of you who don't know OGC, well, it's dealing exactly with geospatial data. And you can see in opengeospatial.org, it's composed
with a consensus based involving industry partners in building the standard. And one particular standard that I'm working with is the coverage standard. So that's one aspect, one particular feature. And our work group is contributing to several other technical groups
in the OGC. So what's the coverage? Well, basically it tackles the variety of data by providing a structured way of looking at data. OGC defines abstract coverage as a feature type.
And from that you can have different types of coverages. In our group, we work with large array databases that are particularly suited for serving graded coverage. So this is what you find best in a random implementation that we will see later. Graded coverage are just the start of
what is available on the web and around the world with data. So we are moving on with the other type of coverages like multi-point, curved surfaces, and solids, bringing an interoperable way of formatting this data and delivering it from one partner to another, one system to another. So coverage data and service model, the concrete aspects of this.
So coverage from the top level overview, the architectural view, conceptual, let's say, is a rather simple organization of the data. So it all starts with the feature,
which is a GML standard element. And from that we derive coverage. What the coverage, it is a special temporal, it is a representation of special temporal data. So it must contain, and in these three key elements it provides, the definition of
the data. So where it's spatially located, you find in the domain set element. This again uses GML as a base standard and provides the topological description, so the coordinates of where the values are standing in space and time. Okay, so once you know
where data is located, you have to know what each point represents, what these structured values represent, and this is done by the SVA common data record, which is the range type element. So you get spatial location of data, structured value, so like multi-band image,
that was my former domain, so I know that very well, and range set. Range set is actual values, all put together in the way described by the range type. This allows compact representation of the data to be delivered along with the description, how to extract the data,
and how to locate that into space. So all that you need for a data set is there in this standard. On top of this data set, data model, we provide WCS service model. So what is the service model about? Well, it provides also key elements.
One of them is coverage offerings. This is what the server provides, so you can implement a web server offering services, offering coverages as a service. It will contain offer coverages, which will hold the data, and then in the
offer coverage, you have description about the specific server implementation, and the coverage itself, which is the data model that I've shown you before. So that's basically the
data structure of the service that you can put on the web, and that can be accessed with other systems. The default format in which data is delivered is GML, so it's an XML document, rather verbose, so it's suitable for very small data sets. When you need to extract further larger data sets, then you use different encodings for that. I will show you
later an example. So like many other OGC web services, you have the key get capabilities operation. Every OGC compliant web service offers this operation,
and this is what gives you information about the server. What is its content, which formats and extensions it supports, and particularly which coverages are offered. If you find a coverage of your interest, that might be a data set, a time series
of data sets, you can get details about that coverage, all the metadata associated with it, with the described coverage, and when you're happy knowing what is inside the server with get coverage, you extract the data. The key point is that you can extract not an RGB map,
but actual values, packed in a semantic enriched way, so that your machine can understand how to process further this data. Okay, for performing correctly in get coverage, and actually knowing where the data is, it must be addressed. We use coordinate reference system for addressing
the data, and the interesting thing about coverage is that you can get multi-dimensional data sets addressed, which means not only geographically located data, but also time series
within the same model. So in our effort in the standardization, we aim to have time as just another axis into the data set, which is quite different from the time slices that you might be used to. That's a challenging way of showing, and my colleague, Piero Campalani,
is working specifically also on that aspect. To show you what this CRS mapping looks like in the document, it's URI-based, so you specify which axis use which coordinate system, like the APS-G authority ones that you see here, with URI, so that's portable over the web, and you can
get it directly in an intelligible form. What is the interesting part of this mapping, is that we have URIs mapped to the actual gml definition of the coordinate system, which means, among other things, like datum and other things relevant to the coordinate space,
you get the association with the labels, so the latitude, longitude, and time in this example. This you can use, or your system can automatically use, to query properly the coordinate space of the data set. If you're interested in how time is handled within the data sets,
I invite you to follow discussion on OGC at this link. This is where the temporal working group is working on the addressing aspects of the standard, so how to address time. Getting to the coverage service, what does it provide you, basically? Well, it has a core
that allows you to get subsets out of your data archive with two operations. PRIM, which says, for example, I have a three-dimensional data set, let's use a cube as a simple example, so
regular time slices put together in space, and you get three dimensions. With a slice, you say, I want two months of data over this bounding box, and you get just this part out of the service. Still three coordinate, three-dimensional coordinate space. If you do a slice, you say, I want, for example, one map you can produce with that. I want just the spatial extent
over the time series at this time instant, and you get a two-dimensional data set out of it, so you reduce the dimensionality. This is the main difference of the two operations. Then the standard provides you extensions. With extensions, you can plug in into the service
into the server implementation different further operations, like scaling of the output, reprojecting the coordinates of warping the output, and others. This is the big picture about coverages data model that I show you, which includes format encoding. So if you want
your data out of this web service in a JLTF format or netCDF format, you can specify them and get that out. What's also interesting is the service part of the extensions that gives you
further functionality. Besides the basic subsetting, which is the core of the service, you get other extensions. I don't have time to go into details of this kind of extension, but I will just introduce you to the web coverage processing service, which is one particular specific extensions, which we are using for providing a flexible way of accessing
data expressed as a coverage. In our group, we have been dealing with languages and language processing system, so we use a language-oriented approach to getting data, and we aim to provide
a basic query for RAS, you can say. What does it mean? Well, it means you can, once you understand the data model, you can query on it using a textual query that can be sent over to the server. What can you do with this kind of queries? On server side, you can compute on the single
bands of the data set, you can provide the output of computation. So without downloading data sets files, you can place your query, get a pre-processed data set out of it directly. Yes, this is our language. So in the top part, you define on which coverage
you are going to operate. And for example, here in the return cloud, you want to encode some computation over the band in the active. It is simplified. I mean, you have to also specify, for example, as you can see here, the coordinates, the space over which you want the
computation done, and so you can start a subset pre-processed. And the main part that we are aiming to have is coverage integration. This is already possible if you, let's say, a sample or reproject data over a single, over the same coordinate system, coordinate space.
You can put together several data sets and encode the results of the combination of these data sets. Okay, I have no examples to show here, but they have a presentation also tomorrow afternoon, where you can see how this language is being used in the L server project for
building services over this OGC standard. For now, let's just talk about the implementation that's available. So WCS is a standard, and as an open standard, as not only Rasdaman is an implementation, but several others, which are mostly open source.
And okay, Rasdaman is the core reference implementation, and we believe WCS to be a good future-oriented standard for dealing with large data sets. What does it mean to be a reference implementation with OGC standards? Well, the standards needs, within its specification, the test suite that you have to execute on a server
to allow it to be called a WCS server, for example. And there is a team engine, which is an automated test system for services, OGC services. Being a reference implementation means you must pass all these tests, which go down to the pixel level of the data set that
Okay, we have to keep fast. A brief word about Rasdaman, which is the implementation that we are developing actively, both as a community and the results of commercial company working on
the engineering aspects. It's basically an array database, and on top of it, we have the implementation of the raster services. So any kind of raster data set that you might want to serve and process can be efficiently used and served over this implementation.
Where it has been used and the current founding of this project at server and new community founded project is using Rasdaman as the core platform to deliver analytics over several geospatial, geoscience domains. And yes, it provides services that the community can use
over the web to access the data sets. So this was the compact introduction and overview of the standardization effort. And now let's see some comparison and assessment that we have found out working, applying these standards to the large data sets. First of all, coverage
encoding. As I told you before, one aspect of the coverage is that you can encode that in several different ways and several formats. The reference is GML, so it's XML. It's quite verbose and for the range sets, for the actual values, that is not always the optimal choice.
You can have out of the server a format that allows you to encode everything, so domain set, range set, and range type definition all together. If you have it and if you're okay with it, you can use that. For example, NetCDF is a good candidate. If you don't, well, you can still have
PNG or other common file formats for storing the values like the pixels of the image or the values of the red bands paired with all the other description. There's an example. Well, there's a simple example that can be replaced, for example, with a geotiff file directly,
just to show you how it works. You can have the XML part, the coverage definition part, which allows you to interoperate this data set with other automated services encoded in XML, while the values are provided in a compressed binary file. That's quite convenient for
delivering data over the web efficiently. Then we had made a comparison with WaterML2, which is another standard, especially about time. Since one of our core efforts is making time embedded into the data set itself, so making no assumptions on how it is managed,
the key difference with our model is that we don't deal with time instance. We don't have time labels. We can optimize the efficiency of accessing the data by having a compact representation of it and having time as one access directly accessible into the,
as you will see, the query language or the service model. One technical detail I want to share with you, GML 3.3. It is the standard that we are using and we discovered this optimization like that we have grids can be of different type. They can be irregular or
totally warped. We are going to provide the candidate optimization for that, allowing the mix of the two of these data sets. I have no time for these details, but I wanted to share with you this also. With respect to ISO standardization,
the most known is SQL. What we are trying to pursue is mixing array data into SQL directly, so have a tight integration of array, not only array data, but operators on array data into an SQL-like language. This is how we deal and compare with other standardization
efforts. Sorry I have to keep it brief, so thank you for your attention.