Data-driven innovation, as outlined by Granell et al. (2022), has seen recent advances in technology driven by the continuous influx of data, miniaturization and massive deployment of sensing technology, data-driven algorithms, and the Internet of Things (IoT). Data-driven innovation is considered key in several policy efforts, including the recently published European strategy for data, where the European Commission acknowledged Europe’s huge potential in the data economy by leveraging on available data produced by all actors (including public sector, private sector, academia and citizens). Technologies currently used for the management, exchange and transmission of data, including geospatial data, must be evaluated in terms of their suitability to efficiently adapt to streams of larger data and datasets. As more users access data services through mobile devices and service providers are faced with the challenges of making larger volumes of data available, we must consider how to optimise the exchange of data between these clients and servers (services). For many years JSON, GeoJSON, CSV and XML have been considered as the 'de facto' standard for data serialisation formats. These formats, which enjoy near ubiquitous software tool support, are commonly used for the storage and sharing of large amounts of data in an interoperable way. Most Application Programming Interfaces (APIs) available today facilitate data sharing and exchange, for a myriad of different types of applications and services, using these exchange formats (Vaccari et al., 2020). However, there are many limitations to approaches based on JSON and XML when the volume of data is likely to be large. Potentially the most serious of these limitations is related to reduced computational performance, when exchanging or managing large volumes of data where there are high computational costs associated with (de)serializing and processing these data. Against this background, binary data serialization approaches allowing for the interoperable exchange of large volumes of data have been used extensively within scientific communities such as meteorology and astronomy for decades. In recent years, popular distributors of geospatial data have also begun making use of binary data formats. Examples are OpenStreetMap (OSM) data (e.g. the OSM Planet and OSM Full History Planet files, providing access to the whole OSM database and its history) as well as the popular ESRI Shapefile format's main file (.shp), which also contains geometry data and is stored as a binary data file. In this paper we describe the methodology, implementation and analysis of a set of experiments to analyse the use of binary data serialization as an alternative to data exchange in XML or JSON data formats for several commonly encountered GIS workflows. Binary data serialization allows for the storage and exchange of large amounts of data in an interoperable fashion (Vanura and Kriz, 2018). While anecdotal evidence indicates binary serialization approaches are more efficient in terms of computation costs, processing times, etc., there are additional overheads to consider with these approaches including special software tools, additional configuration, schema definitions, etc. (Viotti and Kinderkhedia, 2022). Additionally, there have been few, if any, investigations of binary data serialization approaches specifically for geographical data. Our set of experiments investigates the advantages and disadvantages of binary data serialization for three common GIS workflow scenarios: (1) geolocation point data from an OGC SensorThings API; (2) geolocation point data from a very large static GeoPackage dataset representing the conflation of address data from the National Land Survey of Finland and OpenStreetMap; and (3) geographic polygon datasets containing land cover polygons (currently ongoing work). We consider comparisons of JSON and GeoJSON with two very popular binary data formats (Proos and Carlsson, 2020), namely Google Protocol Buffers and Apache Avro. Protocol Buffers (Protobuf) is an open source project developed by Google providing a platform neutral mechanism for serializing structured data. Apache Avro, another very popular schema-based binary data serialization technique, is also a language-neutral approach which was originally developed for serializing data within Apache Hadoop. Both Protobuf and Avro have wide support in many popular languages such as C++, C#, Java and Python. The full paper will provide detailed descriptions of the implementations of our experiments. However, here we provide a summary of some of the key results and highlights of our analysis. As binary data formats such as Protobuf and Avro are not self-describing schemata and schema definitions are required for each dataset or data stream, these definitions are required for the serialization and… |