We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Cluster Analysis

00:00

Formal Metadata

Title
Cluster Analysis
Subtitle
a comprehensive and versatile QGIS plugin for pattern recognition in geospatial data
Title of Series
Number of Parts
351
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition in geospatial data As geospatial data continuously grows in complexity and size, the application of Machine Learning and Data Mining techniques to geospatial analysis is increasingly more essential to solve real-world problems. Although, in the last two decades, the research in this field produced innovative methodologies, they are usually applied to specific situations and not automatized for general use. Therefore, both generalization and integration of these methods with Geographic Information Systems (GIS) are necessary to support researchers and organizations in data exploration, pattern recognition, and prediction in the various applications of geospatial data. The lack of machine learning tools in GIS is especially clear for what concerns unsupervised learning and clustering. The most used clustering plugins in QGIS [1] contain few functionalities beyond the basic application of a clustering algorithm. In this work we present Cluster Analysis, a Python plugin that we developed for the open-source software QGIS and offers functionalities for the entire clustering process: from (i) pre-processing, to (ii) feature selection and clustering, and finally (iii) cluster evaluation. Our tool provides different improvements from the current solutions available in QGIS, but also in other widespread GIS software. The expanded features provided by the plugin allow the users to deal with some of the most challenging problems of geospatial data, such as high dimensional space, poor quality of data, and large size of data. In particular, the plugin is composed of three main sections: - feature cleaning: This part aims to provide some options to reduce the dimensionality of the dataset by removing the attributes that are most likely bad for the clustering process. This is important to achieve better results and faster execution time, avoiding the problems of clustering in high dimensionality. The first filter removes the features that are correlated above a user-defined threshold, since highly correlated features usually provide redundant information and can lead to overweight of some characteristics. The other two filters identify the attributes with constant values for all the data points or with few outliers differentiating from them. These types of features don’t provide any valuable information and can worsen the performance of clustering. To identify quasi-constant features, we use two different parameters introduced in the function NearZeroVar() from the Caret package developed for R [2]: the ratio between the two most frequent values and the number of unique values relative to the number of samples. - clustering: This section is used to perform clustering on the chosen vector layer. First of all, the user needs to select the features to use in the process. It is possible to select the features both manually and automatically. The automatic feature selection is done using an entropy-based algorithm [3] presented in two versions with different computational complexities. The currently available algorithms for clustering are K-Means and Agglomerative Hierarchical, and the users can select the one that best suits their needs. Before performing clustering, the plugin offers the possibility to scale the datasets with standardization or normalization, and to plot two different graphs to facilitate the choice of the number of clusters. - evaluation: In this section we show all the experiments carried out in the current session, with a recap of the settings and performances of the experiments and the possibility to save and load them with text files. To evaluate the quality of the experiments we calculate two indexes and the comparisons among experiments on the same dataset. The indexes are the internal metrics Silhouette coefficient and Davies-Bouldin index. To directly compare the clusters formed by two or more experiments we compute the score [4], which evaluates how many couples of data points are grouped together in all of the experiments or in none of them. Every experiment completed in the current session can be stored in a text file, and the experiments saved in previous sessions can be loaded in the plugin and are shown in the evaluation section along with the other ones. One of the major challenges during development has been allowing most of the functionalities on large datasets as well, both from the point of view of the number of samples and the number of dimensions. To achieve this, we also implemented algorithm options with good time complexities, as in the case of entropy with sampling and K-Means. Moreover, for all the data storage and manipulation done in the system, we use the data structures and functions provided by the libraries pandas and NumPy to guarantee high performance. Another important objective of the research is the accessibility and ease of use of the plugin since the general user of GIS is often lacking a machine learning and computer science background. To guarantee this, the User Interface is simple and self-explanatory, and each section contains a brief guide to explain all the functionalities. Furthermore, some algorithm parameters that cannot be modified via the interface are stored in an external configuration file, and can be modified via this. This is done to avoid confusing the less experienced users. Along with the implementation, the research is integrated with a considerable experimental phase, both during and after the development phase. This phase is essential to highlight both the potential of the plugin and its limitations in real-world scenarios. The great volume of experiments is conducted on data about the city of Milan, describing social-demographics, urban and climatic characteristics and with different granularities (ranging from less than 100 data points to almost 70000, and with a large number of numerical attributes, up to 109). Overall, the experimental phase shows good and adequate flexibility of the plugin, and outlines the possibilities for future developments that can be provided also by the QGIS community, given the open-source nature of the project.
Keywords
202
Thumbnail
1:16:05
226
242
Cluster analysisMachine learningCartesian coordinate systemAttribute grammarPresentation of a groupPolygonPlug-in (computing)Computer animation
Field (computer science)Content (media)Machine learningDisintegrationStochastic processPersonal digital assistantUsabilityProcess (computing)Group actionLocal GroupPoint (geometry)Similarity (geometry)Task (computing)AlgorithmPort scannerReduction of orderCluster analysisMathematical analysisPerformance appraisalNumberSoftwareOpen sourcePlug-in (computing)Phase transitionImage resolutionThermal expansionMathematical optimizationPerformance appraisalTask (computing)Standard deviationVirtual machineAxiom of choiceDifferent (Kate Ryan album)Field (computer science)Cluster analysisNumberMultiplicationMathematical analysisFunktionalanalysisCross-correlationStochastic processRow (database)Computer fileSheaf (mathematics)Subject indexingResultantCASE <Informatik>Similarity (geometry)Point (geometry)Attribute grammarAlgorithmCartesian coordinate systemBuildingSoftware developerRevision controlAngular resolutionTable (information)Set (mathematics)Process (computing)Group actionMultiplication signPlug-in (computing)Selectivity (electronic)Phase transitionPort scannerOpen sourceMessage passingSoftwareDemonGreatest elementError messageVisualization (computer graphics)Natural numberGraph theoryComplete metric spaceEntire functionParameter (computer programming)Configuration spaceHierarchyFiltrationMathematical morphologyLevel (video gaming)Arithmetic meanVulnerability (computing)Normal (geometry)AreaExecution unitFlow separation2 (number)Musical ensembleType theorySquare numberDimensional analysisTheory of relativityTheoryComplex analysisSingle-precision floating-point formatMoment (mathematics)User interfaceMeasurementKey (cryptography)SimulationStructural loadFreewarePairwise comparisonMachine learningUsabilityGUI widgetLibrary (computing)OutlierScaling (geometry)TwitterDistanceGraph (mathematics)Graph (mathematics)Range (statistics)INTEGRALAutomationPresentation of a groupBitMathematical optimizationThermal expansionFile formatException handlingPattern languageRepository (publishing)Thresholding (image processing)MereologyForestSummierbarkeitBenutzerhandbuchReduction of orderMetreVector spaceMaxima and minimaComputer animation
Transcript: English(auto-generated)
Hello everyone, I am Andrea Foligni, a research fellow for Politecnico di Milano and my presentation is about cluster analysis which is a comprehensive and versatile plug-in for QGIS for attribute based clustering on geospatial data. The application of unsupervised machine learning to geospatial data is important to uncover hidden patterns in the data and data exploration
in multiple fields such as urban planning or anomaly detection for natural disasters and so on. And the integration of these methods with GIS software allows for the automatization of
the processes and also allows a wider range of users to access these methods. During the presentation we will see a bit of background theory on clustering and some related topics, then a detailed explanation of the functionalities of the plug-in
and finally a couple of simple example use cases. When dealing with machine learning on geospatial data there are some particular challenges that we have to take into account such as the large size of the data sets which could result in a slowdown of the algorithms,
also poor quality of the data and for these reasons we should always have some kind of data cleaning before running our analysis and then a large number of dimensions or attributes which move the problem in a high dimensional space which especially in clustering could reduce
the performance of the algorithm. For these reasons the goal during the development of our plug-in are first of all to develop a tool that completely covers the clustering process not only with the application of a single algorithm but starting from the data cleaning to
the evaluation of the obtained results. Then we want to provide the flexibilities for what concerns both the size of the data sets and also the kind of the data that we are using and another important point is to guarantee that the software is easy to use to every user
regardless of their experience with machine learning or GIS softwares. Clustering is the task of separating a population of data points into multiple groups such that the points in the same cluster are similar to each other and the points in
different clusters are far from each other based on some kind of similarity measures. There are different algorithms to perform this task and some of the most common are key means, hierarchical clustering, dbscan and so on and each of them has
their own advantages and disadvantages based on the situation. Another closely related topic to clustering is feature selection which is the process of selecting the attributes that we want to use during the analysis and this is important both to
reduce the total dimensionality of the data set and so achieve better time execution and performances and also to only use during the analysis those attributes that better separates the data points. Another choice that we have to make before running a clustering algorithm is
the choice of the number of clusters that we want to divide our population in and unless we already know which is the target number this is a difficult choice which is usually done with graphical methods. Here we can see a dendrogram of hierarchical clustering which shows the entire
hierarchy of clusters and how they are formed and at which distance they are merging and the other one is a graph that shows they're within sum of squares and between sum of square trends which are respectively two indexes that shows how similar clusters are
dense and different clusters are far from each other. After running a clustering algorithm we obtain some cluster labels and we have to interpret these results to understand if the
clusters are well formed. There are mostly two different ways to do so. The first one is an internal evaluation which is usually an index that explains how the data points in the same clusters are close to each other and data points from different clusters are far and this is the
fastest way to perform evaluation and also easy to interpret while the other one is called external evaluation which is a comparison of the obtained clusters with a gold standard
classification that we can obtain before the analysis and this is usually not so common since it is not easy to have a classification of the data points it is both time consuming and require expertise. For what concerns existing tools for clustering in GIS software, there are some
solutions for both paid and free software such as QGIS and ArcGIS Pro but all of them lack some functionalities to support the users during the entire process and the goal of our
plugin as I said is developed for QGIS and it is developed with the Python library. It is obviously open source and completely available in the official QGIS plugin repository
and so anyone can download the code and change it as you wish. It is applicable to any vector file format and to numerical attributes of these layers.
The plugin is composed by mainly three sections, the first one for feature cleaning, the second for clustering and the last one for the evaluation. All of the sections are developed independently so that a user can decide which functionalities to use at any time
and is not bounded to a specific process that he has to follow. To avoid any confusion in the following sections, when I will refer to features I will be referred to columns or attributes in the data table and not to rows as they are called in QGIS.
The first section, the feature cleaning one aims to reduce the total dimensionality of the dataset by dropping those features that provide few or no benefits to the analysis. There are three filters in this section, the first one is used to remove ugly correlated features
which are those attributes that present a similar trend and are strictly related to each other and from a group of these attributes we can usually only keep one and the user can define
the threshold for the correlation and also the criteria used to keep the single feature. The second filter can be used to remove constant features which are those attributes that contains the same unique values for all the rows and this kind of attributes it's obviously
not very useful to separate our data points. While the last one is similar to the previous kind and are quasi-constant features which are attributes that present some outlier or few outliers differentiating from the unique values but again these attributes are not very good to
divide the data points unless we want to specifically find those outliers in those attributes. The second section is the main one which contains the algorithm for clustering and feature selection. The feature selection can be done both manually by selecting the numerical
attributes of the of the layer and also automatically with an algorithm that we implemented which is an entropy-based feature selection algorithm which ranks all of the features based on their ability to separate the data points and returns only the best one. This
algorithm is implemented in two different versions one that uses all of the data points at once which is obviously very time consuming and it is not advised to use on data sets larger than few hundreds of data points. While the second version exploits random sampling to
speed up the computation and reduce the time complexity at the cost of a slight worse feature selection but this allows to use automatic feature selection on any type of data set.
For what concern the clustering algorithms implemented, we only have two at the moment which are key means and agglomerative hierarchical which are two of the most common clustering algorithm and both of them have their own disadvantages for advantages for
example key means is faster and more more advised on large data sets while with hierarchical clustering we can have and visualize the entire hierarchy of the clusters so we have a better understanding of how the process is done. Then we can plot the graphs that we saw
before the dendrogram is only available for hierarchical while the wss and bss trends are available for both both algorithms. Finally in this section we have the possibility to scale
our data with normalization or standardization and this task is important when we have a data set with features that have different units or different scales since those could lead to an overweight or an underweight of some features. Once the clustering is done
a new numerical attribute is added to our layer so that we can visualize the results on the QGIS map and the experiment is also added to the last section which is the evaluation one which contains two indexes for internal evaluation the silhouette score and the
Davis-Boulton index which are basically just a number and really easy to interpret. In this section we also have the possibility to compare different experiments on the same data
set. This comparison is done by calculating a score that tell us how close the two clustering are within each other and here we can also save all of our experiment current experiment in a text file and also we can load back into the plugin previously run experiments.
As I said before the goal for the user interface interface is the usability for every user so we try to keep keep it as simple as possible. To do so each section that I described
before has its own tab which and all of the tabs have a similar layout. On one side we have the widgets and the parameter for the user, on the right side we have a brief user guide that explains all of the functionalities and on the bottom we have a message section to notify
the users any error or the completion of the the tasks. We also implemented an external configuration file which contains some of the most technical parameters for the algorithms in the plugin. This choice has been taken to avoid any confusion among less experienced users
while also keeping the possibility for a most experienced user to modify those parameters. Both during and after the development phase we performed a large number of experiments to
analyze the weaknesses and the strengths of our plugin. All of the data that we used during this phase was from the city of Milan and it arranged greatly in size with a number of data points
from around 100 to almost 70 thousands and also it differently a lot in its nature with data from climatic data building and urban data and also social demographic data. The first use case example that we have is an attempt to separate the city of Milan
in different climate zones by using automatic selected features and the features selected are about mean mean temperature, high temperature, mean relative humidity and max wind speed.
The spatial resolution is a grid of 100 square meters and the algorithm we use here is key means given the size of the data sets which is about four thousands of data points. As we can
identify some clusters that follows the morphology of the city, we can see in the south and the west a cluster formed by rural areas while in the center of the city we have a cluster formed by areas with high rises building and we can also clearly see one cluster with urban parks
and so on while the second use case is done on social demographic data and in particular manually selected features which are on the employment and education of young people.
Here the goal was to identify the outliers neighbors in the city on those attributes. This is why we chose to select only two clusters to separate the forest points and also the reason why we use the hierarchical clustering since it is more suited
for the task of identifying outliers. As we can see from the map, the highlighted clusters are the neighbors which are among the most fragile in the city. At the end of
the experimental phase, the plugin worked well on all of the on all of each type of data and the use cases that we tried and also all of the functionalities were available on all sizes of the data sets with very few exceptions for the largest data sets and to demonstrate
demonstration of the interest in the in the topic, the plugin has already been downloaded around 2500 from 2500 users and in few months and it is constantly growing and we also identified
two main parts for future developments. First of all, on one end we have the optimization of the software performances so that the plugin is usable on even larger data sets while on the other end we have the expansion and improvement of the analysis functionalities
for example by adding new algorithms for feature selection or clustering or even new section for example one for data visualization before running any analysis. Obviously this modification can be provided by the entire QGIS community given the open source
nature of the project and thank you for the attention and I will gradually respond to any question.