Cluster Analysis
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68979 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Firenze 2022263 / 351
1
7
13
22
25
31
33
36
39
41
43
44
46
52
53
55
58
59
60
76
80
93
98
104
108
127
128
133
135
141
142
143
150
151
168
173
176
178
190
196
200
201
202
204
211
219
225
226
236
242
251
258
263
270
284
285
292
00:00
Cluster analysisMachine learningCartesian coordinate systemAttribute grammarPresentation of a groupPolygonPlug-in (computing)Computer animation
00:16
Field (computer science)Content (media)Machine learningDisintegrationStochastic processPersonal digital assistantUsabilityProcess (computing)Group actionLocal GroupPoint (geometry)Similarity (geometry)Task (computing)AlgorithmPort scannerReduction of orderCluster analysisMathematical analysisPerformance appraisalNumberSoftwareOpen sourcePlug-in (computing)Phase transitionImage resolutionThermal expansionMathematical optimizationPerformance appraisalTask (computing)Standard deviationVirtual machineAxiom of choiceDifferent (Kate Ryan album)Field (computer science)Cluster analysisNumberMultiplicationMathematical analysisFunktionalanalysisCross-correlationStochastic processRow (database)Computer fileSheaf (mathematics)Subject indexingResultantCASE <Informatik>Similarity (geometry)Point (geometry)Attribute grammarAlgorithmCartesian coordinate systemBuildingSoftware developerRevision controlAngular resolutionTable (information)Set (mathematics)Process (computing)Group actionMultiplication signPlug-in (computing)Selectivity (electronic)Phase transitionPort scannerOpen sourceMessage passingSoftwareDemonGreatest elementError messageVisualization (computer graphics)Natural numberGraph theoryComplete metric spaceEntire functionParameter (computer programming)Configuration spaceHierarchyFiltrationMathematical morphologyLevel (video gaming)Arithmetic meanVulnerability (computing)Normal (geometry)AreaExecution unitFlow separation2 (number)Musical ensembleType theorySquare numberDimensional analysisTheory of relativityTheoryComplex analysisSingle-precision floating-point formatMoment (mathematics)User interfaceMeasurementKey (cryptography)SimulationStructural loadFreewarePairwise comparisonMachine learningUsabilityGUI widgetLibrary (computing)OutlierScaling (geometry)TwitterDistanceGraph (mathematics)Graph (mathematics)Range (statistics)INTEGRALAutomationPresentation of a groupBitMathematical optimizationThermal expansionFile formatException handlingPattern languageRepository (publishing)Thresholding (image processing)MereologyForestSummierbarkeitBenutzerhandbuchReduction of orderMetreVector spaceMaxima and minimaComputer animation
Transcript: English(auto-generated)
00:00
Hello everyone, I am Andrea Foligni, a research fellow for Politecnico di Milano and my presentation is about cluster analysis which is a comprehensive and versatile plug-in for QGIS for attribute based clustering on geospatial data. The application of unsupervised machine learning to geospatial data is important to uncover hidden patterns in the data and data exploration
00:25
in multiple fields such as urban planning or anomaly detection for natural disasters and so on. And the integration of these methods with GIS software allows for the automatization of
00:41
the processes and also allows a wider range of users to access these methods. During the presentation we will see a bit of background theory on clustering and some related topics, then a detailed explanation of the functionalities of the plug-in
01:00
and finally a couple of simple example use cases. When dealing with machine learning on geospatial data there are some particular challenges that we have to take into account such as the large size of the data sets which could result in a slowdown of the algorithms,
01:22
also poor quality of the data and for these reasons we should always have some kind of data cleaning before running our analysis and then a large number of dimensions or attributes which move the problem in a high dimensional space which especially in clustering could reduce
01:43
the performance of the algorithm. For these reasons the goal during the development of our plug-in are first of all to develop a tool that completely covers the clustering process not only with the application of a single algorithm but starting from the data cleaning to
02:03
the evaluation of the obtained results. Then we want to provide the flexibilities for what concerns both the size of the data sets and also the kind of the data that we are using and another important point is to guarantee that the software is easy to use to every user
02:26
regardless of their experience with machine learning or GIS softwares. Clustering is the task of separating a population of data points into multiple groups such that the points in the same cluster are similar to each other and the points in
02:46
different clusters are far from each other based on some kind of similarity measures. There are different algorithms to perform this task and some of the most common are key means, hierarchical clustering, dbscan and so on and each of them has
03:04
their own advantages and disadvantages based on the situation. Another closely related topic to clustering is feature selection which is the process of selecting the attributes that we want to use during the analysis and this is important both to
03:23
reduce the total dimensionality of the data set and so achieve better time execution and performances and also to only use during the analysis those attributes that better separates the data points. Another choice that we have to make before running a clustering algorithm is
03:43
the choice of the number of clusters that we want to divide our population in and unless we already know which is the target number this is a difficult choice which is usually done with graphical methods. Here we can see a dendrogram of hierarchical clustering which shows the entire
04:04
hierarchy of clusters and how they are formed and at which distance they are merging and the other one is a graph that shows they're within sum of squares and between sum of square trends which are respectively two indexes that shows how similar clusters are
04:29
dense and different clusters are far from each other. After running a clustering algorithm we obtain some cluster labels and we have to interpret these results to understand if the
04:44
clusters are well formed. There are mostly two different ways to do so. The first one is an internal evaluation which is usually an index that explains how the data points in the same clusters are close to each other and data points from different clusters are far and this is the
05:06
fastest way to perform evaluation and also easy to interpret while the other one is called external evaluation which is a comparison of the obtained clusters with a gold standard
05:20
classification that we can obtain before the analysis and this is usually not so common since it is not easy to have a classification of the data points it is both time consuming and require expertise. For what concerns existing tools for clustering in GIS software, there are some
05:45
solutions for both paid and free software such as QGIS and ArcGIS Pro but all of them lack some functionalities to support the users during the entire process and the goal of our
06:06
plugin as I said is developed for QGIS and it is developed with the Python library. It is obviously open source and completely available in the official QGIS plugin repository
06:22
and so anyone can download the code and change it as you wish. It is applicable to any vector file format and to numerical attributes of these layers.
06:41
The plugin is composed by mainly three sections, the first one for feature cleaning, the second for clustering and the last one for the evaluation. All of the sections are developed independently so that a user can decide which functionalities to use at any time
07:00
and is not bounded to a specific process that he has to follow. To avoid any confusion in the following sections, when I will refer to features I will be referred to columns or attributes in the data table and not to rows as they are called in QGIS.
07:25
The first section, the feature cleaning one aims to reduce the total dimensionality of the dataset by dropping those features that provide few or no benefits to the analysis. There are three filters in this section, the first one is used to remove ugly correlated features
07:45
which are those attributes that present a similar trend and are strictly related to each other and from a group of these attributes we can usually only keep one and the user can define
08:01
the threshold for the correlation and also the criteria used to keep the single feature. The second filter can be used to remove constant features which are those attributes that contains the same unique values for all the rows and this kind of attributes it's obviously
08:21
not very useful to separate our data points. While the last one is similar to the previous kind and are quasi-constant features which are attributes that present some outlier or few outliers differentiating from the unique values but again these attributes are not very good to
08:42
divide the data points unless we want to specifically find those outliers in those attributes. The second section is the main one which contains the algorithm for clustering and feature selection. The feature selection can be done both manually by selecting the numerical
09:03
attributes of the of the layer and also automatically with an algorithm that we implemented which is an entropy-based feature selection algorithm which ranks all of the features based on their ability to separate the data points and returns only the best one. This
09:23
algorithm is implemented in two different versions one that uses all of the data points at once which is obviously very time consuming and it is not advised to use on data sets larger than few hundreds of data points. While the second version exploits random sampling to
09:49
speed up the computation and reduce the time complexity at the cost of a slight worse feature selection but this allows to use automatic feature selection on any type of data set.
10:06
For what concern the clustering algorithms implemented, we only have two at the moment which are key means and agglomerative hierarchical which are two of the most common clustering algorithm and both of them have their own disadvantages for advantages for
10:25
example key means is faster and more more advised on large data sets while with hierarchical clustering we can have and visualize the entire hierarchy of the clusters so we have a better understanding of how the process is done. Then we can plot the graphs that we saw
10:45
before the dendrogram is only available for hierarchical while the wss and bss trends are available for both both algorithms. Finally in this section we have the possibility to scale
11:01
our data with normalization or standardization and this task is important when we have a data set with features that have different units or different scales since those could lead to an overweight or an underweight of some features. Once the clustering is done
11:25
a new numerical attribute is added to our layer so that we can visualize the results on the QGIS map and the experiment is also added to the last section which is the evaluation one which contains two indexes for internal evaluation the silhouette score and the
11:47
Davis-Boulton index which are basically just a number and really easy to interpret. In this section we also have the possibility to compare different experiments on the same data
12:01
set. This comparison is done by calculating a score that tell us how close the two clustering are within each other and here we can also save all of our experiment current experiment in a text file and also we can load back into the plugin previously run experiments.
12:26
As I said before the goal for the user interface interface is the usability for every user so we try to keep keep it as simple as possible. To do so each section that I described
12:41
before has its own tab which and all of the tabs have a similar layout. On one side we have the widgets and the parameter for the user, on the right side we have a brief user guide that explains all of the functionalities and on the bottom we have a message section to notify
13:06
the users any error or the completion of the the tasks. We also implemented an external configuration file which contains some of the most technical parameters for the algorithms in the plugin. This choice has been taken to avoid any confusion among less experienced users
13:30
while also keeping the possibility for a most experienced user to modify those parameters. Both during and after the development phase we performed a large number of experiments to
13:47
analyze the weaknesses and the strengths of our plugin. All of the data that we used during this phase was from the city of Milan and it arranged greatly in size with a number of data points
14:01
from around 100 to almost 70 thousands and also it differently a lot in its nature with data from climatic data building and urban data and also social demographic data. The first use case example that we have is an attempt to separate the city of Milan
14:26
in different climate zones by using automatic selected features and the features selected are about mean mean temperature, high temperature, mean relative humidity and max wind speed.
14:46
The spatial resolution is a grid of 100 square meters and the algorithm we use here is key means given the size of the data sets which is about four thousands of data points. As we can
15:01
identify some clusters that follows the morphology of the city, we can see in the south and the west a cluster formed by rural areas while in the center of the city we have a cluster formed by areas with high rises building and we can also clearly see one cluster with urban parks
15:25
and so on while the second use case is done on social demographic data and in particular manually selected features which are on the employment and education of young people.
15:43
Here the goal was to identify the outliers neighbors in the city on those attributes. This is why we chose to select only two clusters to separate the forest points and also the reason why we use the hierarchical clustering since it is more suited
16:04
for the task of identifying outliers. As we can see from the map, the highlighted clusters are the neighbors which are among the most fragile in the city. At the end of
16:24
the experimental phase, the plugin worked well on all of the on all of each type of data and the use cases that we tried and also all of the functionalities were available on all sizes of the data sets with very few exceptions for the largest data sets and to demonstrate
16:47
demonstration of the interest in the in the topic, the plugin has already been downloaded around 2500 from 2500 users and in few months and it is constantly growing and we also identified
17:05
two main parts for future developments. First of all, on one end we have the optimization of the software performances so that the plugin is usable on even larger data sets while on the other end we have the expansion and improvement of the analysis functionalities
17:25
for example by adding new algorithms for feature selection or clustering or even new section for example one for data visualization before running any analysis. Obviously this modification can be provided by the entire QGIS community given the open source
17:47
nature of the project and thank you for the attention and I will gradually respond to any question.
Recommendations
Series of 12 media