Making Sense of the Noise: Integrating Multiple Analyses for Stop and Trip Classification
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68919 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Firenze 2022323 / 351
1
7
13
22
25
31
33
36
39
41
43
44
46
52
53
55
58
59
60
76
80
93
98
104
108
127
128
133
135
141
142
143
150
151
168
173
176
178
190
196
200
201
202
204
211
219
225
226
236
242
251
258
263
270
284
285
292
00:00
UsabilityBasis <Mathematik>Mathematical analysisFlow separationSoftwareAndroid (robot)Projective planeData qualityComputer scienceMultiplication signResultantRow (database)SmartphoneGeneric programmingStatement (computer science)Computer animation
01:03
IBM Systems Application ArchitectureTimestampFront and back endsUniform resource locatorTimestampVariable (mathematics)Android (robot)Multiplication signMobile appNumberSoftwareDistanceReading (process)Point (geometry)Maxima and minimaMathematical analysisDiallyl disulfideSampling (statistics)Electronic mailing list
02:10
Sample (statistics)Workstation <Musikinstrument>PreprocessorRepresentation (politics)Order (biology)Uniform resource locatorPosition operatorRaw image formatGroup actionTimestampTranslation (relic)Electronic mailing listFreewareNumberMultiplication signCASE <Informatik>Level (video gaming)Sign (mathematics)Object (grammar)Computer animationDiagram
04:07
User interfaceRow (database)Position operatorInterface (computing)Source codeRaw image formatMeasurementSampling (statistics)PhysicalismMetric systemOpen sourceComputer animation
05:04
Source codePattern languageData miningService (economics)Mobile WebPoint (geometry)Point (geometry)LengthNoise (electronics)DistanceMultiplication signRadiusWindowCircleRow (database)CASE <Informatik>Position operatorIdentical particlesLevel (video gaming)TrajectoryAngleOutlierBitRight angleQuantum entanglementComputational intelligenceShape (magazine)Sampling (statistics)BenchmarkNumberMathematical analysisAverageSubsetSet (mathematics)Entire functionIterationAlgorithmSummierbarkeit2 (number)NP-hardInstance (computer science)Clique-widthSocial classArithmetic meanDirection (geometry)Analytic continuationRevision controlCountingVariable (mathematics)Thresholding (image processing)Source codeLibrary (computing)Decision theoryPoint cloudFlow separationMixed realitySensitivity analysisGeometryDegree (graph theory)Open sourceComputer animationDiagram
11:07
ArchitectureCoordinate systemAxonometric projectionProcess (computing)AlgorithmLevel (video gaming)CASE <Informatik>Source codeComputer animation
11:24
RectangleMathematical analysisPoint (geometry)Sample (statistics)Dimensional analysisSimilarity (geometry)Set (mathematics)BitNumberCASE <Informatik>SubsetSampling (statistics)TrailAdditionMobile appPosition operatorRow (database)Point (geometry)MetreBuildingDifferent (Kate Ryan album)MereologyMathematical analysisPhysical systemWhiteboardMultiplication signRight angleWell-formed formulaGroup actionTable (information)PlanningArithmetic meanNP-hardNegative numberThresholding (image processing)Computer animationDiagram
14:28
Context awarenessAverageIndependence (probability theory)Fehlende DatenMathematical analysisDigital filterModule (mathematics)ResultantOutlierMathematical analysisWindowPreprocessorElectronic mailing listGoodness of fitStrategy gameNoise (electronics)Filter <Stochastik>Different (Kate Ryan album)EstimatorFunction (mathematics)Decision theoryProcess (computing)VotingGeometryMixed realityChemical equationDiagramComputer animation
15:31
Performance appraisalPairwise comparisonVideo gameInformationSample (statistics)Social classInclusion mapFunktionalanalysisMathematical optimizationParameter (computer programming)Mathematical analysisAngleCASE <Informatik>Sampling (statistics)Condition numberRun time (program lifecycle phase)Pairwise comparisonSlide ruleRight angleLibrary (computing)GeometryPower (physics)Set (mathematics)DistanceRaw image formatNoise (electronics)Point (geometry)Chemical equationData analysisGoodness of fitNegative numberClosed setClique-widthIdentifiabilityRevision controlGreen's functionResultantFront and back endsPosition operatorComputer animation
19:04
DisintegrationInformationSoftware repositoryInclusion mapCodeRepository (publishing)Touch typingDemosceneRow (database)Computer animation
19:43
Touch typingComputer animation
Transcript: English(auto-generated)
00:00
Please. Thank you very much for the introduction. When data quality can't be changed, rethinking the status quo of how to make sense of your data can drastically improve your analysis results. And while this is a very generic statement, I'd like to apply this to a very foundational,
00:20
very basic, very first step of mobility research. My name is Robert Spang from Technical University of Berlin, and I'm usually concerned with topics somewhere between computer science and psychology. And as such, I often collaborate in research projects with the University Hospital of Berlin, Charité. I guess since Corona, everyone knows about Charité Berlin.
00:42
And that's how I got into geoscience, because in one of these projects, I was asked to build a solution to measure older adults' mobility in rural areas over several days in a row, multiple times a year. I was looking into GPS trackers and low-end smartphones, and I opted for the latter because they offer more situations, more possibilities later on.
01:03
And then we took the basis of the micrologger Android app, developed a software around it to also record some accelerometer readings, and developed a backend for our colleagues to collect the data and to access this data in the end. Imagine how happy I was when I was able to say, mission accomplished, here is your
01:23
participants' tracking data. Yeah, we have longitude, latitude, time stamp, and your username. And they were like, sure, happy for me, but I guess they expected something more that was like a typical analysis variables, like time out of home, number of visited places per day, maximum distance out of home, et cetera.
01:44
That was the point when I realized I needed to switch from a general purpose engineer into a geodata science analyst. I decided to, the very first step, before I do anything else, I would need to find a way to transform my raw samples into a list of significant locations or stops.
02:01
Most of these variables of interest then would be derivable if I had a way to read how long people have stayed at which place. Now, switching from my older adults to a more tangible example, this is the movement I made through the city of Florence three days prior to this conference. I did a free walking tour, but apparently it did not help to overcome my North-South bias here.
02:23
Now, that's a bunch of raw positions only plotted onto a map. No filtering, no pre-processing, just in a consecutive order. To analyze that and to learn more about my mobility behavior, I would need to find out all about the places I've visited and the places I've stayed for at least some while.
02:43
Because these places would tell a lot about what's important to me, what my interest might be, and how active I am, and so on. So this is an actual representation of all the stops I have made. The color-coding is just to distinguish the stops from each other. Transforming this bunch of raw data into positions and timestamps is my main objective
03:03
here. If I had such a list, I can then go ahead and group these places together to know which places are unique, which is a place that I have re-visited before, and so on. And based off these unique places, I can then go deeper and aggregate some data. For example, the number of visited places and the time I've spent at each place.
03:23
And that's pretty insightful. Usually the most time people spend at one place is their home, just because of sleeping. So that's right here where my Airbnb is right now. And working typically is the second most visited place. In my case, that's a nice co-working cafe in the Southern City. So how to actually do this translation from my raw bag of positions into a tidy list of
03:47
significant places? As many of you might know, what we measure when we ask our devices where we are right now is an engineering masterpiece. Because our phones do a lot of pre-processing before we actually get to see our locations.
04:03
If you pull out your phone right now to see your current location on the map, you'll probably end up with something like this. However, the raw signal your phone actually measured looks much more like this. The interface will polish the measured data perfectly, integrating multiple sensors so you don't actually end up zipping around like this original raw signal it actually
04:23
recorded. And this is because physical signal measurement also always records noise. Unfortunately, there is no such thing as a perfect signal. Our raw position samples are scattered around the true position that we are actually at. And because identifying these true positions is such a crucial pre-processing step, all
04:43
the other metrics, then, that we will derive from this, depend on having the very best accuracy possible for classifying stops and trips from raw data. Okay, from the source community, you never failed me so far. Show me what we got. Of course, this is a solved problem. Many have done this before. Most notably, the work by Ashbrook and Starner in the early 2000s needs to be mentioned here.
05:04
Here visualized by Yi and colleagues, the fundamental idea is simple. You have only two variables, a radius r and the time threshold t. Now for all points that lay within this time threshold, let's say five minutes, you test if all these points lay within a circle of this radius r.
05:22
And if that's the case, then you summarize that all these samples belong together as one stop. If I stick around for long enough at one position, I'll record several positions very close to that so they may well fall into this. And as such, I decided that all these points belong together and form one stop on the map. Yes, there are other approaches too, especially the work by Petteri Nurmi needs to be mentioned
05:44
here if you want to dig deeper into this. However, most open source libraries still rely on an approach that is similar to this one that is described here. Most famously, moving pandas or psychic mobility. I think of them as the most important libraries for mobility analysis in Python.
06:04
And both rely on this idea to distinguish stops from trips. Now my personal success with this algorithm was actually a bit mixed. The trackers that we bought are, let's call it, cost sensitive. So the sensors might not be ideal and do record a lot of outliers. And that's why we end up with a lot of fragmentation.
06:23
Fragmentation in this case is that when one stop that should belong together is actually detected as multiple stops. And that's especially troublesome when you are interested in, for example, the number of visited places and the like. And because of that, and to be honest, because I really wanted to test out some ideas, I
06:40
decided to implement my own stop and trip classifier. I found it just so easy to spot these things with my own mind, right? So how hard should it be to train a computer to do that? I had several ideas in my mind. And to use this entanglement, you know, this geometric shape that the signal noise draws, to use it for solving the stop and trip classification problem.
07:01
With the remaining time, I'd like to discuss a new class of algorithm. I'll walk you through four different ideas how to interpret this signal noise geometrically. Then I'll benchmark each of these ideas to demonstrate how they can work together to form an even stronger decision. And eventually you'll learn where this algorithm lives right now and how to use it yourself. Let's have a look at this sketch of GPS records.
07:23
It's a subset of points recorded while I was on the go. Oh yeah, and that's a good time to point out that all of these ideas have in common that they don't work on the entire data set at once, but rather on a subset. And then we move through the whole data set in a rolling window manner. So we have a couple of points, and for the next iteration, we have not the first point,
07:42
but at the end one more, and so we apply it to all these subsets until we scored all the points in our data set. Okay, good, back to the sketch. Analytically, first here, let's sum the distances between each two pairs of points. Here, let's say they have a combined length of 120 meters. Second, we divide this by the length of the single greatest distance
08:04
between all these points. Here, let's say 105 meters. When we divide these two numbers, then we end up with a ratio that's just above one. Okay, let's apply the same idea to an example of another instance. Here, we see a similar drawing that was recording
08:21
while I was just stopping into a shop to grab some coffee, and I was not really zipping around like that, but that's just my phone having a hard time to really determine my actual position because it's hard indoors, signal has noise, and we end up with such a drawing like this. Again, let's sum the distances
08:40
and divide it by the largest distance of all these points. Here, the ratio of these two numbers is much larger, and as you can see, this method provides a very good discrimination between the two classes of point clouds. The lower the value, the more likely it was that I was just moving around. The larger the value, the more likely I was actually stopping or dwelling somewhere around in the center
09:03
of all these crowded samples. And as you can see, such a geometrical analysis can work very well for our stop and trip detection problem. We named this approach the width-distance-ratio method because it actually, that's what it is, right? It computes the ratio between the width and the distance of these points. Okay, that was the first of the four.
09:21
Now, let's start again with the same trajectory. Here, we compute the angles between each consecutive pair of path segments. And then we calculate all these angles, we note them down, and eventually, we compute the mean of all these angles
09:40
that we were looking at in our subset. Let's apply the same thing to our stop example. Here again, we calculate all these angles, note them down, and compute the mean. Here we end up with 144 degrees compared to our trip example of only 27 degrees. So this method, again, provides a single score
10:02
that we can use to distinguish stops from trips. We call this the bearing analysis, bearing as in direction. Third, let's take the average of the first two points and then the average of the last two points. This is actually done to make the method more robust to some outliers. And then we compute the distance
10:21
between these new centers of the first and the last segment. We do exactly the same with our stops and we end up with a much smaller distance and then this distance, again, provides a continuous number that we can use to distinguish stops from trips. And last but not least, the most simple version of all of them,
10:40
let's count the intersections of all the path segments. As you can see, there are none. However, when we count the intersections of our stop example, then we have already five. So counting these intersection alone is, or again, is a good discriminatory method to distinguish stops from trips.
11:02
We created an algorithm, ah, the intersecting segments analysis, sure. We created a whole algorithm around all of these ideas. Assuming that you record data with your phone, we could use your GPS data and a solarometer data. However, we are not limited to that when you have another data source, that's perfectly fine, but just in case you have,
11:21
at that stage we would align all these sensor data. Next, we would then score each individual sample using our four different analysis ideas I just described, as depicted here, they all are applied in a rolling manner, so we process only one subset at a time. In addition to that, there are two more aspects we considered
11:40
that I haven't mentioned so far, the analysis of missing data and our so-called motion score. Okay, missing data, basically we have just some gaps in the recorded samples, and that's mostly of course because our tracking device can't compute a position or with an accuracy below a certain threshold, that depends a little bit on the settings of your app when we record a data, we usually put that to 25 meters
12:02
and especially indoors, these systems always have a hard time finding your position. Right now in this building, for example, I don't get any GPS signal at all. Think of it like this, you walk towards a building, you have perfect signal, you record position data, then you enter the building, you have no signal,
12:20
you don't record anything for, I don't know, you work there for eight hours and you leave the building again, you have perfect signal again, so you record position again. Now, the data that you actually recorded is just the path to the building and the path from the building, and then there's a gap of eight hours between the last and the first part. We can interpret that and say like, okay,
12:40
let's look at the last sample that we recorded and the first sample just after the gap. If they are close together, then it's quite likely that we actually were just stopping and not recording anything because of, you know, buildings. In another example, you board a plane, for example, you don't record any GPS data either, but your position would be very large apart from each other,
13:01
from the first and the last sample, right? And in this case, it's just inconclusive. You can't decide anything here. Maybe it's the phone or the tracker, run dry, run out of battery or something, we don't know. But we actually can assume a stop when these two points are very close together within this gap, so that's what we do here. Our motion score is kind of similar.
13:21
That's actually where our accelerometer data goes in. This captures the physical impact. So that's basically just a number that we derive, simple formula, look up the paper if you want, where we put all these three dimensions of the accelerometer in and basically have one number saying, well, if I shake this device, then I get a large number. If I hold it very still or put it only on a table,
13:42
then I have a very small number. Then I can use this number to distinguish that because the idea is when you are on transit, when you are moving somewhere, then it's unlikely that you won't have any movement, any physical impact to your tracker device. However, if you are sleeping at home and you just put your tracker on the desk, well, then we don't have any physical impact
14:01
and that's what we can use this number for. And when we then look at very small numbers and we can say like, okay, it's fair to assume that this is a stop for these very, very small numbers. We don't know anything what the large numbers mean. Could be walking around, I don't know, in a mall or something, you're still at one place but you're actually walking. But for these low numbers, we can use that.
14:21
Of course, all that is optional. So if your use case don't have any accelerometer data, don't worry, you can still use that. Okay, good. Third, our classification. Now we integrate our four analysis, a geometrical analysis, and add the estimates of the missing data and the motion score into the mix. And then we compare the difference results to form a majority decision.
14:41
This way, they can compensate for each other. For example, if three methods lean towards a decision of voting. Yeah, this window seems like a trip but one method is really confident. And it says like, no, I'm certain that's a stop. Then they can balance each other. You can think of it as a democratic process if you want. And lastly, we have a module
15:00
to filter outliers and merge fragmentation so the final result is as clean as possible. This module, again, uses different scoring methods under the hood to decide which stop to keep and which to get rid of. Eventually, the stop list should be clean and nice and tidy. All of that is what we call the stop-and-go classifier. It uses geometrical analysis of the signal noise.
15:21
It operates on the raw signal and it won't need any data pre-processing or filtering. And it combines multiple strategies to form one unified strong output. Okay, but how well does it perform? Well, to answer this, there was a lightning talk yesterday. Just in case you have missed that, there's a data set that we recorded solely for this purpose
15:42
and we recorded over 120 days and captured 120,000 GPS samples and did a manual labeled movement diary. So we have some annotations for this data set where we can actually know all these samples were recorded while on the go and all these samples were recorded while dwelling somewhere.
16:01
And this data set is a Staga data set. We published it here at Phosphor-G and there's a paper around that if you look into this, but basically that's what we use for our analysis. Okay, good. Let's use that to score all of these four methods that I just described. First, our intersecting segments analysis has a balance accuracy of 0.9.
16:21
That was the one where we count the intersections. Next is the bearing analysis. That was the one with the angles, 0.93 for balance accuracy. The width distance ratio scores 0.94 and the start and end distance analysis close to 0.95. If we put all of them together, and that's for me personally the most interesting aspect,
16:41
then we reach an even, like actually a good step forward of an additional 1.5% of balance accuracy to 0.96. And if you put all of the mentioned ideas together, like including the motion score and this missing data analysis,
17:01
then we end up with 0.965. And yes, of course, that's very academic. It's like optimizing the last 3%, but I would argue the last 5% are always the hardest. So that seems cool, but how good is it actually compared to other libraries? Well, to do that, we compared all of that with moving pandas and psychic mobility
17:20
as they provide classifiers for this as well. Without going into too much detail, best values are highlighted in green. And of course, we only use the version without the motion score because the other libraries don't have a function for that, so that would be an unfair comparison. And of course, we have several parameters
17:40
to set all these things up and the other libraries too. So we tuned to get the very best outcome on our data set with all these libraries to have an optimal comparison, but that's important to keep in mind. That's a comparison under optimal conditions. Apart from sample by sample analysis, we also looked at the performance
18:01
of our stops and the trips. Here, notably, that moving pandas was able to miss less trips. I would argue they find double the amount of trips that they are, but sure, we miss less. And last but not least, looking at the runtime might be interesting here.
18:21
It's just outstanding how quick the psychic mobility that this moving pandas and our stop and go classifier here still, I guess, has a lot of headroom to explore if performance really is necessary to you, then one idea might be changing the Python backend or rewriting everything in C,
18:40
what you would do at the weekend, right? All right, good. So yeah, if there's only one slide to remember, then it's this one. First, the library offers a powerful geometrical analysis of signal noise. Second, it combines different approaches to form the best result. And third, if you need to identify stops and trips from raw position data, the thing that you want to Google
19:01
is called stop and go classifier. All right, good. So how do you get that? Well, you can scan this code to get to the GitHub repository. There is also a reference to the paper where we describe in detail why, how, and what we exactly did there. And there are also some examples to get you started super quickly. Of course, I did not write some good documentation
19:22
as I just learned I should, but I hope the examples still help you. And behind the scenes, Anita Graza, who's the mind behind moving pandas, and I are in touch and we are discussing integrating this into moving pandas. So if you're already using that, which you should, then there's a good chance that you get to use that in the future.
19:41
Now, I'd like to encourage you to get your hands dirty and record your own GPS data and use this tool to analyze it and get in touch if you're interested. I'm very much looking forward to discussing this with you. Thank you very much.