We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Analysing spatial data with R (ECS530)

00:00

Formal Metadata

Title
Analysing spatial data with R (ECS530)
Subtitle
Norwegian School of Economics PhD course
Author
License
CC Attribution - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Term (mathematics)Electric currentRepresentation (politics)Scripting languageBroadcast programmingComputer fontMassBinary fileMenu (computing)EstimationLaceMaxima and minimaLibrary (computing)Equals signMedical imagingConvex hullRaster graphicsEmailTemporal logicDuality (mathematics)Standard deviationBargaining problemObject (grammar)Annulus (mathematics)CubeError messageFile archiverInterface (computing)HierarchyGauge theoryGoogolSocial classInternet service providerSystem callView (database)EmailMultiplication signLattice (order)Electronic mailing listIntegrated development environmentState observerSet (mathematics)Shape (magazine)Natural numberPosition operatorSoftwareFormal grammarSuite (music)1 (number)Point (geometry)Vector spaceForm (programming)Range (statistics)Term (mathematics)Hill differential equationInformationStandard deviationEndliche ModelltheorieSpeech synthesisWorkstation <Musikinstrument>Sheaf (mathematics)Level (video gaming)Control flowObject (grammar)19 (number)Data storage devicePresentation of a groupActive contour modelZoom lensShared memoryInteractive televisionGrass (card game)Online helpVisualization (computer graphics)Field (computer science)Ring (mathematics)10 (number)Matrix (mathematics)WebsiteDimensional analysisMappingStreaming mediaPolygonScripting languageFisher's exact testGeometryLengthExtension (kinesiology)Row (database)State of matterOpen setSeries (mathematics)QuicksortGene clusterBitData managementMathematicsUsabilityProjective planeCodeRepresentation (politics)InternetworkingMereologyConnected spaceCASE <Informatik>Attribute grammarAdditionOrder (biology)NumberTheoryDifferent (Kate Ryan album)Software protection dongleDiscrete element methodSource codeProcess (computing)Functional (mathematics)TouchscreenQuantum stateDeterminantStudent's t-testNetwork topologySatelliteMathematical analysisIndependence (probability theory)Module (mathematics)FamilyResultantArc (geometry)Parameter (computer programming)MaizeImage resolutionRevision controlForcing (mathematics)Connectivity (graph theory)Reading (process)Frame problemProgram slicingLink (knot theory)Default (computer science)Exterior algebraTouch typingFormal languageMoment (mathematics)Universe (mathematics)Address spaceLibrary (computing)Triangulation (psychology)Query languageElement (mathematics)Mechanism designImplementationTrailSemiconductor memoryServer (computing)Physical systemExpert systemSquare numberDrop (liquid)WindowSpacetimeGoodness of fitDoubling the cubePoisson-KlammerGame controllerReflection (mathematics)Greatest elementFlow separationString (computer science)Line (geometry)RoboticsCore dumpData acquisitionRootType theoryLaceSinc functionData structureComplex (psychology)Disk read-and-write headFigurate numberCollaborationismFile formatComputer fileAsynchronous Transfer ModeAxiom of choicePlotterStatisticsDesign by contractDependent and independent variablesBit rateDataflowAreaStiff equationObservational studyAuthorizationComputer programmingOpen sourceDiagonalGraph coloringElectric generatorWordDegree (graph theory)Variable (mathematics)Web pageInformation securityGraph (mathematics)Repository (publishing)Group actionUser interfaceLimit setPlanningDesire pathHard disk driveoutputNegative numberProper mapTesselationGreat circleVideoconferencingClique-widthHomographyDirection (geometry)Video gameAlpha (investment)Parallel portSoftware developerCartesian coordinate systemSystem administratorRoutingData conversionKey (cryptography)SubsetMultiplicationSoftware frameworkCycle (graph theory)DivisorRoundness (object)Basis <Mathematik>Wage labourDescriptive statisticsNoise (electronics)CountingWave packetDecision theoryException handlingMetric systemRule of inferenceChannel capacityRandomizationSoftware testingCellular automatonSurgeryRankingIntegerSpring (hydrology)GradientSingle-precision floating-point formatPattern languageDevice driverLinear regressionComputer clusterPixelTemplate (C++)Digital photographyCausalityMixture modelNumeral (linguistics)HexagonWhiteboardContrast (vision)Theory of relativityMorley's categoricity theoremSound effectTwitterComplete metric spacePentagonTable (information)MultilaterationDreizehnCAN busFitness functionInstance (computer science)Substitute goodBuildingPrice indexCross-correlationSpreadsheetWeightTelecommunicationProgramming languageComplex numberHuman migrationKnotOperator (mathematics)FeedbackSlide ruleReal numberPopulation densityDistanceCircleError messageVector potentialFaculty (division)ÖkonometrieCoordinate systemOcean currentMeasurementBinary codeAnalytic continuationLoop (music)Uniform boundedness principleAutomatic differentiationRobotFilm editingCuboidBayesian networkTransformation (genetics)CoefficientRectangleRegular graphDecimalAbstractionPlanar graphConstraint (mathematics)Raster graphicsCarry (arithmetic)Personal area networkWeb 2.0Three-dimensional spaceExploratory data analysisPoint cloudEnumerated typeFloppy diskHash functionFunction (mathematics)Green's functionTopostheorieSampling (statistics)MetadataElectronic visual displayProgrammer (hardware)Array data structureLoginRight angleCodeHydraulic jumpTimestampBulletin board systemMaxima and minimaPattern recognitionEuclidean vectorCluster analysisData analysisBus (computing)File Transfer ProtocolPublic domainInteger factorizationVirtual machineMemory managementComputer fontScheduling (computing)GeostatisticsComputer animation
Transcript: English(auto-generated)
The course has been run, the first course which was based on our spatial was organized by political scientists in Trondium in the summer of 2004. That was very early and subsequently I've done other doctoral courses or tutorials
10 years ago or 15 years ago. 15 years ago is Trondium, summer of 2004 for political scientists organized in collaboration with the peace research institute Oslo and conflict studies and studies of conflict. This was at the time when we had no idea that anybody used the software we were writing.
So what I'll be doing initially for the first half hour or so, I'll go through a schedule in a moment, is to talk about the background for where we are, where these two books came from. The first edition of our book from 2008 which is quite like the original course,
then the second edition of the book from 2013 and when we get about half an hour into this morning I'll be explaining why we haven't got a third edition of the book, which is that the world has changed and we are under contract for a different book.
Vitilio, who's the co-author in this one, has several books which have just been published on Bayesian statistics and Edse Päbsmer and I are under contract for a different book, but it's not ready yet and we're not quite sure where we are.
So that you're in a slightly difficult place because those who've taken responsibility for making things happen over the last 15 years are somewhat uncertain about where we are and where we should be going. But that will develop as it goes. That's with regard to data
representation and data handling. With regard to analysis, then I'll be talking to you once we're off-streaming later on about choice of topics towards the end of the week. At the moment the end of the week is a bit uncertain and and I need feedback from you
about that. The format for all of the talks is identical so that you get a copyright notice CC by SA. I'm running current R361 on Fedora 31, so on Linux. These are the packages which
you need to have installed if you want to run the code which is for this morning, and the code can be downloaded from this link on GitHub. So this is a zip file which contains the R script. In some cases there's just the R script, in some cases there's the R script
and some data files. The schedule for the whole course is today looking at spatial data representation which is something which you may find unnecessary. Please just let us get to work. Sorry we've got to look at data representation.
It saves time in the longer run. You may know your data at the moment but in two years time you'll be on a different project. The data will be different. Data representation matters and in particular the changes which are occurring in data representation. We'll go through to 11 o'clock, turn off the streaming, then we can talk a bit.
First say half an hour if I haven't keeled over, then we take lunch break, then we can interact a little bit more before one o'clock, then start again with streaming at 1.15 and then look at support, topology, input output. So those are other elements. Tomorrow morning
looking at coordinate reference systems and then in the afternoon visualization and the visualization I'd like you to be fairly active and try things out. There's a lot which has been changing certainly since the books were written and much of it is quite exciting. On Wednesday morning and this is then up to discussion,
if nobody needs interfaces with GIS then we can choose something else. And from then on it will be analysis depending on what your needs for analysis are then we can
substitute topics from then on. I've said project surgery on Wednesday after lunch, that means that I won't be talking to you but you can talk to me about your project so we'll be here in the classroom and we can do that. And the same thing on Friday after lunch
with a presentation towards the end of the afternoon. Then on Saturday presentations, for those of you who haven't presented on Friday, and then we'll be done. I was a little bit worried about my first line in the outline. Why break stuff and then why not?
There are some people who provide open source software who really enjoy breaking stuff. I don't. But sometimes you have to. So what I'll be doing in data representation,
in input-output, in visualization, particularly in coordinate reference systems is talking about things which have now outlived their usefulness but still have many uses which means how do we
manage a process of migration to more modern robust and sustainable representations of the data. It's also useful to know that there's good communication between the various communities and
this includes the Python community. The Python community also lively does things and I can provide links if someone's interested to an R Python workshop that I did in in September in Luxembourg. There are new opportunities for visualization, particularly the tmap package and
map view and there are challenges regarding the upstream software libraries and I'll be talking about these. And I'll go on to talk something about spatial weights, spatial autocorrelation, spatial regression, if that's something that you find useful and I can talk about that because
I'm responsible for those packages. That's spdep and spatialreg. And then we can choose other things to do later on as well. So this morning and we're now at 9 15 to 9 45,
check Slido and I have confirmation from outside that streaming is running so at least I can relax that the streaming is running. Thank you Lorenzo.
We haven't really worked out a way to get feedback but Slido was one of the possibilities that it may be that some questions come in through Slido as well but for the time being this will be more or less sort of straight provision information from my side. If any
concepts that I'm using are unknown to you please put up your hand and say. At the moment I'm stuck completely inside a particular problem of data representation
and I find it quite difficult to get out of it so that it may be that what I say to you even though I'm speaking English does not make any sense. If that is the case put your hands up and say oh could you repeat that in English please in which case I'll do my best. And
the fact that I'm at one battery bar rather than five battery bars means that I may actually get lost in my own thoughts if I do wake me up. Now if we step back not 15 years not
even 20 years but if we step back 25 years many of us were teaching courses in spatial analysis or many of the small group of people were teaching courses in spatial analysis. Edzer Pepsma's work was in particular in in geostatistics and writing an excellent
standalone program gstat for geostatistical analysis. Others were working in the same field the same area Albrecht Gephard was teaching in
Klagenfurt and other people were teaching were trying to find software tools to teach the sort of outside GIS. It's not that GIS was a bad thing but that most most geographical information systems did not really provide the tools that you needed to do
analysis and partly this could be could be concerned with geostatistical analysis partly it could be concerned with point patterns originally in spatial point pattern analysis the S plus package splanks a spatial Lancashire because it was at the University of
Lancaster University and so on there were people who needed software to to do analysis to do to do teaching and it was also an advantage if the license fees for the teaching software were
not unreasonable as most universities had limited budgets for software and going to the going through the departmental and faculty committees to try to get more money to buy more software was difficult. Gstat was open source from from the beginning splanks was made available to license holders of S plus free so that there was a community
working working there were disparate individuals working who maybe knew each other. It was possible to write and share scripts for arc info in aml arc view of visual basic
for arc chase site licenses dongles and I still have a postcard from a French ecologist by my screen saying and this is from a long time ago this is from 15 years ago 14 years ago saying I'm sitting here on an island in a river in Tibet with my students
our batteries are still running we can do the analysis on site because we have open source software so otherwise you had to have a dongle which you attached to the parallel port of your laptop so that working in the field was clunky if you had to have licenses if you didn't have to have lice as in 2005 in an island in Tibet you didn't have internet connection in any case
you couldn't run arc info where the license fed from a license manager on the internet so there was a practical problem in teaching and field work which could be resolved by open source software so site licenses dongles nowadays they seem completely I mean why would anybody need
anything like that everything's open source but then it wasn't. From late 96 R became a viable alternative for teaching and a couple of us several of us started saying okay well why don't we try this out R uses much of the we began by using much of the syntax of the S language
which is then commercially available and quite a lot of universities did have site licenses and they were managed in a different way to the GIS licenses R is licensed under the GNU
general public license version 2 and remains so so that it it's free to use wherever the user wishes to free to use extend distribute and so on with modifications to the code restricted by the by the GPL so that you can't sell a modified
version without contributing back the modifications that you've made this is slightly different from the Python ethos but it's the ethos that that R has in 1997 1998 maybe the end of 1996 a spatial stats module was made available for for S plus but it was still licensed you still
needed an S plus license and you needed a license for that for that for that module and there was also a meeting in Leicester in in the UK where quite a number of people working on exploratory spatial data analysis met and and discussed all kinds of things how could we use
even Tcl Tcl was a language a scripting language at the time I was using Ork as a
porting of S code R was begun by Albrecht Gephardt because he needed it for teaching and was then made available as soon as the R package mechanism matured which is is 20 years ago so that porting mattered because if you if you'd been running a course using S plus and
the different in S plus they're called libraries which were available s geostats splanks and so on so these were available you could teach with them so okay so you can imagine that you're dealing with master's courses or what we would now call master's courses
or doctoral courses giving introductions in applied spatial data analysis there is a separate book by Bailey and Gattrell on interactive spatial data analysis from the mid-1990s distributed with a diskette which you had to run on windows but windows no you had to run it
on DOS it wouldn't run on windows because the mouse didn't work because they had a special mouse driver so that that everybody who was teaching this kind of stuff was trying to how on earth can we get it to to work distribute it to our students without them having to go around
with these these half kilo dongles it wasn't half kilos maybe 100 grams but but still so that the the the the first package is that on the comprehensive archive network from from
1998 were ported by by Albrecht Gephard where tripac akima both available within S plus and unfortunately on on non-open source licenses but needed by people who were doing spatial work followed by ash s geostat
six months later so about a semester later so you could see the the the clock of the semesters ticking so that Albrecht was getting stuff out which matched the the the semesters Albrecht also helped port parts of the spatial package which is part of the
modern applied statistics with S from the various from the very beginning the the administrators were very helpful Albrecht and I were in contact and we did a seriously mismanaged talk at the regional science conference in Vienna in 1998 we had
a 20-minute slot and we used 45 but but people were tolerant in those times you didn't have these people with them we'll turn off your beamer stuff because actually they were interested they were sympathetic and and they gave us some feedback so then we were able to show that
in some cases in in in packages which you had to download from our own ftp sites you could do the teaching you needed to do and if you needed to do research in the field you could you could do this do this as well the S plus version I've already mentioned of
Splanks and I'd contacted Barry Rollingson in 1997 but only moved forward because of the the the amount of Fortran involved in in the importing the package to September 1998 and at that stage we'd started realizing that there was an issue because we had
an implementation of Ripley's k-test for spatial randomness call it that it's not quite that but for point patterns in the spatial package and we had another one in the Splanks package
and it would be really nice to be able to confirm that both of them came to the same results from the same data okay so we've got a standard test now are the two implementations the same or different so this was a question which arose very early
um so this is this quote from an email an issue I thought about a little is whether at some stage Albrecht and I wouldn't integrate or harmonize the points and pairs objects in Splanks spatial and SGeostat they aren't the same but for users maybe they were to appear to be so
so we were thinking about shared classes for representing uh spatial data and it turned out this was quite fruitful. I stepped aside a little and worked with Markus Nettler on an interface to the GRASS GIS. GRASS GIS was originally public domain and became open source
in in the mid 1990s. It was written by the US army and it was written for the reasonably un-military purpose of monitoring erosion on army ranges so if you drove your tanks along the contours of a of a hill then they created fewer gullies than if you drove them
down the hill so this this was so that they were interested in modeling the erosion caused by army exercises on an army range in the middle of the states and so so GRASS existed and still exists it's now at the 782 is now considered almost ready to be released
the release candidate it was published last night so that I was working on interfacing R and GRASS and and using using R to analyze quite large raster data sets from from from GRASS at that point in time as class classifying landscape types. The interface has both to GRASS has
evolved and interfaces to other GIS have also also evolved and if you want me to talk more about this then we can we can look at that on on Wednesday morning. I'll already draw your attention to this book. If you need a help I don't know what to do I'm new to this
then this is the book to go to it's not the third edition of our book although Robin Lovelace wrote a very well thought through review of the second edition of the book he knew the book before but
he was thinking how do you teach people towards the end of the 2010s this stuff and he's saying this isn't going to work the the reviews of the first edition were okay this is tough stuff
but if I have a doctoral student who has to do it then I give him this and tell him to chew until he's chewed all the way through so in the 20s so 20 zeros then people were expected to have considerable determination and independence of thought and if they didn't understand the
paragraph they were expected to reread it and read it successively until they had understood it in the 2010s this is not so much the case and people are more perhaps expected to be stroked so I'm not saying that this book isn't accurate concise but it is more read differently
so and it's also available online in addition to the printed edition it was it was written as a as a bookdown project and the link is is provided in in the references at the end of the
slides so if you're lost at the moment this book will be your friend one of the so going back to 20 years ago uh working on different different packages on the archive network the archive network team who worked uh called Honnik Fritz Leish in Vienna said that
we're going to have a meeting for our people in in March 2001 and can you come and give us talk about the GRASS interface okay so it's a bit scary I'd done another scary thing in
1986 when I went to the European UNIX user group meeting to talk about spatial data as I'd sent an abstract and said they're not going to be interested in this but they put me in with three four hundred people including the next developers next was what came before
the new mac so it was where Steve Jobs went when he wasn't friends with apple and they were seriously intense people so I talked to some of them after my talk and they were actually quite interested about what was going on and what people were doing with spatial data
and why UNIX was useful for this and the the modular approach to writing writing software that was that was scary but then walking into this room with people about whom I the only thing I knew were that they first they were they were all statisticians I'm a geographer
and they'd all written a good deal of software which I used on a daily basis and they were very kind polite interested even and and so on so so so that that was that was fun and they were extremely
helpful to talk to because you've got all kinds of sort of hints well had you thought of doing it that way did you try that out and so on so the community was was feeding back and the mailing list as well was was very useful so that I knew all of them from the mailing list we were fewer than 70 at the at this this meeting in in in in Vienna
so unique insights yes definitely a bit later the same year I was asked to go to to Santa Barbara to a workshop on software for spatial data by Luke Anselin and Serge
Ray. Serge Ray is most involved in python development and Luke Anselin has been more involved in coordinating standalone programs at the time space stat which unfortunately escaped his his control and subsequently geodah and I was continuing to work on
the spatial econometrics as a as a narrower field during the the second half of 2002. It seemed sensible to try and do something at the next R meeting in Vienna so the next R
meeting was being organized and I'd been asked could could you do a paper session on spatial statistics so okay we can send out some emails get some people to submit papers and we'll see how that goes and I also thought of that maybe we should have a workshop to to discuss classes for spatial data. I'd contacted Edsa Pepsma because of his work with G-STAT and he'd at the time
coincidentally it's the end of 2002 been approached by a Netherlands environmental agency to write an interface between G-STAT and S-plus and so from an email from Edsa in
November 2002 so that we're now 17 years ago he he he mentioned perhaps I should I wonder whether I should start writing S-classes I'm afraid I should I'm not sure whether he hello I'm not sure whether he's grateful for the insight but since then he has done lots of
other things but but certainly engagement with with classes has been has been has been very fruitful. Vidhilio Gomez-Rupio, my third co-author in book had developed two packages ones are ArcInfo for interfacing old ArcInfo vector formats and de-cluster for disease clustering
and was also committed to coming to that meeting and then other people wanted to come to the meeting or did come to the meeting like Marcus Nettler, Albrecht Gephard, Manfred Fisher and a number of other people so we were about a dozen. Nicholas Lewin-Koch who had made contributions to
map tools also said more or less the same kinds of things. There's a lot of duplication effort I did notice after looking through people's packages there's a lot of duplication effort my suggestion would be to set up a repository for spatial packages which we did on SourceForge
similar to Bioconductor mode with the base spatial packages which has s4 that was then new style s classes and methods which are efficient in general so that we had a if you like a mandate before we met or around the time that we met to do this.
After the workshop we set up a collective repository on SourceForge set up the R-SIG GEO mailing list and there are still three and a half thousand subscribers and that was the beginning of the SP package. So we had a mandate for the development of the SP package discussions we then met for coding meetings in Lancaster with Barry Rollingson
in 2004 with Virgilio in Valencia in 2005 and we got SP onto the comprehensive archive network in April 2005. So what we'd done when we went for writing the SP package
it's a package containing definitions of classes for spatial data was to use new style class representations for spatial objects whether they were rest or a vector
and that they should behave like data frame objects and that we also in the SP package contained visualization methods to make it easy to show those objects so check to see whether there are any further comments. So one of the things that we were clear about was that we
didn't oblige the authors of any other package to use those classes if they wanted to use them
they could if they didn't we would provide coercion methods that's a way of converting a spatial say a spatial points object into a PPP objects for point pattern analysis that we that you could move freely between the representations if a package didn't want to
adopt the SP representation of data fine great not a problem we did talk about it whether we should whether we should do this and very early on we said we're not sure that our representation is the one which sort of one size fits all so we won't we won't go out and assert in
that way the SPATSTAT package has grown considerably does very well and we keep the interface with SPATSTAT up to date and and completely current and running from from SP. We have this year made a radical change which is that up until the penultimate release of map
tools you could convert point a point pattern in geographical coordinates that's in a degree decimal degree metric into a PPP object for point pattern analysis despite the fact that the
SPATSTAT package was designed to handle planar geometries so that distance measurements should be euclidean and not great circle and EDSA took this up in a question to
Adam Adrian Badali's plenary at a spatial statistics conference in spain in july this year to say why can people still make the mistake he said i have this problem i have students who have data in geographical coordinates and they coerce them to SPATSTAT and carry on happily
analyzing their points even though the technically the results are rubbish so we agreed that we would we would insert either warnings or errors into the coercion so that if it was known that
your object was was in geographical coordinates then you you get a slap in the face and don't do that there is a way around of course which is to say that we don't know what whether the the the data are in geographical or planar coordinates in which case planar are assumed
but that's the user's choice so that now if the user goes in with data which are known to be in geographical coordinates they get a pushback but that's that's the first the first radical intervention that we've made uh in in in 15 years so there's quite a lot of continuity here
and continuity is something that we see as as mattering accessing data from outside had been available in the map tools package prior to prior to to to other work for shape files this was superseded by our google
but our google had originally been written simply to read raster uh i'm using some uh pronunciations which you're not familiar with i call it c-ran the comprehensive archive network because there's also c pan which is the comprehensive
pearl archive network and c tan the comprehensive tech archive network and some of the software originally used for c-ran was also used for c pan it was written in pearl to that c-ran was written in pearl until quite quite late and
ran in pearl so that's a possibly an unusual pronunciation the other unusual pronunciation which i've just used is our google and why am i calling something which you might write as as as
jedel you might read that as jedel and i'm reading it as why am i doing that well because the the one of the the prime original contributors to the
software library frank warmadam said that they always wanted it to be object oriented so that he he always pronounced it a geographical object oriented data abstraction library but they never really got it very object oriented but he kept pronouncing it that way
because that was their original ambition so when you hear me say google i mean what maybe you might have pronounced as jedel but google provided extensive access to to raster
data and the original work that was done by by by tim kate other parts of google for reading vector data which was what was then called the ogr part of google
were written separately by barry rollinson and barry also contributed parts for dealing with a cordon reference system projections or transformations this was then made sp it was adapted to sp so that both our google looking at raster data provided
ray could read in a spatial object could write from a spatial object out to a raster could read into a spatial object from a vector file could write out from a spatial object to a vector file
so all of these things were were in place were in place fairly early on but in bits and it took a little longer this was so that that was complete by about 2000 2008 so completing this
involved using the external libraries google and proj and then we've managed to keep the everything running more or less more or less more or less consistently since then this was then using the sp package to define the classes and the r google package to handle
data input and output into either raster or vector representations the final part of the framework arrived with thanks to colin rundell who participated in a 2010 google summer of coding project and led to the rgos package which permitted us to do
spatial vector data handling so that we could do topological operations on vector however at this
representation of the vector geometries used by by google and geos were what was what are known as simple features representations this is an international standard which became began to become become important during the end of the 2000 zeros so by 2000 so by about the time we
we were done with writing the first book the the the choices we'd made in terms of vector representation and to a certain extent also raster representation were beginning to show
so we'd made choices seven eight years earlier they were documented in the first edition of the book and they were already beginning to show that the choices we made were not the only ones which were possible so we published the first edition so it was completed in
2007 published in 2008 some of it was modified immediately prior to to publication and the second edition of the book was then involved a certain amount of of modification as about 20 30 percent of the book was changed between the first and second edition came out in
in 2013 the significant changes were the addition of the space time package and the addition of argios but beyond that there were not not very substantial changes so we started to realize that spatial data was not not just the end of the road because we needed to deal
with time as well those of you who've used geographical information systems will be aware that time is something that they don't do very well and we'd followed in the same line of thought with regard to to handling time what we had done however in the beginning
of the in the preface of the of the first book we'd included a a figure which showed the
dependency tree between this this dependency tree and in 2008 sort of sort of in in July 2008 June 2008 you could still print all of the names of the packages which depended on sp so
we've got sp here and then there are the packages which depended on sp and the ones which gray are the ones which the authors of the book maintained so so we maintained most of the ones
which were there packages which say i also maintain like like splanks doesn't it it's not an sp package it uses its own own representation so that some of the people had begun to had begun to to adopt this by the time we got to the 2013 book as i'm not even sure if if i can find the
right the right figure we couldn't fit the the graph onto the page because i have a copy somewhere
uh in 2014 uh andres de veris did a um a cluster analysis of using page ranks of packages on siren uh this is then a rerun from from from last month this is this is the fourth largest
cluster on siren it's it's not a very big cluster um i i i'm not exposing here how big is this the the the the the the if you scale the figures by the number of page ranks of the packages then
then we're not big but in a poster at at the usar meeting in albork in in 2015 microsoft someone who's now a microsoft employee had discovered that spatial was
was the third fourth or fifth cluster in in terms of use of r so at that stage we realized that we'd we'd cut ourselves into more trouble than we we expected so we thought we were doing something which was to enable teaching so the first time i visited edcer and and gave a gave
a talk at utrecht when he was in still in in utrecht in in in i think that must have been in 2004 then the point of writing the software was to have it continued to be the same as it been eight years earlier or six years earlier is writing software so that you can teach stuff
so that the idea was that you have a a book on on spatial data analysis and you can teach that with the software so the students can not only read about the theoretical definitions of the methods but can try it out and can try it out on different data and can see what happens
if say if you if you if you change the the the variogram the fit if you if you're fitting a variogram by i what happens if you if you insert different different values into the coefficients so so the the we were still sort of thinking well people are going to use
this for teaching aren't they and now we've realized that this is it's it's much worse than that so if we if we um and we did uh break stuff so it changed something in sp or change something in one of the key packages like our google or our gios then then you get lots
lots of attention very quickly on on the mailing list that people people get in touch and tell you things so this is this is the rationale for for where we go from here so i'll carry on without a break because because we're online
so i think what we hadn't realized was that there was a considerable appetite for doing stuff with spatial data out there this has also been influenced but particularly
in the last five to six years by increasing access to data if you look at the behavior of national mapping agencies five years ago it was quite difficult to be able to download
but you might be able to download a picture of a topographic map but you probably wouldn't be able to download any detailed data the number of i mean you would 10 years ago you
would find a number of people who were exercising in the mountains over bergen who had their own gps's but the gps was was this so it's a fairly fairly clunky thing and its batteries ran out after about an hour has anybody used one like
that so a gps so if you were going for a long hike you took lots of batteries if you needed your gps do you remember this yeah so that this this is not just me making it up and that's just 10 years ago so that at the doctoral course here 2006 there's a fisheries
researcher who was working on lobsters and he didn't yet have his lobster tracking data so he was using a handheld gps around the campus here to generate sample data for his project so how he thought lobsters moved it turned out later on that they didn't move like that as he was using acoustic tracking and he set out his triangulation points
but the lobsters had then a reflector glued on which he had to remove before they shed their carapaces because otherwise they'd be killed and they were moving around on the bottom of the sea and the first after he'd stationed his his his data collection points the first things
the lobsters did was move out of range because his his imagination of where the lobsters how how far the lobsters would move on a given night as he had no and he didn't have anything to base it on there were no observations to give him a good idea of how what the home
range of a lobster was but then you got to a course in 2008 2010 12 and the amount of data that people have access to has exploded it's become very very large so that where whereas in 2006 people were using data sets if you had 150 points this this was a big data set
now now under 150 000 is a bit small really so so thing things have happened quite quite rapidly with regard to to access to data the way the data are configured is unchanging
spatial data is position data in 2 or 3d we've got attribute data and we've got metadata which would be concerned connected to the position data you could call spatial data map
data or you could call it gis data the use of sp and similar has not we're not we're not clearly aware that it's been used on other planets i mean not that the people were on other planets but it's been being being used with regard to
planetary data but we do know that it's been used with regard to to microbiologic biological data so that it's been used for very small things as well even though the there's just treating them as planar or 3d but so that some of that has happened i mentioned on on the on in the script that the the the gps only became that's the gps gps that the american
military system only lost its its civilian use noise additive component in 2000 as as
one of one of the last decisions of the clinton presidency was was to remove the remove the the noise component so where are we today uh okay so just just to give you a little bit a couple of little a few pictures to begin with
we're in the sf package for vector data not the sp package i'll be pointing to the sp package where we get to it and i'll be explaining where the sf package came to in a moment okay came from in a moment but this is also using the osm data osm anybody osm
three letter acronyms we won't get away from them even four four letter acronyms we get open street map does anybody use open street map anybody consciously view open street map data
open and news newspaper website the articles that they're running on online if they have a map they may have google but open street map is free
they may be using an interface to open street map but sometimes if you look at this the the copyright line at the bottom you see that this is is open street map data open street map started being used it started being visible in terms of its usefulness in in a worldwide setting after the haitian earthquake and because there were there were no
extant digital maps so that that volunteers in the field were recording gps data and uploading it and so that that certainly from not having proper maps then you had maps where you could identify where where different different aid was was required so open street map is is a
it's not 100 reliable and the the code which you won't like to look at here gives you an example of this is that one of the sources for the bergen light rail system did any of you
use the light rail on the way in from the airport no it it's it it's much cheaper than the airport bus especially for people over over 67 because then i get half price so so i've stopped using
taxis to the airport so i can get there just as quickly with with the light rail since 2017 but the first part was coded as as a light rail and the second part was coded as a tram so in downloading this data from so what i've what i've said here with regard to open street map is
that this is the the query that i'm going to generate and i want to generate from a bounding box just for bergen norway so i'm saying this is in bergen norway i don't want light rail from everywhere and what i want to do is to query the railway features with the value of
light rail and extract the lines as sf simple features lines and put them in this one then i'm going to get the trams then i have to remove some of the tram entries which are bogus
then i have to find out because the two two different data sets have different sets of headers so they have different columns with data in so i take an intersection of them so that i can then put them together as as two different as a merge the two different data sets together and here i'm saving them as as an rds object so here i'm defining the
area of interest here i had to do a little exploration so try try and find out which values worked so you have to look at the table and see which values are present and then guess
that they may be the right ones a little bit further out looked at found that there was some were light rail some were tram some of the trams are actually the museum tram in the center of town which doesn't run so these are the ones here which are being removed because some enthusiasts had been around
and made an extremely detailed map of the museum tram tracks in the center of town which don't run and here we have it so spatial vector data is points so we've got points they make lines and then we can construct larger more complex objects from from these
the light rail tracks are 2d vector data the points are stored stored as double precision floating point and they're downloaded from from the open street maps from from from the cloud
and this is then where we are but what else are we doing here what else has happened to the way that we handle spatial data in our both the tmap package and map view provide interactive mapping they provide it through leaflet which is another package
which uses leaflet js which is a javascript library so that there are layer on layer one above another but this means then that we can instead of choosing this we could choose this background or we could choose an open street map background which takes a little longer to
to load and we can of course zoom and pan and and so on so that we can we can visit ourselves here so this is this is a standard standard interface of the kind that you're used
to from from from web maps the map view package began in probably in 2014's tim appelhans
uh i the the there's a there's a series of seminars now called the the open geo hub seminars
the 2014 one was held in in these two rooms here there's this c and b and was also streamed and at that stage we hadn't really realized that this was going to happen and i noticed that the package was made available on c ram and emailed tim and said well we're going to have the next
the open geo hub seminar in lancaster in in august 2015 could you drop by so he dropped by and he's he's a good community player and and contributes a lot
and there are there are it's really um it it uh satisfying perhaps is is the is the right word to to see a generation of people who are 30 35 years younger than i am is like uh yakub nova sad uh like robin lovelace like
like tim uh and lots of others and the the martin tenneke is the author of tmap so there are lots of other people contributing things but they all build on an infrastructure which which we have to to to to maintain so so map view is is important and it's also based on
what we didn't have until very recently which was access to the tiles to place behind the behind the the interactive web max we can also take another example it's based on a package by
by robin lovelace the st plan r so this is a trans for transport planning and what's going on here and if you want to replicate it it's it's it's it's it's it's
key to to a conversion from desire lines from and to for transport data to
to routing on a particular route and this isn't done properly this is just just a picture but again what we're doing here is downloading the complete set of monthly comma separated value files from the city bike system in bergen they've been downloaded and placed in in a folder called
bbs you can download the same ones if you like i don't think i made them available like i can't recall whether i made them available so then then we need to read in the read in the the trips we then need to massage the trips some of them are so that quite a lot of
the the the the the initial is finding out which stations so it's from and to stations so obviously the the city bikes are taken from and handed into the same hubs and then some of them
are moved which we don't know we the data on movements of bikes where they accumulate at one city center hub and need to be moved back to a place with no bikes so that we don't have that data but we know where the stations are and we also know that one of the one of the one of the
one of the stations was actually in oslo because that was where the bikes were primed so you get spurious
movements across the whole of southern Norway, where they're not actually cycling, but they're just being moved. Actually, I'm fairly certain that these are the actual cycle trips and not the trips made when they're moved by the trucks.
Okay, so then we have counts of, so what we're doing here is summing the counts between each potential pair of stations. So you've got the origin station and the destination station, so this is the OD object, and the OD object is simply a table of counts from and to
for each pair of stations for which they exist, subtracting the ones where the bike was taken from and returned to the same station, because obviously then there's no desire line. So using ST plan R, we want to create the OD lines given the stations where we know
the geographical coordinates of the stations, and we know the flows, that's the number of flows from which station to which station, so that here we've got a table of about 100 something stations, and here we've got not 100 by 100, so we haven't got 10,000
desire lines, because some of them would have been zero and they'd drop out, and some of them are from and to the same ones, that's down the principal diagonal, so they're out. So we have the lines, that's this one, which we could then again zoom into,
and here an alpha channel is used to indicate, alpha channel in the width of the lines used to indicate which, and the closest hub here is actually a bit closer to
to the city than here, so you've got to walk about 15 minutes to get the first hub, 15 minutes, 20, 15, yeah something like that. However the package also, the ST plan R package also
provides a function called line to route, which if you have signed up to get an API key from Cycle Street, which is a UK website, which you'd usually use it as, I have a bike,
I'm standing here, I want to get to there, give me a route, but here we're giving them a subset of 10,000 routes, and that takes a little longer, so this was then pre-generated,
and you also need to apply in advance to get the API key, but once you've got the API key, you can go, if you're using it sort of just once from an application or something like that, then you'll be using the API key of the owner of the application,
or developer of the application, so that here we get to something like that, so that if we allocate all of these cycle trips, assuming that the hub that the bikes were handed back to
was logical, this on a sunny day in Bergen, there are sunny days in the summer, or even in the spring, then people will say, I just want to cycle, I just want to cycle around, so that there isn't a real desire line,
they're just cycling from somewhere around in circles and leave it somewhere else, and if you like that's okay, most of that is fairly central in town, you do begin now to see a certain density out around here,
there are certain things going on here, but again partly assuming that we've picked up the effective movements, but we are starting to get a density of cycle movements, so those are the kinds of things which are going on, so these are the files,
I actually only went through to halfway through November, I could have gone a bit further, and I didn't get October, which I should have got, so that each of the CVS files is at 30 mega,
these are these are data sizes which they're just not real compared to where we were 20 years ago, you had 30 mega on your hard disk though, was R wouldn't run, R needed a two mega object in R
in 1997 was too much, and even though you could do some quite hairy regressions and things and lots of modeling,
but the data sizes were much smaller than the ones with which we're familiar now, and where data has been made available openly, so there's much, there's the access to data is much greater, but it's fairly heterogeneous, you've seen in both of the examples that I've given you, is that if you were looking at R for the first time, maybe some of you are,
then you say this looks scary, because what I'm having to do is a lot of data cleaning to get something which is even representable on the map, because of the data is provided by the data providers in ways which they feel is appropriate, and which for their purposes almost
certainly is appropriate, they may not have thought about it a great deal, and they just dump it out, but they may need it internally, and those are the variables they need internally, okay good, but that leaves us with problems of advancing from the SP representation,
so if we take the object that we had here, we have the light rail system, again check every little bit, no we're still, no further comments coming in,
we're in slight something of a time loop, 17 years ago it made a great deal of sense to use the formal class system for representing spatial data, many other implementation projects at the
time use the same representation, Bioconductor in particular, which is an off C run archive network with curated packages, it's a very solid bioinformatics resource, also chooses to use
by and large S4 classes, they're formal classes, you define them ahead of time, and we can get from the SF representation, which I'll talk about in a moment, to SP by
coercion, so here we're coercing from the object here to an SP object, and then we can look at the formal representation which we can see here, here we have an object with four slots, it has a data slot, a line slot, which contains the geometries, a bounding box slot,
and a proj four string slot, which contains the coordinate reference system, which is also a formal class, and then we can look at the way in which the geometries are represented, and if we just start at the first one of these lines, so we're taking the first
slot of, we're taking the line slot, and then we're looking at the first element of that list, and we can see that this is a formal, this is a formal class of lines, formal class of line, and within the line we've then got a matrix of coordinates, we've got a matrix of coordinates
inside, but we know ahead of time that the coordinates are floating point numbers, because if you are moving between R and C code, and you had integers as coordinates, then you would get a mess, or you have to check, but in a formal class system,
you don't have to check because the class would be invalid if somebody, if they tried to insert an integer as a coordinate, they'd say it would be converted to floating point straight away, they wouldn't be allowed to do it, so we would know ahead of time a lot about the way that the data was structured. So that was the idea of having a formal representation,
was that we saved time in interfacing compiled languages outside. Spatial raster data, in contrast from vector data, which is observed at points, and from the points you
construct lines, if you need to construct polygons, you construct those from the lines, which lines in which directions make up a ring to construct a polygon. Or if we want some raster data, and so far we haven't got any, we could for example use the
elevator package. Again, when we were working with Sp, and even when we were at the level of the second edition of the books, that's 2012-13, if you wanted satellite elevation data,
you had to download it, you had to download tiles of elevation data, you had to identify where you were going to get them from. Many of them were available sometimes with a login, so if you
had requested, had been packaged and could be downloaded, were available for download. So this would be the typical system that you'd go possibly to a web interface, you would choose
the files that you wanted, you would request them, and generally you'd need a login or some kind of email interactions that you would give your email address. You'd be sent a challenge to reply whether you're a bot or a person, and you'd reply I'm a person, and then you'd be sent a link
from which you could download the data. But this is now on AWS, this is full elevation data at different levels of resolution, and it's there. In the elevator package they do document
the provenance of the data, so when it was observed and what the quality of the data is. This is something one needs to keep an eye on when working with online data sources, but in this case, if you ask for this, in this case we're using the spatial, the coerced spatial
SP version of the light rail tracks, so you get a bounding box around the light rail tracks, and that gets pushed out to the server, says okay this is the area you want,
this is the zoom level that you want, and off we go. And it will then go to the cloud and say okay this is what we need, and this is what we pull in, and when this is read in, this is read in as a raster layer from the raster package, the raster package builds on the
raster representation in SP, and also uses formal classes. The raster package was written at the end of the 20 zeros, this is 2008, 2009, 2010, and also uses the S4 classes,
the same as in SP. We did write in the second edition of our book that we really hope that the raster book comes out soon, it still hasn't come out, it would be really useful, but Robert
Hemans who wrote the raster package is very busy, has done an awful lot of work in modernizing the package, and you have to be able to do the other things you do, and he works on
crop robustness, and so he's worked in the Philippines, he's worked on potatoes, he's worked on rice and things like that, so that he's a working field ecologist, and so writing a book
in addition is something that just hasn't happened yet. But this is a formal class, we can look at its representation as a spatial grid data frame, so there's the spatial representation of the data frame, and once again you see that there's a data, but there aren't lines as there were with the light rail, but there's a grid which is
defining the geometry. In this case the grid is quite simple because it's just saying what is the southwest grid center point, coordinates of the southwest grid center point, how many grid cells are there in each direction, and what is their step, what's their size.
So you've got the data with the data frame, with the observations of the elevation, and the grid defining the geometry, the bounding box, and the approach fostering. We can, whoops, whoops, whoops, whoops, whoops, whoops, whoops, we come back to this on Tuesday.
So I've got your attention, grids not updated for approach greater than or equal to six, and we can display the object. Here I probably should actually have changed the
representation of the, should have changed the representation of the, from here I was using
topo colors and I probably should have used terrain colors, so that you're getting blue where it isn't blue, it should be green, so darker green. But you can see that once again we've got the data, and the initial warnings before we got to the warnings about Proj, which there were lots of,
what MapView was saying was that it's quite difficult for me to represent as much data as this, could you, if you really want to show all of the pixels then tell me to do that, but otherwise I will decimate the pixels, the number of pixels which are being displayed.
One of the the consequences of these Proj warnings, we'll be looking at them tomorrow morning, is that the data might perhaps have been offset in relation to where they should be in register on the web map. Yes, yes with qualifications, so that there is a
DG grid R package which provides not only for hexagons but a mixture of hexagons and pentagons to give a complete global coverage. However, the current status with regard to whether you can treat those as a vector object or a raster object is unclear for obvious reasons.
Raster objects are most often, or raster arrays in depth because they may have four dimensions, that's the X and Y dimensions, the time dimension, and the attributes dimensions that
you could be measuring or using different instruments on a satellite. We're not there, but the DG grid is somewhere to look. Could you get back to that after we turn off the
streaming at 11 o'clock and then I can change my screen and look for the package or you can look for the package yourself. But there are a number of possibilities like that. Okay, so the raster package has been widely adopted and is a fairly robust way of
representing data. One of the other things, but it's also based that what raster does in particular is to say that okay, the Google library and our Google has an opportunity,
offers the opportunity for reading not the whole raster but for reading chunks of a raster. You can decide which columns and rows of a raster you want to read and you don't have to read the whole raster at the same time. So that one of the things that raster permitted was to iterate
across a large raster to generate results from a large raster which in 2008 you couldn't get into your memory because your memory was much smaller. So then you were looking at the
32-bit systems, then probably you weren't really handling memory above a couple of gigabytes. So this has 16, but 16 is something I've only had recently. Four gigabytes was much more typical, two gigabytes, one gigabyte of memory and
being able to handle a big raster 10 years ago was really hard. So that facility in raster was important and it used our Google to do that. One of the things which has been absolutely crucial has been help from the C-RAN administrators, in particular
Professor Brian Ripley at the University of Oxford who from very early on first compiled all of the external dependencies for Windows and for OS X himself on his own machine and made them available so that if you were going to use our Google then he would,
so before the Windows binary version was available on C-RAN then he was providing them from his own server in Oxford. And this was extremely useful, well in the sense of growing the user base it was absolutely killing because it meant that lots and lots of students were using
this stuff because they could install it and they were installing it from Oxford rather than from C-RAN which hadn't then developed the capacity to do this. And his sympathetic support continues to be very important so that it is the rule rather than the exception that if
something is going to go wrong then Brian Ripley will find it before we do. So this is, it's not just for R spatial, he covers everybody's backs.
Like finding out that Fedora 30, the FORTRAN in Fedora 30 was more strict about standards than any previous version and led to things falling apart for everybody. But that's
maybe a minority interest but when you use R you can be sure that the fact that it's running properly is to quite a large extent down to Brian Ripley from things like memory management on Windows which he's written himself or parts of which he's modified himself,
going back an awful long way. Obviously then there was a limited set of vector and raster drivers so that some of them were not available and others have been made available as time goes on so that when we were contacted by the Sentinel team at the Joint Research Center, the European
Union Research Institution to add the JPEG 2000 driver to the Windows and OS X so that we found out how to do that using OpenJPEG 2000. So that we have if you like full Sentinel support
because there was interaction between the data providers and the people on C-RAN who could help us with the libraries we needed to permit Google to handle these kinds of things. Okay so questions are arising. I've already mentioned the RGS package. The RGS package was
fairly consistent in its use of simple features. The idea with simple features was that you defined a hierarchy of classes theoretically and it was then a good idea if software implemented
that hierarchy of classes and not some other hierarchy of classes. The hierarchy of classes which we implemented for vector data in or the way in which we implemented the hierarchy of classes for vector data in SP was more based on the then most used vector format which was
a shapefile and the shapefile does not distinguish adequately between an internal ring and an external ring so that a polygon will have an sorry an exterior ring so it will have an exterior ring and if there's a hole in the polygon that's its interior ring.
Now the difference between the exterior ring and the interior ring in a shapefile is that they they go in different directions. The coordinates go clockwise or anti-clockwise to define whether
they are exterior or interior. However in the simple features each polygon can only have one exterior ring. In a shapefile you can have an object which calls itself a polygon which has got multiple exterior rings like a collection of islands but in the simple features sf then you
have to call this a multi-polygon. You can't call it a polygon and our system was inconsistent in this way is that we could have messes of this. The first one was drawn to my attention by my brother in 2004
when he was trying to plot labor market data for Sheffield and found that some enumeration districts were disappearing and it turned out that we were plotting the enumeration districts in order so by number so if it's a b c d then we plot them that way
and it turned out that to get around that you needed to plot the biggest one first and successively smaller ones which might over be over plotted by the big one afterwards so so there was a lot of mess caused by not using simple features if we'd use simple features
from the beginning which we couldn't because they hadn't been defined or hadn't been standardized then everything would have been a lot simpler but that simply wasn't available so we need vector standards compliance jts and geos the jts is the java original version of
geos they require simple feature compliance mechanism and to do this then we had to uh create a kludge for sp polygons objects to define which of the component rings were exterior
and interior rings and if they were interior rings to which of the exterior rings did they belong so it was a mess a spatiotemporal data also appeared as a it should essentially
be obvious that all spatial data is spatiotemporal anyway maybe you just have one observation for each point but but but each point is observed at this particular point in time we get back to
this tomorrow morning uh with the uh with the um uh with the um examination of coordinate reference systems uh indeed now geodeticists would like each observed point to be given
a timestamp so when was this point observed and as one of the geodeticists who really enjoys working on uh iceland he says you go back a week later and it's moved because there are earthquakes there are tectonic movements so that the landscape is is is is dancing
and he says this is for a geodeticist this is really fun so that if you're observing a gps point we need that timestamp as well it's not good enough just to have the the position and gps observations come with a timestamp because time is what drives gps
so so we'd realize that the the the setting was was was inadequate the the original publication of the iso which at that stage was closed was there wasn't access to it was in 2004 and work on international standards proceeded from then on so that there's a
paper by an article by kralidis and a longer work by herring about this so that we needed to go back to to to simple features uh in terms of my presentation i'm now somewhat behind
my schedule so that uh when when i break at 11 o'clock i'll break at 11 o'clock and we'll carry on with what i haven't completed from the first section at 115 because these these foundations are i think quite quite important
how many of you are familiar with with data frames so in in in in terms of r from python data tables okay yep so i'll give you some basic background on data frame objects in in r
assume that we was i'll complete this this little bit um first simple features in our which is where we're going to get to um sort of began uh after uh edcer and uh paulo ribeiro
and i were through with a special issue of the general statistical software in 2015 so we finished the book then there was a special issue and
then we took a shallow breath and said what's going to hit us next and yet we need to revisit the vectors so we'll look at doing simple features support was offered from 2016 by the newly started uh our consortium and the key breakthrough was uh after um had hadley wickham uh who's the uh who's a who's a programmer working or
study station working at our studio and previously working on ggplot and ggplot2 had declared that data needed to be tidy and data frames were tidy he'd also said that list
columns are not tidy however at the 2016 use our conference at stanford edcer and i were at the side of the room coding as we usually did and suddenly we we we sort of started listening
to the plenary and hadley had just declared that list columns were tidy so he said okay he then explained why and he said well we've been trying in ggplot2 draw maps and we've had problems with exactly this so how do you create a tidy data frame
where two of the variables of the x and y coordinates of the lines you want to draw or the containers for fill colors and the way that they've done it previously was simply by
uh having a list of sorry vectors of the x coordinates vectors of the y coordinates where the pen was supposed to move from x x1 y1 to x2 y2 to x then it would get to a line where there were so you'd jump over to the next slot which is the way it had been done in s plus
as well but they were losing holes from the middle of the polygons when they were filling so there was no way to sort this out and he'd he'd come around to our point of view which was that you needed to have a richer data structure to handle geometries of this kind
so list columns are tidy so what is a data frame object a data frame object in in r and elsewhere is the same kind of structure in python is a list object what's a
things behind much of the success of modern programming language or even what would probably now be called standard programming languages which is that if you go back to when i started programming which was fortran and algol then there weren't structures like this
but when you got to see there were lots of structures like this that's a list which you can grow and where one element may point to the next element so you've got something which is not structured as a as a vector well actually a list is a vector in a regular vector all of the
the elements of the vector have to have the same this need to be of the same kind but in a list you can put whatever in you can put a list component can be another list or it could be an integer or it could be a character string or it could be a floating
point number so lists are very flexible tools if you look at the output of fitting a regression in r what is it of course it's a list it has a class lm but it's a list so lists are prevalent there are lots of them vectors are fairly simple and i can give some
references if if people people need references for that lists can be manipulated with single square brackets and you can get out the what's inside with the double square brackets so here we'll start with vectors of four different types v1 which is integers one through
three v2 which the letters one through three it says abc v3 is the square root of v1 which is going to be floating point number and v4 is a complex of the negative of that which gives us a complex vector and we can make these into into a list so we've made these into a list and
if we look at the structure the str is a function for showing the structure we can we can get the we can get the we can see what's inside so v1 is is one two three v2 is abc
v3 is one one point four one point seven and we can then see the same in the the complex number we can access them either using double square brackets or using a dollar and the name of the of the list element okay so we can we can we can we're handling this this list but the list
is the template for creating a vector and what's why is a a a data frame different from any list
the only serious difference is that the list components have to be of the same length the data frame is thought of as being a regular rectangular container for data but it is a list we can create this by using as data
frame so it's coercing to data frame and here we see that the the the classes of the objects remain i can't quite get both of them on at the same time so you see we've got integer character numeric complex what we've got here is integer factor numeric complex
strings as factors is the default in r and has been since s because statisticians needed to treat categorical variables as statistically important variables which should be handled
properly and not simply as just text and now treating just text as categorical variables is something which goes back to the beginning of s and has been inherited so that by default if you give either read in or convert another object to a data frame and it will say okay
you want you want to you want me to handle this this string character string data so i'm going to convert it to a factor a factor is something like a hash table which makes it easy to to build dummy variables for instance in a model it's possible to to set this argument strings as
factors to false and not take the default in which case we get the the representation we had before but we now have a data frame rather than a list and the data frame can also be handled in in in in other ways we could also extend two of these and try to create a
so as a list this is quite fine with with with components of different lengths but when we try to convert this to a data frame then there's an error and the error is
arguments imply different differing numbers of rows so that it will stop us doing this that we know that the difference between whatever list and the data frame is that a data frame is a list with components with with with the same length of components of the same length
we can also access the elements which which are there in the same way that we could do with a list but we can also access them as though we were looking at the data frame as a matrix but it's not a matrix because it's a list so this is the data frame is a rectangular object
we can access things treating them as though they were elements of of of a matrix there's a further point here which is concerned with the drop equals false or drop equals true
drop equals true by default since early s so for a very very long time that is if you sub select a matrix or a data frame or a two or more dimensional array down to one down to two dimensions so that if you have a three-dimensional array and subset
so you're taking taking just one slice then it goes down to two dimensions if you take just one vector it goes down to one dimension so that's drop equals true which is default so that if we if we ask what is then this this subset we're just looking at one element
so it's saying the understanding would be that we just want that single element so drop is true by default if we set drop is false we get a data frame with one row and one column we could
try to coerce the data frame to a matrix but in this case we have a character variable a factor or character variable in in in the data frame so it will coerce all of the values to
to to to character so this is in a character matrix if we just take the the the two columns of the data frame which are numeric the integer and numeric there's a third one which was the
complex leaving that out we get we get a a numeric matrix obviously the length of list l was four we put four things into it so its length is four so what's the length of the
data frame that we created from l it's four because it's a list of four components the columns what's the link length of the data frame when we've turned it into a matrix
it's 12 because it's three times four so we've got four columns three rows why should the matrix have a length and it's a vector yes yeah yeah so that the the answer for
for those who who don't enjoy this this level of abstraction is that a matrix is a vector with a dim attribute of length two an array is a vector with a dim attribute of
two or more so this this goes back to to s so it's it goes back a long way and part of the reason is that the data are organized so that moving the object from the s side to the c side c uses what's known as column major representation of matrices
and the representation here is also column major Fortran uses row major that that's even more abstract so that when you when you ask what's the length of something you need to think
but what am i asking what what's the underlying representation of the data here and if the underlying representation is a list it will be the the the the length of the list so if we ask what's the dim of a list it says no it don't have that attribute if we ask what the
dim of the data frame is it does but it doesn't need to but it had so it sort of pretends to have a dim attribute but a matrix has to have a dim attribute otherwise it's a vector so that if we look at the at the the the coercion of the data frame to a matrix then we see that it's a
character matrix of three rows and four columns and it has a dim names attribute we're not it we're not we're not showing the the dim attribute here the row names of the data frame
this has also been modified as time has gone on originally all data frames had fully enumerated row names which if you had one to a thousand it didn't really matter they took up a bit of space but not very much but if you got one to hundred million it takes up a bit of space and perhaps it's irrelevant so that about 10 years ago are said that if
the row names are just the integers from one to n then we store a marker saying that generate them on the fly if you need them you can you can change the the names of a data frame you can
adapt them we can look say here at the attributes of the data frame we've now changed the names to abc big ab big bb big c it has a class and it has row names one more one two three if we look at the matrix then we see that it has a list of two
a list of two attributes the reason why they're not this was sorry this was just showing the uh what's happening here is that str is seeing that this is a matrix so i encode the information
in the dim attribute in this description here and then only displaying the other attribute which is present but if we ask attributes of of this object then we can see both of
the attributes the dim attribute and the dim names attribute one of the possibilities is to is to address different length vectors by inserting missing values and missing values and
not available now because it's important i'll mention tidy lists now and then we'll be fresh to start sf at quarter past one so what is it a list column so here what we're doing is we're
adding an extra component to our data frame so adding extra column to the data frame which is a list and this list contains one floating point number one character and one logical value
and if we then look at the structure of our data frame we've got the data frame we had before okay good things are there and we've got a list now putting this into a regression and saying okay so that we want to regress a on e is going to lead to mayhem so don't do it but list columns
are valid they've been valid since forever it's not immediately obvious how say to write them out to a comma separated value file spreadsheet in some settings it should be okay in others it
might not but how do you know what formatting constraints to put on column e you don't really so that there are things with list columns which are iffy but list columns are completely legal
and that's where we go when we go to to to sf as i said the user meeting in a plenary hadley wickham said that list columns are tidy so from there on okay off we go uh i'll stop the streaming uh now