We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Scipp: multi-dimensional arrays with labeled dimensions and physical units

00:00

Formal Metadata

Title
Scipp: multi-dimensional arrays with labeled dimensions and physical units
Title of Series
Number of Parts
141
Author
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Inspired by Xarray, Scipp [scipp.github.io] enriches raw NumPy-like multi-dimensional data arrays by adding named dimensions and associated coordinates. For an even more intuitive and less error-prone user experience, Scipp adds physical units to arrays and their coordinates. Scipp data arrays additionally support a dictionary of masks, as well as histogram bin-edge coordinates. One of Scipp's key features is the possibility of using multi-dimensional non-destructive binning to sort record-based "tabular"/"event" data into arrays of bins. This provides fast and flexible binning, rebinning, and filtering operations, all while preserving the original individual records. Scipp ships with data display and visualization features for Jupyter notebooks, including a powerful plotting interface. Named Plopp, this tool uses a graph of connected nodes to provide interactivity between multiple plots and widgets, requiring only a few lines of code from the user.
Software developerData analysisProjective planeVisualization (computer graphics)SoftwareSource codeComputer animationLecture/Conference
Asynchronous Transfer ModeView (database)Computer fileKernel (computing)MultiplicationMaß <Mathematik>Dimensional analysisArray data structureSoftwareSource codeSoftware developerData analysisVisualization (computer graphics)CodeShape (magazine)Computer multitaskingPlot (narrative)Random numberRadon measureCoordinate systemComputer-generated imageryLetterpress printingDisintegrationScalar fieldAreaLaptopMultiplication signArray data structureDemo (music)Library (computing)Touchscreen2 (number)MultilaterationDifferenz <Mathematik>Functional (mathematics)Medical imagingMaß <Mathematik>CountingDimensional analysisMereologySoftware testingINTEGRALProjective planeContext awarenessMetreQuicksortElectronic mailing listFrame problemRepresentation (politics)Combinational logicAudiovisualisierungVariable (mathematics)Volume (thermodynamics)Food energyCartesian coordinate systemExtension (kinesiology)Point (geometry)Musical ensemblePlotterSubject indexingCoordinate systemSpacetimeString (computer science)Mechanism designData structureData dictionaryCodeUniform resource locatorProgram slicingNumberSet (mathematics)Real numberOrder (biology)InformationSquare numberAreaElement (mathematics)Channel capacityShape (magazine)Matching (graph theory)Row (database)Scripting languageNormal (geometry)Level (video gaming)RectangleComputer animation
Computer fileDisintegrationComputer-generated imageryError messageMaß <Mathematik>Focus (optics)Channel capacityKernel (computing)View (database)Asynchronous Transfer ModeArray data structureEquals signData structureCoordinate systemEvent horizonHistogramBinary fileOutlierGUI widgetNumberZoom lensWrapper (data mining)Data structureQuicksortPoint (geometry)Row (database)ScatteringBinary fileHistogramLaptopTable (information)Dimensional analysisInformationGroup actionRepresentation (politics)Data analysisView (database)Medical imagingArray data structureProjective planeSubsetLengthMultiplication signSet (mathematics)DistanceCountingInsertion lossElectronic mailing listSingle-precision floating-point formatTupleSummierbarkeitIdentical particlesMereology2 (number)Maß <Mathematik>Range (statistics)Program slicingAdditionSubject indexingType theoryParticle systemComputer simulationPlotterFlow separationLibrary (computing)Structural loadSign (mathematics)Different (Kate Ryan album)Computer animation
CodeView (database)Kernel (computing)Dimensional analysisCoordinate systemAsynchronous Transfer ModeHistogramThumbnailPlot (narrative)Computer fileMaxima and minimaBitFunctional (mathematics)Binary fileSpacetimeAreaPlotterDimensional analysisDistanceInformationQuicksortArithmetic meanProfil (magazine)Reduction of orderDot productHistogramPosition operatorSoftware engineeringLink (knot theory)LengthGroup actionMultiplication sign2 (number)Semiconductor memoryLevel (video gaming)CubeNumberGreatest elementLibrary (computing)Core dumpDefault (computer science)Image resolutionThree-dimensional spaceRadical (chemistry)SubsetLine (geometry)DiagonalRange (statistics)Table (information)Visualization (computer graphics)Cartesian coordinate systemPoint (geometry)Array data structurePersonal identification numberComputer animation
CodeSoftwarePermanentAsynchronous Transfer ModeView (database)Kernel (computing)System callMaß <Mathematik>Electronic mailing listMetreLibrary (computing)Keyboard shortcutRun time (program lifecycle phase)Different (Kate Ryan album)Computer animationLecture/Conference
Scalar fieldView (database)Kernel (computing)Physical systemAudiovisualisierungInteractive televisionData conversionComputer animation
INTEGRALFrame problemPlotterArray data structurePhysical systemVisualization (computer graphics)View (database)Different (Kate Ryan album)InterpolationMaß <Mathematik>BuildingTraffic reportingLecture/Conference
Asynchronous Transfer ModeCoordinate systemBinary filePlot (narrative)Group actionInformationView (database)InterpolationCoordinate systemMatching (graph theory)Point (geometry)Maß <Mathematik>Computer animation
Data structureMaß <Mathematik>QuicksortAttribute grammarResultantMultiplication signCore dumpLibrary (computing)Different (Kate Ryan album)Fundamental theorem of algebraProjective planeCodeKeyboard shortcutComputer fileSystem callTouchscreenLecture/Conference
Scalable Coherent InterfaceSpacetimeVariable (mathematics)IntelAlgorithmLevel (video gaming)Data storage deviceBuffer solutionVarianceQuicksortData structureMaß <Mathematik>Dimensional analysisMoment (mathematics)GradientControl flowLecture/ConferenceComputer animation
Transcript: English(auto-generated)
Hello everyone, so Yeah, my name is Neil Vité and I'm a scientific software developer at the European Spallation Source in Denmark and Sweden. I do Python for scientific data analysis and visualization
And I'm gonna talk to you today about our project. We Pronounce it skip you can come and ask me why after the talk And it's going to be about multi-dimensional arrays with labeled dimensions and physical units and Just a shout out to my awesome team
Simon and Lucas and so new I'm aware that Some people in the audience can't really see the like the bottom part of the screen or so I'm gonna try and keep what I do towards the top, but just let me know if you can't see Okay And I can also make it
So I'm basically gonna do this is like a demo on a Jupyter notebook I just have a bunch of imports and I'm defining myself a few useful plotting functions, but that's that's for later. Okay So the label dimensions, why do we need them?
So say I have a rectangular array numpy array, which has a shape 10 by 20 and it might look something like this and I would like to slice out the row number four. I look at the shape of my array and I know
That this is the one that has only 10 elements. So I have to slice out the first index Which is fine gives me what I want However, you can't always deduce from the shape Say now I have something that's square. It looks like this Now do I remember which one it was was it the first index or was it the second index?
And obviously, you know, you're gonna get very different answers if you get it wrong It gets even worse when you have more dimensions right now say I have four dimensions XYZ time in that order maybe I
Want to get the first z-slice? Which one is it do you remember is it colon colon zero or is it zero so Hands up who has never struggled with this while using numpy good That's what I thought If you put your hand up, I would say you were lying that
So label dimensions and so this this really a cool project called x-ray if you haven't heard of them Go and check it out They introduced label dimensions to a multi-dimensional numpy arrays and from their documentation. They say
Real world data sets are usually more than just raw numbers They have labels which encode information about how the array values map to locations in space and time, etc And what we have done at the skip project is we have embraced and that to a large extent copied the x-ray mechanism and
How this works is that you create so SC is for skip We create a skip array by giving it The numpy array we had above but now we give it a list of dimension which is gonna some strings that they're going to label each dimension that we have and
We sort of have some fancy HTML representations for Jupyter notebooks, but you can see that Every Label for the dimensions and the size are here and then we have the values And now when I want to get the Z slice, all I need to do is Give it the Z label and then the index
So compared to colon colon colon and zero This is really nice easy, but I think most importantly and that's a point that is often forgotten It makes your code extremely readable if I go back to my code a Month two months later and I look at this I can see oh, yeah. I was trying to slice the Z dimension
Or if somebody else looks at your code, and I think that's really important Okay, and then This is also what x-ray and skip both have you can add coordinates so You can have coordinates on each of the dimensions of your array
And they basically describe the extent of each axis or maybe how far every data point is from its neighbors We have some some visual visual representations for this so say I have a two-dimensional array maybe it's representing say
the air temperature Above a city so at different altitudes and as a function of year, so that's your sort of your dense two-dimensional array And then in skip and x-ray coordinates are added in a structure called a data array So you feed it your your data variable?
And then you give it a dictionary of coordinates that are saying the years are from 2015 to 2023 and the altitude is from 0 to 8,000 meters so Effectively what you're doing is is this You're adding Coordinates to your data and
You can also look at the HTML representation So you have your original data that we had and then you have a list of coordinates Altitude in here good So now I want to talk about what we've added on top of this in in the skip project and
The first one is physical units So every data variable and coordinate in skip has physical units and if it was very important for us to have this embedded from the start There are other Python projects that do this This pine dash reply units. This is just for the units
there's a pine x-ray project to try and incorporate this in x-ray, but we needed to have this baked in from the beginning and I'll just sort of give you an example Maybe I'll also plot this and I hope
and so When you look at the representation, you can see that my X and my Y coordinate both have units of centimeters So it's think of it as maybe like a detector panel and I'm sort of imaging some some counts coming in my data has units of counts and
We can just plot this and it's sort of automatically labels the axes And then now say I also have an integration time I know for how long I've counted when I was recording say 300 seconds So I divide my image by the integration time and now The unit is counts per second automatically. It just library does this for you
We can do pretty much any combination of units and you also see the values Have changed so my image has been normalized So this is really useful if you're dealing with physics and you're you can't remember if your energy was per unit volume or something like that
You actually can see by just looking at the unit of your variables However, there's another bonus is that the units also provide protection Say now I have a background image like a dark frame Which I want to subtract from the signal image above but I forgot to first
Normalize it by integration time. So I have my background which has units of counts. My image above had units of Counts per second and now the library is turning up. You can't do this So I first have to divide by my background My background integration time and then I can do the the subtraction here
So it units are extremely useful in preventing In early prevention of difficult spots to bug and if you have a very long Python script Normally you arrive at the end and you don't really understand why the units don't match or something went wrong This will catch it really early. So this they save hours and I mean hours of debugging time and they also I
Think it's also very important. They free up a lot of mental capacity for the user They don't really have to think remember. Did I divide by area or volume or something like that? Just letting you focus on the important thing which is doing the science that you want to do
Just as a side note. We can also use units for What you call label based indexing if you know x-ray, so say I want the slice at 0.5 centimeters and I don't know the index the number of the index
But I can just say slice X at 0.5 centimeters and they'll just find the correct slice That's also nice this is something that you can do with x-ray but This is like a really nice way to do it with the units
Okay, the second thing I want to that we have in addition is what we call bin edge coordinates It's some sign necessary to have coordinates that represent a range for each data value say The temperature was 310 Kelvin between 10 and 20 seconds It's not a given point in time. You have a range of when that data
Value was valid and it's also This is what you have every time you histogram data So just like in my image above when we did some histogramming of counts. It was the counts The Counts are this much between 0.1 and 0.2 centimeters or something like that and
Skip supports this by having bin edge coordinates Which is a coordinates which has a length of one more than the dimension of your data So my little representation here, I sort of have an 8 by 8
image and My coordinates has length of 9 each side and you can See in the representation that these are usually marked by bin edge so that they're sort of you can see in the representation did you have bin edge coordinates and
Yeah, this is like the image I had above but Binned it and I've histogrammed it into 8 by 8 bins You've probably used histogramming with numpy or matplotlib and they will return you the edges and the data Separately like in a tuple
We have everything inside a single data structure now this edition has actually allowed us to Create something which is I think one of the most powerful features of skip and This is the third part of my talk and we call this bin data. So
I can So skip distinguishes between histogram data and bin data histogram data is the regular dense arrays when you've basically
Collected all your counts and then you've done the sum. So you have a value of three between zero and one Seven between one and two and so on Bin data refers to the precursor of histogram data. It's basically that you have a list of bins and each one of them Contains a list of records and you can of course convert from one to the other
by summing all the data inside each bins, but there is a loss of information here and You can actually do some cool things if you sort of keep this structure So If you've if you know a little something about awkward array, it's basically
conceptually similar to a multi-dimensional awkward array and To best illustrate this I'll do a little example of data analysis and For that I'm going to use something called the New York yellow taxi data set
If you haven't heard of this, it's quite a famous data set in data analysis It Basically is a really long table of data on New York taxi trips You have a pick up date time drop off date time how many?
passengers the distance pick up latitude longitude, so this is for example an image made of the histogram of pickup latitude and longitude and you see sort of Manhattan and here you have the JFK Airport So
I've got my data set from the the Vicks documentation which is if you want quite a nice project also for data analysis Go and check it out as well So I'm only gonna load a subset of this if not, my laptop is gonna cry or scream at me
but basically I have Loaded The latitude and longitude of drop-offs so where people were dropped off by the taxi the trip distance the hour of the day and how much they paid for the trip and I have 71 million rows in my table
It's about 3.2 gigabytes In my memory, so if we have a quick look at this data I'm plotting like one in a thousand points because 71 million scatter points in my plot lib is all still quite difficult But you can see you can see Manhattan and if you sort of zoom in here
You see that you start seeing individual streets. So there's a lot of data in here Okay. So now I'm going to show you what you can do if you bin the data into records So working with bin data is actually most efficient when you keep the number of bins relatively low
we can have a lot of bins, but you can it's basically most efficient when you keep the number of bins low and Binning is essentially like overlaying a grid of bin edges onto our data. So This is kind of what we're doing. We're keeping the underlying data, but we're overlaying a grid wrapper onto it
And you can do this with any kind of data, which is scattered or like for example, there was Talk yesterday, but about some cosmological simulations that are using particles and you could you could apply this to to that like grouping your particles and
Then so the way I do and want to do this it's very simple and skip I've got my original data array I do da dot bin and I say I want eight bins in latitude and be eight bins in longitude and See takes about a second and now I have my binned data structure. So I have eight bins in latitude and longitude and
Then my data is actually has kind of a weird type it's a data array view and what it's telling me it's like the view onto my original array and then it has sort of different bins of the first bin has 65,000 records in it second one has 50,000 records and so on
and so if I naively just Histogram this you're gonna get a very pixelated image 8x8 of Manhattan, which is not very useful
But Because it only groups the data into bins actually just reorders the data you don't lose any information It's simply we hope reordered so then the bins can use to be used for very efficient slicing or filtering
So for example, I want to select a bin in Manhattan So I take the first one in longitude and the fourth and so the first one and then the fourth I'm gonna be sort of up here Which is probably this one So you just change the slicing for the slicing we did before
With the Z dimension that is longitude the first one latitude fourth and now I have something like this Where I have 770 Megabytes out of 3.2 gigabytes and I have about 17 million records in that bin And now I have this because I haven't lost any information
I can rear histogram it at a much higher resolution. And now you can sort of see that you have all the data in there So it's really really useful for working on sort of subsets of your data I'm gonna select another bin which contains the JFK airport
and you see like It's kind of hot spots here And if you look at the map of it you can see this the different terminals at the airport I'm not sure why people are being dropped off on the highway, but you know, sorry
Yeah, I'm guessing that's probably inaccuracies in GPS positions like it's just recorded by the taxis and Yeah
Yeah Okay now I've sort of selected a single bin, but once I've done this what you can do after that is You can then bin this into a new dimension So let's go back to my Manhattan bin. I have a single bin which
Has 17 million records, but if I look inside it I can see that I still have all the information on fair amount and trip distance latitude longitude and all this and so if I want to look at the trip distances inside
the Manhattan and JFK bins I've selected above I Take this this bin that I've sliced out and I make a hundred trip distance bins And now I have 100
The dimension of length 100 in trip distance and I can plot this and I can see that most of the trips in Manhattan As you usually short distance trips like less than five or ten miles And if you do the same with the JFK, you can see that people who go to the airport
Usually they do a longer trip so I'm not saying this is the only way to do it You can do this with pandas you can do this with x-ray, but the ways I found to do it with pandas actually usually Not as simple the syntax. I think we have is actually really nice and they also tends to consume more memory
especially if you've been bin a second time into something so we have a Reordering the data it makes it really efficient and then You can also do other things with bins like it's a little bit like if you use something like group by you
Don't always have to just sum the things you have in your bins. You can also do other reductions like min and max or mean So I have a little questions. I would like to know what is the fair amount as a function of distance So I'm gonna go back to this
Data I had from Manhattan Which has a hundred bins in trip distance and once again if I look inside it I know that I still have all the information on the fair amount or the Iraq the hour of the day so to get the minimum and maximum fares for all trips that
That are inside on Manhattan area we can do so this is my data right here You do dot bins dot coordinates and then the min and the max and this will give you the the min and the max of
The fair amounts that you have for all the trips and I signed that bin and the first thing you see is that the minimum is minus two hundred and forty two dollars, which is Bit weird and the maximum is seven thousand dollars, which seems a bit excessive So These values are maybe a bit strange maybe indicative of bad data in the table
So I'm going to restrict the range from zero to two hundred dollars. So you don't only have to specify bins with a number of bins just like in NumPy, you can just directly specify the bin edges that you want. So I'm doing a lint space between zero and 200 and
So because this had one dimension Here and now I'm making a new dimension with a hundred points and now get something that's two-dimensional and Now you get something that looks like this so I have the fair amount on the y-axis as a function of trip distance and
There's a few things we can say about the data. So First one is that you have this sort of diagonal line Which you kind of expect like the further you're gonna go the more you're gonna pay Make sense The other thing you see is people mostly pay above the line, but not really below it Which yeah, apart from maybe here at the bottom. Some people seem quite good at negotiating
and then the the last one is You have this sort of magical number of fifty two dollars, which will take you anywhere from zero to sixty miles Which is kind of interesting
So is it bad data? Maybe there's a default value That gets if it doesn't get overwritten. It's always fifty two dollars. I don't know well, actually I think I do know because in the last few minutes that I have I Want to talk about what?
We have stuff that we build around skip. So that was like mostly the core features of skip But we think we've We've developed this thing this library called plop, which is what we use for all the visualization we do in skip The name sounds a little Funny or something first time I was working on the logo and my wife looked at it
She was like you made something called plop. Is that what they're paying you for? But everybody sort of laughed but everybody remembered the name so we sort of stuck with it Supposed to stand for plotting plus plus, you know, but anyway So anyway, we've got a lot of tools, but I just want to show you quickly one of them
Which I think is quite so now I have I'm gonna go back to my original data and histogram it in three dimensions So I have latitude longitude and the the fair amount so I have a three-dimensional cube And
Then I have this thing that we call the inspector plot Which Maybe I need to maybe it's better and So on the left is my the map so it's latitude long So it's my two of my dimensions and then on the right what this is going to be useful
I've got this little tool here and I can add these these dots and this is sort of probing the third dimension So it's giving you the profile and you can you know, you can move these dots around and they will update So if I put one down here and then last thing I want to do is
Go back to my airport and add another dot here. And now all of a sudden you see that you've got this spike of $52 so I think What it is is that it's all the airport shuttles. They've got sort of a fixed fare Which is pretty sort of arranged and then they'll just take you anywhere
Yeah Yeah That was about it Thank you for listening. There's a few links For you here go and check it out. I also would like to say that we are hiring we have a
Permanent position as a software engineer developing some tools for science So if you're interested come and talk to me, thank you very much
Have a question about the units how small nowadays There's a predefined list of units you accept and can they convert each other? Like if you have grams then you can say I want them in kilograms or tons or those kind of things. Yes So there's a long list of units. It's
so Because we so skip is written. It has a C++ call and then it has Python bindings on top And so the newness library is a runtime C++ library and it has a very long list of lots of different units and you can definitely do things like
Just like you can convert from meters to kilometers with
Something like that, I'll just tell you 200 centimeters, you know the converter feed sorry, you can also convert to feed or miles So not only as I book and those imperial yeah, yeah, okay
Hi, my name is Mark Thank you for a great and amazing talk my questions about the plop actually Because I can see you what you were love making things Interactive you need to visualize big amounts of data and in the x-ray system There's always the integration to all of us and it's report and so on
Have you considered building on that instead because they all have all the tools already for what you've been showing and so on We have we took a deep look at all of the Well, not all probably because there are a lot but many different visualization packages and I
Think the issue we had with Holoviews is that your data needs to be either a pandas data frame or an x-ray data array So we would probably have to convert to that Let's talk about it because what the system does is really they implement begins So you could implement like a skip bag end and then everything would just work
Yeah, thank you. Sounds good. Yeah. Thanks. Thanks for your talk. I really liked the Feature where you could slice your your data in units and can you control the interpolation when you do that in skip? so The way it works is that
If you have so it doesn't do any interpolation that's something you could probably add on top so it does basically If if you have the binnage coordinate And you give it a value that's inside a given bin then it will just return you that bin where that value you give it
is If you don't have been edges So if your coordinates are actually marking exact points, then you have to have an exact match when you request Data with a unit if not, it will tell you I can't find anything Yeah, I'll make a request on that. Yeah
Thanks for a nice talk My question is maybe a philosophical one what What were the reasons why you? Decided to create your own library and not to For example extend the x-ray mount the pint on that and somehow do better combination out of these
so We looked at x-ray for a long time and we've got a lot of our exploration from it Before we started building and we have two reasons. The first one is that First one is historical is that to start with we thought that skip was gonna need to interact with a lot of other
C++ code at our facility So we needed to have a C++ core and then we sort of added Python bindings On top this may not be so true anymore But it was true when we started the project in 2019 or 2018. And then the second one is that we considered adding
contributing to x-ray, but we thought that If we wanted to add something as fundamental as other units or bin edges It was gonna take a really long time to get it right and to get it adopted in x-ray and we actually needed things to
move quite fast So those are the two reasons We're not trying to replace x-ray. We just yeah needed something at all at our facility Thanks Results I think the saving units is very important for sharing data also in the scientific community
So but first question is where did do you store the unit is it like the attribute or the column or? Is it like all the data structure in the file? It's in the C++ data structure so inside the variable
so you sort of have We have sort of different data structures So you have the variable which is sort of the lowest level thing and this is like in the c++ and it's stored in
There next to the the buffer. We have the dimensions the unit and we can also store Variances like uncertainties alongside the values as well Second question regarding the binning. Can you use some custom grid instead of this regular?
It's like so I go now for example, it does not need to be All of the same size so your grids can have different size any size you want But they do have to be rectangular at the moment Thanks
Thank you so much. It was such a delight to listen such an interesting topic and now we're in the lunch break