We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introduction to Programming for Business Analytics - Exercise 8: Plotting

00:00

Formal Metadata

Title
Introduction to Programming for Business Analytics - Exercise 8: Plotting
Title of Series
Number of Parts
22
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Computer animation
Computer animation
Computer animationDiagram
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationEngineering drawing
Transcript: English(auto-generated)
Hi, welcome to the eighth exercise video for the introduction to programming for business analytics class. Today we're going to look at the plotting exercise. Before we get started,
please make sure that you attempt to solve all the tasks in the exercise by yourself before you watch the respective part of the video where I explain the solution to the task because if you don't do that, then you will not really learn how to solve the task by yourself. With that said, let's get started with the first task which is about
plotting COVID data. So the first sub task is to load the plots package. And if we haven't installed the package yet, we have to install it first. So let's just type the commands we need to install the package just for good measure. Okay, the package already seems
to be installed. So we can load it with this command. And the next task is to create
a data frame that contains the data from the file COVID WHO data dot CSV, which we provided via Moodle and you have to download it and put it into the same directory on your hard drive as the Jupyter notebook you are editing. And to create a data frame from a CSV file, we need the packages data frames, and CSV. And then we can just do
CSV dot read, we assign the result to a variable, then we put the file name here. And then the data type we want to create from the data which we read from the file.
All right, I provided the wrong, I provided the wrong data type, I provided a package instead of a data type. Okay, but now it works, we get the data frame out. As we can see, the data frame has three columns, which are called date reported new cases and
new deaths respectively. And so now let's go to the next task, which is to plot the data from the data frame we created. And we should create a line plot, which shows date reported on the x axis and new cases on the y axis. So as you know, when we call the plot function without any special arguments, we just get a line plot. And the
first argument is basically the data that goes on the x axis. So that would be df dot date reported. And the y axis, we have df dot new cases. And this gives us this plot, which as you can see, on the x axis, we have the dates in this very
weird looking way where they all intersect and are not very nice and informative. And on the y axis, we have the case numbers like that. And the task also says that we should make sure that our plot does not show a legend. And we
should give it an appropriate title. So let's put the keyword argument legend equals to false here, which will turn off the legend. There it goes. And if we also put the keyword argument title and provide an appropriate title such as new
COVID infections, Germany, then we get this plot, which has an appropriate title, and no legend. The next task says, as you can see, the labels on the x axis of the plot are not very informative. This is because the dates are
represented as strings in the data frame. Now we should add another column to the data frame in which the dates are represented as the data type date from the Julia package dates. And now first thing we have to do is to load the package like this. And now we can make a new column by just assigning to
the column name. For example, we could call that just date. And then to this, we assign the new column. And for this new column, we have to basically call the constructor for the date data type, we have to call it for
every item in the old column. So the old column was called date underscore reported. So why don't we start with this. And now, we can use a neat little trick, which is called a array comprehension. And for
this, we can just call some function for every item in the vector. And in this case, we call the constructor for the for the data frame, sorry for the date packet data type. And now we don't really know how to use this
constructor. So let's have a look at the documentation of the package dates. And there we have the we have the constructor. Maybe we can also find a more comprehensive documentation here. And here you
can see for example, that they put us provide a string to this date constructor. And as the second argument, they provide a date format string. So let's try. Let's try this in our code. And
so if we just make a new cell and execute this, then you can see that we get this date object from this string now. But in our case, the strings are formatted in a different way. And
the day comes first and the month and the year. So we will have to change this date format. Let's provide an example
date to see if it works. Okay, so this gives us a date in the year 15. First of January in the year 15. And maybe we can
adjust this by adding 2000 years. Nope, that does not
work. Let's have another look at the documentation. Okay, here
we can see that they do some data arithmetic. And they do
this with these functions, dates dot month dates dot day. And these are actually these great instances of the period data type. So there is also the period here. And I think if
we adjust the if we adjust the code by adding this, but instead
of one year we had 2000 years, yes, then we get the date we want, which is the year 2015. Because this is how the data is represented in our data frame up here. So now we just have to copy this here. We copy the date format here. And
we put date, which is, if you remember what we called the single items in our column. And now let's have a look at the resulting data frame. Okay, and we can see there is a new
column, it has the type date. And it has these representations of the date, which look like what we want, don't they, they have the year first, then the month, in this case, it's August, and then there's the day 23. Okay. So
with that, we are done with this task. And for the next task, we should draw the plot again. But this time, we use our newly created column for the x axis. So here we use the f dot date. And on the y axis, we still have the case
numbers. Let's see what these were called again, new cases. Okay, so new cases. Okay, the plot looks very similar. Let's also copy the stuff that made it a little nicer. Okay, and
there we have it, our COVID infections. But now the x axis has changed because now the labels on the x axis are not intersecting anymore. And there are in general, a little more informative. Okay. So now from the plot, the task goes on
saying, we can see that the case numbers can strongly vary between adjacent days, but follow steady or long term trends with we compare adjacent weeks. So what is meant by that is that the the case numbers have this huge interval where they kind of go up and down. But all in all, they
follow a trend that goes like this, right? So they go up here and even more up and then they go down again, and then they go up and so on. So this is meant by the trend. And if we are only interested in the trend, rather than the
day by day numbers, we can create a less noisy plot by computing weekly averages. And now the task is to create a new data frame in which we group the entries of the original data frame by week, then take the averages of the case and death numbers. And so the hint is to use the function first day
of week to obtain the first day of the respective week from an instance of the type date. Okay. So if we look at our data frame, it has this date column now. And if we call the
function first day of week on all of these, then we only get the first day, we get back the first day of the respective week from the date. So you can see that there are seven, seven entries, which all have the same date, then
followed by another seven entries for the next first day of the week, and so on. And we can now use this to create week averages. And let's just add this first day of the week
as a column to our data frame. Now the column is here and then we can now group by the column and we can combine and
we want to create averages in the column new cases and new deaths. So we take the column new cases, we send it to the mean, and we call the result new cases, average, or maybe
weekly average even. And then we can do the same thing for
the deaths. Okay, let's see what this gives us. The mean
is not defined. That's because we have to load the statistics package first. Okay, now we have this data frame, which has three columns, the first day of the respective week, then the new cases, weekly average, and the new deaths also a weekly
average. Alright, and to use this later, we now have to assign it to a new variable. Let's call this df2. Now, the next task is to create a combined line plot, which is one plot with two lines, which show the weekly averages
of both cases and deaths on the y axis with respect to the first day of the week, which we will put on the x axis. And now the task goes on by saying choose an informative title for your plot and create legend that tells which line is which. Okay, let's get started. So the first
argument is always over x axis, which in this case is df2 dot first day of week. And on the y axis, we want both cases and deaths. So here we pass a vector. The first
element of the vectorize our cases, and the second element would then be the deaths. Let's see what this looks like. Okay, yeah, the the rudimentary the plot is already
finished. Now, we still have to choose an informative title for our plot. So let's put something like COVID new COVID
infections. Or maybe just why don't just COVID data, weekly COVID data Germany, and then the legend will do the rest.
And in the legend, we can put something like new cases, average, or weekly average, and then we leave out the weekly in the title. And again, then we then we put something like
new deaths, weekly average. Let's see what this looks like. We missed a comma here. Okay, now the format of the legend is wrong. Maybe it has to be a vector that is shaped like this. No, maybe like this. No, that also doesn't
work. Well, let's look it up in the documentation. Let's see
if they have an example here. Plot attributes. Okay. Okay, the the keyword that we have to use is not a legend, but label instead. So yeah, we just put label here. Again,
what was the format? Okay, they put a space. So this should do the trick. Yes. Okay. And yeah, there we have our plot. But one last thing we can do is actually to put
the legend in a different place. I think this works like this top right is where it is currently, I guess. Yep. And if we put top left, then it should go over here. Yes. Okay, now it's not, it's not overlapping with the line in the
plot anymore. Okay. The next task is, do you think the plot you created is perfectly informative? Can you think of a better plot? If so create a better one. So let's have a look at our plot. And what is not very informative about the plot is the fact that we cannot really make out any
differences in the deaths because the number of infections, the number of cases are so high, that the deaths basically become this flat line. So we don't see we don't see any of the variants in the deaths because numbers are just much smaller. And I think we could make the
plot more informative by splitting it into where we have one plot that shows just the cases and another that shows just the deaths, but they are both on the same x axis so we can compare the dates. And how can we do this? Let's
have another look at the documentation. Yeah, okay, they have this example here. And here they introduced this parameter which they call layout. And if we pass layout
for one, then we get these four plots which are stacked on top of each other. And the one says that it's just one column of plots. So let's try what happens if we just add
this to our existing plot code. Okay. What we have now is we have two plots where the first plot shows the new
cases weekly average and the second plot shows the new deaths weekly average and they are as you can see they are both on the same x axis now. But what is a little sad is that the legend for the new deaths is actually overlapping with this plot here. So maybe we can actually move
that by passing a vector here. Let's see. So we want the first to be in the top left and the second to be in the top right. No, that does not work. Maybe we have to
call it legends. Also doesn't work. Let's see if the documentation has to say something about that. Maybe in the plot attribute. Or maybe we can just search for legend. Okay,
so now we have to find out whether it's a series attribute, a plot attribute, or an access attribute or a subplot attribute. I guess it could be a subplot attribute because or maybe we just have a vector that has the wrong
that has the wrong shape. Okay. Yes, now. Now we're done. Yeah, there is our there is our plot with the legends not intersecting with the lines of the plot. But what
is a little suboptimalist that we have the title twice. So there was one way to change this. Maybe the keyword we wanted was plot title. No, that does not seem to be the case. Okay, so what is the different ways to set titles?
But I guess this should definitely be a plot attribute. Maybe we can search for title here. Okay, it's plot underscore title. And it's already also says title for the whole plot, not the subplots. So let's try plot underscore title. Okay, and there it is our plot with
only one title and non overlapping legends. Task two is about locations in Aachen. And the task starts by saying, obtain the coordinates of 10 of your favorite places in the city of Aachen. You can use OpenStreetMap.org, click on
query features, then click on the map, then choose a note from the list on the left and right on its location, which is given in coordinates latitude and longitude. Okay, let's have a look. So this is OpenStreetMap, we can scroll all the way into Aachen. And maybe we want the
coordinates of the park and the Frankenbergerfüttel. Then we use query features, we click on something in the park. And then it shows us this list. And we can just click on some note and there we have its location. And then we
can just do this for 10 things in Aachen and then we obtain a list of locations that looks much like this. And to create a scatterplot of our favorite places, we just call the plot function, we provide the latitude and the longitude. And then we put the keyword argument series type
equals symbol scatter. And then we have this not very nice yet map of our favorite places in Aachen. Let's actually remove the legend because I think legend is not very informative. There we go. Now for the second task is,
second task is to create another plot this time remove the legend. Okay, we already did that. And add an informative title. Okay, let's just start with this code. Add an informative title, how about my favorite places in
Aachen. And we should add a label to every place which provides a description like super C, if the super C should
be among our favorite locations. And to do this, we can provide a vector of tuples as an argument with the keyword annotations. So let's do this and notations a vector of tuples. And the tuples must have three items, which indicate the x coordinate y coordinate and label data respectively. And the label data in turn is another
tuple with three items, which indicate the label text relative positions such as top or left and font size. Okay, so we have to make a tuple, which contains three things x, y, and the label data, which is another tuple, first
of which would be the label text, then the relative position, let's go with left, and the font size, let's go with nine. And we do this for every something in something.
And for the letter something, we can use the function zip. The function zip takes vectors, and all puts them
into one vector of tuples. And if we provide longitude, latitude, and the labels, which we still have to create, then it will zip this and then we can read x, y and label out
of here. And now we have to create the labels. For this, I will refer to the sample solution because I don't want to type it all right now. Let's see what this does. Okay, now
you can see that that we have some annotations, the annotations intersect with the markers of the scatterplot unfortunately, and also some of them are not really inside of our plot anymore. We can fix the fact that they
intersect with the with the markers by just adding some adding some white space. So like this, we just add two spaces to every label. And then the labels are on the
right side of the plots, even though we told them to be on the left side, but whatever. I guess the the plots are the dots on the left side of the labels. Okay, I think we're done with the task. Let's see. Yep, looks like we're done.
Next task, depending on your places and labels, some labels may not be fully visible in the plot you created. Yes, definitely the case down here. To change this, adjust the x axis limits and y axis limits as needed by providing the appropriate arguments to the plot function. Okay, let's copy and paste our plot down here. Now, what is the what's
the keyword arguments for the x axis limit and y axis limits? We can look it up in the plots documentation. Yeah, section axis limits sounds about right. So y limbs seems
to control the limits of the y axis here. And yeah, I think then x limbs would control the limits of the x axis. So
let's see, we have to put a comma here first. Okay, let's see what this does. So if I put x limbs, what could be appropriate limits, maybe 6.04 or something and 6.011. Let's let's start with that. 6.11. Right? Yes. Oh, no, maybe 6.
101. Actually. No, this is not enough. Let's go with this.
Okay. How about this? Okay, this looks better. All right. Almost good. Okay, maybe 125. Okay, this is barely enough.
And maybe we can also move it a little bit into the other direction by passing y limbs. Otherwise, we will just have to live with the fact that these intersect. So I guess the upper y limb would be something like 50.77, maybe 80. Like
this, the lower limit would be something like 76. Okay, this way too much. Nope, it doesn't really work. Okay. But we just
have to live with the fact that these intersect, but at least the whole label is now readable in the plot. The next
task is to choose five out of our 10 places and to draw a route that starts from our house and visits each of the five locations exactly once before finally ending at our house. Again, the sequence of the locations visited in the route should be indicated by an arrow that points which
location will be visited next from each of the locations. So then there's the hint to use the function quiver. But first, let's actually make the route. So I don't know about you, but I personally live in my office, which is in the DPO
chair. So the DPO chair is the first item in the list. So I will put one into my route. And then yeah, let's go maybe
to the Mensa, which is the second, fourth, fifth, fifth, sixth item, sixth item. And then why don't I go to Carl, which is the third item, and then to the university
hospital, which is the fifth, so three, five. And then we go to campus, my lab, maybe and then we go home. So for
Okay, now we have this vector route. And I can use this to index the long, for example, the longitude. Yes, and also the latitudes. And oh, oh, that's wrong route. And then
also the labels. No, that's also wrong. Labels route like this. Yeah. All right. DPO chair Mensa, Carl, university hospital, campus Milan. Perfect. Now how does quiver
work? Yeah, maybe we can start actually with the, with the plot, we want to draw over like this. And then we put the route here, we put the route here. And we put
the route here. Okay, this only shows these five these five locations now. Yeah, but the y axis limits are now
really weird. So let's fix those actually. No. Yeah, y limbs. 50.7 69, maybe six, nine, and she points seven,
nine, two or something. Whoa. What did I do? Oh, it should
be 769. Okay, that's not enough. Okay, maybe seven will do the trick. Okay, let's go with that. So we copy this down here. Okay, now it looks better. And now for quiver
quiver is a function that makes a vector field plot and
the ith vector extends from so the vector is the thing that is plotted. So that's in our case, that's the arrow and extends from x i y i. So that's the coordinates of the start of the arrow to x i plus ui y i plus vi. And then they pass
these x y quiver equals UV here. So if we want to plot some arrows, which which draw our tour, then, and we have our tour, we have the edit it in a race that are like this, right, like these, we have to erase the where we have the
latitude and the longitude. So we wanted to start obviously at at x i y i, that's correct, we want the ith vector to start down. But we want to go it, we want actually that it goes to x i plus one plus you know, sorry, just
x i plus one. So we just we have to find some way to create this u array in such a way that this is actually equal to x i plus one. And to do this, I think it should be
sufficient to make u equal to the difference between x i and x i plus one. So let's try to do that. Sorry. So we
basically have to shift this, we have to we have to shift
this to the end. So this is this is x i plus one basically, right. And then we have to compute the difference to x i.
Okay, let's, let's just see what this looks like. Yeah,
this is almost good. The only problem is that this arrow is pointing down here. Why though? Why is it pointing down
there? I guess I have to add some element to the last I
have to add some some something to the last to the
to the vectors in q. So maybe Yeah, I guess the last
arrow should go to the first element again. So we need to put the difference between the first element and the last
element. So that would be route one, minus. Right. Let's see if this works. Nice. Okay. Now it looks like exactly
exactly like what we want. The next task is to wrap the code
for plotting your tour into a function draw tour which receives a tour of arbitrary length. And the plots title is its two arguments and returns the plot. You can obtain a Julia representation of the plot by assigning the result of the functions plot and quiver to a variable and returning variable. Alright, so the function should be called draw
tour, receive the arguments, the tour and also the plots titles. So let's just call this title. Let's copy and paste our code. And we have to change the title to title.
And we previously called the tour route. So let's make it simple and just assign it like that. Yeah, the the indentation is now very ugly, but I guess it should work.
Anyway, 123. Draw tour not defined. Okay. Okay. Nice.
That worked. But what I don't like about this is that it only shows the dots which are inside of the tour. I want it such that all of them are always drawn. Okay, that's
better. Cool. The next task is to use your function draw tour to create an animated GIF file of plots in which the two emerges sequentially with each frame in the sequence
containing one more to a stop than the last. Okay, so we this by first calling the animation constructor, which creates our animation, then we have to add frames to the animation. And in the end, we create the GIF from the
animation by calling the function GIF with the argument and then then we have to provide a a frames per second. So that's how many how fast it will cycle through the frames it has basically, and we also have to provide a file name, let's call this tour.gif. And in the middle,
we now write a little for loop for stop and tour. Or I guess we call it route, which adds a frame to our
animation. So to create the frame, we just create this plot, which we didn't return from the function yet. We can return it by assigning the result of quiver to a variable
and then returning that to the trick. No, this just returns Oh, no, it doesn't return anything actually. Yeah, it doesn't return the Oh, wait, I didn't find type return
result. Just returns nothing. Okay, now it returns the plot. Perfect. And here we put draw tour now, with our tour, which is called route in this case, and we only want to go to we have to slice it. So let's use the
function each index to obtain the indices for our route. And then we just slice from one to stop. And in the end, we also want a title. So let's do like
that a tour in our home and then we put the frame number here. Let's see what this does. Okay, and then we have our animated GIF with sequel which sequentially builds the
tour through our Okay, the part with the interpolation doesn't work yet. That's because I confused the interpolation syntax between Julia and Python. And now it also counts through the frames. Nice. The next task is to
implement a nearest neighboristic for the TSP. So let's read through it. The problem of finding the shortest tour that visits every location in a given set of locations exactly once and returns to the first visited location at the end is known as the traveling salesman
problem TSP. Even for small sets of locations, finding the shortest tour can be very difficult. One very simple way to find a suboptimal TSP tour is the nearest neighbor algorithm. The algorithm starts with a tour that contains only one location and builds a TSP solution by adding to the tour, the location which is closest to recurrent end of the tour, as long as there are locations left to visit,
implement the nearest neighbor algorithm to find a TSP tour that connects all the locations in our home, which have plotted above. Every time your algorithm at a location to the tour, draw a frame and add it to an animation object. At the end, create an animated GIF file which shows the progress of your algorithm iteration by iteration. To compute distances between the locations use the Euclidean
distance remark, Euclidean distance on coordinates is not a useful way to compute any real world distance but sufficient for this exercise. Okay. So the on the on the outermost layer, we're creating an animation. So we
can just use our animation code from the last task. And now we basically have to change this. Instead of the route one to stop, we want to draw something else. This
is the nearest neighbor tour, we now have to compute. And instead of looping over the indices, we instead loop loop over the loop over the locations that are left. So we
in every iteration of our loop, we will mark one location as used basically. And to do this, we can use an array or a vector in which we store the value true if the
location has been used in the route. And not if not, then we call it then we store the value false. So in the all locations are not used, so they are all stored as false.
So as as the length of our of our locations, we can just use this vector labels, we can also use the vector long, just
to make the array used exactly as long as the as the number of locations. Let's have a look at this actually. Okay, so this is what the array used looks like. Now, we can
also loop over. We can also we can loop over long as well. And also, we have to create this neural this nearest
neighbor route. And at first the nearest neighbor route. As the as the text said the nearest neighbor route contains only one location. Let's just start with the location one. It doesn't matter which location we start with,
because it doesn't reduce the generality of the TSP solution, because every location will have to be visited eventually. It just may be that some locations give us a different route than others because because of how the
nearest neighbor algorithm works. So obviously, if we put if you if we use one as our first stop, then we will also have to we will also have to mark this as used. So we can
put used of first stop equals true. Now, what we want to do is we want to compute the distances from the last stop of our route and then route to every other location that is
not yet used. So and then from these, we want to choose the nearest or the minimum. And we can do this by making a variable which we call minimum index. This can be anything. So let's just put minus one as a dummy value. And then minimum
distance. This has to be infinity or some very, very large value, because it will be necessary to accept the first thing we find as the current minimum. So let's just put
infinity here. And now we loop over the remaining the remaining stops. So again, we we use a for loop for this.
And now we have to compute the distance to the current end of
the route. To do this, we use the Euclidean distance, which means that the outermost layer is the square root. And then we have to add two things. The two things we add are two differences. And then we square the two differences. And what
we add is the longitudes the longitude of neighbor minus the longitude of the current end of the route, which is nn
route. And yeah, let's just let's just make some line breaks here. Okay, so this computes distance. However, one
thing we forgot is we only want to do this if the neighbor is if the neighbor is not in use. So if, if used
often neighbor is false, then we want to do it. So if the opposite is true, then we do nothing. Sorry, if Yeah, if used at the index neighbor is true, then we skip this
entire iteration of the loop. Okay, and now we come to the minimization. So we were if we obviously want to choose the nearest neighbor. And so if the current distance to the
current neighbor we are looking at is smaller than the minimum distance to any other neighbor we have found so far, then we accept this as the new currently nearest known neighbor. This means that we put min index equals neighbor and min distance equals D. Okay, and after this loop,
after this loop, these two are now meaningful values, which indicate the nearest neighbor that we have found. So
this means we can add to our nearest neighbor route, we can add the new found neighbor which is stored in the variable min index and to and we also have to update the
array called used. And this has to be true now because we occur we are in this line, we are using the the neighbor as a new as a new item in our route. And this end keyword
is not in the right place. Now we want to draw a frame and the frame should just display the NN route that we
found so far. And to make it even nicer, we can already do that up here once. So here I equals one, or maybe I equals zero actually. And let's call this iteration index I and then we can do dollar I down here. Let's see what
this gives us. Unable to check bounds for indices of type type of first. And this is where this is in line eight.
Oh, right. Maybe it works now. No attempt to access 10 element vector bullet index minus one. Oh, I guess this is because we have one iteration too much. So let's
just iterate from two to end. Okay. And there we have it, a nearest neighbor algorithm, which creates a TSP among the
locations in our home and creates a GIF that cycles through the various stages, displaying the intermediate results as the nearest neighbor algorithm goes along.
And you can immediately see that the result by the nearest neighbor algorithm is not an optimal solution to the TSP. Because obviously, if you want to make the shortest route among these, and you're in West Park, you will never ever come to the idea to go to campus first, then the university hospital and then the DPO
chance that you will go to the hospital first and campus Milad and then to the DPO chair. Anyway, with this, we are done with this task. For task three, we will have to create some scatter plots from a data set that we can download by executing this cell. And there
is downloaded. Now we are supposed to read the data into a data frame. So again, we call our well known function CSV dot read. And we put a data frame here,
not data frames. And that we have our our data frame, which has 44 rows and four columns. And personally, I think the ID column is very redundant. So I want to get rid of that. Okay, and now it's
just three columns. The task is to make ourselves familiar with the data frame, how many columns are there, it's three, what are their names and types. And as you can see, the x and y columns contain floating point numbers, while the column data set contains these strings, which are Roman
number one, Roman number two, Roman number three, and yeah, I can, I can tell you, you can believe me that it's going to be Roman number four, for the last for the last 10 rows, or no 12, 12 rows, I think, doesn't doesn't really matter.
It's not extremely important. Again, the answers to the question, we have three, three columns, string float and float. And the next task is first an explanation, which says the data and our data frame is divided into four disjoint sub data sets.
which are labeled roman number one roman number two roman number three and roman number four respectively as indicated in the column data set the actual data is stored in the columns x and y now the task is for each sub data set calculate and print descriptive statistics for both the x and the y values include the mean median and standard deviation okay and to access the four sub data sets sequentially i think it's useful to just
use the function group by and we group our data frame which we haven't assigned yet so let's do it like this we group our data frame by the column data set
this gives us this grouped data frame with four groups based on the key data set and now we can use a for loop to iterate through these groups and to yeah the for loop can just print the for data frames like this and
now the task is to calculate and print descriptive statistics for both the x and the y values we can use the function describe for that and we can call describe on group we can select
the columns x and y is a common argument just weird why do i where's the missing comma all right um there's a missing describe i okay it's written like this
okay and now we get these data frames um the task is to include the mean median and standard deviation and we can look at the documentation maybe to find out how you do that so julia data
frames describe how about that okay describe uh data i describe gets a data frame and columns or we can call it with a data frame stats and column status
symbols and it can either be a symbol from the list mean std min and so on so this is i think what we want to do so let's put stats here stats equals
to the mean so mean the median and the standard deviation
no method matching describe maybe this is superfluous yeah okay now it works um as you can see we have the four groups here and the means of the x values and the y values of the groups are all basically the same so the mean of the x values is always nine
the mean of the y values is always very very close to 7.5 and the same is true for the for the median so the median of the nine values is of the x values is always close to nine
the median of the y values is always um also very close to something around seven or eight so no no big difference here and the standard deviation is also remarkably similar so for x values the standard deviation is 3.316 approximately and the y value standard deviation
2.03 approximately so from these descriptive statistics it looks like the four sub-data sets are extremely similar i think and this is also part of the task to write down how similar or different you expect the four sub-data sets to be and yeah so i would
personally expect them to be extremely similar if i just saw these statistics the next task is again a short description to allow for good comparison between the four sub-data sets it is best to plot them as four scatter plots in one figure adapt the example from the julia plotting
tutorial and plot the four sub-data sets into a two by two grid and then we have some more tasks to make the x limbs and y limbs all the same of for the four plots so let's look at this example from the julia plotting tutorial to see what this is about
and there's this example here where in the end they call the plot function and they provide these four arguments with which they have created before by calling various plot functions
then they put layout two by two and legend equals false so let's actually copy and paste this because i think this will be very useful and now we only have to create p1 p2 p3 and p4 um and we can do that i think by again using group by on our sub-data set
or why don't we do it in a loop actually okay so this should create four scatter plots
and plot them in a grid i just have to replace this let's see what it looks like okay um
so we have these four plots now but the problem is it's difficult to compare them because they're not all in the same axis and this one is on 10 to 15.7 these are on 5 to 12.5
and the y-axis are also different so maybe we can find a way to um make them share the same axis how about we put um yeah why don't we just try this first
what see what happens if we just put one and 20 here and also one and 20 here ah mr comma okay um yeah i think we can make it a little nicer the task says
try to infer appropriate access limits from the data so let's just try the minimum of the column x in our data frame for the x limits
and the maximum of the column x and let's do the same for the column y
okay now um still not perfect because now the the one marker that's on the axis is actually not fully visible so why don't we subtract one from
the minimum and add one to the maximum to just give this a little bit more room and now we can see it um yeah so um the final question and the task is does the final plot
match what you expected when you compared the mean median and standard deviation so you remember that the mean median standard deviation actually make it look like the data sets are very similar but the truth is that they're actually not very similar they have they they are in a similar range but obviously if you look at them they have very very different
shapes um when you when you compare the scatter plots for example this scatter plot is much noisier than this one this scatter plot has barely any variance on the on the x-axis and scatter plot kind of supposes there's a quadratic um quadratic relationship between x and y while
this scatter plot more looks like there's a linear this also more looks like there's a linear relationship and with this one being kind of an outlier so you can see that in these scatter plots we can actually um find a lot of information about these data sets which we
wouldn't which we wouldn't have seen in in descriptive statistics like the these numbers here and which are in fact very very hard to find out with any descriptive statistics so i hope you learned that plots are a very very useful tool when you want to generate
intuitive insights into the data sets you're working with um with that said we're done with the plots exercise video i hope i could answer all of your questions about how to solve the plots exercises if not please ask a question in the
moodle forums i'm happy to help you there and you can also write me an email to the email address which i we also shared in moodle and in any case have a nice day and see you next time