Introduction to Programming for Business Analytics - Exercise 8: Plotting
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 22 | |
Autor | ||
Lizenz | CC-Namensnennung 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/64447 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produzent |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Schlagwörter |
00:00
Computeranimation
08:31
Computeranimation
16:19
ComputeranimationDiagramm
24:06
Computeranimation
33:27
Computeranimation
42:47
Computeranimation
52:08
Computeranimation
01:01:29
Computeranimation
01:10:49
ComputeranimationTechnische Zeichnung
Transkript: Englisch(automatisch erzeugt)
00:10
Hi, welcome to the eighth exercise video for the introduction to programming for business analytics class. Today we're going to look at the plotting exercise. Before we get started,
00:23
please make sure that you attempt to solve all the tasks in the exercise by yourself before you watch the respective part of the video where I explain the solution to the task because if you don't do that, then you will not really learn how to solve the task by yourself. With that said, let's get started with the first task which is about
00:43
plotting COVID data. So the first sub task is to load the plots package. And if we haven't installed the package yet, we have to install it first. So let's just type the commands we need to install the package just for good measure. Okay, the package already seems
01:12
to be installed. So we can load it with this command. And the next task is to create
01:21
a data frame that contains the data from the file COVID WHO data dot CSV, which we provided via Moodle and you have to download it and put it into the same directory on your hard drive as the Jupyter notebook you are editing. And to create a data frame from a CSV file, we need the packages data frames, and CSV. And then we can just do
01:44
CSV dot read, we assign the result to a variable, then we put the file name here. And then the data type we want to create from the data which we read from the file.
02:05
All right, I provided the wrong, I provided the wrong data type, I provided a package instead of a data type. Okay, but now it works, we get the data frame out. As we can see, the data frame has three columns, which are called date reported new cases and
02:21
new deaths respectively. And so now let's go to the next task, which is to plot the data from the data frame we created. And we should create a line plot, which shows date reported on the x axis and new cases on the y axis. So as you know, when we call the plot function without any special arguments, we just get a line plot. And the
02:44
first argument is basically the data that goes on the x axis. So that would be df dot date reported. And the y axis, we have df dot new cases. And this gives us this plot, which as you can see, on the x axis, we have the dates in this very
03:05
weird looking way where they all intersect and are not very nice and informative. And on the y axis, we have the case numbers like that. And the task also says that we should make sure that our plot does not show a legend. And we
03:21
should give it an appropriate title. So let's put the keyword argument legend equals to false here, which will turn off the legend. There it goes. And if we also put the keyword argument title and provide an appropriate title such as new
03:45
COVID infections, Germany, then we get this plot, which has an appropriate title, and no legend. The next task says, as you can see, the labels on the x axis of the plot are not very informative. This is because the dates are
04:03
represented as strings in the data frame. Now we should add another column to the data frame in which the dates are represented as the data type date from the Julia package dates. And now first thing we have to do is to load the package like this. And now we can make a new column by just assigning to
04:24
the column name. For example, we could call that just date. And then to this, we assign the new column. And for this new column, we have to basically call the constructor for the date data type, we have to call it for
04:41
every item in the old column. So the old column was called date underscore reported. So why don't we start with this. And now, we can use a neat little trick, which is called a array comprehension. And for
05:00
this, we can just call some function for every item in the vector. And in this case, we call the constructor for the for the data frame, sorry for the date packet data type. And now we don't really know how to use this
05:20
constructor. So let's have a look at the documentation of the package dates. And there we have the we have the constructor. Maybe we can also find a more comprehensive documentation here. And here you
05:47
can see for example, that they put us provide a string to this date constructor. And as the second argument, they provide a date format string. So let's try. Let's try this in our code. And
06:08
so if we just make a new cell and execute this, then you can see that we get this date object from this string now. But in our case, the strings are formatted in a different way. And
06:23
the day comes first and the month and the year. So we will have to change this date format. Let's provide an example
06:43
date to see if it works. Okay, so this gives us a date in the year 15. First of January in the year 15. And maybe we can
07:03
adjust this by adding 2000 years. Nope, that does not
07:20
work. Let's have another look at the documentation. Okay, here
07:55
we can see that they do some data arithmetic. And they do
08:01
this with these functions, dates dot month dates dot day. And these are actually these great instances of the period data type. So there is also the period here. And I think if
08:26
we adjust the if we adjust the code by adding this, but instead
08:41
of one year we had 2000 years, yes, then we get the date we want, which is the year 2015. Because this is how the data is represented in our data frame up here. So now we just have to copy this here. We copy the date format here. And
09:02
we put date, which is, if you remember what we called the single items in our column. And now let's have a look at the resulting data frame. Okay, and we can see there is a new
09:22
column, it has the type date. And it has these representations of the date, which look like what we want, don't they, they have the year first, then the month, in this case, it's August, and then there's the day 23. Okay. So
09:41
with that, we are done with this task. And for the next task, we should draw the plot again. But this time, we use our newly created column for the x axis. So here we use the f dot date. And on the y axis, we still have the case
10:00
numbers. Let's see what these were called again, new cases. Okay, so new cases. Okay, the plot looks very similar. Let's also copy the stuff that made it a little nicer. Okay, and
10:26
there we have it, our COVID infections. But now the x axis has changed because now the labels on the x axis are not intersecting anymore. And there are in general, a little more informative. Okay. So now from the plot, the task goes on
10:45
saying, we can see that the case numbers can strongly vary between adjacent days, but follow steady or long term trends with we compare adjacent weeks. So what is meant by that is that the the case numbers have this huge interval where they kind of go up and down. But all in all, they
11:05
follow a trend that goes like this, right? So they go up here and even more up and then they go down again, and then they go up and so on. So this is meant by the trend. And if we are only interested in the trend, rather than the
11:22
day by day numbers, we can create a less noisy plot by computing weekly averages. And now the task is to create a new data frame in which we group the entries of the original data frame by week, then take the averages of the case and death numbers. And so the hint is to use the function first day
11:45
of week to obtain the first day of the respective week from an instance of the type date. Okay. So if we look at our data frame, it has this date column now. And if we call the
12:08
function first day of week on all of these, then we only get the first day, we get back the first day of the respective week from the date. So you can see that there are seven, seven entries, which all have the same date, then
12:25
followed by another seven entries for the next first day of the week, and so on. And we can now use this to create week averages. And let's just add this first day of the week
12:43
as a column to our data frame. Now the column is here and then we can now group by the column and we can combine and
13:04
we want to create averages in the column new cases and new deaths. So we take the column new cases, we send it to the mean, and we call the result new cases, average, or maybe
13:32
weekly average even. And then we can do the same thing for
13:41
the deaths. Okay, let's see what this gives us. The mean
14:05
is not defined. That's because we have to load the statistics package first. Okay, now we have this data frame, which has three columns, the first day of the respective week, then the new cases, weekly average, and the new deaths also a weekly
14:22
average. Alright, and to use this later, we now have to assign it to a new variable. Let's call this df2. Now, the next task is to create a combined line plot, which is one plot with two lines, which show the weekly averages
14:44
of both cases and deaths on the y axis with respect to the first day of the week, which we will put on the x axis. And now the task goes on by saying choose an informative title for your plot and create legend that tells which line is which. Okay, let's get started. So the first
15:04
argument is always over x axis, which in this case is df2 dot first day of week. And on the y axis, we want both cases and deaths. So here we pass a vector. The first
15:24
element of the vectorize our cases, and the second element would then be the deaths. Let's see what this looks like. Okay, yeah, the the rudimentary the plot is already
15:46
finished. Now, we still have to choose an informative title for our plot. So let's put something like COVID new COVID
16:06
infections. Or maybe just why don't just COVID data, weekly COVID data Germany, and then the legend will do the rest.
16:21
And in the legend, we can put something like new cases, average, or weekly average, and then we leave out the weekly in the title. And again, then we then we put something like
16:48
new deaths, weekly average. Let's see what this looks like. We missed a comma here. Okay, now the format of the legend is wrong. Maybe it has to be a vector that is shaped like this. No, maybe like this. No, that also doesn't
17:02
work. Well, let's look it up in the documentation. Let's see
17:22
if they have an example here. Plot attributes. Okay. Okay, the the keyword that we have to use is not a legend, but label instead. So yeah, we just put label here. Again,
17:43
what was the format? Okay, they put a space. So this should do the trick. Yes. Okay. And yeah, there we have our plot. But one last thing we can do is actually to put
18:01
the legend in a different place. I think this works like this top right is where it is currently, I guess. Yep. And if we put top left, then it should go over here. Yes. Okay, now it's not, it's not overlapping with the line in the
18:21
plot anymore. Okay. The next task is, do you think the plot you created is perfectly informative? Can you think of a better plot? If so create a better one. So let's have a look at our plot. And what is not very informative about the plot is the fact that we cannot really make out any
18:43
differences in the deaths because the number of infections, the number of cases are so high, that the deaths basically become this flat line. So we don't see we don't see any of the variants in the deaths because numbers are just much smaller. And I think we could make the
19:02
plot more informative by splitting it into where we have one plot that shows just the cases and another that shows just the deaths, but they are both on the same x axis so we can compare the dates. And how can we do this? Let's
19:20
have another look at the documentation. Yeah, okay, they have this example here. And here they introduced this parameter which they call layout. And if we pass layout
19:42
for one, then we get these four plots which are stacked on top of each other. And the one says that it's just one column of plots. So let's try what happens if we just add
20:04
this to our existing plot code. Okay. What we have now is we have two plots where the first plot shows the new
20:22
cases weekly average and the second plot shows the new deaths weekly average and they are as you can see they are both on the same x axis now. But what is a little sad is that the legend for the new deaths is actually overlapping with this plot here. So maybe we can actually move
20:43
that by passing a vector here. Let's see. So we want the first to be in the top left and the second to be in the top right. No, that does not work. Maybe we have to
21:02
call it legends. Also doesn't work. Let's see if the documentation has to say something about that. Maybe in the plot attribute. Or maybe we can just search for legend. Okay,
21:41
so now we have to find out whether it's a series attribute, a plot attribute, or an access attribute or a subplot attribute. I guess it could be a subplot attribute because or maybe we just have a vector that has the wrong
22:03
that has the wrong shape. Okay. Yes, now. Now we're done. Yeah, there is our there is our plot with the legends not intersecting with the lines of the plot. But what
22:21
is a little suboptimalist that we have the title twice. So there was one way to change this. Maybe the keyword we wanted was plot title. No, that does not seem to be the case. Okay, so what is the different ways to set titles?
22:41
But I guess this should definitely be a plot attribute. Maybe we can search for title here. Okay, it's plot underscore title. And it's already also says title for the whole plot, not the subplots. So let's try plot underscore title. Okay, and there it is our plot with
23:04
only one title and non overlapping legends. Task two is about locations in Aachen. And the task starts by saying, obtain the coordinates of 10 of your favorite places in the city of Aachen. You can use OpenStreetMap.org, click on
23:22
query features, then click on the map, then choose a note from the list on the left and right on its location, which is given in coordinates latitude and longitude. Okay, let's have a look. So this is OpenStreetMap, we can scroll all the way into Aachen. And maybe we want the
23:42
coordinates of the park and the Frankenbergerfüttel. Then we use query features, we click on something in the park. And then it shows us this list. And we can just click on some note and there we have its location. And then we
24:01
can just do this for 10 things in Aachen and then we obtain a list of locations that looks much like this. And to create a scatterplot of our favorite places, we just call the plot function, we provide the latitude and the longitude. And then we put the keyword argument series type
24:21
equals symbol scatter. And then we have this not very nice yet map of our favorite places in Aachen. Let's actually remove the legend because I think legend is not very informative. There we go. Now for the second task is,
24:45
second task is to create another plot this time remove the legend. Okay, we already did that. And add an informative title. Okay, let's just start with this code. Add an informative title, how about my favorite places in
25:05
Aachen. And we should add a label to every place which provides a description like super C, if the super C should
25:20
be among our favorite locations. And to do this, we can provide a vector of tuples as an argument with the keyword annotations. So let's do this and notations a vector of tuples. And the tuples must have three items, which indicate the x coordinate y coordinate and label data respectively. And the label data in turn is another
25:43
tuple with three items, which indicate the label text relative positions such as top or left and font size. Okay, so we have to make a tuple, which contains three things x, y, and the label data, which is another tuple, first
26:05
of which would be the label text, then the relative position, let's go with left, and the font size, let's go with nine. And we do this for every something in something.
26:27
And for the letter something, we can use the function zip. The function zip takes vectors, and all puts them
26:41
into one vector of tuples. And if we provide longitude, latitude, and the labels, which we still have to create, then it will zip this and then we can read x, y and label out
27:01
of here. And now we have to create the labels. For this, I will refer to the sample solution because I don't want to type it all right now. Let's see what this does. Okay, now
27:21
you can see that that we have some annotations, the annotations intersect with the markers of the scatterplot unfortunately, and also some of them are not really inside of our plot anymore. We can fix the fact that they
27:41
intersect with the with the markers by just adding some adding some white space. So like this, we just add two spaces to every label. And then the labels are on the
28:01
right side of the plots, even though we told them to be on the left side, but whatever. I guess the the plots are the dots on the left side of the labels. Okay, I think we're done with the task. Let's see. Yep, looks like we're done.
28:20
Next task, depending on your places and labels, some labels may not be fully visible in the plot you created. Yes, definitely the case down here. To change this, adjust the x axis limits and y axis limits as needed by providing the appropriate arguments to the plot function. Okay, let's copy and paste our plot down here. Now, what is the what's
28:45
the keyword arguments for the x axis limit and y axis limits? We can look it up in the plots documentation. Yeah, section axis limits sounds about right. So y limbs seems
29:03
to control the limits of the y axis here. And yeah, I think then x limbs would control the limits of the x axis. So
29:20
let's see, we have to put a comma here first. Okay, let's see what this does. So if I put x limbs, what could be appropriate limits, maybe 6.04 or something and 6.011. Let's let's start with that. 6.11. Right? Yes. Oh, no, maybe 6.
29:51
101. Actually. No, this is not enough. Let's go with this.
30:01
Okay. How about this? Okay, this looks better. All right. Almost good. Okay, maybe 125. Okay, this is barely enough.
30:22
And maybe we can also move it a little bit into the other direction by passing y limbs. Otherwise, we will just have to live with the fact that these intersect. So I guess the upper y limb would be something like 50.77, maybe 80. Like
30:44
this, the lower limit would be something like 76. Okay, this way too much. Nope, it doesn't really work. Okay. But we just
31:15
have to live with the fact that these intersect, but at least the whole label is now readable in the plot. The next
31:25
task is to choose five out of our 10 places and to draw a route that starts from our house and visits each of the five locations exactly once before finally ending at our house. Again, the sequence of the locations visited in the route should be indicated by an arrow that points which
31:41
location will be visited next from each of the locations. So then there's the hint to use the function quiver. But first, let's actually make the route. So I don't know about you, but I personally live in my office, which is in the DPO
32:01
chair. So the DPO chair is the first item in the list. So I will put one into my route. And then yeah, let's go maybe
32:21
to the Mensa, which is the second, fourth, fifth, fifth, sixth item, sixth item. And then why don't I go to Carl, which is the third item, and then to the university
32:49
hospital, which is the fifth, so three, five. And then we go to campus, my lab, maybe and then we go home. So for
33:05
Okay, now we have this vector route. And I can use this to index the long, for example, the longitude. Yes, and also the latitudes. And oh, oh, that's wrong route. And then
33:24
also the labels. No, that's also wrong. Labels route like this. Yeah. All right. DPO chair Mensa, Carl, university hospital, campus Milan. Perfect. Now how does quiver
33:44
work? Yeah, maybe we can start actually with the, with the plot, we want to draw over like this. And then we put the route here, we put the route here. And we put
34:03
the route here. Okay, this only shows these five these five locations now. Yeah, but the y axis limits are now
34:20
really weird. So let's fix those actually. No. Yeah, y limbs. 50.7 69, maybe six, nine, and she points seven,
34:46
nine, two or something. Whoa. What did I do? Oh, it should
35:01
be 769. Okay, that's not enough. Okay, maybe seven will do the trick. Okay, let's go with that. So we copy this down here. Okay, now it looks better. And now for quiver
35:37
quiver is a function that makes a vector field plot and
35:41
the ith vector extends from so the vector is the thing that is plotted. So that's in our case, that's the arrow and extends from x i y i. So that's the coordinates of the start of the arrow to x i plus ui y i plus vi. And then they pass
36:01
these x y quiver equals UV here. So if we want to plot some arrows, which which draw our tour, then, and we have our tour, we have the edit it in a race that are like this, right, like these, we have to erase the where we have the
36:23
latitude and the longitude. So we wanted to start obviously at at x i y i, that's correct, we want the ith vector to start down. But we want to go it, we want actually that it goes to x i plus one plus you know, sorry, just
36:42
x i plus one. So we just we have to find some way to create this u array in such a way that this is actually equal to x i plus one. And to do this, I think it should be
37:03
sufficient to make u equal to the difference between x i and x i plus one. So let's try to do that. Sorry. So we
37:54
basically have to shift this, we have to we have to shift
38:01
this to the end. So this is this is x i plus one basically, right. And then we have to compute the difference to x i.
38:29
Okay, let's, let's just see what this looks like. Yeah,
38:44
this is almost good. The only problem is that this arrow is pointing down here. Why though? Why is it pointing down
39:07
there? I guess I have to add some element to the last I
39:40
have to add some some something to the last to the
39:46
to the vectors in q. So maybe Yeah, I guess the last
40:05
arrow should go to the first element again. So we need to put the difference between the first element and the last
40:23
element. So that would be route one, minus. Right. Let's see if this works. Nice. Okay. Now it looks like exactly
40:53
exactly like what we want. The next task is to wrap the code
41:00
for plotting your tour into a function draw tour which receives a tour of arbitrary length. And the plots title is its two arguments and returns the plot. You can obtain a Julia representation of the plot by assigning the result of the functions plot and quiver to a variable and returning variable. Alright, so the function should be called draw
41:21
tour, receive the arguments, the tour and also the plots titles. So let's just call this title. Let's copy and paste our code. And we have to change the title to title.
41:45
And we previously called the tour route. So let's make it simple and just assign it like that. Yeah, the the indentation is now very ugly, but I guess it should work.
42:02
Anyway, 123. Draw tour not defined. Okay. Okay. Nice.
42:22
That worked. But what I don't like about this is that it only shows the dots which are inside of the tour. I want it such that all of them are always drawn. Okay, that's
42:50
better. Cool. The next task is to use your function draw tour to create an animated GIF file of plots in which the two emerges sequentially with each frame in the sequence
43:01
containing one more to a stop than the last. Okay, so we this by first calling the animation constructor, which creates our animation, then we have to add frames to the animation. And in the end, we create the GIF from the
43:21
animation by calling the function GIF with the argument and then then we have to provide a a frames per second. So that's how many how fast it will cycle through the frames it has basically, and we also have to provide a file name, let's call this tour.gif. And in the middle,
43:45
we now write a little for loop for stop and tour. Or I guess we call it route, which adds a frame to our
44:02
animation. So to create the frame, we just create this plot, which we didn't return from the function yet. We can return it by assigning the result of quiver to a variable
44:22
and then returning that to the trick. No, this just returns Oh, no, it doesn't return anything actually. Yeah, it doesn't return the Oh, wait, I didn't find type return
44:42
result. Just returns nothing. Okay, now it returns the plot. Perfect. And here we put draw tour now, with our tour, which is called route in this case, and we only want to go to we have to slice it. So let's use the
45:03
function each index to obtain the indices for our route. And then we just slice from one to stop. And in the end, we also want a title. So let's do like
45:24
that a tour in our home and then we put the frame number here. Let's see what this does. Okay, and then we have our animated GIF with sequel which sequentially builds the
45:44
tour through our Okay, the part with the interpolation doesn't work yet. That's because I confused the interpolation syntax between Julia and Python. And now it also counts through the frames. Nice. The next task is to
46:06
implement a nearest neighboristic for the TSP. So let's read through it. The problem of finding the shortest tour that visits every location in a given set of locations exactly once and returns to the first visited location at the end is known as the traveling salesman
46:20
problem TSP. Even for small sets of locations, finding the shortest tour can be very difficult. One very simple way to find a suboptimal TSP tour is the nearest neighbor algorithm. The algorithm starts with a tour that contains only one location and builds a TSP solution by adding to the tour, the location which is closest to recurrent end of the tour, as long as there are locations left to visit,
46:42
implement the nearest neighbor algorithm to find a TSP tour that connects all the locations in our home, which have plotted above. Every time your algorithm at a location to the tour, draw a frame and add it to an animation object. At the end, create an animated GIF file which shows the progress of your algorithm iteration by iteration. To compute distances between the locations use the Euclidean
47:03
distance remark, Euclidean distance on coordinates is not a useful way to compute any real world distance but sufficient for this exercise. Okay. So the on the on the outermost layer, we're creating an animation. So we
47:22
can just use our animation code from the last task. And now we basically have to change this. Instead of the route one to stop, we want to draw something else. This
47:40
is the nearest neighbor tour, we now have to compute. And instead of looping over the indices, we instead loop loop over the loop over the locations that are left. So we
48:04
in every iteration of our loop, we will mark one location as used basically. And to do this, we can use an array or a vector in which we store the value true if the
48:24
location has been used in the route. And not if not, then we call it then we store the value false. So in the all locations are not used, so they are all stored as false.
48:45
So as as the length of our of our locations, we can just use this vector labels, we can also use the vector long, just
49:01
to make the array used exactly as long as the as the number of locations. Let's have a look at this actually. Okay, so this is what the array used looks like. Now, we can
49:23
also loop over. We can also we can loop over long as well. And also, we have to create this neural this nearest
49:45
neighbor route. And at first the nearest neighbor route. As the as the text said the nearest neighbor route contains only one location. Let's just start with the location one. It doesn't matter which location we start with,
50:04
because it doesn't reduce the generality of the TSP solution, because every location will have to be visited eventually. It just may be that some locations give us a different route than others because because of how the
50:24
nearest neighbor algorithm works. So obviously, if we put if you if we use one as our first stop, then we will also have to we will also have to mark this as used. So we can
50:43
put used of first stop equals true. Now, what we want to do is we want to compute the distances from the last stop of our route and then route to every other location that is
51:03
not yet used. So and then from these, we want to choose the nearest or the minimum. And we can do this by making a variable which we call minimum index. This can be anything. So let's just put minus one as a dummy value. And then minimum
51:23
distance. This has to be infinity or some very, very large value, because it will be necessary to accept the first thing we find as the current minimum. So let's just put
51:47
infinity here. And now we loop over the remaining the remaining stops. So again, we we use a for loop for this.
52:17
And now we have to compute the distance to the current end of
52:20
the route. To do this, we use the Euclidean distance, which means that the outermost layer is the square root. And then we have to add two things. The two things we add are two differences. And then we square the two differences. And what
52:48
we add is the longitudes the longitude of neighbor minus the longitude of the current end of the route, which is nn
53:00
route. And yeah, let's just let's just make some line breaks here. Okay, so this computes distance. However, one
53:35
thing we forgot is we only want to do this if the neighbor is if the neighbor is not in use. So if, if used
53:46
often neighbor is false, then we want to do it. So if the opposite is true, then we do nothing. Sorry, if Yeah, if used at the index neighbor is true, then we skip this
54:07
entire iteration of the loop. Okay, and now we come to the minimization. So we were if we obviously want to choose the nearest neighbor. And so if the current distance to the
54:21
current neighbor we are looking at is smaller than the minimum distance to any other neighbor we have found so far, then we accept this as the new currently nearest known neighbor. This means that we put min index equals neighbor and min distance equals D. Okay, and after this loop,
54:51
after this loop, these two are now meaningful values, which indicate the nearest neighbor that we have found. So
55:00
this means we can add to our nearest neighbor route, we can add the new found neighbor which is stored in the variable min index and to and we also have to update the
55:21
array called used. And this has to be true now because we occur we are in this line, we are using the the neighbor as a new as a new item in our route. And this end keyword
55:48
is not in the right place. Now we want to draw a frame and the frame should just display the NN route that we
56:02
found so far. And to make it even nicer, we can already do that up here once. So here I equals one, or maybe I equals zero actually. And let's call this iteration index I and then we can do dollar I down here. Let's see what
56:26
this gives us. Unable to check bounds for indices of type type of first. And this is where this is in line eight.
56:46
Oh, right. Maybe it works now. No attempt to access 10 element vector bullet index minus one. Oh, I guess this is because we have one iteration too much. So let's
57:08
just iterate from two to end. Okay. And there we have it, a nearest neighbor algorithm, which creates a TSP among the
57:24
locations in our home and creates a GIF that cycles through the various stages, displaying the intermediate results as the nearest neighbor algorithm goes along.
57:41
And you can immediately see that the result by the nearest neighbor algorithm is not an optimal solution to the TSP. Because obviously, if you want to make the shortest route among these, and you're in West Park, you will never ever come to the idea to go to campus first, then the university hospital and then the DPO
58:01
chance that you will go to the hospital first and campus Milad and then to the DPO chair. Anyway, with this, we are done with this task. For task three, we will have to create some scatter plots from a data set that we can download by executing this cell. And there
58:26
is downloaded. Now we are supposed to read the data into a data frame. So again, we call our well known function CSV dot read. And we put a data frame here,
58:40
not data frames. And that we have our our data frame, which has 44 rows and four columns. And personally, I think the ID column is very redundant. So I want to get rid of that. Okay, and now it's
59:03
just three columns. The task is to make ourselves familiar with the data frame, how many columns are there, it's three, what are their names and types. And as you can see, the x and y columns contain floating point numbers, while the column data set contains these strings, which are Roman
59:21
number one, Roman number two, Roman number three, and yeah, I can, I can tell you, you can believe me that it's going to be Roman number four, for the last for the last 10 rows, or no 12, 12 rows, I think, doesn't doesn't really matter.
59:42
It's not extremely important. Again, the answers to the question, we have three, three columns, string float and float. And the next task is first an explanation, which says the data and our data frame is divided into four disjoint sub data sets.
01:00:00
which are labeled roman number one roman number two roman number three and roman number four respectively as indicated in the column data set the actual data is stored in the columns x and y now the task is for each sub data set calculate and print descriptive statistics for both the x and the y values include the mean median and standard deviation okay and to access the four sub data sets sequentially i think it's useful to just
01:00:25
use the function group by and we group our data frame which we haven't assigned yet so let's do it like this we group our data frame by the column data set
01:00:43
this gives us this grouped data frame with four groups based on the key data set and now we can use a for loop to iterate through these groups and to yeah the for loop can just print the for data frames like this and
01:01:10
now the task is to calculate and print descriptive statistics for both the x and the y values we can use the function describe for that and we can call describe on group we can select
01:01:26
the columns x and y is a common argument just weird why do i where's the missing comma all right um there's a missing describe i okay it's written like this
01:01:47
okay and now we get these data frames um the task is to include the mean median and standard deviation and we can look at the documentation maybe to find out how you do that so julia data
01:02:05
frames describe how about that okay describe uh data i describe gets a data frame and columns or we can call it with a data frame stats and column status
01:02:28
symbols and it can either be a symbol from the list mean std min and so on so this is i think what we want to do so let's put stats here stats equals
01:02:48
to the mean so mean the median and the standard deviation
01:03:00
no method matching describe maybe this is superfluous yeah okay now it works um as you can see we have the four groups here and the means of the x values and the y values of the groups are all basically the same so the mean of the x values is always nine
01:03:25
the mean of the y values is always very very close to 7.5 and the same is true for the for the median so the median of the nine values is of the x values is always close to nine
01:03:40
the median of the y values is always um also very close to something around seven or eight so no no big difference here and the standard deviation is also remarkably similar so for x values the standard deviation is 3.316 approximately and the y value standard deviation
01:04:04
2.03 approximately so from these descriptive statistics it looks like the four sub-data sets are extremely similar i think and this is also part of the task to write down how similar or different you expect the four sub-data sets to be and yeah so i would
01:04:24
personally expect them to be extremely similar if i just saw these statistics the next task is again a short description to allow for good comparison between the four sub-data sets it is best to plot them as four scatter plots in one figure adapt the example from the julia plotting
01:04:42
tutorial and plot the four sub-data sets into a two by two grid and then we have some more tasks to make the x limbs and y limbs all the same of for the four plots so let's look at this example from the julia plotting tutorial to see what this is about
01:05:09
and there's this example here where in the end they call the plot function and they provide these four arguments with which they have created before by calling various plot functions
01:05:21
then they put layout two by two and legend equals false so let's actually copy and paste this because i think this will be very useful and now we only have to create p1 p2 p3 and p4 um and we can do that i think by again using group by on our sub-data set
01:05:59
or why don't we do it in a loop actually okay so this should create four scatter plots
01:06:34
and plot them in a grid i just have to replace this let's see what it looks like okay um
01:06:53
so we have these four plots now but the problem is it's difficult to compare them because they're not all in the same axis and this one is on 10 to 15.7 these are on 5 to 12.5
01:07:07
and the y-axis are also different so maybe we can find a way to um make them share the same axis how about we put um yeah why don't we just try this first
01:07:48
what see what happens if we just put one and 20 here and also one and 20 here ah mr comma okay um yeah i think we can make it a little nicer the task says
01:08:09
try to infer appropriate access limits from the data so let's just try the minimum of the column x in our data frame for the x limits
01:08:27
and the maximum of the column x and let's do the same for the column y
01:08:46
okay now um still not perfect because now the the one marker that's on the axis is actually not fully visible so why don't we subtract one from
01:09:01
the minimum and add one to the maximum to just give this a little bit more room and now we can see it um yeah so um the final question and the task is does the final plot
01:09:21
match what you expected when you compared the mean median and standard deviation so you remember that the mean median standard deviation actually make it look like the data sets are very similar but the truth is that they're actually not very similar they have they they are in a similar range but obviously if you look at them they have very very different
01:09:44
shapes um when you when you compare the scatter plots for example this scatter plot is much noisier than this one this scatter plot has barely any variance on the on the x-axis and scatter plot kind of supposes there's a quadratic um quadratic relationship between x and y while
01:10:07
this scatter plot more looks like there's a linear this also more looks like there's a linear relationship and with this one being kind of an outlier so you can see that in these scatter plots we can actually um find a lot of information about these data sets which we
01:10:25
wouldn't which we wouldn't have seen in in descriptive statistics like the these numbers here and which are in fact very very hard to find out with any descriptive statistics so i hope you learned that plots are a very very useful tool when you want to generate
01:10:45
intuitive insights into the data sets you're working with um with that said we're done with the plots exercise video i hope i could answer all of your questions about how to solve the plots exercises if not please ask a question in the
01:11:06
moodle forums i'm happy to help you there and you can also write me an email to the email address which i we also shared in moodle and in any case have a nice day and see you next time
Empfehlungen
Serie mit 22 Medien