Monitoring of a Large-Scale University Network
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47327 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Scale (map)Direction (geometry)Computer networkPresentation of a groupSoftwareUniverse (mathematics)Forschungszentrum RossendorfResultantComputer animation
00:35
Data structureComputer networkOpen sourceSystem identificationKolmogorov complexityScale (map)outputDigital filterGame theorySoftwareDirection (geometry)Presentation of a groupUniverse (mathematics)Computer animation
01:16
Computer networkSystem on a chipDuality (mathematics)Term (mathematics)Universe (mathematics)WordCore dumpSoftwareInternetworking10 (number)Projective planeComputer animation
02:02
Open sourceInformation securityView (database)System programmingUDP <Protokoll>Communications protocolSoftwareComputer hardwarePhysical systemTerm (mathematics)Student's t-testIP addressInformationUniverse (mathematics)Cartesian coordinate systemTelecommunicationProjective planeOpen sourceSoftwareInformation securityCommunications protocolComputer hardwareMultiplication signMetadataRight anglePoint (geometry)Autonomous system (mathematics)View (database)Cross-correlationComputer animation
04:14
Open sourceoutputEvent horizonInformation securityDataflowModal logicInternetworkingView (database)DatabaseOpen setResultantInterface (computing)SoftwarePhysical systemFreewareInformation securityModal logicTime seriesMathematical analysisSoftware developerCombinational logicWeb pageTelecommunicationComputer animation
05:49
ArchitecturePhysical systemTime seriesData structurePhysical systemPoint (geometry)File formatElectronic mailing listCore dumpoutputInformation securityComputer animation
06:46
InternetworkingoutputTelecommunicationAverageMassPoint (geometry)2 (number)Physical systemSoftwareComputer animation
07:20
Message passingContinuous functionPhysical systemEmailInteractive televisionPhysical systemPresentation of a groupExpected valueComputer animation
07:53
Execution unitAnalogyComa BerenicesMaxima and minimaMessage passingPhysical system
08:27
10 (number)Maxima and minimaBuffer solutionDrop (liquid)Point (geometry)Physical systemBuffer solutionDrop (liquid)Computer animation
08:53
Process (computing)UDP <Protokoll>DataflowNetwork socketDrop (liquid)Network socketDrop (liquid)Computer-assisted translationWeightNumberComputer animation
09:38
Bit rateUDP <Protokoll>Drop (liquid)DataflowBasis <Mathematik>Scale (map)Socket-SchnittstelleMoment (mathematics)Convex hullDrum memoryScaling (geometry)2 (number)Drop (liquid)Multiplication signBit rateNetwork socketFrequencyMathematicsComputer animation
11:05
FrequencyData typeNetwork socketDrop (liquid)UDP <Protokoll>Computer configurationDataflowFile formatMultiplication signType theoryCodierung <Programmierung>LengthComputer animation
11:54
Coma BerenicesBuffer solutionMereologyKernel (computing)Network socketUDP <Protokoll>Point (geometry)EmailMereologyPhysical systemComputer animation
12:19
Read-only memoryPhysical systemSoftwarePort scannerRight angleData storage deviceSemiconductor memoryComputer animation
13:05
Physical system
13:43
Euler anglesQuery languageTable (information)Continuous functionQuery languageNumberMultiplication signOrder (biology)Time seriesPoint (geometry)Regular graphMathematical analysisOperator (mathematics)Computer animation
14:51
Query languageQuery languageMereologySemiconductor memoryComputer animation
15:25
Term (mathematics)MereologyConstraint (mathematics)Point (geometry)Time seriesAnalytic continuationMereologyTerm (mathematics)NumberBelegleserComputer animation
15:47
Computer networkSoftwareBefehlsprozessorTrailUniverse (mathematics)Semiconductor memoryConnected spaceBefehlsprozessorRange (statistics)InternetworkingHoaxPoint (geometry)View (database)SoftwarePhysical systemIP addressMultiplication signTerm (mathematics)Computer animation
18:11
TrailFlagUDP <Protokoll>Physical systemShape (magazine)FlagPattern languageAlgorithmPort scannerTrail
19:28
AlgorithmMereologyGame theoryPhysical systemSoftwareTrailGame theoryPhysical systemComputer animation
20:40
Server (computing)Public domainPressure volume diagramView (database)TelecommunicationPrice indexQuicksortInformation securitySoftwarePhysical systemPoint (geometry)Presentation of a groupScaling (geometry)Multiplication signSlide ruleAddress spaceComputer animation
23:16
Computer networkComputer hardwareSoftwareVirtual machineIntelBefehlsprozessorComputer hardwareStructural loadSoftwareMessage passingCore dumpOpen sourceComputer animation
24:02
Public domainServer (computing)AlgorithmInformation securityMathematical analysisPerspective (visual)Direction (geometry)Open sourcePresentation of a groupTrailDatabaseTime seriesComputer animation
24:37
Mereology2 (number)SoftwareView (database)Universe (mathematics)Multiplication sign
25:27
Point cloudOpen source
Transcript: English(auto-generated)
00:05
I am Simone Mainardi and this presentation is about the monitoring of a large-scale university network and it's basically the result of the cooperation between NTOP and Tobias Appel. Tobias Appel works at the
00:22
Leibniz Research Center in Germany. This research center is basically the institution who actually operates these large-scale university network. So what are we going to see today? Well I'm going to start this presentation with a brief overview of this university network and then we are going to
00:45
briefly see the monitoring goals we had in mind when we started this project. And then I'm going to move to discuss the main challenges we have faced and the main solutions we have found when trying to monitor this
01:05
challenging infrastructure. And eventually I'm going to conclude the presentation with the main lessons learned and some future directions. Okay a few words on the university network. The university network is the Munich
01:26
backbone for Munich universities and research centers. It's a pretty large network, I mean we are talking about tens of core routers, thousands of switches, and just to give you a ballpark in terms of traffic exchanged
01:44
with the internet, we are talking here about 3,000 terabytes downloaded and 2,000 terabytes uploaded to the internet every month. So this is the network that we we monitored in this project. So our goals, why did we want
02:07
to monitor this network? Well first of all we wanted to have visibility in terms of top talkers. So who are the IP addresses that generate most of the traffic? Who are the autonomous system that generates most of the traffic? One of them is Netflix because you know students like to watch Netflix at the
02:24
university. But then we also wanted to see the who talks to whom communications. So not only an IP address the top talkers but also given an IP address we wanted to have visibility into its peers, into the other IP addresses it is communicating to. Also in terms of application
02:44
protocols, ports used, protocols such as UDP and TCP, and we also wanted to correlate network traffic data with security metadata to augment the network
03:01
traffic data with the security information to know if the communications are secure or not. And eventually we wanted to provide both live and historical views. So on the one hand we wanted to open up the monitoring system to see what is happening in the network right now, but on the other
03:21
hand we also wanted to open up the monitoring system and go back in time and to see for example what happened one week ago. So these were mainly our goals. We also wanted to do this using mainly open source software because you
03:40
know guys we are here at FOSDEM, you know why it's important to use open source software. So we wanted to do this using open source software and as there was no budget to buy new fancy hardware, new fancy software licenses, we also needed to do this project on commodity hardware. And in
04:02
practice we reused an already existing hardware that has been decommissioned and then re-allocated for this project. So the point given our goals, how to get to the goals? Well, we at NTOP are developers of a
04:22
network monitoring tool which is called NTOP-NG. It's free, open source, you can go on our GitHub page and download it, try it if you want. And it has a web-based interface that gives us live and historical views, security events, so that was the the main candidate for this project. However, we
04:46
needed to use it in combination with another software which is called nProbe, the software nProbe. It's necessary in this network infrastructure because the
05:00
network traffic data is exported via NetFlow. I don't know if you are familiar with NetFlow but it's basically a way to export the network communication over UDP packets. So we needed a software able to read these UDP packets and to put the result into the monitoring system. So we needed to use nProbe,
05:24
which is not open source, but however it's free for nonprofit. So if you are a university, a nonprofit organization, hospital or whatever, and basically you don't make money with the software, you can get it for free. And eventually for the time series data, so for the historical analysis we
05:44
wanted to use InfluxDB as the time series to store. So in this chart I tried to summarize graphically the structure of the monitoring system. So we have our NetFlow exporter, we have the nProbe NetFlow collector which is
06:06
in charge of receiving NetFlow packets and converting them into a format which is understandable by NTOPNG. At the core of the monitoring system we have NTOPNG which combines traffic data with black lists released by security
06:25
vendors such as Cisco, Thalos, Emerging Threats and other vendors and it combines this data, generates alerts and also interacts with InfluxDB to write and read time series. So this is basically the structure of the
06:42
monitoring system. It's pretty simple but the point was that the input data was massive. For massive I'm talking about almost 20 billions of network communications per week, 20 billions. It's an average of almost 50,000
07:07
network communications per second, every second of the week on average. So the point was not in having that monitoring system, the point was in making that monitoring system work at that scale. And indeed this
07:24
presentation is about the main challenges, it's about basically how we manage to make this operating, to make this monitoring system work at this scale. And as are we going to see now, initially things were a mess,
07:41
nothing was working as expected. Then interaction after interaction, email after email, commit after commit, things have started to improve and things have started to work. So let's see, let's have a look at this journey. Everything started basically with this message on our NTOP
08:04
Community Telegram group, WertoBias sent us a message and he told us that traffic reported is completely wrong. I mean it reports a maximum of 1.8 gigabits but actually it should be around 20 gigabits. So the
08:25
monitoring system was up and running but it was losing basically 90% of the traffic. So the point was, why? What is happening? Why is the system losing all
08:41
this traffic? So we started looking at all the buffers in the system. We started shooting the buffers and what we found out is that the drops were occurring at the very beginning of the monitoring system. They were occurring
09:01
on the UDP socket in charge of receiving the NetFlow packet. It's pretty easy to spot these drops because if you are on Linux you can just use cat proc net UDP and you have a column which tells you the number of
09:21
drops on the UDP socket. So we were dropping on the UDP socket. So what does this mean? It means that the end probe was not able to process these NetFlow UDP packets fast enough. However, we were confused because we started
09:43
monitoring the rate of these NetFlow packets and we found out that this rate was always less than 10,000 packets per second. We were confused because we knew that the end probe could process even 25,000 packets per second. So we say on the
10:02
one hand we have just 10,000 packets per second but we still have drops on the UDP socket. So why this thing is dropping? Why the end probe is dropping? Well, we were looking at the wrong time scale because on a sub-second scale the NetFlow
10:24
exporter, the NetFlow packets, were arriving in very very steep bursts. So on a sub-second scale here, look from 0 to 1 second, you have a huge burst of packets and then silence. Huge burst of packets and then silence. So that was
10:43
causing the drop. We were looking at the wrong time scale. So what was happening here is basically there were idle periods where end probe was just doing nothing and then it was asked to do a huge amount of work in a very short period of time and that was causing the drop. So how did we
11:05
fix this? Well, we couldn't smooth, we couldn't change this burst behavior of the NetFlow exporter, so we needed to make a probe even faster. So we started studying how to make this end probe faster and we
11:22
found out that a lot of time was spent in converting NetFlow to JSON because JSON was a standard used by NTOPNG to read input traffic. So we said, okay, JSON conversion, it's very expensive, we dropped JSON in favor of a
11:44
pretty simple type length value encoding. And along with other changes, we made it to work at this rate. So this is now after these
12:01
changes, it seems that everything started working. So we received an email from Tobias, he told us, hey, it looks very good and stable now. So we said, wow, everything is done, the system is working. No, actually it was only the first part of the story because at least two other issues came out at this point.
12:24
First, the downstream data store InfluxDB was eating up all the available memory in the system. So it was causing a lot of memories, the system was becoming unusable, all the swap gone, all the RAM gone, so system unusable first.
12:45
And then we noticed that there were continuous scans of the whole network that was causing, let's say, fake costs showing up. We are going to see what does it mean fake costs in a while. So let's have a look at these two
13:03
other issues. This is, oh, I don't know if you guys can read this right now, this is how we found the system after a few days. If you can read here, we had our InfluxDB instance, which was taking a hundred gigs of memory,
13:26
all the swap. So even if the system seems idle, you know, all these scores, almost idle, but the RAM was gone, completely used by InfluxDB. So this was another big issue because the system was unusable. So we started looking more closely at InfluxDB,
13:47
we started doing experiments and analysis of why InfluxDB was behaving like this, and we found out that the issue was due to the continuous queries. Continuous queries
14:00
are a programmatic way to instruct InfluxDB to perform certain operations at regular intervals of time. In our case, we use continuous queries to do the so-called data roll-up, that is to aggregate fine-grained data points into coarser-grained data points. So for example,
14:23
I take all the second-by-second points of an hour, I take the average, and I generate one point. This is to make queries somehow faster when you need to do the queries. However, we found out that especially when the number of time series in InfluxDB grows above,
14:43
let's say, one million, in the order of millions, however, this creates very, very big issues with performance and with the use of RAM. So this time, more than a solution, is a trade-off. We needed to disable these continuous queries,
15:00
a part of these continuous queries, actually the hourly continuous queries. That made InfluxDB more stable, so it stabilized this. Memory used stabilized at around 60 gigs, and it's working pretty okay. So continuous queries were the issues.
15:26
So okay, as the second part of this lesson learned, if you keep in mind that if you are planning to use InfluxDB continuous queries, they can be very demanding in terms of RAM, and you might experience performance degradation when the number of time series that you have
15:43
grows, let's say, in the order of millions. Now, scanners. I mentioned the scanners earlier in this presentation. What happens with the scanners? Well, keep in mind that this university network has a range of public IP addresses. So what happens is that someone
16:04
in the internet is continuously and constantly probing all their range of IP addresses one after the other. Why? Typically, this kind of probing is done by attackers, which try to detect
16:21
and discover if hosts have open ports, if they are vulnerable on certain ports. So they probe, and they try to open up connections on all the available range of ports, both in TCP and UDP. So this creates issues on the monitoring system, because when someone is continuously
16:48
probing all the network, this, from the point of view of the monitoring system, creates a lot of fake hosts. Why am I saying fake hosts? That is the term used by Tobias.
17:03
They are fake hosts because when an attacker is probing your host, that host can even be inactive often. No one can be there, but since someone is probing it from the outside world, you see this host popping out in the monitoring system. So this creates a lot of troubles,
17:27
because from a software perspective, you waste a lot of CPU time and a lot of memory in keeping track of these hosts that are just being probed, but they do not actually generate
17:43
any kind of traffic. This is also an issue for the analyst, because when you open up the monitoring system, and the monitoring system is poisoned, I would say, is polluted with a thousand or hundreds of thousands of fake hosts, then you are confused, because you no
18:04
longer know if you are looking to a host which is actively sending traffic, or to a host which is just being probed. So we wanted to, we needed somehow to identify and discard this probing traffic, but still we wanted to keep track of this probing traffic as well. So
18:25
we started looking at the patterns in the probing traffic. So what's the shape of the probing traffic? We started looking at the TCP flags. We also generate scans using a popular
18:42
tool, maybe you guys know it, it's called NetMap. We started generating scans with NetMap to identify the patterns in the scans, and to implement a simple algorithm to actually identify the probing
19:01
traffic. So once we were able to identify this probing traffic, we started discarding it, keeping track of it, but we didn't let it enter the monitoring system. So this has the benefit of leaving the monitoring system clean from this noise, but on the other hand, we are still able
19:23
to keep track and quantify this monitoring traffic. And to give you some numbers, this is the monitoring traffic that we have discovered in the system, in the monitoring system. So basically, there is a continuous traffic of almost 20 megabits per second, which is constantly
19:46
hitting the network. A 20 megabits traffic constantly hitting the network is probing traffic. So we discarded it, but we were able to keep track of it. So now, again, lessons learned,
20:05
part three, important to identify the probing traffic when you are monitoring a large public network, because that will very likely mess up the real data. And you need to be very, very fast and efficient. This is fundamental, because you have always to stay up. You need to stay
20:24
up. In the worst case, you don't have to play the game of the attackers, because it's when something bad is happening that a monitoring system is most useful. So you need to always stay up and always function. So be very, very fast and efficient. Let me show you now a couple of
20:44
slides of what you can get out of the monitoring system. Now that everything is in place, the main problems have been solved, the monitoring system is up and running. Let me show you something that you can extract from the monitoring system. I mentioned in the beginning of this
21:00
presentation that we were interested in the who talks to whom communication. Well, I said, so let's have a look at these who talks to whom communications, but from a security perspective. So you open up the monitoring system, you just pick up the network communication, which is also known as flows, you select the status blacklisted, and you immediately have
21:24
a live overview of all the communications that are taking place with your network that are known to be insecure, suspect. For example, at the first place here we have an host from Pakistan, which is exchanging data with one of our hosts, and they have been exchanging almost
21:45
700 megabytes of data. So immediately two clicks, boom, and you have visibility of what is happening, not only from a traffic point of view, but also from a security point of view. The same is true for the host. We have seen the network communication, the same is true for the host.
22:04
For the host, we have defined a score, which is basically a security, a trouble indicator. So you can just, with two clicks, select all the hosts, sort them by the score, and the hosts with the highest score are automatically those that are more relevant
22:23
from a security perspective. If I pick the first one, for example, and I look at its communication flows that are active, I mean, I found out that this was doing pings towards a lot of blacklisted addresses. See all this bang, bang, bang, bang here. So this host with
22:46
the highest score is contacting all these blacklisted hosts. So what is happening here? Is this host a victim? Has this host been compromised by someone? So in just a few clicks, even if the scale of this network is massive, in just a few clicks you can see the network
23:05
communication. You can see the host that are vulnerable, maybe the host that have been compromised, just in a very, very short period of time. So well, it's time for me to wrap up.
23:22
Basically, I hope the message here is that if you want to monitor a large-scale network, this is something that you can do using open source software and commodity hardware. You don't have to pay to buy fancy software licenses or to buy fancy hardware. You can just do it using open source software and commodity hardware. This is the hardware that
23:46
we have used. It's a two-time Intel Xeon. But if you look at the cores, I mean, almost all the cores are idle. The load is not even five. So it's even over provision to monitor all this traffic. And as future directions, now we are going to try other
24:06
time series databases. We are considering TimescaleDB. I don't know if you guys have experience with TimescaleDB and you want to discuss with us later, that would be great. We are also working on efficient algorithms, not only to discard the probing traffic,
24:22
but also to keep track of the top attackers, so the top sources that are creating the probing traffic. Again, that must be done in a very, very efficient way. And yeah, that's pretty much all. This is the dashboard that I would like to close this presentation with a dashboard
24:43
to show you what I've mentioned in the beginning. Here we have the top talkers and the top destinations of the traffic of the whole university network. And on the right side of this dashboard, we have the live part of the traffic here at the top and the historical part of the traffic here at the bottom, as you can see last day view. And here we have
25:04
a constant traffic which is, let's say, up to 20 gigs per second during the day. And it goes even, let's say, at around 10 gigs during the night. So thanks for your time. If you have questions, please let me know. Thank you.