We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Monitoring of a Large-Scale University Network

00:00

Formal Metadata

Title
Monitoring of a Large-Scale University Network
Subtitle
Lessons Learned and Future Directions
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The complexity of network monitoring strongly depends on the size of the network under observation. Challenges in monitoring large-scale networks arise not only from dealing with a large volume of traffic, but also from keeping track of all traffic sources, destinations, and who-talks-to-whom communications. Analyzing this information allows to uncover new behaviors that would have not been visible by merely observing common metrics such as bytes and packets. The drawback is that extra pressure is put on the monitoring system as well and on the downstream data- and timeseries-stores. This talk presents a case study based on the monitoring of a large-scale university network. Challenges faced, findings, and lessons learned will be examined. It will be shown how to make sense of the input data to properly manage and reduce its scale as early as possible in the monitoring system. The discussion will also highlight the advantages and limitations of the opensource software components of the monitoring system. In particular, the opensource network monitoring tool ntopng and the timeseries-store InfluxDB will be considered. It will be shown what happens when ntopng and InfluxDB are pushed to their limits and beyond, and what it can be done to ensure their smooth operation. Relevant findings, behaviors uncovered in the network traffic, and future directions will conclude the talk. Intended audience is technical and managerial individuals who are familiar with network monitoring. - Main challenges of monitoring large-scale networks - Case study based on the monitoring of a university network - Availability, scalability and suitability of opensource software for network monitoring
Scale (map)Direction (geometry)Computer networkPresentation of a groupSoftwareUniverse (mathematics)Forschungszentrum RossendorfResultantComputer animation
Data structureComputer networkOpen sourceSystem identificationKolmogorov complexityScale (map)outputDigital filterGame theorySoftwareDirection (geometry)Presentation of a groupUniverse (mathematics)Computer animation
Computer networkSystem on a chipDuality (mathematics)Term (mathematics)Universe (mathematics)WordCore dumpSoftwareInternetworking10 (number)Projective planeComputer animation
Open sourceInformation securityView (database)System programmingUDP <Protokoll>Communications protocolSoftwareComputer hardwarePhysical systemTerm (mathematics)Student's t-testIP addressInformationUniverse (mathematics)Cartesian coordinate systemTelecommunicationProjective planeOpen sourceSoftwareInformation securityCommunications protocolComputer hardwareMultiplication signMetadataRight anglePoint (geometry)Autonomous system (mathematics)View (database)Cross-correlationComputer animation
Open sourceoutputEvent horizonInformation securityDataflowModal logicInternetworkingView (database)DatabaseOpen setResultantInterface (computing)SoftwarePhysical systemFreewareInformation securityModal logicTime seriesMathematical analysisSoftware developerCombinational logicWeb pageTelecommunicationComputer animation
ArchitecturePhysical systemTime seriesData structurePhysical systemPoint (geometry)File formatElectronic mailing listCore dumpoutputInformation securityComputer animation
InternetworkingoutputTelecommunicationAverageMassPoint (geometry)2 (number)Physical systemSoftwareComputer animation
Message passingContinuous functionPhysical systemEmailInteractive televisionPhysical systemPresentation of a groupExpected valueComputer animation
Execution unitAnalogyComa BerenicesMaxima and minimaMessage passingPhysical system
10 (number)Maxima and minimaBuffer solutionDrop (liquid)Point (geometry)Physical systemBuffer solutionDrop (liquid)Computer animation
Process (computing)UDP <Protokoll>DataflowNetwork socketDrop (liquid)Network socketDrop (liquid)Computer-assisted translationWeightNumberComputer animation
Bit rateUDP <Protokoll>Drop (liquid)DataflowBasis <Mathematik>Scale (map)Socket-SchnittstelleMoment (mathematics)Convex hullDrum memoryScaling (geometry)2 (number)Drop (liquid)Multiplication signBit rateNetwork socketFrequencyMathematicsComputer animation
FrequencyData typeNetwork socketDrop (liquid)UDP <Protokoll>Computer configurationDataflowFile formatMultiplication signType theoryCodierung <Programmierung>LengthComputer animation
Coma BerenicesBuffer solutionMereologyKernel (computing)Network socketUDP <Protokoll>Point (geometry)EmailMereologyPhysical systemComputer animation
Read-only memoryPhysical systemSoftwarePort scannerRight angleData storage deviceSemiconductor memoryComputer animation
Physical system
Euler anglesQuery languageTable (information)Continuous functionQuery languageNumberMultiplication signOrder (biology)Time seriesPoint (geometry)Regular graphMathematical analysisOperator (mathematics)Computer animation
Query languageQuery languageMereologySemiconductor memoryComputer animation
Term (mathematics)MereologyConstraint (mathematics)Point (geometry)Time seriesAnalytic continuationMereologyTerm (mathematics)NumberBelegleserComputer animation
Computer networkSoftwareBefehlsprozessorTrailUniverse (mathematics)Semiconductor memoryConnected spaceBefehlsprozessorRange (statistics)InternetworkingHoaxPoint (geometry)View (database)SoftwarePhysical systemIP addressMultiplication signTerm (mathematics)Computer animation
TrailFlagUDP <Protokoll>Physical systemShape (magazine)FlagPattern languageAlgorithmPort scannerTrail
AlgorithmMereologyGame theoryPhysical systemSoftwareTrailGame theoryPhysical systemComputer animation
Server (computing)Public domainPressure volume diagramView (database)TelecommunicationPrice indexQuicksortInformation securitySoftwarePhysical systemPoint (geometry)Presentation of a groupScaling (geometry)Multiplication signSlide ruleAddress spaceComputer animation
Computer networkComputer hardwareSoftwareVirtual machineIntelBefehlsprozessorComputer hardwareStructural loadSoftwareMessage passingCore dumpOpen sourceComputer animation
Public domainServer (computing)AlgorithmInformation securityMathematical analysisPerspective (visual)Direction (geometry)Open sourcePresentation of a groupTrailDatabaseTime seriesComputer animation
Mereology2 (number)SoftwareView (database)Universe (mathematics)Multiplication sign
Point cloudOpen source
Transcript: English(auto-generated)
I am Simone Mainardi and this presentation is about the monitoring of a large-scale university network and it's basically the result of the cooperation between NTOP and Tobias Appel. Tobias Appel works at the
Leibniz Research Center in Germany. This research center is basically the institution who actually operates these large-scale university network. So what are we going to see today? Well I'm going to start this presentation with a brief overview of this university network and then we are going to
briefly see the monitoring goals we had in mind when we started this project. And then I'm going to move to discuss the main challenges we have faced and the main solutions we have found when trying to monitor this
challenging infrastructure. And eventually I'm going to conclude the presentation with the main lessons learned and some future directions. Okay a few words on the university network. The university network is the Munich
backbone for Munich universities and research centers. It's a pretty large network, I mean we are talking about tens of core routers, thousands of switches, and just to give you a ballpark in terms of traffic exchanged
with the internet, we are talking here about 3,000 terabytes downloaded and 2,000 terabytes uploaded to the internet every month. So this is the network that we we monitored in this project. So our goals, why did we want
to monitor this network? Well first of all we wanted to have visibility in terms of top talkers. So who are the IP addresses that generate most of the traffic? Who are the autonomous system that generates most of the traffic? One of them is Netflix because you know students like to watch Netflix at the
university. But then we also wanted to see the who talks to whom communications. So not only an IP address the top talkers but also given an IP address we wanted to have visibility into its peers, into the other IP addresses it is communicating to. Also in terms of application
protocols, ports used, protocols such as UDP and TCP, and we also wanted to correlate network traffic data with security metadata to augment the network
traffic data with the security information to know if the communications are secure or not. And eventually we wanted to provide both live and historical views. So on the one hand we wanted to open up the monitoring system to see what is happening in the network right now, but on the other
hand we also wanted to open up the monitoring system and go back in time and to see for example what happened one week ago. So these were mainly our goals. We also wanted to do this using mainly open source software because you
know guys we are here at FOSDEM, you know why it's important to use open source software. So we wanted to do this using open source software and as there was no budget to buy new fancy hardware, new fancy software licenses, we also needed to do this project on commodity hardware. And in
practice we reused an already existing hardware that has been decommissioned and then re-allocated for this project. So the point given our goals, how to get to the goals? Well, we at NTOP are developers of a
network monitoring tool which is called NTOP-NG. It's free, open source, you can go on our GitHub page and download it, try it if you want. And it has a web-based interface that gives us live and historical views, security events, so that was the the main candidate for this project. However, we
needed to use it in combination with another software which is called nProbe, the software nProbe. It's necessary in this network infrastructure because the
network traffic data is exported via NetFlow. I don't know if you are familiar with NetFlow but it's basically a way to export the network communication over UDP packets. So we needed a software able to read these UDP packets and to put the result into the monitoring system. So we needed to use nProbe,
which is not open source, but however it's free for nonprofit. So if you are a university, a nonprofit organization, hospital or whatever, and basically you don't make money with the software, you can get it for free. And eventually for the time series data, so for the historical analysis we
wanted to use InfluxDB as the time series to store. So in this chart I tried to summarize graphically the structure of the monitoring system. So we have our NetFlow exporter, we have the nProbe NetFlow collector which is
in charge of receiving NetFlow packets and converting them into a format which is understandable by NTOPNG. At the core of the monitoring system we have NTOPNG which combines traffic data with black lists released by security
vendors such as Cisco, Thalos, Emerging Threats and other vendors and it combines this data, generates alerts and also interacts with InfluxDB to write and read time series. So this is basically the structure of the
monitoring system. It's pretty simple but the point was that the input data was massive. For massive I'm talking about almost 20 billions of network communications per week, 20 billions. It's an average of almost 50,000
network communications per second, every second of the week on average. So the point was not in having that monitoring system, the point was in making that monitoring system work at that scale. And indeed this
presentation is about the main challenges, it's about basically how we manage to make this operating, to make this monitoring system work at this scale. And as are we going to see now, initially things were a mess,
nothing was working as expected. Then interaction after interaction, email after email, commit after commit, things have started to improve and things have started to work. So let's see, let's have a look at this journey. Everything started basically with this message on our NTOP
Community Telegram group, WertoBias sent us a message and he told us that traffic reported is completely wrong. I mean it reports a maximum of 1.8 gigabits but actually it should be around 20 gigabits. So the
monitoring system was up and running but it was losing basically 90% of the traffic. So the point was, why? What is happening? Why is the system losing all
this traffic? So we started looking at all the buffers in the system. We started shooting the buffers and what we found out is that the drops were occurring at the very beginning of the monitoring system. They were occurring
on the UDP socket in charge of receiving the NetFlow packet. It's pretty easy to spot these drops because if you are on Linux you can just use cat proc net UDP and you have a column which tells you the number of
drops on the UDP socket. So we were dropping on the UDP socket. So what does this mean? It means that the end probe was not able to process these NetFlow UDP packets fast enough. However, we were confused because we started
monitoring the rate of these NetFlow packets and we found out that this rate was always less than 10,000 packets per second. We were confused because we knew that the end probe could process even 25,000 packets per second. So we say on the
one hand we have just 10,000 packets per second but we still have drops on the UDP socket. So why this thing is dropping? Why the end probe is dropping? Well, we were looking at the wrong time scale because on a sub-second scale the NetFlow
exporter, the NetFlow packets, were arriving in very very steep bursts. So on a sub-second scale here, look from 0 to 1 second, you have a huge burst of packets and then silence. Huge burst of packets and then silence. So that was
causing the drop. We were looking at the wrong time scale. So what was happening here is basically there were idle periods where end probe was just doing nothing and then it was asked to do a huge amount of work in a very short period of time and that was causing the drop. So how did we
fix this? Well, we couldn't smooth, we couldn't change this burst behavior of the NetFlow exporter, so we needed to make a probe even faster. So we started studying how to make this end probe faster and we
found out that a lot of time was spent in converting NetFlow to JSON because JSON was a standard used by NTOPNG to read input traffic. So we said, okay, JSON conversion, it's very expensive, we dropped JSON in favor of a
pretty simple type length value encoding. And along with other changes, we made it to work at this rate. So this is now after these
changes, it seems that everything started working. So we received an email from Tobias, he told us, hey, it looks very good and stable now. So we said, wow, everything is done, the system is working. No, actually it was only the first part of the story because at least two other issues came out at this point.
First, the downstream data store InfluxDB was eating up all the available memory in the system. So it was causing a lot of memories, the system was becoming unusable, all the swap gone, all the RAM gone, so system unusable first.
And then we noticed that there were continuous scans of the whole network that was causing, let's say, fake costs showing up. We are going to see what does it mean fake costs in a while. So let's have a look at these two
other issues. This is, oh, I don't know if you guys can read this right now, this is how we found the system after a few days. If you can read here, we had our InfluxDB instance, which was taking a hundred gigs of memory,
all the swap. So even if the system seems idle, you know, all these scores, almost idle, but the RAM was gone, completely used by InfluxDB. So this was another big issue because the system was unusable. So we started looking more closely at InfluxDB,
we started doing experiments and analysis of why InfluxDB was behaving like this, and we found out that the issue was due to the continuous queries. Continuous queries
are a programmatic way to instruct InfluxDB to perform certain operations at regular intervals of time. In our case, we use continuous queries to do the so-called data roll-up, that is to aggregate fine-grained data points into coarser-grained data points. So for example,
I take all the second-by-second points of an hour, I take the average, and I generate one point. This is to make queries somehow faster when you need to do the queries. However, we found out that especially when the number of time series in InfluxDB grows above,
let's say, one million, in the order of millions, however, this creates very, very big issues with performance and with the use of RAM. So this time, more than a solution, is a trade-off. We needed to disable these continuous queries,
a part of these continuous queries, actually the hourly continuous queries. That made InfluxDB more stable, so it stabilized this. Memory used stabilized at around 60 gigs, and it's working pretty okay. So continuous queries were the issues.
So okay, as the second part of this lesson learned, if you keep in mind that if you are planning to use InfluxDB continuous queries, they can be very demanding in terms of RAM, and you might experience performance degradation when the number of time series that you have
grows, let's say, in the order of millions. Now, scanners. I mentioned the scanners earlier in this presentation. What happens with the scanners? Well, keep in mind that this university network has a range of public IP addresses. So what happens is that someone
in the internet is continuously and constantly probing all their range of IP addresses one after the other. Why? Typically, this kind of probing is done by attackers, which try to detect
and discover if hosts have open ports, if they are vulnerable on certain ports. So they probe, and they try to open up connections on all the available range of ports, both in TCP and UDP. So this creates issues on the monitoring system, because when someone is continuously
probing all the network, this, from the point of view of the monitoring system, creates a lot of fake hosts. Why am I saying fake hosts? That is the term used by Tobias.
They are fake hosts because when an attacker is probing your host, that host can even be inactive often. No one can be there, but since someone is probing it from the outside world, you see this host popping out in the monitoring system. So this creates a lot of troubles,
because from a software perspective, you waste a lot of CPU time and a lot of memory in keeping track of these hosts that are just being probed, but they do not actually generate
any kind of traffic. This is also an issue for the analyst, because when you open up the monitoring system, and the monitoring system is poisoned, I would say, is polluted with a thousand or hundreds of thousands of fake hosts, then you are confused, because you no
longer know if you are looking to a host which is actively sending traffic, or to a host which is just being probed. So we wanted to, we needed somehow to identify and discard this probing traffic, but still we wanted to keep track of this probing traffic as well. So
we started looking at the patterns in the probing traffic. So what's the shape of the probing traffic? We started looking at the TCP flags. We also generate scans using a popular
tool, maybe you guys know it, it's called NetMap. We started generating scans with NetMap to identify the patterns in the scans, and to implement a simple algorithm to actually identify the probing
traffic. So once we were able to identify this probing traffic, we started discarding it, keeping track of it, but we didn't let it enter the monitoring system. So this has the benefit of leaving the monitoring system clean from this noise, but on the other hand, we are still able
to keep track and quantify this monitoring traffic. And to give you some numbers, this is the monitoring traffic that we have discovered in the system, in the monitoring system. So basically, there is a continuous traffic of almost 20 megabits per second, which is constantly
hitting the network. A 20 megabits traffic constantly hitting the network is probing traffic. So we discarded it, but we were able to keep track of it. So now, again, lessons learned,
part three, important to identify the probing traffic when you are monitoring a large public network, because that will very likely mess up the real data. And you need to be very, very fast and efficient. This is fundamental, because you have always to stay up. You need to stay
up. In the worst case, you don't have to play the game of the attackers, because it's when something bad is happening that a monitoring system is most useful. So you need to always stay up and always function. So be very, very fast and efficient. Let me show you now a couple of
slides of what you can get out of the monitoring system. Now that everything is in place, the main problems have been solved, the monitoring system is up and running. Let me show you something that you can extract from the monitoring system. I mentioned in the beginning of this
presentation that we were interested in the who talks to whom communication. Well, I said, so let's have a look at these who talks to whom communications, but from a security perspective. So you open up the monitoring system, you just pick up the network communication, which is also known as flows, you select the status blacklisted, and you immediately have
a live overview of all the communications that are taking place with your network that are known to be insecure, suspect. For example, at the first place here we have an host from Pakistan, which is exchanging data with one of our hosts, and they have been exchanging almost
700 megabytes of data. So immediately two clicks, boom, and you have visibility of what is happening, not only from a traffic point of view, but also from a security point of view. The same is true for the host. We have seen the network communication, the same is true for the host.
For the host, we have defined a score, which is basically a security, a trouble indicator. So you can just, with two clicks, select all the hosts, sort them by the score, and the hosts with the highest score are automatically those that are more relevant
from a security perspective. If I pick the first one, for example, and I look at its communication flows that are active, I mean, I found out that this was doing pings towards a lot of blacklisted addresses. See all this bang, bang, bang, bang here. So this host with
the highest score is contacting all these blacklisted hosts. So what is happening here? Is this host a victim? Has this host been compromised by someone? So in just a few clicks, even if the scale of this network is massive, in just a few clicks you can see the network
communication. You can see the host that are vulnerable, maybe the host that have been compromised, just in a very, very short period of time. So well, it's time for me to wrap up.
Basically, I hope the message here is that if you want to monitor a large-scale network, this is something that you can do using open source software and commodity hardware. You don't have to pay to buy fancy software licenses or to buy fancy hardware. You can just do it using open source software and commodity hardware. This is the hardware that
we have used. It's a two-time Intel Xeon. But if you look at the cores, I mean, almost all the cores are idle. The load is not even five. So it's even over provision to monitor all this traffic. And as future directions, now we are going to try other
time series databases. We are considering TimescaleDB. I don't know if you guys have experience with TimescaleDB and you want to discuss with us later, that would be great. We are also working on efficient algorithms, not only to discard the probing traffic,
but also to keep track of the top attackers, so the top sources that are creating the probing traffic. Again, that must be done in a very, very efficient way. And yeah, that's pretty much all. This is the dashboard that I would like to close this presentation with a dashboard
to show you what I've mentioned in the beginning. Here we have the top talkers and the top destinations of the traffic of the whole university network. And on the right side of this dashboard, we have the live part of the traffic here at the top and the historical part of the traffic here at the bottom, as you can see last day view. And here we have
a constant traffic which is, let's say, up to 20 gigs per second during the day. And it goes even, let's say, at around 10 gigs during the night. So thanks for your time. If you have questions, please let me know. Thank you.