We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Network Traffic Analysis of Hadoop Clusters

Formal Metadata

Title
Network Traffic Analysis of Hadoop Clusters
Subtitle
Understand the common usage patterns and identify typical / atypical workloads.
Title of Series
Number of Parts
611
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2017

Content Metadata

Subject Area
Genre
Abstract
Cybersecurity is a broad topic and many commercial products are related to it.We demonstrate a fundamental concept in network analysis: re-construction andvisualization of temporal networks. Furthermore, we apply the method todescribe operational conditions of a Hadoop cluster. Our experiments providefirst results and allow a classification of the cluster state related tocurrent workloads. The temporal networks show significant differences fordifferent operation modes. In reallity we would expect mixed workloads. Ifsuch workload parameters are known, we are able to handle a-typical eventsaccordingly - which means, we are able to create alerts based on contextinformation, rather than only the package content. We show an end-to-endexample: (1) Data collection is done via python, using the sniffer script; (2)using Apache Hive and Apache Spark we analyze the network traffic data andcreate the temporary network. Finally, we are able to visualize the resultsusing Gephi in step (3). In a next step, we plan to contribute to the ApacheSpot project. # Expected prior knowledge / intended audience: No special skills required, but minimal exposure to the Hadoop ecosystem ishelpful. # Speaker bio: Márton Balassi is a Solution Architect at Cloudera and a PMC member at ApacheFlink. He focuses on Big Data application development, especially in thestreaming space. Marton is a regular contributor to open source and has been aspeaker of a number of open source Big Data related conferences includingHadoop Summit and Apache Big Data and meetups recently. Mirko Kämpf is a Solution Architect at Cloudera and the initiator of theEtosha project. He holds a Diploma in Physics and worked on several projectsrelated to complex systems analysis. His focus is on time dependent networkanalysis and time series analysis, using tools from the Hadoop ecosystem, andespecially on the related metadata management. Mirko is actively using opensource tools, author of several blog articles in the Cloudera engineeringblog, and a speaker in Big Data related conferences and meetups.