Datacenter Network Monitoring and Management

Course Description:

The seminar focuses on understanding sources of failure and preformance issues ina datacenter network as well as on solutions how to monitor these. The goal is to have a broad understanding of tooling and practices in monitoring network health. We will work to establish a problem-solving mindset and build skills to tackle netwrk performance problems.

The course is build around a set of papers which cover the landscape of monitoring solutions in datacenter managament. As grading references there are three reports to be submitted, each report coveing a subset of papers. Together we will work on establishng a clear report structure and set of questions to answer. 

Schedule:

Date Topic Documments
Feb-18 Introduction  Slides
Feb 25 Failures [1,2,3] 
Mar 03 SNMP [4] Slides
Mar 10 Self study   Report A instructions
Mar 17

Counters

Report A submission

 [5] Paper notes

FlowRadar notes

Mar 24 Applications  [6] Paper notes
Mar 31 Tomography  [7] Paper notes
Apr 07 Self study  Report B instructions
Apr 14 Easter break  
Apr 21

Probing 1

 [8,9]
Apr 28 Probing 2  [10] Paper notes Slides
May 05

Probing 3

Report B submission

 [11] Paper notes Slides

 

May 12

Self study 

 Report C instructions
May 19 Presentation discussion  [10 Slide notes] [11 Slide notes]
May 26

Mirroring & Triggers

 [12] Paper notes

 [13] Paper notes

June 2

Report C submission

 

Support papers:

How to read a paper pdf
How to read a research paper pdf 

Mark Handley animation YouTube video

Overview of discussed papers 

Papers: 

[1] Understanding and Mitigating Packet Corruption in Data Center Networks (2017) Paper Sections: 2,3,4

[2] Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications (2011) Paper Sections: 2,4

[3] Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure (2016) Paper Sections: 4,5,6

[4] SNMP tutorial link 

[5] LossRadar: Fast Detection of Lost Packets in Data Center Networks (2016) Paper Sections: 1-6 and 8

[6] Passive realtime datacenter fault detection and localization (2017) Paper Sections: 1-5

[7] Netscope: practical network loss tomography(2010) Paper Sections: full paper

[8] Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis (2015) Paper Sections: 1-3, 5 and 6.2

[9] NetNORAD: Troubleshooting networks via end-to-end probing (2016) Paper Sections: full blog

[10] NetBouncer: Active Device and Link Failure Localization in Data Center Networks (2019) Paper Sections: see shared doc

[11] Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing (2015) Paper Sections: see shared doc

[12] Packet-Level Telemetry in Large Datacenter Networks (2015) Paper Sections: 1,3-4, 7.1 and 7.2

[13] Trumpet: Timely and Precise Triggers in Data Centers (2016) Paper Sections: 2-4,5.1-5.5, only scan 6

[14] NetPilot: Automating Datacenter Network Failure Mitigation (2016) Paper Sections: 1-4

Optional reading 

FlowRadar: A Better NetFlow for Data Centers (2016) Paper Similar to FlowRadar 

deTector: a Topology-aware Monitoring System for Data Center Networks (2017) Paper Solution using probing 

Scalable Near Real-Time Failure Localization of Data Center Networks (2014) Paper Solution using probing