Course Description:
The seminar focuses on understanding sources of failure and preformance issues ina datacenter network as well as on solutions how to monitor these. The goal is to have a broad understanding of tooling and practices in monitoring network health. We will work to establish a problem-solving mindset and build skills to tackle netwrk performance problems.
The course is build around a set of papers which cover the landscape of monitoring solutions in datacenter managament. As grading references there are three reports to be submitted, each report coveing a subset of papers. Together we will work on establishng a clear report structure and set of questions to answer.
Schedule:
Date | Topic | Documments |
---|---|---|
Feb-18 | Introduction | Slides |
Feb 25 | Failures | [1,2,3] |
Mar 03 | SNMP | [4] Slides |
Mar 10 | Self study | Report A instructions |
Mar 17 |
Counters Report A submission |
[5] Paper notes FlowRadar notes |
Mar 24 | Applications | [6] Paper notes |
Mar 31 | Tomography | [7] Paper notes |
Apr 07 | Self study | Report B instructions |
Apr 14 | Easter break | |
Apr 21 |
Probing 1 |
[8,9] |
Apr 28 | Probing 2 | [10] Paper notes Slides |
May 05 |
Probing 3 Report B submission |
[11] Paper notes Slides
|
May 12 |
Self study |
Report C instructions |
May 19 | Presentation discussion | [10 Slide notes] [11 Slide notes] |
May 26 |
Mirroring & Triggers |
[12] Paper notes [13] Paper notes |
June 2 |
Report C submission |
Support papers:
How to read a paper pdf
How to read a research paper pdf
Mark Handley animation YouTube video
Overview of discussed papers
Papers:
[1] Understanding and Mitigating Packet Corruption in Data Center Networks (2017) Paper Sections: 2,3,4
[2] Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications (2011) Paper Sections: 2,4
[3] Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure (2016) Paper Sections: 4,5,6
[4] SNMP tutorial link
[5] LossRadar: Fast Detection of Lost Packets in Data Center Networks (2016) Paper Sections: 1-6 and 8
[6] Passive realtime datacenter fault detection and localization (2017) Paper Sections: 1-5
[7] Netscope: practical network loss tomography(2010) Paper Sections: full paper
[8] Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis (2015) Paper Sections: 1-3, 5 and 6.2
[9] NetNORAD: Troubleshooting networks via end-to-end probing (2016) Paper Sections: full blog
[10] NetBouncer: Active Device and Link Failure Localization in Data Center Networks (2019) Paper Sections: see shared doc
[11] Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing (2015) Paper Sections: see shared doc
[12] Packet-Level Telemetry in Large Datacenter Networks (2015) Paper Sections: 1,3-4, 7.1 and 7.2
[13] Trumpet: Timely and Precise Triggers in Data Centers (2016) Paper Sections: 2-4,5.1-5.5, only scan 6
[14] NetPilot: Automating Datacenter Network Failure Mitigation (2016) Paper Sections: 1-4
Optional reading
FlowRadar: A Better NetFlow for Data Centers (2016) Paper Similar to FlowRadar
deTector: a Topology-aware Monitoring System for Data Center Networks (2017) Paper Solution using probing
Scalable Near Real-Time Failure Localization of Data Center Networks (2014) Paper Solution using probing