CAB E 72
Talk by Vasileios Tsoutsouras (Institute of Communication and Computer Systems (ICCS), Greece:
Title: Design Methodologies for Resource Management of Many-core Computing Systems
Abstract:
The complexity and elevated requirements of modern applications have driven the development of computing systems characterized by high number of processing cores, heterogeneity and complex communication interconnect. Inevitably, in order for these systems to yield their maximum performance, novel dynamic resource management mechanisms are required. Towards this direction, this presentation outlines the building blocks and design decisions of a run-time resource manager targeting many-core computing systems with Network-on-Chip (NoC) interconnection. Due to the high complexity and fast response requirements of dynamically mapping many concurrently running applications, a novel run-time resource management framework is introduced, aiming at providing a scalable solution based on distributed decision-making mechanisms.
This Distributed Run-Time Resource Management (DRTRM) framework is implemented and evaluated on top of Intel SCC, an actual many-core, NoC based computing platform. Motivated by the unpredictable workload dynamics and application requests, an impact analysis of their arrival rate on DRTRM is performed, showing that a fast and resource hungry scenario of incoming applications can be the breaking point not only for conventional centralized managers but also for distributed ones. In addition, the distribution of decisions in DRTRM complicates the enforcement of a system-wide mitigation scheme, as it requires the consensus of many agents. This issue is efficiently addressed by proposing an admission control policy that retains distributed features by taking advantage of the resource allocation hierarchy in DRTRM and enforcing Voltage and Frequency Scaling on few, specific distributed agents. This policy is implemented and evaluated as an extension of DRTRM, showing that it can relieve the congestion of applications under stressful conditions and also provides energy consumption gains.
Last, the increased probability of manifested hardware errors is addressed, a side-effect of the tight integration of many processing elements on the same system, which jeopardizes the Service Quality provided to the end user. By adhering to the concept of dynamic recovery from the manifested errors, SoftRM is introduced, a DRTRM augmented with fault tolerant features. SoftRM extends the well-known Paxos consensus algorithm concepts, providing dynamic self-organization and workload-aware error mitigation. SoftRM policies also refrain from the provisioning of spare cores for fault tolerance, thus maximizing system throughput both in the existence and absence of errors in the processing elements of the SoC.
Short Bio:
Vasileios Tsoutsouras received his Diploma and Ph.D. degree in Electrical and Computer Engineering from the Microprocessors and Digital Systems Laboratory of the National Technical University of Athens, Greece in 2013 and 2018, respectively. The main topics of his research include dynamic resource management of many-core computing systems, Edge computing in Internet of Things architectures and HW/SW co-design. He has published over 20 technical and research papers in scientific books, international conferences and journals. Since 2013, he has also worked as a research associate of the Institute of Communication and Computer Systems (ICCS) in 2 European founded projects regarding run-time resource management of medical embedded devices and Cloud infrastructure.