CAPES PrInt – C+HPC

Cloud Resource Management for Running High-Performance Applications

Coordinator: Lúcia Maria de Assumpção Drummond

The project aims to solve the problem of resource allocation and management in computer clouds for HPC applications, minimizing execution time, energy consumption and maximizing fault tolerance, without violating service level agreements (SLA), and using applications of biology as a case study. Traditionally, cloud computing has been used for sharing data and general-purpose services, however it has also emerged as a promising alternative for running HPC (High Performance Computing) applications more recently. This computing paradigm offers several advantages when compared to a dedicated infrastructure, such as rapid resource provisioning and significant reduction in operational costs.

However, some challenges must be overcome to reduce the difference between the performance offered by dedicated infrastructure and clouds. Overheads introduced by the virtualization layer, hardware heterogeneity, and high network latencies negatively affect the performance of HPC applications. Furthermore, cloud providers often adopt resource sharing policies that can further reduce the performance of such applications. Typically, a physical server can host several virtual machines that can cause contention in accessing shared resources, such as cache and main memory, significantly reducing their performance.

Furthermore, selecting virtual machines and manually configuring them is a very complex task for scientists who develop HPC applications and are not experts in cloud administration tools. Application schedulers that have several policies that vary according to the objective function, such as minimizing total execution time, minimizing energy demand, maintaining a guaranteed level of service with the user, among others, play a fundamental role in ensuring the efficiency of executing such applications. In order to leverage the use of clouds for running HPC applications, this project aims to address these various aspects. The importance of using clouds to run HPC applications can be observed by some initiatives, such as UberCloud, which has offered an HPC service in the cloud, where users can discuss the experience of using such an environment. As a case study we mainly consider experiments in the area of bioinformatics and, in particular, comparative genomics.

Team

Universidade Federal Fluminense
Lúcia Maria de Assumpção Drummond
Maria Cristina Silva Boeres
Eugene Francis Vinod Rebello
Yuri Abitbol de Menezes Frota
Igor Monteiro Moraes
José Viterbo Filho
Daniel Cardoso Moraes de Oliveira
Eduardo Uchoa
Artur Pessoa

Sorbonne Université
Pierre Sens
Luciana Arantes
Guy Pujolle
Amal El Fallah Seghrouchni

Université d’Avignon
Rosa Figueiredo

Université de Montpellier
Esther Pacitti

Université de Bordeaux
François Vanderbeck
Ruslan Sadykov
Laércio Lima Pilla

MINES ParisTech
Claude Tadonki

Notices

Notice	Links	Registration Period	Situation
PDSE Scholarship Selection 2019	Notice Attachment I Attachment II	03/15/2019 to 04/01/2019	Closed
Junior Visiting Professor Abroad	Notice	07/01/2019 to 07/08/2019	Closed
Visiting Researcher in Brazil	Notice	07/01/2019 to 07/15/2019	Closed
PDSE Scholarship Selection 2020	Notice Attachment I Attachment II Selection Minutes	10/13/2020 to 10/19/2020	Closed
Senior Visiting Professor Abroad	Notice Selection Minutes	10/26/2020 to 11/02/2020	Closed
Visiting Researcher in Brazil	Notice	11/12/2020 to 11/13/2020	Closed
Visiting Researcher in Brazil	Notice	08/10/2022 to 08/19/2022	Closed

Events

Event	Date	Local
Talk: Smart City Network Uberization Abstract: Connectivity is a key issue for smart cities. Providing Internet to a large number of users and things requires intensive network densification, an operation consisting in increasing the number of access points in order to provide a high density of users and things with an Internet connection. This presentation describes how to reach this goal using an uberization solution. We compare our solution with 5G and Fog networking solutions. We also describe how it is possible to secure the network with blockchains solutions. Guy Pujolle (Visiting Professor)	03/26/2019 at 11am	IC/UFF Auditorium
Defense of Doctoral Thesis Proposal: A Hibernation-Aware Dynamic Scheduler for Cloud Environments Abstract: Cloud platforms have emerged as a prominent environment to execute different classes of applications providing on-demand resources as well as scalability. They usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the financial advantages, a spot VM can be terminated or hibernated by EC2 at any moment. Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper the Hibernation-Aware Dynamic Scheduler (HADS), that uses those VMs to execute applications composed of independent tasks (bag-of-task) with deadline constraints. Besides that, we also define the problem of temporal failures, that occurs when a spot VM hibernates, and it does not resume within a time that guarantees the application’s deadline. Our scheduling approach, thus, aims at minimizing the monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline even in the presence of hibernation, and avoiding temporal failures. Performance results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling and that it can tolerate temporal failures. Luan Teylo (PhD student) Advisors: Lúcia Drummond, Luciana Arantes e Pierre Sens	05/03/2019 at 10:30 am	IC Room 317
Talk: Dealing with Non Uniform Memory Access Abstract: A natural trend in the multicore evolution is the increase in the number of integrated cores, while keeping the configuration of a shared main memory. One common packaging considers a Non Uniform Memory Access (NUMA), where the overall available memory is made of several blocks that are physically separated but interconnected so as to form a virtually contiguous memory with a unified addressing. Form the programming point of view, this aspect is completely virtual, and ordinary programmers are not even aware of this technical reality. The main consequence of not addressing the NUMA configuration is the unacceptable scalability that can be observed using a standard parallelization. This penalty is the conjunction of remote accesses and bus contention, among others. In this talk, we will explain the concept, how is it technically considered, what are the effects and how to deal with this programming concern. Claude Tadonki (Visiting Professor)	05/15/2019 at 10am	IC Room 202
Talk: Influence of Tasks Duration Variability on Task-Based Runtime Schedulers Abstract: In the context of HPC platforms, individual nodes nowadays consist of heterogenous processing resources such as GPU units and multicores. Those resources share communication and storage resources, inducing complex co-scheduling effects, and making it hard to predict the exact duration of a task or of a communication. To cope with these issues, runtime dynamic schedulers such as StarPU have been developed. These systems base their decisions at runtime on the state of the platform and possibly on static priorities of tasks computed offline. In this paper, our goal is to quantify performance variability in the context of HPC heterogeneous nodes, by focusing on very regular dense linear algebra kernels, such as Cholesky and LU factorizations. We therefore first concentrate on the evaluation of the individual block-size kernels variability. Then, we analyze the impact of this variability at the scale of a full application on a dynamic runtime scheduler such as StarPU, in order to analyze whether the strategies that have been designed in the context of MapReduce applications to cope with stragglers could be transferred to HPC systems, or if the dynamic nature of runtime schedulers is enough to cope with actual performance variations, even in presence of task dependencies. Olivier Beaumont (Visiting Professor)	05/29/2019 at 11am	IC Room 202
Talk: Optical Interconnects for Advanced Proceessing Systems Abstract: The presentation intends to identify the role of optical interconnects in next-generation distributed processing systems and will review the challenges and recent developments in board-level and on-board chip-to-chip interconnection for Data Centres and High-Performance Compute applications. George T. Kanellos (Visiting Professor)	10/29/2019 at 3pm	IC/UFF Auditorium
Talk: Failure Detection in Large Distributed Systems Abstract: Failure detection is a prerequisite to failure mitigation and a key component to build distributed algorithms requiring resilience. This talk introduces the problem of failure detection in asynchronous network where the transmission delay is not known. We show how distributed failure detector oracles can be used to address fundamental problems such as consensus, k-set agreement, or mutual exclusion. Finally, we focus on how to build scalable failure detectors. Pierre Sens (Visiting Professor)	10/29/2019 at 4pm	IC/UFF Auditorium
Talk: A Communication-Efficient Causal Broadcast Protocol Abstract: A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. We have proposed [ICPP 2018] a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube like topology called Vcube. Messages are broadcast by dynamically building spanning trees rooted in the message’s source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question cannot satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction. Luciana Arantes (Visiting Professor)	10/30/2019 at 11am	IC/UFF Auditorium
Talk: Probabilistic Byzantine Tolerance Scheduling in Hybrid Cloud Environments Abstract: This talk explores scheduling challenges in providing probabilistic Byzantine fault tolerance in a hybrid cloud environment, consisting of nodes with varying reliability levels, compute power, and monetary cost. In this context, the probabilistic Byzantine fault tolerance guarantee refers to the confidence level that the results of a given computation is correct despite potential Byzantine failures. We formally define a family of such scheduling problems distinguished by whether they insist on meeting a given latency limit and trying to optimize the monetary budget or vice versa. For the case where the latency bound is a restriction and the budget should be optimized, we present several heuristic protocols and compare between them using extensive simulations. Pierre Sens (Visiting Professor)	10/30/2019 at 12pm	IC/UFF Auditorium
Workshop: 1º WCloud-HPC Lúcia Drummond (Organizer)	08/31/2020 and 09/01/2020	Online
Workshop: 2º WCloud-HPC Lúcia Drummond (Organizer)	09/16/2021 and 09/17/2021	Online

Missions

Mission	Period	Local
Talk: A Hibernation Aware Scheduler for Cloud Environments Abstract: Nowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment. In this talk, we present the Hibernation-Aware Dynamic Scheduler (HADS), to schedule applications composed of independent tasks (bag-of-tasks) with deadline constraints in both hibernation-prone spot VMs (for cost sake) and on-demand VMs. We also consider the problem of temporal failures, that occurs when a spot VM hibernates, and does not resume within a time that guarantees the application’s deadline. Our dynamic scheduling approach aims at minimizing the monetary costs of bag-of-tasks applications execution, respecting its deadline even in the presence of hibernation. It is also able to avoid temporal failures, by using task migration and work-stealing techniques. Experimental results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling when compared with on-demand VM only based approaches, in terms of monetary costs and execution times. It is also shown that our strategy can tolerate temporal failures.. Lúcia Drummond	11/14/2019 to 11/25/2019	Sorbonne Université, Paris

Scholarships

Scholarship	Period	Local
Sandwich Doctoral Scholarship (Six months) Thesis: A Dynamic Task Scheduler Tolerant to Multiple Hibernations in Cloud Environments. Luan Teylo	09/2019 to 02/2020	Sorbonne Université, Paris
Visiting Researcher Scholarship in Brazil Talk: A Communication-Efficient Causal Broadcast Protocol. Luciana Arantes	11/2019	Universidade Federal Fluminense, Niterói

Approved Projects

Project	Coordinator
STIC AMSUD – Notice no. 10/2019 Name of the Capes Program: STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI Coordinator: EDUARDO UCHOA BARBOZA Selection instrument number : STIC AMSUD – Notice no. 10/2019 – Projects Calendar – The first year of the project starts on 01/2020 and ends on 12/2020; – The second year of the project project starts on 01/2021 and ends on 12/2021. According to the notice of the STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI program, CAPES will subsidize the project and the Brazilian team, according to annual budget availability, in: 4 (four) work missions, up to 2 (two) to France and 2 (two) to South American countries, for the coordinator or registered doctoral team members In the project. Each work mission must last a minimum of 7 (seven) and a maximum of 20 (twenty) days; Up to 2 (two) scholarships per year, in the following modalities: Sandwich Doctorate: duration of 4 (four) to 12 (twelve) months; Post-Doctorate: duration of 2 (two) to 12 (twelve) months.	Eduardo Uchoa Barboza

Project

Coordinator

STIC AMSUD – Notice no. 10/2019

Name of the Capes Program: STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI

Coordinator: EDUARDO UCHOA BARBOZA

Selection instrument number : STIC AMSUD – Notice no. 10/2019 – Projects

Calendar
– The first year of the project starts on 01/2020 and ends on 12/2020;
– The second year of the project project starts on 01/2021 and ends on 12/2021.

According to the notice of the STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI program, CAPES will subsidize the project and the Brazilian team, according to annual budget availability, in: 4 (four) work missions, up to 2 (two) to France and 2 (two) to South American countries, for the coordinator or registered doctoral team members In the project. Each work mission must last a minimum of 7 (seven) and a maximum of 20 (twenty) days; Up to 2 (two) scholarships per year, in the following modalities: Sandwich Doctorate: duration of 4 (four) to 12 (twelve) months; Post-Doctorate: duration of 2 (two) to 12 (twelve) months.

Eduardo Uchoa Barboza