CAPES PrInt

Cloud Resource Management for Running High-Performance Applications

Coordinator: Lúcia Maria de Assumpção Drummond

The project aims to solve the problem of resource allocation and management in computer clouds for HPC applications, minimizing execution time, energy consumption and maximizing fault tolerance, without violating service level agreements (SLA), and using applications of biology as a case study. Traditionally, cloud computing has been used for sharing data and general-purpose services, however it has also emerged as a promising alternative for running HPC (High Performance Computing) applications more recently. This computing paradigm offers several advantages when compared to a dedicated infrastructure, such as rapid resource provisioning and significant reduction in operational costs.

However, some challenges must be overcome to reduce the difference between the performance offered by dedicated infrastructure and clouds. Overheads introduced by the virtualization layer, hardware heterogeneity, and high network latencies negatively affect the performance of HPC applications. Furthermore, cloud providers often adopt resource sharing policies that can further reduce the performance of such applications. Typically, a physical server can host several virtual machines that can cause contention in accessing shared resources, such as cache and main memory, significantly reducing their performance.

Furthermore, selecting virtual machines and manually configuring them is a very complex task for scientists who develop HPC applications and are not experts in cloud administration tools. Application schedulers that have several policies that vary according to the objective function, such as minimizing total execution time, minimizing energy demand, maintaining a guaranteed level of service with the user, among others, play a fundamental role in ensuring the efficiency of executing such applications. In order to leverage the use of clouds for running HPC applications, this project aims to address these various aspects. The importance of using clouds to run HPC applications can be observed by some initiatives, such as UberCloud, which has offered an HPC service in the cloud, where users can discuss the experience of using such an environment. As a case study we mainly consider experiments in the area of ​​bioinformatics and, in particular, comparative genomics.


Team
Universidade Federal Fluminense
Lúcia Maria de Assumpção Drummond      
Maria Cristina Silva Boeres    
Eugene Francis Vinod Rebello    
Yuri Abitbol de Menezes Frota    
Igor Monteiro Moraes    
José Viterbo Filho    
Daniel Cardoso Moraes de Oliveira    
Eduardo Uchoa    
Artur Pessoa    
Sorbonne Université
Pierre Sens    
Luciana Arantes    
Guy Pujolle  
Amal El Fallah Seghrouchni    
Université d’Avignon
Rosa Figueiredo    
Université de Montpellier
Esther Pacitti    
Université de Bordeaux
François Vanderbeck  
Ruslan Sadykov  
Laércio Lima Pilla      
MINES ParisTech
Claude Tadonki    

Notices
NoticeLinksRegistration PeriodSituation
PDSE Scholarship Selection 2019Notice
Attachment I
Attachment II
03/15/2019 to 04/01/2019Closed
Junior Visiting Professor AbroadNotice
07/01/2019 to 07/08/2019
Closed
Visiting Researcher in BrazilNotice07/01/2019 to 07/15/2019Closed
PDSE Scholarship Selection 2020Notice
Attachment I
Attachment II
Selection Minutes
10/13/2020 to 10/19/2020Closed
Senior Visiting Professor AbroadNotice
Selection Minutes

10/26/2020 to 11/02/2020
Closed
Visiting Researcher in BrazilNotice11/12/2020 to 11/13/2020Closed
Visiting Researcher in BrazilNotice08/10/2022 to 08/19/2022Closed

Events
EventDateLocal
Talk: Smart City Network Uberization

Abstract: Connectivity is a key issue for smart cities. Providing Internet to a large number of users and things requires intensive network densification, an operation consisting in increasing the number of access points in order to provide a high density of users and things with an Internet connection. This presentation describes how to reach this goal using an uberization solution. We compare our solution with 5G and Fog networking solutions. We also describe how it is possible to secure the network with blockchains solutions.


Guy Pujolle (Visiting Professor)
03/26/2019 at 11amIC/UFF Auditorium
Defense of Doctoral Thesis Proposal: A Hibernation-Aware Dynamic Scheduler for Cloud Environments

Abstract: Cloud platforms have emerged as a prominent environment to execute different classes of applications providing on-demand resources as well as scalability. They usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the financial advantages, a spot VM can be terminated or hibernated by EC2 at any moment.

Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper the Hibernation-Aware Dynamic Scheduler (HADS), that uses those VMs to execute applications composed of independent tasks (bag-of-task) with deadline constraints. Besides that, we also define the problem of temporal failures, that occurs when a spot VM hibernates, and it does not resume within a time that guarantees the application’s deadline. Our scheduling approach, thus, aims at minimizing the monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline even in the presence of hibernation, and avoiding temporal failures. Performance results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling and that it can tolerate temporal failures.


Luan Teylo (PhD student)

Advisors: Lúcia Drummond, Luciana Arantes e Pierre Sens
05/03/2019 at 10:30 amIC Room 317
Talk: Dealing with Non Uniform Memory Access

Abstract: A natural trend in the multicore evolution is the increase in the number of integrated cores, while keeping the configuration of a shared main memory. One common packaging considers a Non Uniform Memory Access (NUMA), where the overall available memory is made of several blocks that are physically separated but interconnected so as to form a virtually contiguous memory with a unified addressing. Form the programming point of view, this aspect is completely virtual, and ordinary programmers are not even aware of this technical reality. The main consequence of not addressing the NUMA configuration is the unacceptable scalability that can be observed using a standard parallelization. This penalty is the conjunction of remote accesses and bus contention, among others. In this talk, we will explain the concept, how is it technically considered, what are the effects and how to deal with this programming concern.


Claude Tadonki (Visiting Professor)
05/15/2019 at 10amIC Room 202
Talk: Influence of Tasks Duration Variability on Task-Based Runtime Schedulers

Abstract: In the context of HPC platforms, individual nodes nowadays consist of heterogenous processing resources such as GPU units and multicores. Those resources share communication and storage resources, inducing complex co-scheduling effects, and making it hard to predict the exact duration of a task or of a communication. To cope with these issues, runtime dynamic schedulers such as StarPU have been developed. These systems base their decisions at runtime on the state of the platform and possibly on static priorities of tasks computed offline. In this paper, our goal is to quantify performance variability in the context of HPC heterogeneous nodes, by focusing on very regular dense linear algebra kernels, such as Cholesky and LU factorizations. We therefore first concentrate on the evaluation of the individual block-size kernels variability. Then, we analyze the impact of this variability at the scale of a full application on a dynamic runtime scheduler such as StarPU, in order to analyze whether the strategies that have been designed in the context of MapReduce applications to cope with stragglers could be transferred to HPC systems, or if the dynamic nature of runtime schedulers is enough to cope with actual performance variations, even in presence of task dependencies.


Olivier Beaumont (Visiting Professor)
05/29/2019 at 11amIC Room 202
Talk: Optical Interconnects for Advanced Proceessing Systems

Abstract: The presentation intends to identify the role of optical interconnects in next-generation distributed processing systems and will review the challenges and recent developments in board-level and on-board chip-to-chip interconnection for Data Centres and High-Performance Compute applications.


George T. Kanellos (Visiting Professor)
10/29/2019 at 3pmIC/UFF Auditorium
Talk: Failure Detection in Large Distributed Systems

Abstract: Failure detection is a prerequisite to failure mitigation and a key component to build distributed algorithms requiring resilience. This talk introduces the problem of failure detection in asynchronous network where the transmission delay is not known. We show how distributed failure detector oracles can be used to address fundamental problems such as consensus, k-set agreement, or mutual exclusion. Finally, we focus on how to build scalable failure detectors.


Pierre Sens (Visiting Professor)
10/29/2019 at 4pmIC/UFF Auditorium
Talk: A Communication-Efficient Causal Broadcast Protocol

Abstract: A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. We have proposed [ICPP 2018] a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube like topology called Vcube. Messages are broadcast by dynamically building spanning trees rooted in the message’s source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question cannot satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction.


Luciana Arantes (Visiting Professor)
10/30/2019 at 11amIC/UFF Auditorium
Talk: Probabilistic Byzantine Tolerance Scheduling in Hybrid Cloud Environments

Abstract: This talk explores scheduling challenges in providing probabilistic Byzantine fault tolerance in a hybrid cloud environment, consisting of nodes with varying reliability levels, compute power, and monetary cost. In this context, the probabilistic Byzantine fault tolerance guarantee refers to the confidence level that the results of a given computation is correct despite potential Byzantine failures. We formally define a family of such scheduling problems distinguished by whether they insist on meeting a given latency limit and trying to optimize the monetary budget or vice versa. For the case where the latency bound is a restriction and the budget should be optimized, we present several heuristic protocols and compare between them using extensive simulations.


Pierre Sens (Visiting Professor)
10/30/2019 at 12pmIC/UFF Auditorium
Workshop: 1º WCloud-HPC

Lúcia Drummond (Organizer)
08/31/2020 and 09/01/2020Online
Workshop: 2º WCloud-HPC

Lúcia Drummond (Organizer)
09/16/2021 and 09/17/2021Online

Missions
MissionPeriodLocal
Talk: A Hibernation Aware Scheduler for Cloud Environments

Abstract: Nowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment.

In this talk, we present the Hibernation-Aware Dynamic Scheduler (HADS), to schedule applications composed of independent tasks (bag-of-tasks) with deadline constraints in both hibernation-prone spot VMs (for cost sake) and on-demand VMs. We also consider the problem of temporal failures, that occurs when a spot VM hibernates, and does not resume within a time that guarantees the application’s deadline. Our dynamic scheduling approach aims at minimizing the monetary costs of bag-of-tasks applications execution, respecting its deadline even in the presence of hibernation. It is also able to avoid temporal failures, by using task migration and work-stealing techniques. Experimental results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling when compared with on-demand VM only based approaches, in terms of monetary costs and execution times. It is also shown that our strategy can tolerate temporal failures..


Lúcia Drummond
11/14/2019 to 11/25/2019Sorbonne Université, Paris

Scholarships
ScholarshipPeriodLocal
Sandwich Doctoral Scholarship (Six months)

Thesis: A Dynamic Task Scheduler Tolerant to Multiple Hibernations in Cloud Environments.


Luan Teylo
09/2019 to 02/2020Sorbonne Université, Paris
Visiting Researcher Scholarship in Brazil

Talk: A Communication-Efficient Causal Broadcast Protocol.


Luciana Arantes
11/2019Universidade Federal Fluminense, Niterói

Approved Projects
ProjectCoordinator
STIC AMSUD – Notice no. 10/2019


Name of the Capes Program: STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI

Coordinator: EDUARDO UCHOA BARBOZA

Selection instrument number : STIC AMSUD – Notice no. 10/2019 – Projects

Calendar
– The first year of the project starts on 01/2020 and ends on 12/2020;
– The second year of the project project starts on 01/2021 and ends on 12/2021.

According to the notice of the STIC AMSUD – COOPERATION IN SCIENCE AND INFORMATION AND COMMUNICATION TECHNOLOGY FRANCE – SOUTH AMERICA – CAPES/CDEFI program, CAPES will subsidize the project and the Brazilian team, according to annual budget availability, in: 4 (four) work missions, up to 2 (two) to France and 2 (two) to South American countries, for the coordinator or registered doctoral team members In the project. Each work mission must last a minimum of 7 (seven) and a maximum of 20 (twenty) days; Up to 2 (two) scholarships per year, in the following modalities: Sandwich Doctorate: duration of 4 (four) to 12 (twelve) months; Post-Doctorate: duration of 2 (two) to 12 (twelve) months.

Eduardo Uchoa Barboza

Publications