CAPES PrInt – C+HPC

Gerência de Recursos em Nuvens para Execução de Aplicações de Alto Desempenho

Coordenadora: Lúcia Maria de Assumpção Drummond

O projeto visa resolver o problema de alocação e gerência de recursos em nuvens de computadores para aplicações HPC, minimizando o tempo de execução, consumo de energia e maximizando a tolerância a falhas, sem violar acordos de nível de serviço (SLA), e usando aplicações da biologia como estudo de caso. Tradicionalmente a computação em nuvem tem sido usada para compartilhamento de dados e serviços de uso geral, entretanto esta também tem aparecido como uma alternativa promissora para executar aplicações de HPC (High Performance Computing), mais recentemente. Este paradigma computacional oferece diversas vantagens quando comparado a uma infraestrutura dedicada, tais como o rápido provisionamento de recursos e significativa redução de custos operacionais.

Entretanto, alguns desafios devem ser superados para reduzir a diferença entre o desempenho oferecido por uma infraestrutura dedicada e pelas nuvens. Overheads introduzidos pela camada de virtualização, heterogeneidade de hardware e altas latências de rede afetam negativamente o desempenho da aplicações HPC. Além disso, provedores de nuvens, geralmente adotam políticas de compartilhamento de recursos que podem reduzir ainda mais o desempenho de tais aplicações. Tipicamente, um servidor físico pode hospedar várias máquinas virtuais que podem causar contenção no acesso a recursos compartilhados, como cache e memória principal, reduzindo significativamente seus desempenhos.

Além disso, a seleção de máquinas virtuais e sua configuração manual é tarefa bastante complexa para cientistas que desenvolvem aplicações HPC e não são experts em ferramentas de administração de nuvens. Os escalonadores de aplicações que possuem diversas políticas que variam de acordo com a função objetivo como minimizar o tempo de execução total, minimizar a demanda por energia, mantendo uma garantia de nível de serviço com o usuário, entre outras, tem papel fundamental para garantir a eficiência da execução de tais aplicações. A fim de alavancar o uso de nuvens para execução de aplicações HPC, este projeto visa tratar esses diversos aspectos. A importância do uso de nuvens para execução de aplicações HPC, pode ser observada por algumas iniciativas, tal como o UberCloud, que tem oferecido serviço de HPC na nuvem, onde os usuários podem discutir a experiência de usar tal ambiente. Como estudo de caso consideramos principalmente experimentos na área de bioinformática e, em especial, a genômica comparativa.

Equipe

Universidade Federal Fluminense
Lúcia Maria de Assumpção Drummond
Maria Cristina Silva Boeres
Eugene Francis Vinod Rebello
Yuri Abitbol de Menezes Frota
Igor Monteiro Moraes
José Viterbo Filho
Daniel Cardoso Moraes de Oliveira
Eduardo Uchoa
Artur Pessoa

Sorbonne Université
Pierre Sens
Luciana Arantes
Guy Pujolle
Amal El Fallah Seghrouchni

Université d’Avignon
Rosa Figueiredo

Université de Montpellier
Esther Pacitti

Université de Bordeaux
François Vanderbeck
Ruslan Sadykov
Laércio Lima Pilla

MINES ParisTech
Claude Tadonki

Editais

Edital	Links	Período de Inscrição	Situação
Seleção de Bolsista PDSE 2019	Edital Anexo I Anexo II	15/03/2019 a 01/04/2019	Encerrado
Professor Visitante no Exterior Júnior	Edital	01/07/2019 a 08/07/2019	Encerrado
Pesquisador Visitante no Brasil	Edital	01/07/2019 a 15/07/2019	Encerrado
Seleção de Bolsista PDSE 2020	Edital Anexo I Anexo II Ata de Seleção	13/10/2020 a 19/10/2020	Encerrado
Professor Visitante no Exterior Sênior	Edital Ata de Seleção	26/10/2020 a 02/11/2020	Encerrado
Pesquisador Visitante no Brasil	Edital	12/11/2020 a 13/11/2020	Encerrado
Pesquisador Visitante no Brasil	Edital	10/08/2022 a 19/08/2022	Encerrado

Eventos

Evento	Data	Local
Palestra: Smart City Network Uberization Abstract: Connectivity is a key issue for smart cities. Providing Internet to a large number of users and things requires intensive network densification, an operation consisting in increasing the number of access points in order to provide a high density of users and things with an Internet connection. This presentation describes how to reach this goal using an uberization solution. We compare our solution with 5G and Fog networking solutions. We also describe how it is possible to secure the network with blockchains solutions. Guy Pujolle (Professor Visitante)	26/03/2019 às 11h	Auditório do IC/UFF
Defesa de Proposta de Tese de Doutorado: A Hibernation-Aware Dynamic Scheduler for Cloud Environments Abstract: Cloud platforms have emerged as a prominent environment to execute different classes of applications providing on-demand resources as well as scalability. They usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the financial advantages, a spot VM can be terminated or hibernated by EC2 at any moment. Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper the Hibernation-Aware Dynamic Scheduler (HADS), that uses those VMs to execute applications composed of independent tasks (bag-of-task) with deadline constraints. Besides that, we also define the problem of temporal failures, that occurs when a spot VM hibernates, and it does not resume within a time that guarantees the application’s deadline. Our scheduling approach, thus, aims at minimizing the monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline even in the presence of hibernation, and avoiding temporal failures. Performance results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling and that it can tolerate temporal failures. Luan Teylo (Aluno de Doutorado) Orientadores: Lúcia Drummond, Luciana Arantes e Pierre Sens	03/05/2019 às 10h30	IC Sala 317
Palestra: Dealing with Non Uniform Memory Access Abstract: A natural trend in the multicore evolution is the increase in the number of integrated cores, while keeping the configuration of a shared main memory. One common packaging considers a Non Uniform Memory Access (NUMA), where the overall available memory is made of several blocks that are physically separated but interconnected so as to form a virtually contiguous memory with a unified addressing. Form the programming point of view, this aspect is completely virtual, and ordinary programmers are not even aware of this technical reality. The main consequence of not addressing the NUMA configuration is the unacceptable scalability that can be observed using a standard parallelization. This penalty is the conjunction of remote accesses and bus contention, among others. In this talk, we will explain the concept, how is it technically considered, what are the effects and how to deal with this programming concern. Claude Tadonki (Professor Visitante)	15/05/2019 às 10h	IC Sala 202
Palestra: Influence of Tasks Duration Variability on Task-Based Runtime Schedulers Abstract: In the context of HPC platforms, individual nodes nowadays consist of heterogenous processing resources such as GPU units and multicores. Those resources share communication and storage resources, inducing complex co-scheduling effects, and making it hard to predict the exact duration of a task or of a communication. To cope with these issues, runtime dynamic schedulers such as StarPU have been developed. These systems base their decisions at runtime on the state of the platform and possibly on static priorities of tasks computed offline. In this paper, our goal is to quantify performance variability in the context of HPC heterogeneous nodes, by focusing on very regular dense linear algebra kernels, such as Cholesky and LU factorizations. We therefore first concentrate on the evaluation of the individual block-size kernels variability. Then, we analyze the impact of this variability at the scale of a full application on a dynamic runtime scheduler such as StarPU, in order to analyze whether the strategies that have been designed in the context of MapReduce applications to cope with stragglers could be transferred to HPC systems, or if the dynamic nature of runtime schedulers is enough to cope with actual performance variations, even in presence of task dependencies. Olivier Beaumont (Professor Visitante)	29/05/2019 às 11h	IC Sala 202
Palestra: Optical Interconnects for Advanced Proceessing Systems Abstract: The presentation intends to identify the role of optical interconnects in next-generation distributed processing systems and will review the challenges and recent developments in board-level and on-board chip-to-chip interconnection for Data Centres and High-Performance Compute applications. George T. Kanellos (Professor Visitante)	29/10/2019 às 15h	Auditório do IC/UFF
Palestra: Failure Detection in Large Distributed Systems Abstract: Failure detection is a prerequisite to failure mitigation and a key component to build distributed algorithms requiring resilience. This talk introduces the problem of failure detection in asynchronous network where the transmission delay is not known. We show how distributed failure detector oracles can be used to address fundamental problems such as consensus, k-set agreement, or mutual exclusion. Finally, we focus on how to build scalable failure detectors. Pierre Sens (Professor Visitante)	29/10/2019 às 16h	Auditório do IC/UFF
Palestra: A Communication-Efficient Causal Broadcast Protocol Abstract: A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. We have proposed [ICPP 2018] a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube like topology called Vcube. Messages are broadcast by dynamically building spanning trees rooted in the message’s source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question cannot satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction. Luciana Arantes (Professor Visitante)	30/10/2019 às 11h	Auditório do IC/UFF
Palestra: Probabilistic Byzantine Tolerance Scheduling in Hybrid Cloud Environments Abstract: This talk explores scheduling challenges in providing probabilistic Byzantine fault tolerance in a hybrid cloud environment, consisting of nodes with varying reliability levels, compute power, and monetary cost. In this context, the probabilistic Byzantine fault tolerance guarantee refers to the confidence level that the results of a given computation is correct despite potential Byzantine failures. We formally define a family of such scheduling problems distinguished by whether they insist on meeting a given latency limit and trying to optimize the monetary budget or vice versa. For the case where the latency bound is a restriction and the budget should be optimized, we present several heuristic protocols and compare between them using extensive simulations. Pierre Sens (Professor Visitante)	30/10/2019 às 12h	Auditório do IC/UFF
Workshop: 1º WCloud-HPC Lúcia Drummond (Organizador)	31/08/2020 e 01/09/2020	Online
Workshop: 2º WCloud-HPC Lúcia Drummond (Organizador)	16/09/2021 e 17/09/2021	Online

Missões

Missão	Período	Local
Palestra: A Hibernation Aware Scheduler for Cloud Environments Abstract: Nowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment. In this talk, we present the Hibernation-Aware Dynamic Scheduler (HADS), to schedule applications composed of independent tasks (bag-of-tasks) with deadline constraints in both hibernation-prone spot VMs (for cost sake) and on-demand VMs. We also consider the problem of temporal failures, that occurs when a spot VM hibernates, and does not resume within a time that guarantees the application’s deadline. Our dynamic scheduling approach aims at minimizing the monetary costs of bag-of-tasks applications execution, respecting its deadline even in the presence of hibernation. It is also able to avoid temporal failures, by using task migration and work-stealing techniques. Experimental results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling when compared with on-demand VM only based approaches, in terms of monetary costs and execution times. It is also shown that our strategy can tolerate temporal failures.. Lúcia Drummond	14/11/2019 a 25/11/2019	Sorbonne Université, Paris

Bolsas

Bolsa	Período	Local
Bolsa de Doutorado Sanduiche (Seis Meses) Tese: A Dynamic Task Scheduler Tolerant to Multiple Hibernations in Cloud Environments. Luan Teylo	09/2019 a 02/2020	Sorbonne Université, Paris
Bolsa de Pesquisador Visitante no Brasil Palestra: A Communication-Efficient Causal Broadcast Protocol. Luciana Arantes	11/2019	Universidade Federal Fluminense, Niterói

Projetos Aprovados

Projeto	Coordenador
STIC AMSUD – Edital nº 10/2019 Nome do Programa-Capes: STIC AMSUD – COOPERAÇÃO EM CIÊNCIA E TECNOLOGIA DA INFORMAÇÃO E DA COMUNICAÇÃO FRANÇA – AMÉRICA DO SUL – CAPES/CDEFI Coordenador: EDUARDO UCHOA BARBOZA Número do instrumento de seleção: STIC AMSUD – Edital no 10/2019 – Projetos Calendário – O primeiro ano do projeto inicia em 01/2020 e finaliza em 12/2020; – O segundo ano do projeto do projeto inicia em 01/2021 e finaliza em 12/2021. Segundo o edital do programa STIC AMSUD – COOPERAÇÃO EM CIÊNCIA E TECNOLOGIA DA INFORMAÇÃO E DA COMUNICAÇÃO FRANÇA – AMÉRICA DO SUL – CAPES/CDEFI, a CAPES subsidiará o projeto e a equipe brasileira, conforme disponibilidade orçamentária anual, em: 4 (quatro) Missões de trabalho sendo até 2 (duas) para a França e 2 (duas) para países da América do Sul, para o coordenador ou membros doutores da equipe inscritos no projeto. Cada missão de trabalho deverá ter duração mínima de 07 (sete) e máxima de 20 (vinte) dias; Até 2 (duas) bolsas de estudos por ano, nas modalidades: Doutorado-Sanduíche: duração de 4 (quatro) a 12 (doze) meses; Pós-Doutorado: duração de 2 (dois) a 12 (doze) meses.	Eduardo Uchoa Barboza

Projeto

Coordenador

STIC AMSUD – Edital nº 10/2019

Nome do Programa-Capes: STIC AMSUD – COOPERAÇÃO EM CIÊNCIA E TECNOLOGIA DA INFORMAÇÃO E DA COMUNICAÇÃO FRANÇA – AMÉRICA DO SUL – CAPES/CDEFI

Coordenador: EDUARDO UCHOA BARBOZA

Número do instrumento de seleção: STIC AMSUD – Edital no 10/2019 – Projetos

Calendário
– O primeiro ano do projeto inicia em 01/2020 e finaliza em 12/2020;
– O segundo ano do projeto do projeto inicia em 01/2021 e finaliza em 12/2021.

Segundo o edital do programa STIC AMSUD – COOPERAÇÃO EM CIÊNCIA E TECNOLOGIA DA INFORMAÇÃO E DA COMUNICAÇÃO FRANÇA – AMÉRICA DO SUL – CAPES/CDEFI, a CAPES subsidiará o projeto e a equipe brasileira, conforme disponibilidade orçamentária anual, em: 4 (quatro) Missões de trabalho sendo até 2 (duas) para a França e 2 (duas) para países da América do Sul, para o coordenador ou membros doutores da equipe inscritos no projeto. Cada missão de trabalho deverá ter duração mínima de 07 (sete) e máxima de 20 (vinte) dias; Até 2 (duas) bolsas de estudos por ano, nas modalidades: Doutorado-Sanduíche: duração de 4 (quatro) a 12 (doze) meses; Pós-Doutorado: duração de 2 (dois) a 12 (doze) meses.

Eduardo Uchoa Barboza