Gerência de Recursos em Nuvens para Execução de Aplicações de Alto Desempenho
Coordenadora: Lúcia Maria de Assumpção Drummond
O projeto visa resolver o problema de alocação e gerência de recursos em nuvens de computadores para aplicações HPC, minimizando o tempo de execução, consumo de energia e maximizando a tolerância a falhas, sem violar acordos de nível de serviço (SLA), e usando aplicações da biologia como estudo de caso. Tradicionalmente a computação em nuvem tem sido usada para compartilhamento de dados e serviços de uso geral, entretanto esta também tem aparecido como uma alternativa promissora para executar aplicações de HPC (High Performance Computing), mais recentemente. Este paradigma computacional oferece diversas vantagens quando comparado a uma infraestrutura dedicada, tais como o rápido provisionamento de recursos e significativa redução de custos operacionais.
Entretanto, alguns desafios devem ser superados para reduzir a diferença entre o desempenho oferecido por uma infraestrutura dedicada e pelas nuvens. Overheads introduzidos pela camada de virtualização, heterogeneidade de hardware e altas latências de rede afetam negativamente o desempenho da aplicações HPC. Além disso, provedores de nuvens, geralmente adotam políticas de compartilhamento de recursos que podem reduzir ainda mais o desempenho de tais aplicações. Tipicamente, um servidor físico pode hospedar várias máquinas virtuais que podem causar contenção no acesso a recursos compartilhados, como cache e memória principal, reduzindo significativamente seus desempenhos.
Além disso, a seleção de máquinas virtuais e sua configuração manual é tarefa bastante complexa para cientistas que desenvolvem aplicações HPC e não são experts em ferramentas de administração de nuvens. Os escalonadores de aplicações que possuem diversas políticas que variam de acordo com a função objetivo como minimizar o tempo de execução total, minimizar a demanda por energia, mantendo uma garantia de nível de serviço com o usuário, entre outras, tem papel fundamental para garantir a eficiência da execução de tais aplicações. A fim de alavancar o uso de nuvens para execução de aplicações HPC, este projeto visa tratar esses diversos aspectos. A importância do uso de nuvens para execução de aplicações HPC, pode ser observada por algumas iniciativas, tal como o UberCloud, que tem oferecido serviço de HPC na nuvem, onde os usuários podem discutir a experiência de usar tal ambiente. Como estudo de caso consideramos principalmente experimentos na área de bioinformática e, em especial, a genômica comparativa.
Equipe
Editais
Edital | Links | Período de Inscrição | Situação |
---|---|---|---|
Seleção de Bolsista PDSE 2019 | Edital Anexo I Anexo II | 15/03/2019 a 01/04/2019 | Encerrado |
Professor Visitante no Exterior Júnior | Edital | 01/07/2019 a 08/07/2019 | Encerrado |
Pesquisador Visitante no Brasil | Edital | 01/07/2019 a 15/07/2019 | Encerrado |
Seleção de Bolsista PDSE 2020 | Edital Anexo I Anexo II Ata de Seleção | 13/10/2020 a 19/10/2020 | Encerrado |
Professor Visitante no Exterior Sênior | Edital Ata de Seleção | 26/10/2020 a 02/11/2020 | Encerrado |
Pesquisador Visitante no Brasil | Edital | 12/11/2020 a 13/11/2020 | Encerrado |
Pesquisador Visitante no Brasil | Edital | 10/08/2022 a 19/08/2022 | Encerrado |
Eventos
Evento | Data | Local |
---|---|---|
Palestra: Smart City Network UberizationAbstract: Connectivity is a key issue for smart cities. Providing Internet to a large number of users and things requires intensive network densification, an operation consisting in increasing the number of access points in order to provide a high density of users and things with an Internet connection. This presentation describes how to reach this goal using an uberization solution. We compare our solution with 5G and Fog networking solutions. We also describe how it is possible to secure the network with blockchains solutions. Guy Pujolle (Professor Visitante) | 26/03/2019 às 11h | Auditório do IC/UFF |
Defesa de Proposta de Tese de Doutorado: A Hibernation-Aware Dynamic Scheduler for Cloud EnvironmentsAbstract: Cloud platforms have emerged as a prominent environment to execute different classes of applications providing on-demand resources as well as scalability. They usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the financial advantages, a spot VM can be terminated or hibernated by EC2 at any moment. Luan Teylo (Aluno de Doutorado) Orientadores: Lúcia Drummond, Luciana Arantes e Pierre Sens | 03/05/2019 às 10h30 | IC Sala 317 |
Palestra: Dealing with Non Uniform Memory AccessAbstract: A natural trend in the multicore evolution is the increase in the number of integrated cores, while keeping the configuration of a shared main memory. One common packaging considers a Non Uniform Memory Access (NUMA), where the overall available memory is made of several blocks that are physically separated but interconnected so as to form a virtually contiguous memory with a unified addressing. Form the programming point of view, this aspect is completely virtual, and ordinary programmers are not even aware of this technical reality. The main consequence of not addressing the NUMA configuration is the unacceptable scalability that can be observed using a standard parallelization. This penalty is the conjunction of remote accesses and bus contention, among others. In this talk, we will explain the concept, how is it technically considered, what are the effects and how to deal with this programming concern. Claude Tadonki (Professor Visitante) | 15/05/2019 às 10h | IC Sala 202 |
Palestra: Influence of Tasks Duration Variability on Task-Based Runtime SchedulersAbstract: In the context of HPC platforms, individual nodes nowadays consist of heterogenous processing resources such as GPU units and multicores. Those resources share communication and storage resources, inducing complex co-scheduling effects, and making it hard to predict the exact duration of a task or of a communication. To cope with these issues, runtime dynamic schedulers such as StarPU have been developed. These systems base their decisions at runtime on the state of the platform and possibly on static priorities of tasks computed offline. In this paper, our goal is to quantify performance variability in the context of HPC heterogeneous nodes, by focusing on very regular dense linear algebra kernels, such as Cholesky and LU factorizations. We therefore first concentrate on the evaluation of the individual block-size kernels variability. Then, we analyze the impact of this variability at the scale of a full application on a dynamic runtime scheduler such as StarPU, in order to analyze whether the strategies that have been designed in the context of MapReduce applications to cope with stragglers could be transferred to HPC systems, or if the dynamic nature of runtime schedulers is enough to cope with actual performance variations, even in presence of task dependencies. Olivier Beaumont (Professor Visitante) | 29/05/2019 às 11h | IC Sala 202 |
Palestra: Optical Interconnects for Advanced Proceessing SystemsAbstract: The presentation intends to identify the role of optical interconnects in next-generation distributed processing systems and will review the challenges and recent developments in board-level and on-board chip-to-chip interconnection for Data Centres and High-Performance Compute applications. George T. Kanellos (Professor Visitante) | 29/10/2019 às 15h | Auditório do IC/UFF |
Palestra: Failure Detection in Large Distributed SystemsAbstract: Failure detection is a prerequisite to failure mitigation and a key component to build distributed algorithms requiring resilience. This talk introduces the problem of failure detection in asynchronous network where the transmission delay is not known. We show how distributed failure detector oracles can be used to address fundamental problems such as consensus, k-set agreement, or mutual exclusion. Finally, we focus on how to build scalable failure detectors. Pierre Sens (Professor Visitante) | 29/10/2019 às 16h | Auditório do IC/UFF |
Palestra: A Communication-Efficient Causal Broadcast ProtocolAbstract: A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. We have proposed [ICPP 2018] a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube like topology called Vcube. Messages are broadcast by dynamically building spanning trees rooted in the message’s source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question cannot satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction. Luciana Arantes (Professor Visitante) | 30/10/2019 às 11h | Auditório do IC/UFF |
Palestra: Probabilistic Byzantine Tolerance Scheduling in Hybrid Cloud EnvironmentsAbstract: This talk explores scheduling challenges in providing probabilistic Byzantine fault tolerance in a hybrid cloud environment, consisting of nodes with varying reliability levels, compute power, and monetary cost. In this context, the probabilistic Byzantine fault tolerance guarantee refers to the confidence level that the results of a given computation is correct despite potential Byzantine failures. We formally define a family of such scheduling problems distinguished by whether they insist on meeting a given latency limit and trying to optimize the monetary budget or vice versa. For the case where the latency bound is a restriction and the budget should be optimized, we present several heuristic protocols and compare between them using extensive simulations. Pierre Sens (Professor Visitante) | 30/10/2019 às 12h | Auditório do IC/UFF |
Workshop: 1º WCloud-HPC Lúcia Drummond (Organizador) | 31/08/2020 e 01/09/2020 | Online |
Workshop: 2º WCloud-HPC Lúcia Drummond (Organizador) | 16/09/2021 e 17/09/2021 | Online |
Missões
Missão | Período | Local |
---|---|---|
Palestra: A Hibernation Aware Scheduler for Cloud EnvironmentsAbstract: Nowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment. Lúcia Drummond | 14/11/2019 a 25/11/2019 | Sorbonne Université, Paris |
Bolsas
Bolsa | Período | Local |
---|---|---|
Bolsa de Doutorado Sanduiche | 09/2019 a 02/2020 | Sorbonne Université, Paris |
Bolsa de Pesquisador Visitante no BrasilPalestra: A Communication-Efficient Causal Broadcast Protocol. Luciana Arantes | 11/2019 | Universidade Federal Fluminense, Niterói |
Projetos Aprovados
Projeto | Coordenador |
---|---|
STIC AMSUD – Edital nº 10/2019
| Eduardo Uchoa Barboza |