Accelerating many-core, heterogeneous, and distributed architectures with hardware runtimes and programming models

Haro Ruiz, Juan Miguel de

Accelerating many-core, heterogeneous, and distributed architectures with hardware runtimes and programming models

dc.contributor

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

dc.contributor.author

Haro Ruiz, Juan Miguel de

dc.date.accessioned

2025-10-01T06:21:10Z

dc.date.available

2025-10-01T06:21:10Z

dc.date.issued

2025-09-10

dc.identifier.uri

http://hdl.handle.net/10803/695347

dc.description.abstract

(English) Due to increasing concern about energy efficiency and the current trend to scale out HPC systems to many computing nodes, this thesis tries to tackle both problems with the help of hardware acceleration and programming models. Regarding the first topic, FPGAs have been the target of study due to their high flexibility to adapt to any computing workload and due to their high energy efficiency. We present extensions to the OmpSs@FPGA framework, which provides a high-level task-based programming interface to non-FPGA experts. These extensions include compiler directives to automatically optimize FPGA code, a hardware task scheduling runtime with dependence analysis called POM, and a multi-FPGA MPI-like API and runtime, called OMPIF. In addition, we present the Implicit Message Passing (IMP) model, which combines task-based and message-passing programming models, leveraging dependence information and a static data distribution. IMP automatically communicates data between nodes when required by the data dependencies of a task. Therefore, the user does not need to write any call to MPI or OMPIF in the code, as this is handled by IMP. We evaluate this model on both FPGA and CPU clusters, with hardware acceleration for task scheduling and message passing using the POM and OMPIF runtimes. For CPU clusters, we study several ways to incorporate POM into an SoC, first with an embedded FPGA, then we design it as an ASIC for a RISC-V core, and finally in an FPGA softcore also based on RISC-V. In the last case, we use both POM and OMPIF to evaluate distributed applications with a cluster of FPGAs that emulate a CPU cluster. We evaluate IMP and regular MPI+tasks programming with several benchmarks: Matrix Multiply, Spectra, N-body, Heat, and Cholesky. With the mentioned contributions, we achieve several objectives. First, we demonstrate that with OmpSs@FPGA we can achieve similar absolute performance to a CPU node for some benchmarks, like N-body, and outperform in energy efficiency to similar CPU and GPU architectures (in area and technology). Second, we also evaluate multi-FPGA applications on three different clusters: cloudFPGA, ESSPER, and MEEP, which have very distinct characteristics. With IMP, we show that we can scale linearly the N-body, Heat, and Cholesky benchmarks to 64 FPGAs. For CPUs, we are also able to scale linearly with the same benchmarks on an 8-core, 64-node cluster, with 512 cores in total. With our hardware-software co-design, which combines the hardware acceleration of task scheduling and message passing with IMP, we show a solution to accelerate HPC workloads as transparently as possible to the programmer, thus boosting productivity. This solution has been designed for heterogeneous systems based on FPGAs, but also based on CPUs. The latter also benefit significantly from the runtime overhead reduction thanks to the hardware acceleration.

dc.description.abstract

(Català) A causa de la importància de l'eficiència energètica i la tendència a escalar sistemes HPC amb molts nodes, en aquesta tesi tractem ambdós temes amb l'ajuda d'acceleració hardware i models de programació. Sobre el primer tema, les FPGA son objecte d'estudi a causa de la seva alta flexibilitat per adaptar-se a qualsevol càrrega de treball i a la seva alta eficiència energètica. En aquesta tesi presentem extensions al framework d'OmpSs@FPGA, el qual proveeix un model de programació basat en tasques per programadors no experts en FPGA. Aquestes extensions inclouen directives del compilador per optimitzar el codi FPGA automàticament, un runtime hardware per tasques amb anàlisi de dependències, anomenat POM, i una API basada en MPI per multi-FPGA amb el seu runtime associat, anomenat OMPIF. Addicionalment, presentem el model anomenat Implicit Message Passing (IMP), el qual combina models de programació basats en tasques i intercambi de missatges, aprofitant les dependències i una distribució estàtica de les dades. IMP comunica dades automàticament entre nodes quan les dependències d'una tasca ho requereixin. Per tant, l'usuari no ha d'utilitzar MPI o OMPIF en el codi, ja que IMP s'encarrega de la comunicació. Avaluem aquest model en clústers d'FPGA i CPU, amb acceleració hardware per scheduling de tasques i intercanvi de missatges fent servir els runtimes OMPIF i POM. Per clústers de CPU, estudiem diferents maneres d'incorporar POM en un SoC, primer amb una FPGA encastada, després dissenyem POM com ASIC per un processador RISC-V, i finalment per un softcore FPGA també basar en RISC-V. En l'últim cas, utilitzem POM i OMPIF per avaluar aplicacions distribuïdes amb un clúster de FPGA que emula un clúster de CPU. Avaluem IMP i programació tradicional amb MPI i tasques en diversos benchmarks: Multiplicació de matrius, Spectra, N-body, Heat i Cholesky. Amb les contribucions esmentades, complim diversos objectius. Primer, demostrem que amb OmpSs@FPGA podem aconseguir rendiment absolut similar a una CPU en alguns bechmarks, com N-body, i superar en eficiència energètica a arquitectures CPU i GPU similars (en àrea i tecnologia). Segon, també avaluem aplicacions multi-FPGA en tres clústers diferents: cloudFPGA, ESSPER, i MEEP, els quals tenen característiques molt distintives. Amb IMP, demostrem que podem escalar linealment N-body, Heat, i CHhlesky amb 64 FPGA. Per CPU, també podem escalar linealment amb els mateixos benchmarks en un clúster de 64 nodes i 8 nuclis, amb un total de 512 nuclis. Amb el nostre codiseny hardware-software, el qual combina acceleració hardware per scheduling de tasques i intercanvi de missatges i IMP, proveïm una solució per accelerar aplicacions HPC transparentment al programador, impulsant productivitat. Aquesta solució ha sigut dissenyada per sistemes heterogenis basats en FPGA i CPU. Les CPU també es beneficien significativament de la reducció d'overhead del runtime per l'acceleració hardware.

dc.description.abstract

(Español) Debido a la importancia de la eficiencia energética y la tendencia a escalar sistemas HPC con muchos nodos, en esta tesis tratamos ambos temas con la ayuda de aceleración hardware y modelos de programación. Sobre el primer tema, las FPGA son objeto de estudio debido a su alta flexibilidad para adaptarse a cualquier carga de trabajo y a su alta eficiencia energética. En esta tesis presentamos extensiones al framework de OmpSs@FPGA, el cual provee un modelo de programación basado en tareas para programadores no expertos en FPGA. Estas extensiones incluyen directivas del compilador para optimizar código FPGA automáticamente, un runtime hardware para tareas con análisis de dependencias llamado POM, y una API basada en MPI para multi-FPGA con su runtime asociado, llamado OMPIF. Además, presentamos el modelo llamado Implicit Message Passing (IMP), el cual combina modelos de programación basados en tareas y en intercambio de mensajes, aprovechando las dependencias y una distribución estática de los datos. IMP comunica datos automáticamente entre nodos cuando las dependencias de una tarea lo requieren. Por lo tanto, el usuario no ha de escribir ninguna llamada a MPI u OMPIF en el código, ya que IMP se encarga de la comunicación. Evaluamos este modelo en clústeres de FPGA y CPU, con aceleración hardware para scheduling de tareas e intercambio de mensajes usando los runtimes OMPIF y POM. Para clústeres de CPU, estudiamos diferentes maneras de incorporar POM en un SoC, primero con una FPGA encastada, luego diseñamos POM como ASIC para un procesador RISC-V, y finalmente para un softcore FPGA también basado en RISC-V. En el último caso, usamos POM y OMPIF para evaluar aplicaciones distribuidas con un clúster de FPGA que emulan un clúster de CPU. Evaluamos IMP y programación tradicional con MPI y tareas en varios benchmarks: Multiplicación de matrices, Spectra, N-body, Heat y Cholesky. Con las contribuciones mencionadas, cumplimos varios objetivos. Primero, demostramos que con OmpSs@FPGA podemos conseguir rendimiento absoluto similar a una CPU en algunos benchmarks, como N-body, y superar en eficiencia energética a arquitecturas CPU y GPU similares (en área y tecnología). Segundo, también evaluamos aplicaciones multi-FPGA en tres clústeres diferentes: cloudFPGA, ESSPER, y MEEP, los cuales tienen características muy distintas. Con IMP, demostramos que podemos escalar linealmente el N-body, Heat, y Cholesky con 64 FPGA. Para CPU, también podemos escalar linealmente con los mismos benchmarks en un clúster de 64 nodos y 8 núcleos, con un total de 512 núcleos. Con nuestro de codiseño hardware-software, el cual combina aceleración hardware de scheduling de tareas e intercambio de mensajes e IMP, proveemos una solución para acelerar aplicaciones HPC transparentemente al programador, impulsando la productividad. Esta solución ha sido diseñada para sistemas heterogéneos basados en FPGA y CPU. Las CPU también se benefician significativamente de la reducción del overhead del runtime por la aceleración hardware.

dc.format.extent

220 p.

dc.language.iso

eng

dc.publisher

Universitat Politècnica de Catalunya

dc.rights.license

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by/4.0/

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

High Performance Computing (HPC)

dc.subject

Field-Programmable Gate Array (PFGA)

dc.subject

task scheduling

dc.subject

task-based programming models

dc.subject

computer architecture

dc.subject

CPU

dc.subject

MPI

dc.subject

High-Level Synthesis (HLS)

dc.subject

energy efficiency

dc.subject

programmability

dc.subject

many-core architectures

dc.subject

FPGA clusters

dc.subject

hardware runtimes

dc.subject

hardware acceleration

dc.subject

ASIC

dc.subject

Implicit Message Passing (IMP)

dc.subject.other

Àrees temàtiques de la UPC::Informàtica

dc.title

Accelerating many-core, heterogeneous, and distributed architectures with hardware runtimes and programming models

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.date.updated

2025-10-01T06:21:09Z

dc.subject.udc

004 - Informàtica

dc.contributor.director

Álvarez Martínez, Carlos

dc.contributor.director

Jiménez González, Daniel

dc.embargo.terms

cap

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

dc.identifier.doi

https://dx.doi.org/10.5821/dissertation-2117-442722

dc.description.degree

DOCTORAT EN ARQUITECTURA DE COMPUTADORS (Pla 2012)

Documents

TJMDHR1de1.pdf

4.054Mb PDF

Aquest element apareix en la col·lecció o col·leccions següent(s)

Programa de Doctorat en Arquitectura de Computadors [272]

Àrea de contingut