This page describes a validation plan for OpenHPC deployment on AArch64 systems. It is not restrict to hardware, cloud or emulation systems and should be reasonably independent of the media.

Our main concern is how to automate the deployment in a media that is not only easy but meaningful.

The key points we need are:

Full automation. To be able to start new images, install OpenHPC, download some libraries, configure the master and at least one slave, compile a few HPC programs, run them and make sure that they conform to the expected output.
Representative images. We need to make sure that the minimum number of nodes brings a meaningful result, that the programs we run will be executed in real clusters and that the environment we have chosen is not hiding real-world errors.
Change driven. Triggering builds and validation need to be automated from relevant changes to relevant repositories, or at least at a compatible frequency with said repositories.

To achieve the final stage of continuous integration, we'll need to go through a set of steps, from a fully manual setup to fully automated. Not all steps will need to be done in this effort, as many of them already exist for other projects, and we should aim to not duplicate any previous (or future) efforts in achieving our CI loop.

First Iteration

The first steps are the bootstrap process, where we'll define the best practices, consult other Linaro teams, the upstream OpenHPC community and the SIG members for the multiple decisions we'll need to take to guide us to a meaningful validation process without duplication of work or moving in the wrong direction.

Step #0: Identify Process

The first step is to identify every step that is needed so that we can get a full repeatable process.

Engineering Specification

This will likely involve:

Installing CentOS/SLES on QEMU, container, bare-metal environments.
Applying the industry standard changes to those images (security, authentication, performance, etc.).
Installing OpenHPC directly from their upstream documents, identifying all pertinent steps that are taken/not-taken and their reasons, per environment.
Run a baseline validation on the core setup. This should be in a git repo somewhere, probably GitHub so we can share with our members and the community.
Installing all additional components that are needed for the test, for example LLVM's libomp (to compare with GNU's GOMP), special libraries, etc.
Download all tests that will be compiled and executed as part of the validation.
Compile all tests, possible with different toolchains, options, libraries. Identify every problem (errors, warnings, etc).
Run all tests that were successfully compiled and make sure that their output is as expected, understanding the floating point nature of most of them.

Once these steps are reproduced by hand, and documented thoroughly, we can start automating the process.

Current Infrastructure

OpenHPC already has good part of that infrastructure ready. The PDF document has a full description on Appendix A on how to extract automation scripts from the docs-ohpc package.

This is what is used by the OpenHPC CI, which tries to run more than just "make check" tests by using the test-suite-ohpc package.

Any change we need to do to the automation and testing, should first be done to those packages, and only in case it's not possible, we should write our own scripts, hopefully shared on their own repository.

We may not be able to share our resources with them (license, NDA), but if we do replicate their setup (Jenkins, etc.), we should use the same sources, repositories and configurations.

Acceptance Criteria

The outcomes of this step are:

At least one deployment of OpenHPC on AArch64 running OpenMP and MPI workloads in at least one compute node.
A document, with step-by-step instructions, on how to install, setup and execute a minimal OpenHPC validation suite.
At least one base test and one workload test need to be achieved successfully.
Not all of this task need to be finished before the following ones start, but it would be good to have at least one successful base test (before additional components).
(optional) GitHub repositories containing a few examples on base tests, scripts to use, etc.

Step #1: Automating deployment (0.1 ~ 0.2)

After we know how to validate OpenHPC on AArch64 by hand, we need to identify which deployment solution is both easy to automate and meaningful in the HPC community.

Engineering Specification

Although bare-metal deployments are the most common nowadays, containers are becoming ubiquitous in large-scale, heterogeneous HPC clusters. Containers are also a very easy way to deploy base images on a busy validation hardware infrastructure.

We have a few options:

Cross-QEMU containers (emulation) on x86_64 hardware. This is slow and problematic, but can sometimes be the only way some SIG members will have access to ARM hardware.
1. PROs: AArch64 QEMU is available upstream, anyone can use it;
2. CONs: It's really slow and support for HPC OSs is not complete;
Native containers (acceleration) on AArch64 hardware: This is the easiest way to deploy images, but will be barred on hardware availability and upstream support for the existing hardware.
1. PROs: Fast, reliable, easy to automate;
2. CONs: Needs access to hardware (developer cloud would help here);
Bare-metal AArch64 hardware: This is closer to most current deployments and would allow us to test real-world situations (10GBe, InfiniBand, UEFI via serial, etc).
1. PROs: Tests real-world cases for large scale deployment;
2. CONs: Really limited on hardware, as each machine will be entirely dedicated;

We shouldn't focus on a single one, as all of them are important, but we should prioritise them and pick the more important to do first, and let the others after step #2 is finished in its first iteration.

Current Infrastructure

Right now, what we have is a cloud project on the Linaro Developer Cloud. For now, this allows us to deploy virtual images on AArch64 hardware, create local networks and try out CentOS+OpenHPC installations and mini clusters.

This is the option 2 above and it's probably the closest we'll get from production environments, at least at such an early stage. Progress will be updated in Jira.

Acceptance Criteria

The outcomes of this step are:

Having validated with the SIG members that the chosen deployment is representative to their needs.
At least one successful deployment of and OS + Changes (using Ansible, SALT, etc.) using a reproducible and scalable method.
Being able to install OpenHPC by hand on the installation above and run the same base tests are produced in step #0.

Step #2: OpenHPC deployment (0.3 ~ 0.5)

Once we can quickly deploy CentOS/SLES (and potentially other) images on demand, and successfully install OpenHPC by hand using the instructions produced in step #0, we need to work on how to automate that deployment.

Engineering Specification

This is not as simple as having OS images on demand because some parameters should be chosen on demand, too. For example, the number of nodes, which toolchain to use, which MPI stack, etc.

We also need to understand the difference between setting up an OpenHPC master with TFTP boot process for compute nodes and setting them up directly using the same deployment as in step #1. While the former is more representative of the way OpenHPC works, the latter might be much easier to deploy consistently using the same CI infrastructure.

In this step, we also need to choose the procedure for automating it. We can use production tools, like Ansible or SALT, and have playbooks on a git repo. We can also write our own scripts, or discuss with the OpenHPC community what's the best way to do this.

In the end, having an upstream-approved process, even if it's a bit tedious, is perhaps more important than following the guidelines of other teams. But, in the absence of that, we should try to integrate our jobs as much as possible with the other infrastructure teams at Linaro, most notably Builds & Baselines, Systems and the Lab.

Acceptance Criteria

The outcomes of this step is a process that:

is accepted upstream (OpenHPC), so that it can be repeated by other members of the community, not just our members.
is accepted by the infrastructure teams at Linaro, and that have had their review and input.
is fully automated, with a few choices (compiler, MPI, etc.) and can be deployed on top of a vanilla image produced on the step #1.
At least one OpenMP and one MPI test is performed successfully on each variation.

Step #3: Workload Testing (0.6 ~ 0.8)

Once OpenHPC is being automatically deployed and tested for simple programs, we need to start looking at the packages and tests that the SIG members want us to continuously test.

Engineering Specification

For a first step, we want the workloads / libraries to be available through OpenHPC, and we're already working on getting some in there, but ultimately, we may be able to provide a git repository and a recipe (as a script or playbooks), and run that as additional testing.

We'd need to understand how to encapsulate each test, hopefully in similar steps: download, install, build, test. For example, installing an OpenHPC package means "download" and "install" are on the same process, so we'll need to allow those steps to be empty.

The testing phase can also be a benchmark, so not only we'll have outputs that infer the quality (proximity to certain floating point values or exact results) but also measure run time, or "reported time" and return that as integers, so we can track them in a separate database.

This last step is not necessary on the first run of step #3, but we ultimately want to track benchmark numbers and be able to identify regressions, spikes and trends.

Acceptance Criteria

The first iteration of this step needs to produce:

A framework for how to create tests, separated in steps (download, install, build, test), and applicable to OpenHPC packages.
A way to analyse the results based on the nature of the workload (statistical, exact, approximate) or to allow external scripts to validate output.
A way to communicate pass/fail back to the process that initiated the deployment in the first place, so that we can have a nice green/red status.
An example with at least one OpenHPC package being installed, compiled, tested and the results showing green/red on real errors.

Second Iteration

Once we can successfully deploy an OpenHPC image with some options and an HPC test reporting red/green status, we need to refine the options and improve the availability of the workloads.

However, unlike steps 1 ~ 3, these can and should happen in parallel.

Refining the deployment process

We'll surely make concessions on step #1, likely pick one and continue the process. But after having OpenHPC workloads validated, we need to understand what we left out because we don't need and what was meant to be done after.

Hopefully, step #1 would have left a document, or Jira tasks, so that we could continue on track without needing to re-evaluate anything at this step, just implement what's missing.

A few topics in this track would be:

Check the other two deployment strategies (out of QEMU, container, bare-metal), and try it out in order of member priorities.
Simplifying the process by having ready-images, snapshots, micro-instances.
Making available those images for cloud access / public?
Maybe use different types of containers?

Refining the OpenHPC options

We should be able to specify a few options for the images, regardless whether they're built on the fly or pre-defined images.

The options could be:

Which compiler to use, GCC, LLVM or some proprietary one available at some URL
Which libraries to use for OpenMP, MPI, etc.
Additional components (like monitoring, fast networking, etc.)

These options should be available at dispatch time (Jenkins?), and trigger jobs could pick them at different values to spawn a matrix validation on every update event, for instance a pre-commit hook from Gerrit, a new upstream release, new versions of packages made available, etc.

Improving test coverage

The framework defined in step #3 is an over-simplified version of what we may need to have, because it's expecting OpenHPC (or the base OS) to do the package management dependency decisions. This strategy won't work if we have to try external packages (for licensing reasons).

OpenHPC has a way to install third-party packages, and we may come up with a packaging that exposes the dependencies for each individual package, but that process needs to be well defined and possible to do for all third-party software the SIG members may want to add.

In addition to that improvement, we need to start populating the validation with a lot more tests, hopefully most of them as OpenHPC packages, with their own validation process, pass/fail scripts, etc.

So, while there probably won't be a lot of work in the OpenHPC infrastructure, especially related to our validation process, there will be substantial upstreaming work to get the packages in the releases with a full validation process.

We may also need to have a local repository (OpenHPC also allows some of that), for the experimental packages that haven't made into an upstream release yet.

Benchmarking

The final piece of the puzzle is how to measure performance in a CI loop.

When creating the packages on the task above, we should take care to enable them to run in two modes: validation and benchmark.

The validation style will just run a small subset that hopefully encompass most (if not all) of the functionality, so that we can have a quick return on the status of those features as work / doesn't work.

The benchmark style will run a subset of those features in larger loops and with internal timers, so that we can print out the run-time (or specific counters per second) into the test output.

Not all programs can have such an intrusive change, so we should also allow for simple "execution time", and cope with the noise by running it multiple times.

Both validation and benchmarking modes need to run with at least GCC and Clang, GOMP and libomp, so that we can identify any issues that arise in due time.

The second part of this task involved aggregating all the data into a database, so that we can track performance regressions.

Benchmark databases are generally large, no-SQL based and most of the time hand-made. Other team (ex. toolchain) already have large experience with benchmarking and tracking, so we should leverage their knowledge and existing tools.

HPC-AI

OpenHPC Validation Plan

First Iteration

Step #0: Identify Process

Engineering Specification

Current Infrastructure

Acceptance Criteria

Step #1: Automating deployment (0.1 ~ 0.2)

Engineering Specification

Current Infrastructure

Acceptance Criteria

Step #2: OpenHPC deployment (0.3 ~ 0.5)

Engineering Specification

Acceptance Criteria

Step #3: Workload Testing (0.6 ~ 0.8)

Engineering Specification

Acceptance Criteria

Second Iteration

Refining the deployment process

Refining the OpenHPC options

Improving test coverage

Benchmarking