Updates


Event Date Summary

An email communication was sent on 3 Feb with some updates:

We are notifying you of updates to the previously announced outage and capacity reduction schedule for the Graham compute cluster:

New availability date: Starting February 17, the cluster will resume running jobs with a reduced computational capacity.

No action is required.

Graham is available for login and user storage is available. Project storage is read only while we finish migrating the data to the new storage system.

We have run into a few issues completing the migration but it is nearly complete. Migrated data has brought the new storage system to capacity. We have ordered additional capacity and it will be installed the week of February 10.

Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration. This is required because the cluster is smaller.

Jobs can either be CPU or GPU, but only V100, T4, A100 and A5000 GPUs remain available.

Auxiliary services like Globus and gra-vdi will become available as time permits.

Specific details are available on the Graham wiki page.

Please note that Graham cloud will remain operational during this period.

We recognize that this may cause disruption for those relying on Graham, but these measures are necessary to make room for the new system, Nibi, which will provide even greater capacity and capabilities.

We assure you your data will remain secure at all times.

The status of Graham can be checked on https://status.alliancecan.ca. You can also visit the Infrastructure Renewal Wiki page where National Host Sites continue to publish planned activities and anticipated impacts including outages and system reductions.

If you have any questions and concerns about this please contact [email protected].

Thank you for your patience and support,

We've made progress in restoring CPU and GPU nodes, and will be able to run jobs soon.

Finalizing /project access (which is still readonly) has been delayed by some space issues, which we are resolving.

miniGraham still has no compute nodes available - this means that job submission will fail.  Jupyterhub also depends on Slurm.

We expect to return /project to normal (read/write) status on Monday, Jan 28. We may be able to re-start Slurm then, too, subject to progress on rewriting.

Further detail:

- Graham's Globus is not up yet.  This involves some nodes that were moved, so may require adjustment of network cables.

- Nearline (/nearline) is also not online.  This also involves a few servers, which should become accessible in coming days.


Incident description

Service Incident status Start Date End Date
Graham Open
Created by Mark Hahn on

Title


Planned Outage - Arrêt planifié


Summary


Graham is available for login and access to files.

The /project filesystem is read-only, as we verify that the migration is complete.

Slurm currently has no compute nodes to run jobs; as we add compute nodes, jobs will become possible.


Updated by Mark Hahn on