Updates

Event Date	Summary
	Read/write access to /project, /work, and /scratch returned last week - new storage capacity allowed us to finish the migration from Graham's former Lustre storage. With full storage access, we've re-enabled Globus - the computecanada#graham-globus collection, for instance. Logins now go to the standard three/redundant login nodes - gra-login[1-3]. Slurm has been updated and provides access to CPU nodes. We're currently running with a simplified set of partitions: just b1 and b3 job lengths (<3h and <3d), with no special provisions for bycore vs bynode. Please let us know if you have trouble with submitting jobs. (The simplified set of partitions does interact with Slurm job-submission scripting, but should permit you to run the jobs you need.) NOTE: GPU nodes are not available yet; we're trying to fix a problem with the node image. gra-vdi is also not returned to service.
	We apologize for giving today, the 17th as the target for return of Graham - since it is a regional holiday. Further, shipment of new storage hardware (to permit full read/write access on Graham) has been delayed in transit. We intend to open as soon as possible, hopefully this week.
	An email communication was sent on 3 Feb with some updates: We are notifying you of updates to the previously announced outage and capacity reduction schedule for the Graham compute cluster: New availability date: Starting February 17, the cluster will resume running jobs with a reduced computational capacity. No action is required. Graham is available for login and user storage is available. Project storage is read only while we finish migrating the data to the new storage system. We have run into a few issues completing the migration but it is nearly complete. Migrated data has brought the new storage system to capacity. We have ordered additional capacity and it will be installed the week of February 10. Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration. This is required because the cluster is smaller. Jobs can either be CPU or GPU, but only V100, T4, A100 and A5000 GPUs remain available. Auxiliary services like Globus and gra-vdi will become available as time permits. Specific details are available on the Graham wiki page. Please note that Graham cloud will remain operational during this period. We recognize that this may cause disruption for those relying on Graham, but these measures are necessary to make room for the new system, Nibi, which will provide even greater capacity and capabilities. We assure you your data will remain secure at all times. The status of Graham can be checked on https://status.alliancecan.ca. You can also visit the Infrastructure Renewal Wiki page where National Host Sites continue to publish planned activities and anticipated impacts including outages and system reductions. If you have any questions and concerns about this please contact [email protected]. Thank you for your patience and support,
	We've made progress in restoring CPU and GPU nodes, and will be able to run jobs soon. Finalizing /project access (which is still readonly) has been delayed by some space issues, which we are resolving.
	miniGraham still has no compute nodes available - this means that job submission will fail. Jupyterhub also depends on Slurm. We expect to return /project to normal (read/write) status on Monday, Jan 28. We may be able to re-start Slurm then, too, subject to progress on rewriting.
	Further detail: - Graham's Globus is not up yet. This involves some nodes that were moved, so may require adjustment of network cables. - Nearline (/nearline) is also not online. This also involves a few servers, which should become accessible in coming days.

Incident description

Service	Incident status	Start Date	End Date
Graham	Closed

Created by Mark Hahn on

Title

Planned Outage - Arrêt planifié

Summary

Graham is available for login and access to files.

The /project filesystem is read-only, as we verify that the migration is complete.

Slurm currently has no compute nodes to run jobs; as we add compute nodes, jobs will become possible.

Updated by Mark Hahn on