|Graham has been operating normally, with no recurrance of the previous problems. There are still a small fraction of nodes being drained for reboot, but this is not effecting throughput.
|Most Graham nodes have been rebooted, and the new configuration appears to fix the previous problem. An additional problem has also been fixed. This problem caused spurious permission failures for some users - for instance, if a job opened some input file, it may have failed with an error message similar to "permission to access file denied"
During our recent upgrade, we updated packages related to CVMFS, used to deliver most software on CC systems. This updated version is not behaving well, causing error messages such as: Transport endpoint is not connected and possibly also programs crashing with "bus error". We've downgraded to the previous known-good versions. In order to make sure any references have been cleaned up, all Graham nodes are set to drain and will be automatically rebooted and returned to service when the jobs on them are complete. Users might notice a reduced throughput.
Updated by Mark Hahn on