Update: 2022.06.08 @ 14:00 :
I am tentatively calling maintenance complete at this time. The unit hosting /home and associated shares now seems to be operating normally. It will be slow for the next several hours as it does various checks and performs a backup of /home, which hadn’t happened since early May.
We have been having issues with the scheduler crashing the last few days. I’m hopeful that clearing out the scheduler and restarting things during the maintenance will make that stop happening. If you’ve recently started running jobs again or are starting to submit a new type of job …or anything new within the last week or so, please make sure that your scripts are not submitting things to the queue too quickly, as I have been informed that this can make the scheduler crash. So if you have a script that submits 10 jobs, do something like sleeping for a few seconds between each submission to the queue.
I have removed the job hold as of 13:30 and new jobs seem to be processing without issue so far.
If you spot anything out of the ordinary today or in the future, please email us at firstname.lastname@example.org and we’ll look into it as soon as we can. If it is something that will require another maintenance outage, I’ll then announce that here.
It has come to our attention that /home and associated shares/mounts are not happy due to some complications from the last power outage, and have been limping along since then. Next Wednesday, June 8th, there will be downtime from noon (12:00) until 3:45pm (15:45) pacific time to attempt recovery. I will be placing a scheduling hold for those hours by the end of the day today.
POC: Dave Anderson