Author | Message |
Bill Swisher

Send message
Joined: 10 Jun 13 Posts: 80 Credit: 61,510,959 RAC: 2,052
|
I might as well ask here...
Anybody got an idea about what's going on over at WCG? Seems to be TU.
|
|
Dr Who Fan

Send message
Joined: 28 May 06 Posts: 103 Credit: 289,080 RAC: 4
|
I might as well ask here...
Anybody got an idea about what's going on over at WCG? Seems to be TU.
Follow along in the WCG FORUM @ BOINC message boards starting with Message 116748
Grumpy Swede wrote:New update: https://www.cs.toronto.edu/~juris/jlab/wcg.html (click operational status heading) - August 29.
Also pushed to the BOINC client.
August 29, 2025
Full migration of WCG from the Graham to Nibi cloud facilities will be completed between 3:00-5:00 p.m. on August 31st, 2025
Sharcnet will then power down all hardware at Graham.
We have put in a ticket with UHN Digital to move our DNS records to the new IP addresses we have been allocated in Nibi cloud, and all storage, networking, and compute resources are already provisioned at Nibi.
We continue testing QA and Prod on the new infrastructure.
We will experience some downtime as *.worldcommunitygrid.org URLs switch over. We will be bringing down workunit creation scripting, BOINC server components, and upload/download servers in sequence, halting the database, performing a final rsync and then bringing down the website, forums, and internal services over the next 48h.
In the best case, our DNS records will be switched over on the 31st and everything behind the load balancer will be up and running. However, we want to prepare users for the possibility of additional downtime as we stand up prod on Nibi.(
|
|
Bill Swisher

Send message
Joined: 10 Jun 13 Posts: 80 Credit: 61,510,959 RAC: 2,052
|
Thanks.
They seem to be overachievers. :-) The site disappeared sometime around Friday. I got lots of jobs stacked up waiting to upload.
|
|
Bill Swisher

Send message
Joined: 10 Jun 13 Posts: 80 Credit: 61,510,959 RAC: 2,052
|
[Follow along in the WCG FORUM @ BOINC message boards starting with...
Thanks for that link, I've been following it. I can only say the "96 hours" added has come and gone. They might consider adding at least another 96 hours, maybe 192. :-)
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2454 Credit: 46,464,996 RAC: 2,192
|
Another update today.
Sounds positive - just taking a little longer than I'd like
September 8, 2025
Over the weekend we were able to restore the DB2 databases for the website and forums.
It was a redirected restore that first required a fully containerized instance of DB2 running the same OS as we were in Graham cloud, and we ran into issues attempting the restore of the final backups. Both databases are now successfully restored, and we have moved on to containerizing Websphere and IBM MQ.
We were able to restore the BOINC database.
As part of our work on MAM1 we developed an integration testing environment and containerized the BOINC database.
We also did this for the BOINC server components (scheduler, upload and download servers with file_upload_handler, transitioner/validators/assimilators etc.).
Once we get IBM MQ and Websphere up, we will be able to bring the entire system online shortly afterwards.
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2454 Credit: 46,464,996 RAC: 2,192
|
Another update today.
Sounds positive - just taking a little longer than I'd like
September 8, 2025
Over the weekend we were able to restore the DB2 databases for the website and forums.
It was a redirected restore that first required a fully containerized instance of DB2 running the same OS as we were in Graham cloud, and we ran into issues attempting the restore of the final backups. Both databases are now successfully restored, and we have moved on to containerizing Websphere and IBM MQ.
We were able to restore the BOINC database.
As part of our work on MAM1 we developed an integration testing environment and containerized the BOINC database.
We also did this for the BOINC server components (scheduler, upload and download servers with file_upload_handler, transitioner/validators/assimilators etc.).
Once we get IBM MQ and Websphere up, we will be able to bring the entire system online shortly afterwards.
And another - tantalisingly close
September 9, 2025
We are finalizing IBM MQ <-> DB2 <-> BOINC db <-> website axis, which will allow us to bring up the website. If all goes to plan now - we should have the website up tonight.
Once that is solved - we will go through the BOINC stack to ensure nothing catastrophic will happen when once we let traffic through to the scheduler, upload/download servers. Then we can finally start letting the BOINC daemons manipulate state in the BOINC db.
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2454 Credit: 46,464,996 RAC: 2,192
|
I've still got files to upload to WCG, so I'm taking it that the WCG migration still hasn't fully worked out.
Messages since the last ones I posted:
September 12, 2025
Configuration of Websphere and IBM MQ is taking longer than expected. We are moving all provisioning, build, and deploy stages for all repos from Ansible and Gitlab CI to Dockerfiles and docker compose files, which is a step that precedes running these containers as StatefulSets on Kubernetes. So far, we have functional containers for IBM MQ, Websphere, DB2, MariaDB, and all BOINC endpoints up and running, and what we are still struggling through is configuration.
This approach will benefit site reliability and scalability in an obvious way on Kubernetes, and will improve our development and QA lifecycles drastically. It was also necessary to preserve a maximum compatibility with the CentOS 7 virtual machines that the legacy stack was previously running on, a requirement for the redirected restore of the DB2 data for example, https://www.ibm.com/docs/en/db2/11.5.x?topic=restore-performing-redirected-operation.
So why are we not up, and when will we be up? We are debugging the entrypoint scripts for Websphere and IBM MQ containers. Website cannot be brought up until Websphere is up and configured correctly, receiving messages from all MQ sidecars across the stack, sending emails, etc. Each of the databases, the webserver, and the scheduler have to run MQ, and we are still adapting some of the previous mqsc and other runtime configuration for the MQ service to work with this new setup where each important container that requires one gets an MQ sidecar container that uses the Ubuntu 24.04 host VM network.
September 17, 2025
WCG migration is complete.
Sorry about much longer downtime than anticipated - but - we are finally back online - and Dylan was able to make some improvements as well.
We will share more details soon - but we also want to finish our start of the MAM project.
September 18, 2025
Forum issues detected.
Sorry about these problems starting so soon after we came back - but - hopefully Dylan resolved them now.
It all related to an AutoMaint features of DB2. Likely, database went into one of the scheduled maintenance tasks, MVNForum tried to talk to the database, didn't work, MQ blocked MVNForum and the lack of perms on the queue showed up as this init check failure.
I see their forums are back to normal - full of complaints...
Can anyone point to the latest info from their admins, if any, or is it just radio-silence until they have something positive to say?
|
|
Dr Who Fan

Send message
Joined: 28 May 06 Posts: 103 Credit: 289,080 RAC: 4
|
I see their forums are back to normal - full of complaints...
Can anyone point to the latest info from their admins, if any, or is it just radio-silence until they have something positive to say?
Same as usual from WCG admin, "silence is golden" even if the natives (us crunchers) are ready to cause a mutiny!
|
|