Rosetta Python floods disks with snapshots

Author	Message
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0	Message 104262 - Posted: 15 Jan 2022, 19:28:49 UTC I don't run Rosetta's python (vbox) tasks. Nonetheless, a couple of comments might be interesting for those who do. Found this in the logfiles: 2022-01-15 21:08:35 (2464): Setting Memory Size for VM. (6144MB) This means the VM allocates RAM up to 6144 MB. RAM that is really in use becomes important looking at the next entries. 2022-01-15 21:21:13 (2464): Creating new snapshot for VM. 2022-01-15 21:21:19 (2464): Checkpoint completed. 2022-01-15 21:31:13 (2464): Creating new snapshot for VM. 2022-01-15 21:31:19 (2464): Deleting stale snapshot. 2022-01-15 21:31:20 (2464): Checkpoint completed. 2022-01-15 21:41:14 (2464): Creating new snapshot for VM. 2022-01-15 21:41:20 (2464): Deleting stale snapshot. 2022-01-15 21:41:21 (2464): Checkpoint completed. . . . This means vboxwrapper writes a snapshot to disk (to the snapshot directory below .../slots/n) every 10 minutes. This snapshot includes an image of the RAM used by the VM at that moment. Hence, the size could be small, but it could also be the 6144 MB mentioned above. Your disks are happy about that, especially SSDs. It might be worth to test whether those snapshots are really required for Rosetta tasks. If not, the project admins should add "<disable_automatic_checkpoints/>" to the vbox_job.xml delivered as part of the app. Volunteers who want to test it should do the following steps: 1. Shut down BOINC 2. Insert "<dont_check_file_sizes>1</dont_check_file_sizes>" in cc_config.xml (remove it after the test) 3. Insert "<disable_automatic_checkpoints/>" in Rosetta's vbox_job.xml (don't know which filename they use but it is mentioned in the softlink you find in the slots directory) 4. Start BOINC 5. Run a new Rosetta python task and check it's stderr.txt as well as the corresponding VM's snapshot folder. ID: 104262 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1926 Credit: 18,534,891 RAC: 0	Message 104267 - Posted: 15 Jan 2022, 21:18:19 UTC - in response to Message 104262. This means the VM allocates RAM up to 6144 MB. RAM that is really in use becomes important looking at the next entries. RAM in use is generally only a GB or so. The problem with no snapshots- if BOINC has to restart at any time, all the work done on any Tasks not yet completed up to that point is lost. As it is, the RAM requirement was meant to have been changed to 3GB. Looks like it might have reverted back to it's previous value- which would explain the drop in the number of Python Tasks being processed at any given time, even with the lack of Rosetta 4.20 work. Grant Darwin NT ID: 104267 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0	Message 104270 - Posted: 15 Jan 2022, 21:42:12 UTC - in response to Message 104267. Might be worth to check this. My experience with CMS from LHC@home: - They use "<disable_automatic_checkpoints/>" - I set my BOINC client's checkpoint interval to ~3200 s - If I suspend/resume a task before that point it starts from scratch - If I suspend/resume after that point it writes a snapshot and continues from there ID: 104270 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0	Message 104274 - Posted: 16 Jan 2022, 1:39:18 UTC Looks like they checkpoint every ten minits This line is from one of my valid work units . 2022-01-15 18:25:40 (3920): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds)) I have set checkpoint interval to 3600 seconds [one hour] in boinc mangler [I don't reboot etc unless I have to] I will have a play with this idea and see what I can break :) ID: 104274 · Rating: 0 · rate: / Reply Quote

JAMES DORISIO Send message Joined: 25 Dec 05 Posts: 15 Credit: 213,917,639 RAC: 100	Message 104280 - Posted: 16 Jan 2022, 15:30:07 UTC I have had the checkpoint interval set to 3600 seconds, below is the sdterr output from a python task 2022-01-16 01:57:49 (9405): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds)) 2022-01-16 02:58:03 (9405): Creating new snapshot for VM. 2022-01-16 02:58:12 (9405): Checkpoint completed. 2022-01-16 03:36:54 (9405): Status Report: Elapsed Time: '6000.749009' 2022-01-16 03:36:54 (9405): Status Report: CPU Time: '5950.310000' 2022-01-16 03:58:29 (9405): Creating new snapshot for VM. 2022-01-16 03:58:38 (9405): Deleting stale snapshot. 2022-01-16 03:58:38 (9405): Checkpoint completed. 2022-01-16 04:58:55 (9405): Creating new snapshot for VM. 2022-01-16 04:59:04 (9405): Deleting stale snapshot. 2022-01-16 04:59:04 (9405): Checkpoint completed. 2022-01-16 05:15:58 (9405): Status Report: Elapsed Time: '12001.543604' 2022-01-16 05:15:58 (9405): Status Report: CPU Time: '11930.110000' It looks like it is using the higher value of 3600 seconds 1 hour I am going to try changing this to 7200 seconds to see what happens as these computers are on 24 hours a day and rarely reboot. Jim ID: 104280 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104281 - Posted: 16 Jan 2022, 15:50:44 UTC - in response to Message 104262. Last modified: 16 Jan 2022, 16:39:05 UTC Very nice. I have now set the write interval to 3600 seconds. Thanks. EDIT: But I have to reboot this Ubuntu machine at least once a day to restart the "Vm job unmanageable" jobs, and would take a big hit. So I will go back to 600 seconds and rely on my large write cache to protect my SSD. The writes by the pythons are much larger than the checkpoints it seems. ID: 104281 · Rating: 0 · rate: / Reply Quote