Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,170,646 RAC: 10,441 |
The Rosetta conversion to vbox caused big problems for me. 1. I had to figure out the Rosetta ALLOW switch. 2. I had to limit the number of Rosetta jobs active on the computer (currently 8gb/job) with 3-line app_config.xml. 3. I found high memory errors in one machine that had been running fine. 4. I had to load VirtualBox packages on a Linux machine so the vbox jobs would run. I think things have stabilized. 64-gb Fedora Linux machine. I had to load VirtualBox package to fix COMPUTATION ERRORS. 64-gb Windows 11 Machine Heavy disk usage caused by WU setup and runtime paging from lack of memory. Near zero CPU usage. Long runs. I LIMITED the maximum Rosetta jobs to 8. I can probably relax that some. The jobs seem to want 3gb to start with, but demand more later in the computation. The failures likely occurred when disk space requests exhausted. "app_config.xml" file at C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettaapp_config.xml (3 lines) limits the number of project jobs executed simultaneously. <app_config> <project_max_concurrent> 8 </project_max_concurrent> </app_config> 128-gb Windows 11 Machine Frequent stalled jobs with little CPU usage. Constant high disk usage. Isolated two bad memory sticks in the 64gb to 128gb memory range. 2 x 16gb DIMM sticks on order. Added the 3-line app_config.xml file above. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
<app_config> There was a bug in BOINC that would flood your machine with too many work units if you used that tag. It has been noted, but I don't know if it has been fixed yet. I use PrimoCache on Win10 (Ryzen 3600) with 96 GB of write-cache to run my pythons. I think I can run nine of them without problems, maybe more. But six of them take 56 GB of cache, and the writes with only six pythons are 1.25 TB/day. That will kill most SSDs in a few months. Linux is hopeless. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 396 Credit: 12,254,928 RAC: 11,616 |
<app_config> It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem. |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 194 Credit: 6,540,448 RAC: 8,022 |
<app_config> Well, I have been using that for a couple of years now, and have had no trouble with it. [/var/lib/boinc/projects/boinc.bakerlab.org_rosetta]$ cat app_config.xml <app_config> <project_max_concurrent>6</project_max_concurrent> </app_config> I am running Computer 5910575 Computer information CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.20.1.el8_5.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 63902.14 MB Cache 16896 KB |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem. You can investigate it in more detail, and maybe avoid the problem, or not, as the case may be. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384 |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,170,646 RAC: 10,441 |
It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem. I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine. Your disk cache with the WRITE BACK enabled suggestion is very good. It will reduce disk write traffic and save the SSD/HDD drive. VirtualBox BOINC crunchers can decide on using memory to reduce disk writes or to run more jobs. Thanks |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
[I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine. It isn't the .xml file itself that is the problem, but the "<project_max_concurrent>" tag (also the "<max_concurrent>" tag). Under certain conditions, BOINC thinks it needs to download more work. You can check it with a test case. https://github.com/BOINC/boinc/issues/4322 It caused me problems here the last time I used it a year or two ago, and no one has said it has been fixed yet that I have seen. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,170,646 RAC: 10,441 |
[I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine. I watch my changes to the configuration until I am sure they work and no problems. I have never had problems with this particular option, but I will watch closer ... just in case. How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ??? |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ??? Yes, I enable "Defer writes" with an "infinite" latency. That way it acts like a ramdisk. Note that you don't have to set "infinite", you could try a shorter period. Normally 4 hours or so should work for the amount of writes it produces. But each time I try that, I still get several hundred GB written to the disk every 24 hours, so I just do the infinite. I suspect the .VDI files are not being cached for some reason. As you can see, you need lots of memory. I use 96 GB for the write-cache, and have 128 GB total. It is not a low-resource project. You could just use a ramdisk instead, but if you get the one from Primo, you need "Ramdisk Ultimate" for a 64 GB size ramdisk. The one from Dataram is a bit cheaper, and should work, but the license is tied to the original PC. The one from Primo can be transferred upon request. You then place the entire BOINC Data folder on the ramdisk. It should fit about 8 work units in a 64 GB ramdisk. Or just run on 4 cores; that should fit on a 32 GB ramdisk, more or less. |
Message boards :
Number crunching :
3 x 36-Processor Machines with CPU set to 50% are now working
©2024 University of Washington
https://www.bakerlab.org