3 x 36-Processor Machines with CPU set to 50% are now working

Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working

To post messages, you must log in.

AuthorMessage
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,950,823
RAC: 17,561
Message 105718 - Posted: 27 Mar 2022, 0:41:33 UTC

The Rosetta conversion to vbox caused big problems for me.

1. I had to figure out the Rosetta ALLOW switch.
2. I had to limit the number of Rosetta jobs active on the computer (currently 8gb/job) with 3-line app_config.xml.
3. I found high memory errors in one machine that had been running fine.
4. I had to load VirtualBox packages on a Linux machine so the vbox jobs would run.

I think things have stabilized.


64-gb Fedora Linux machine.
I had to load VirtualBox package to fix COMPUTATION ERRORS.


64-gb Windows 11 Machine
Heavy disk usage caused by WU setup and runtime paging from lack of memory.
Near zero CPU usage. Long runs.
I LIMITED the maximum Rosetta jobs to 8. I can probably relax that some. The jobs seem to want 3gb to start with, but demand more later in the computation.
The failures likely occurred when disk space requests exhausted.

"app_config.xml" file at C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettaapp_config.xml (3 lines) limits the number of project jobs executed simultaneously.

<app_config>
<project_max_concurrent> 8 </project_max_concurrent>
</app_config>



128-gb Windows 11 Machine
Frequent stalled jobs with little CPU usage. Constant high disk usage.
Isolated two bad memory sticks in the 64gb to 128gb memory range.
2 x 16gb DIMM sticks on order.
Added the 3-line app_config.xml file above.
ID: 105718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105719 - Posted: 27 Mar 2022, 0:55:27 UTC - in response to Message 105718.  
Last modified: 27 Mar 2022, 0:59:35 UTC

<app_config>
<project_max_concurrent> 8 </project_max_concurrent>
</app_config>

There was a bug in BOINC that would flood your machine with too many work units if you used that tag.
It has been noted, but I don't know if it has been fixed yet.

I use PrimoCache on Win10 (Ryzen 3600) with 96 GB of write-cache to run my pythons. I think I can run nine of them without problems, maybe more.
But six of them take 56 GB of cache, and the writes with only six pythons are 1.25 TB/day. That will kill most SSDs in a few months.

Linux is hopeless.
ID: 105719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 389
Credit: 12,042,062
RAC: 14,396
Message 105721 - Posted: 27 Mar 2022, 11:16:19 UTC - in response to Message 105719.  

<app_config>
<project_max_concurrent> 8 </project_max_concurrent>
</app_config>

There was a bug in BOINC that would flood your machine with too many work units if you used that tag.
It has been noted, but I don't know if it has been fixed yet.


I use PrimoCache on Win10 (Ryzen 3600) with 96 GB of write-cache to run my pythons. I think I can run nine of them without problems, maybe more.
But six of them take 56 GB of cache, and the writes with only six pythons are 1.25 TB/day. That will kill most SSDs in a few months.

Linux is hopeless.


It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem.
ID: 105721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,328,464
RAC: 6,028
Message 105723 - Posted: 27 Mar 2022, 14:25:08 UTC - in response to Message 105719.  

<app_config>
<project_max_concurrent> 8 </project_max_concurrent>
</app_config>

There was a bug in BOINC that would flood your machine with too many work units if you used that tag.
It has been noted, but I don't know if it has been fixed yet.

Well, I have been using that for a couple of years now, and have had no trouble with it.

[/var/lib/boinc/projects/boinc.bakerlab.org_rosetta]$ cat app_config.xml 
<app_config>
 <project_max_concurrent>6</project_max_concurrent>
</app_config>

I am running

Computer 5910575
Computer information

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.20.1.el8_5.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	63902.14 MB
Cache 	16896 KB

ID: 105723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105724 - Posted: 27 Mar 2022, 16:25:28 UTC - in response to Message 105721.  

It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem.

You can investigate it in more detail, and maybe avoid the problem, or not, as the case may be.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384
ID: 105724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,950,823
RAC: 17,561
Message 105745 - Posted: 28 Mar 2022, 19:24:40 UTC - in response to Message 105724.  

It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem.

You can investigate it in more detail, and maybe avoid the problem, or not, as the case may be.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384


I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine.

Your disk cache with the WRITE BACK enabled suggestion is very good. It will reduce disk write traffic and save the SSD/HDD drive. VirtualBox BOINC crunchers can decide on using memory to reduce disk writes or to run more jobs.

Thanks
ID: 105745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105748 - Posted: 28 Mar 2022, 22:32:56 UTC - in response to Message 105745.  

[I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine.

It isn't the .xml file itself that is the problem, but the "<project_max_concurrent>" tag (also the "<max_concurrent>" tag).
Under certain conditions, BOINC thinks it needs to download more work.

You can check it with a test case. https://github.com/BOINC/boinc/issues/4322
It caused me problems here the last time I used it a year or two ago, and no one has said it has been fixed yet that I have seen.
ID: 105748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,950,823
RAC: 17,561
Message 105764 - Posted: 31 Mar 2022, 16:58:24 UTC - in response to Message 105748.  

[I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine.

It isn't the .xml file itself that is the problem, but the "<project_max_concurrent>" tag (also the "<max_concurrent>" tag).
Under certain conditions, BOINC thinks it needs to download more work.

You can check it with a test case. https://github.com/BOINC/boinc/issues/4322
It caused me problems here the last time I used it a year or two ago, and no one has said it has been fixed yet that I have seen.


I watch my changes to the configuration until I am sure they work and no problems. I have never had problems with this particular option, but I will watch closer ... just in case.

How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ???
ID: 105764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105765 - Posted: 31 Mar 2022, 17:44:25 UTC - in response to Message 105764.  
Last modified: 31 Mar 2022, 17:52:44 UTC

How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ???

Yes, I enable "Defer writes" with an "infinite" latency. That way it acts like a ramdisk.
Note that you don't have to set "infinite", you could try a shorter period. Normally 4 hours or so should work for the amount of writes it produces.
But each time I try that, I still get several hundred GB written to the disk every 24 hours, so I just do the infinite. I suspect the .VDI files are not being cached for some reason.

As you can see, you need lots of memory. I use 96 GB for the write-cache, and have 128 GB total. It is not a low-resource project.




You could just use a ramdisk instead, but if you get the one from Primo, you need "Ramdisk Ultimate" for a 64 GB size ramdisk.
The one from Dataram is a bit cheaper, and should work, but the license is tied to the original PC. The one from Primo can be transferred upon request.
You then place the entire BOINC Data folder on the ramdisk.

It should fit about 8 work units in a 64 GB ramdisk.
Or just run on 4 cores; that should fit on a 32 GB ramdisk, more or less.
ID: 105765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working



©2024 University of Washington
https://www.bakerlab.org