Message boards : Number crunching : task swamping on multi-project host guidance requested
Author | Message |
---|---|
Viktor Send message Joined: 7 Jul 08 Posts: 5 Credit: 3,281,899 RAC: 81 |
Howdy all, I have a linux machine running boinc 24/7. I run Milkyway@home on 1 core/1 gpu, Einstein@home on 1 core/1 gpu, Rosetta@home on 4 cpu cores. To accomplish this I have my rosetta app_config set to: <project_max_concurrent>4</project_max_concurrent> This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache. Ideas? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1682 Credit: 17,854,150 RAC: 18,215 |
Howdy all,Don't use project_max_concurrent. WIth the number of core/threads limited for Rosetta, the system will struggle to do enough work to meet your Resource share settings, as the GPU projects will always be out performing the work done by CPU only Rosetta. So in order to do enough Rosetta work to catch up with the GPU projects it will need to stop doing GPU work to allow Rosetta to catch up. Give Rosetta more cores & threads, and the GPUs can continue to crucnch without getting way ahead of Rosetta for work done. Ideally, use an app_config.xml file to reserve a CPU core/thread to support your GPUs (if needed), but allow all projects to use all available CPU cores/threads that aren't being used to support a GPU. With more than one project, no cache is best as it will allow your Resource share settings to be met in a matter of days (or weeks) and not months (possibly many months). As long as the Estimated completion time for any Rosetta Tasks you get is around 8 hours, and Rosetta can use all the available CPU core/threads (other than the 2 reserved to support the GPUs), with no cache things should settle down within 24hrs. We did have a batch of work that was erroring out in a matter of seconds, and a couple of other batches that could error out after only an hour or 2, but they have been cleared up so things should settle down now. Grant Darwin NT |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
First, I agree with Grant's analysis of the underlying problem. Second, I'd like to suggest another course of action which may be more to your liking. Some remarks ahead. Don't use project_max_concurrent, and if you do make sure to adjust "use n% of CPUs" accordingly. Else you can expect BOINC to fetch more tasks than you allow it to actually process. Don't insist on running 1 Milkyway, 1 Einstein and 4 Rosetta tasks at all times. Run the projects CPU only or GPU only and use resource share to balance projects within each group but this will not work across groups. Keep your cache of work small but you don't need to go as far as 0.01 days. Maybe 0.1 to 0.5 days is good. Plan 1: (This is mostly what Grant already suggested) Configure your GPU projects to reserve 1 CPU per task. Configure BOINC to use 100% of CPUs or whatever your preferred maximum is. Pro: Will always fully load your CPU. Con: May not run GPU work at all times. Plan 2: Set "use n% of CPUs" for CPU tasks only. Make sure there's one CPU left for any possible GPU task running concurrently. Configure the GPU projects to reserve 0.1 (or even less) CPUs per task so the total of all possible GPU tasks is less than 1 CPU. Pro: Will always run GPU work if available. Con: If there is no GPU work will not do more CPU work instead. I think plan 2 is more like what you want. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache. Recent (in the last couple of years) versions of BOINC have a strange problem due to a change in the scheduler, where they randomly go berserk and download too many work units. I have posted on it in a number of forums. It will eventually correct itself, but in the mean time you can do some of the other fixes. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,169,305 RAC: 3,078 |
This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache. AND it's important to remember that aborting unwanted tasks is an okay thing to do!!! JUST because you got sent a bazillion tasks doesn't mean you have to actually try and finish them, abort the unwanted ones. |
Viktor Send message Joined: 7 Jul 08 Posts: 5 Credit: 3,281,899 RAC: 81 |
Thank you guys for your thoughtful replies. I will tinker with setting and see if I can get the desired behavior out of my setup. I like the second plan proposed. I do not want my gpu's idle and I need to hold back 2 cores for other non-boinc work. |
Viktor Send message Joined: 7 Jul 08 Posts: 5 Credit: 3,281,899 RAC: 81 |
Thanks again all who helped. Asking for aid and then not giving updates is a dick'ish move.... thus: * Gutted all my controls via app_config on projects * Changed my prefs to use max of 75% of cores * I kept my cc_config GPU exclusions to force certain GPU apps onto certain GPU's * verified .5 day cache with .01 additional Updated all projects and kickstarted it. Rosetta took 5 cores, GPU projects 1 per, 2 total. * Changed my prefs to use max of 74% of cores because inclusive programmer math. Oops. Updated all projects and kickstarted it. Rosetta took 4 cores, GPU projects 1 per. Rosetta has 4 tasks waiting in reserve which is perfect. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
* Gutted all my controls via app_config on projectsPlease don't be so vague, undoing app_config settings is not trivial. Of course deleting the file is not enough but also reloading the (now non-existent) configuration, updating the project or restarting BOINC isn't, at least the CPU and GPU values persist. The project's original values only come back with new tasks but I'm not sure to what extent they are applied then. I am however sure that the values displayed with old tasks are not updated without another client restart so whatever you see there may be outdated. When I want to revert app_config settings I first change them to the values used by the project, then reload the configuration, then delete it and restart the client. And I try to avoid app_config in the first place. Don't think of app_config as an easy and safe configuration tool for average users, it is a later add-on to BOINC which as far as I know has never been fully integrated. If you use it you can expect unexpected things to happen. I'm quite sure that getting many more tasks than you could finish was such thing. * Changed my prefs to use max of 75% of coresAt that point the event log will show you how many CPUs that translates to. Likely the correct six. I've seen BOINC schedule one CPU more than configured when in panic mode but that shouldn't be the case here with only 10 tasks in progress and nearly full time left. Updated all projects and kickstarted it. Rosetta took 5 cores, GPU projects 1 per, 2 total.Is that what the Manager showed you? Again, that may not be reality. Without different configuration I'd expect 1 core total scheduled for the GPU tasks and the remaining 5 of 6 for CPU tasks. Real usage will rather have been 2+5, more than you wanted. But if the Manager displayed just that in this case it was coincidence. * Changed my prefs to use max of 74% of cores because inclusive programmer math.I don't think so. Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4-6 for CPU tasks. OR set 50% and 0.1 CPU + 1 GPU. Due to the way BOINC schedules CPUs it will not reserve any for GPU support (but the applications still use them) and you always have 4 for CPU tasks but never more. That's two simple suggestions, of course you can make things more complicated by running several tasks per GPU. |
Viktor Send message Joined: 7 Jul 08 Posts: 5 Credit: 3,281,899 RAC: 81 |
Please don't be so vague, undoing app_config settings is not trivial.Sure thing and warning heeded. I was checking project status and found a private message from a user in 2020 offering help for the amount of errors my client was throwing. I did a deep dive on how my rosetta progress was going and noticed the flood of tasks, etc mentioned in my initial post. I disallowed any new rosetta tasks a week ago after and let them run through to avoid giving the project any headaches. After I had no tasks left I posted on the forum and received help. Per recommendations I removed the 1 line present in my app_config which was to limit the concurrent tasks. As boinc does not like a blank app_config I deleted it from all projects as I had only created them to help balance rossetta vs the gpu projects. I issued the command to update the projects via "boinccmd --project (url of project) update". I restarted the boinc service which was when I ran into the 6 vs 7 problem. See below. At that point the event log will show you how many CPUs that translates to. Likely the correct six. I've seen BOINC schedule one CPU more than configured when in panic mode but that shouldn't be the case here with only 10 tasks in progress and nearly full time left. Is that what the Manager showed you? Again, that may not be reality. Without different configuration I'd expect 1 core total scheduled for the GPU tasks and the remaining 5 of 6 for CPU tasks. Real usage will rather have been 2+5, more than you wanted. But if the Manager displayed just that in this case it was coincidence.I agree that in theory boinc with 75% volunteered on a 8 core CPU should =6 cores. With that allocation rosetta wanted to run 5 processes and my other two gpu projects wanted to run 2 total, resulting in 7 total used. 6=/=7. My amateur assumption was that I had run into a "counts from 0" issue. My solution was to volunteer 74%, which is confirmed as 5 cores via journalctl. 74% "cpus" volunteered on an 8 core is 5.9x.... so it makes no sense that my this would result with my desired effect: viktor@bender:~$ ps -u boinc PID TTY TIME CMD 84619 ? 00:00:16 boinc 84673 ? 00:11:47 rosetta_4.20_x8 84676 ? 00:11:42 rosetta_4.20_x8 84678 ? 00:11:37 rosetta_4.20_x8 84681 ? 00:11:32 rosetta_4.20_x8 84746 ? 00:09:00 hsgamma_FGRPB1G 84812 ? 00:00:40 milkyway_1.46_x with boinc reporting: max CPUs used: 5 As to what the event manager thinks I can't help you. I could try to fire up a gui, but I can gather what info I need from logs/ps/nvidia-smi/etc. Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4 Ok, so it sounds like regardless of my current real life situation being what I am looking for, I came to it via an incorrect way. I am 100% down to keep working until it is done right. I will work with cpu_usage on gpu_versions of the GPU projects. I know I sound like a broken record, but thanks. The replies take me ~15 minutes to type out and those who are providing aid are doing so of their own free will. Much easier to click "next thread". I will report back when I bump my GPU apps to cpu_usage of 1 and see if rosetta takes the other 4 seats. |
Viktor Send message Joined: 7 Jul 08 Posts: 5 Credit: 3,281,899 RAC: 81 |
Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4 Well that did it. Forcing the GPU projects to eat 1 core per, 2 total and volunteering 75% cores has resulted in 4 rosetta tasks, 1 milkyway, 1 einstein. In hindsight this makes sense as if the gpu projects were using a fraction of a cpu core the math works. Will report back in a week or so. |
Message boards :
Number crunching :
task swamping on multi-project host guidance requested
©2024 University of Washington
https://www.bakerlab.org