Message boards : Number crunching : WU scheduling issues remain an issue
Author | Message |
---|---|
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
While crunching a few WUs that take ~2 hours each, I get a download of WUs that take ~15 hours each... but in numbers that would require 2 hour completions to avoid machine over-commitment. I share projects on some machines and "just wait until BOINC figures it out" doesn't work for me because I don't believe the other project should be idled to make up for this scheduling miscalculation. I have been training my team mates to use the abort and reset buttons.... I would love to stop issuing 'refunds' of your Work Units... Please help Proudly crunching with TeAm Anandtech |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server. If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones. The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take. See the FAQ for more details. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server. I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue. -Sid Proudly crunching with TeAm Anandtech |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server. On the three machines I have been observing, only one was forced into DCF mode when the new time setting became available. At first I tried to manually intervene but this only produced a temporary fix for the problem, and required me to constantly tinker with the machine. When I decided to allow the machine to sort itself out, I set the time to 4 hours, and the time between contacts to the server to .25 days. In less than 24 hours the system stabilized. I was then able to raise the connection time in increments over two days (about 5 adjustments total) and it is now running very well. You could probably make larger adjustments than I did in the connect time to make it happen faster, but the point is that the system MUST be allowed to correct itself over time. BOINC doe snot have any information about the actual length of the WUs and so it must adjust to them over time. This same situation occurs on other projects when shorter WUs are replaced by longer ones. BOINC is designed to work this out for itself. Moderator9 ROSETTA@home FAQ Moderator Contact |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server. That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so) If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too) Something is telling BOINC initially, these work units will take several hours beyond the default to complete (despite the fact they will not) and confusing BOINC to the point it stops work on any other project on this machine. Yes, BOINC will figure it out over time (at the expense of other projects), but why can't you have Rosetta tell BOINC it will take the amount of time it is defaulted to instead of 16 hours? (where is this 16 hour estimate comming from?) -Sid Proudly crunching with TeAm Anandtech |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
As I said I have seen this behavior before, and worked through it. So yes I gave you a perfect example of what your system is doing, and I understand it quite well. Until the next version of BOINC is released that understands the R@H time setting, the only way to stabilize the system is let it work it out on its own. The 16 hour estimate is being created by boinc, because it is not being allowed to adjust itself to the run conditions of your system. You can change the estimate manually if you want but that will not really help. BOINC uses a correction factor found in one of the files in the system on your machine to calculate the value. That number however is NOT used for requesting work, it is used to display an estimated time in the BOINC manager. BOINC requests work based on what it sees in its queue based on its experience with similar WUs, and the amount of time till the next connection. If you set the connection interval too long it will ask for too much work. If you abort WUs, or reset the project, BOINC will never be able to figure out how long they would run and adjust work requests accordingly. If you want the system to settle down reasonably quickly, then set your time setting to 2 hours, set your connection interval to .2 days, update the project (not reset, UPDATE) and let it run for a while. In less than 1/2 a day it will all balance out. Then if you don't like those settings for some reason. Adjust them. But do not make large adjustments in short periods of time or it will get lost again. If you look at the time estimates for the R@H WUs (assuming you have a number of them to look at) you will see that each time it completes a WU and loads a new one the new one will show a shorter completion estimate. This is because the system is adjusting. Eventually the estimated time to completion will be about equal to the time setting, and the actual run time for the WU. Moderator9 ROSETTA@home FAQ Moderator Contact |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Thanks for the explanations (and patience) -Sid (I don't delete ALL of the downloaded WUs, just enough to get out of earliest deadline mode) Proudly crunching with TeAm Anandtech |
Grenadier Send message Joined: 17 Sep 05 Posts: 1 Credit: 790,880 RAC: 0 |
That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so) Most of your problem is right here. The continual deletion of WU's and resetting the project keep BOINC from adjusting the duration correction factor properly to the new WU size. I know you don't want to hear this, but leave BOINC alone, and you'll have fewer problems in the long run. Yes, you'll have days where one project monopolizes the machine (I've had this with Leiden and Sztaki recently.) But in the end, the adjustment factor will kick in, and the long-term debts will accrue correctly and you will have days with NO work for Rosetta. In the end, everything will balance out. But by micro-managing, you're probably making it worse, not better. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so) Actually, the winning combination seems to be to delete only enough of the mis-estimated WUs in my cache to come out of earliest deadline mode, but let the crunching process continue (by not deleting ALL Work Units) until it "gets straightened out).. I loose no crunching time on the shared project and BOINC gets to continue adjusting it's estimation of completion times until it is correct. yes, the Rosetta project gets a few returned WUs that have to be re-issued this way, but I think "sharing the pain" is only appropriate. -Sid Proudly crunching with TeAm Anandtech |
Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0 |
Sid, are your estimated times going down? Closer to actual? If so it is working. As others have pointed out, letting it go into panic mode will get the estimates closer faster. Also, what sort of time frame are you looking at for your resource balance? If its daily, then Bonic in general may be a lost cause for you, if its longer term balance, it will sort itself out over time. As a side note SETI will futz with your completion times as well with the various angle ranges of the WU, and will be more pronounced when enhanced goes live. Also a member of the TeAm. :) |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Sid, are your estimated times going down? Closer to actual? If so it is working. As others have pointed out, letting it go into panic mode will get the estimates closer faster. I'm having great luck with my latest maneuver to let both projects crunch and let BOINC get it's cache size adjusted to appropriate for these WUs. I am seeing my estimated time go down (as expected) and I have several more WUs in the cache to keep it busy. Rosetta isn't asking for more work because it knows it has plenty and it is sharing nicely. The only "loss" is the 7 work units I aborted (in their "ready to run" state) to eliminate the earliest deadline mode of ops. From all the help I have received here in the way of explaination, I see that until BOINC updates their client to recognize Rosetta's 'time management' scheme (if you will allow my phrasing) this will just be necessary when Rosetta makes drastic changes in work unit crunch times until the WUs from the earlier issue are cleared from the system. I'm happy.... -Sid Proudly crunching with TeAm Anandtech |
Message boards :
Number crunching :
WU scheduling issues remain an issue
©2024 University of Washington
https://www.bakerlab.org