WU scheduling issues remain an issue

Message boards : Number crunching : WU scheduling issues remain an issue

To post messages, you must log in.

AuthorMessage
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11735 - Posted: 6 Mar 2006, 22:31:00 UTC

While crunching a few WUs that take ~2 hours each, I get a download of WUs that take ~15 hours each... but in numbers that would require 2 hour completions to avoid machine over-commitment.

I share projects on some machines and "just wait until BOINC figures it out" doesn't work for me because I don't believe the other project should be idled to make up for this scheduling miscalculation.

I have been training my team mates to use the abort and reset buttons....

I would love to stop issuing 'refunds' of your Work Units...

Please help
Proudly crunching with TeAm Anandtech
ID: 11735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 11736 - Posted: 6 Mar 2006, 23:07:37 UTC

If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.
ID: 11736 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11739 - Posted: 6 Mar 2006, 23:54:39 UTC - in response to Message 11736.  

If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.


I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue.

-Sid

Proudly crunching with TeAm Anandtech
ID: 11739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11740 - Posted: 7 Mar 2006, 0:46:22 UTC - in response to Message 11739.  

If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.


I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue.

-Sid




On the three machines I have been observing, only one was forced into DCF mode when the new time setting became available. At first I tried to manually intervene but this only produced a temporary fix for the problem, and required me to constantly tinker with the machine. When I decided to allow the machine to sort itself out, I set the time to 4 hours, and the time between contacts to the server to .25 days. In less than 24 hours the system stabilized. I was then able to raise the connection time in increments over two days (about 5 adjustments total) and it is now running very well.

You could probably make larger adjustments than I did in the connect time to make it happen faster, but the point is that the system MUST be allowed to correct itself over time. BOINC doe snot have any information about the actual length of the WUs and so it must adjust to them over time. This same situation occurs on other projects when shorter WUs are replaced by longer ones. BOINC is designed to work this out for itself.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11743 - Posted: 7 Mar 2006, 1:52:32 UTC - in response to Message 11740.  
Last modified: 7 Mar 2006, 1:55:54 UTC

If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.


I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue.

-Sid




On the three machines I have been observing, only one was forced into DCF mode when the new time setting became available. At first I tried to manually intervene but this only produced a temporary fix for the problem, and required me to constantly tinker with the machine. When I decided to allow the machine to sort itself out, I set the time to 4 hours, and the time between contacts to the server to .25 days. In less than 24 hours the system stabilized. I was then able to raise the connection time in increments over two days (about 5 adjustments total) and it is now running very well.

You could probably make larger adjustments than I did in the connect time to make it happen faster, but the point is that the system MUST be allowed to correct itself over time. BOINC doe snot have any information about the actual length of the WUs and so it must adjust to them over time. This same situation occurs on other projects when shorter WUs are replaced by longer ones. BOINC is designed to work this out for itself.



That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)

Something is telling BOINC initially, these work units will take several hours beyond the default to complete (despite the fact they will not) and confusing BOINC to the point it stops work on any other project on this machine. Yes, BOINC will figure it out over time (at the expense of other projects), but why can't you have Rosetta tell BOINC it will take the amount of time it is defaulted to instead of 16 hours? (where is this 16 hour estimate comming from?)


-Sid
Proudly crunching with TeAm Anandtech
ID: 11743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11744 - Posted: 7 Mar 2006, 3:59:04 UTC - in response to Message 11743.  


That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurrence I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)

Something is telling BOINC initially, these work units will take several hours beyond the default to complete (despite the fact they will not) and confusing BOINC to the point it stops work on any other project on this machine. Yes, BOINC will figure it out over time (at the expense of other projects), but why can't you have Rosetta tell BOINC it will take the amount of time it is defaulted to instead of 16 hours? (where is this 16 hour estimate coming from?)


-Sid


As I said I have seen this behavior before, and worked through it. So yes I gave you a perfect example of what your system is doing, and I understand it quite well. Until the next version of BOINC is released that understands the R@H time setting, the only way to stabilize the system is let it work it out on its own.

The 16 hour estimate is being created by boinc, because it is not being allowed to adjust itself to the run conditions of your system. You can change the estimate manually if you want but that will not really help. BOINC uses a correction factor found in one of the files in the system on your machine to calculate the value. That number however is NOT used for requesting work, it is used to display an estimated time in the BOINC manager. BOINC requests work based on what it sees in its queue based on its experience with similar WUs, and the amount of time till the next connection. If you set the connection interval too long it will ask for too much work. If you abort WUs, or reset the project, BOINC will never be able to figure out how long they would run and adjust work requests accordingly.

If you want the system to settle down reasonably quickly, then set your time setting to 2 hours, set your connection interval to .2 days, update the project (not reset, UPDATE) and let it run for a while. In less than 1/2 a day it will all balance out. Then if you don't like those settings for some reason. Adjust them. But do not make large adjustments in short periods of time or it will get lost again.

If you look at the time estimates for the R@H WUs (assuming you have a number of them to look at) you will see that each time it completes a WU and loads a new one the new one will show a shorter completion estimate. This is because the system is adjusting. Eventually the estimated time to completion will be about equal to the time setting, and the actual run time for the WU.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11744 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11749 - Posted: 7 Mar 2006, 10:16:10 UTC
Last modified: 7 Mar 2006, 10:35:56 UTC

Thanks for the explanations (and patience)

-Sid

(I don't delete ALL of the downloaded WUs, just enough to get out of earliest deadline mode)
Proudly crunching with TeAm Anandtech
ID: 11749 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grenadier
Avatar

Send message
Joined: 17 Sep 05
Posts: 1
Credit: 790,880
RAC: 0
Message 11759 - Posted: 7 Mar 2006, 20:32:43 UTC - in response to Message 11743.  

That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)


Most of your problem is right here. The continual deletion of WU's and resetting the project keep BOINC from adjusting the duration correction factor properly to the new WU size.

I know you don't want to hear this, but leave BOINC alone, and you'll have fewer problems in the long run. Yes, you'll have days where one project monopolizes the machine (I've had this with Leiden and Sztaki recently.) But in the end, the adjustment factor will kick in, and the long-term debts will accrue correctly and you will have days with NO work for Rosetta. In the end, everything will balance out. But by micro-managing, you're probably making it worse, not better.

ID: 11759 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11761 - Posted: 7 Mar 2006, 21:00:53 UTC - in response to Message 11759.  

That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)


Most of your problem is right here. The continual deletion of WU's and resetting the project keep BOINC from adjusting the duration correction factor properly to the new WU size.

I know you don't want to hear this, but leave BOINC alone, and you'll have fewer problems in the long run. Yes, you'll have days where one project monopolizes the machine (I've had this with Leiden and Sztaki recently.) But in the end, the adjustment factor will kick in, and the long-term debts will accrue correctly and you will have days with NO work for Rosetta. In the end, everything will balance out. But by micro-managing, you're probably making it worse, not better.


Actually, the winning combination seems to be to delete only enough of the mis-estimated WUs in my cache to come out of earliest deadline mode, but let the crunching process continue (by not deleting ALL Work Units) until it "gets straightened out)..

I loose no crunching time on the shared project and BOINC gets to continue adjusting it's estimation of completion times until it is correct.

yes, the Rosetta project gets a few returned WUs that have to be re-issued this way, but I think "sharing the pain" is only appropriate.

-Sid
Proudly crunching with TeAm Anandtech
ID: 11761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 8 Oct 05
Posts: 27
Credit: 665,094
RAC: 0
Message 11763 - Posted: 7 Mar 2006, 22:21:22 UTC
Last modified: 7 Mar 2006, 22:21:48 UTC

Sid, are your estimated times going down? Closer to actual? If so it is working. As others have pointed out, letting it go into panic mode will get the estimates closer faster.

Also, what sort of time frame are you looking at for your resource balance? If its daily, then Bonic in general may be a lost cause for you, if its longer term balance, it will sort itself out over time.

As a side note SETI will futz with your completion times as well with the various angle ranges of the WU, and will be more pronounced when enhanced goes live.


Also a member of the TeAm. :)

ID: 11763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 11764 - Posted: 7 Mar 2006, 22:36:19 UTC - in response to Message 11763.  

Sid, are your estimated times going down? Closer to actual? If so it is working. As others have pointed out, letting it go into panic mode will get the estimates closer faster.

Also, what sort of time frame are you looking at for your resource balance? If its daily, then Bonic in general may be a lost cause for you, if its longer term balance, it will sort itself out over time.

As a side note SETI will futz with your completion times as well with the various angle ranges of the WU, and will be more pronounced when enhanced goes live.


Also a member of the TeAm. :)


I'm having great luck with my latest maneuver to let both projects crunch and let BOINC get it's cache size adjusted to appropriate for these WUs.
I am seeing my estimated time go down (as expected) and I have several more WUs in the cache to keep it busy. Rosetta isn't asking for more work because it knows it has plenty and it is sharing nicely.
The only "loss" is the 7 work units I aborted (in their "ready to run" state) to eliminate the earliest deadline mode of ops.
From all the help I have received here in the way of explaination, I see that until BOINC updates their client to recognize Rosetta's 'time management' scheme (if you will allow my phrasing) this will just be necessary when Rosetta makes drastic changes in work unit crunch times until the WUs from the earlier issue are cleared from the system.

I'm happy....

-Sid


Proudly crunching with TeAm Anandtech
ID: 11764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WU scheduling issues remain an issue



©2024 University of Washington
https://www.bakerlab.org