Maximum CPU time Exceeded...How about some granted credit!

Message boards : Number crunching : Maximum CPU time Exceeded...How about some granted credit!

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 8733 - Posted: 10 Jan 2006, 20:45:37 UTC
Last modified: 10 Jan 2006, 21:01:51 UTC


ID: 8733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 8805 - Posted: 11 Jan 2006, 22:02:54 UTC - in response to Message 8733.  
Last modified: 11 Jan 2006, 22:06:28 UTC


ID: 8805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 8806 - Posted: 11 Jan 2006, 22:07:11 UTC - in response to Message 8805.  

I don't like to kick this post, but also I hate being taken for a fool.
If it is to much trouble to reply to this post, it becomes to much trouble to continue this project.

Why do you think you are being taken for a fool, I don't understand?

I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :)

Regards,
Bob P.
ID: 8806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile The Gas Giant

Send message
Joined: 20 Sep 05
Posts: 23
Credit: 58,591
RAC: 0
Message 8808 - Posted: 11 Jan 2006, 22:14:11 UTC - in response to Message 8805.  
Last modified: 11 Jan 2006, 22:16:34 UTC

ID: 8808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 8809 - Posted: 11 Jan 2006, 22:14:44 UTC - in response to Message 8806.  
Last modified: 11 Jan 2006, 22:19:42 UTC

I don't like to kick this post, but also I hate being taken for a fool.
If it is to much trouble to reply to this post, it becomes to much trouble to continue this project.

Why do you think you are being taken for a fool, I don't understand?

I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :)

Well, if you remove something like a WU from someone's results list I assume that's reason enough to send that person a message or post something on the forum.
Assuming that this will be happening to a lot of crunchers.
And looking at the response time to other replies a day should be enough.
Thereby it's not the amount of credits, but the fact it is happening which is disturbing me.

ID: 8809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator7
Volunteer moderator

Send message
Joined: 27 Dec 05
Posts: 10
Credit: 0
RAC: 0
Message 8813 - Posted: 11 Jan 2006, 22:42:16 UTC

I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered.

ID: 8813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8816 - Posted: 12 Jan 2006, 1:02:21 UTC - in response to Message 8809.  

I don't like to kick this post, but also I hate being taken for a fool.
If it is to much trouble to reply to this post, it becomes to much trouble to continue this project.

Why do you think you are being taken for a fool, I don't understand?

I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :)

Well, if you remove something like a WU from someone's results list I assume that's reason enough to send that person a message or post something on the forum.
Assuming that this will be happening to a lot of crunchers.
And looking at the response time to other replies a day should be enough.
Thereby it's not the amount of credits, but the fact it is happening which is disturbing me.


At some point ALL of your WU will be removed from your list. That is the way they keep the data base lean. How old was this WU you lost? If it was more than a week or two it would be removed in the normal course of running the project. But don't despair!! They have them ALL off line and they will eventually fix the problem.

While it might seem you have been ignored, you have not. There are a lot of users with a number of issues to be answered. What you will eventually discover is that this project is the most responsive of all of the BOINC projects to the needs of the user community. I don't mean to slight E@H with that comment because it is really hard to pick which of the two is better, but the point is they will get this taken care of in due course. Just calm down and give them a few days to take a look at the problem. The few credits that you are waiting for will not make much difference in the big picture.

As "The Gas Giant" pointed out he, I, and others, have a few thousand credits each at stake. This is supposed to be fun! This is more about lost computing time that could have been put to better use, than lost credits. What the discussion is about is fixing a problem in the application to make it a better science project.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8816 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8817 - Posted: 12 Jan 2006, 1:12:14 UTC - in response to Message 8813.  

I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered.



David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days.
ID: 8817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 17 Sep 05
Posts: 82
Credit: 921,382
RAC: 0
Message 8820 - Posted: 12 Jan 2006, 2:13:07 UTC - in response to Message 8817.  

David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days.

Another example of why I respect the management of this project so much. Thanks for the follow through!
ID: 8820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8821 - Posted: 12 Jan 2006, 3:22:43 UTC - in response to Message 8820.  

David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days.

Another example of why I respect the management of this project so much. Thanks for the follow through!



Here, here!!

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 8825 - Posted: 12 Jan 2006, 6:55:38 UTC - in response to Message 8816.  
Last modified: 12 Jan 2006, 6:56:32 UTC


At some point ALL of your WU will be removed from your list. That is the way they keep the data base lean. How old was this WU you lost? If it was more than a week or two it would be removed in the normal course of running the project. But don't despair!! They have them ALL off line and they will eventually fix the problem.

While it might seem you have been ignored, you have not. There are a lot of users with a number of issues to be answered. What you will eventually discover is that this project is the most responsive of all of the BOINC projects to the needs of the user community. I don't mean to slight E@H with that comment because it is really hard to pick which of the two is better, but the point is they will get this taken care of in due course. Just calm down and give them a few days to take a look at the problem. The few credits that you are waiting for will not make much difference in the big picture.

As "The Gas Giant" pointed out he, I, and others, have a few thousand credits each at stake. This is supposed to be fun! This is more about lost computing time that could have been put to better use, than lost credits. What the discussion is about is fixing a problem in the application to make it a better science project.

Regards
Phil

I don't know whether I've had more time-exceeding Wu's.
Just came across this one so I knew it.
People were talking about crediting om Monday but suddenly the WU was gone without crediting.
Ofcourse the lost time is more important than credits, as I stated somewhere else I only choose a medical project in which I want to participate, but if (tome) strange things like this happens I get a bit ???
We'll see what happens next.


ID: 8825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile The Gas Giant

Send message
Joined: 20 Sep 05
Posts: 23
Credit: 58,591
RAC: 0
Message 8881 - Posted: 12 Jan 2006, 20:41:34 UTC - in response to Message 8817.  

I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered.



David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days.


Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance.

David, something needs to be done about this. I have confirmed that if I manually alter the DCF a wu that has an extended completion time does complete normally. Maybe for these wu's you need to increase the number of estimate flops and iops.

I know of 2 fairly large crunchers who have left this project because of this issue and the lost credit.

Paul.
ID: 8881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator7
Volunteer moderator

Send message
Joined: 27 Dec 05
Posts: 10
Credit: 0
RAC: 0
Message 8884 - Posted: 12 Jan 2006, 22:22:28 UTC - in response to Message 8881.  

David, something needs to be done about this.


I have asked again for specifics... I don't expect an answer at this point to be today, but probably tomorrow. Handling the "cpu time exceeded" cases is likely to be more difficult than ones where the original issue date is known; the error message is not readily accessible. I don't know if this _can_ be done, it may be that to grant credit for these, credit would have to be given for every failed WU regardless of reason... and I don't know how big a problem that could cause. Someone WILL give more info as soon as it's available.

Be sure to look at the text file for credits and not just the results web page.

ID: 8884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8886 - Posted: 13 Jan 2006, 0:02:33 UTC - in response to Message 8881.  
Last modified: 13 Jan 2006, 0:04:20 UTC

Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance.

David, something needs to be done about this. I have confirmed that if I manually alter the DCF a wu that has an extended completion time does complete normally. Maybe for these wu's you need to increase the number of estimate flops and iops....

Paul.


I can confirm Paul's solution. If the DCF is increased to a sufficiently high value all of the WUs will compete successfully. I would also add that the problem is made worse by the way R@H increments the values for WU progress. All other projects increment the CPU time and percent complete at the same time. BOINC then uses these values to calculate the time remaining. R@H does not increment the percent complete except at checkpoints (jumping 10% at a time). This causes the time remaining to rise as the WU progresses, and then suddenly drop by what BOINC calculates to be 10% of the time remaining when the Percent complete jumps. This will work fine early in the processing, but towards the end it can cause the time remaining to drop below the amount required to complete the WU, or even to zero out or go negative. When this happens the WU will fail. This will usually occur around 80-90 percent. On my system it is right at the jump point, and usually occurs above 90%.

One solution would be to increment the percent complete even if it has no direct connection to the actual completion time, to prevent the time to completion from rising and throwing off the calculation of what is actually 10% of the time for the WU. There is no need to change the actual checkpoints to do this. This would fit the BOINC model and possibly fix the problem.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile The Gas Giant

Send message
Joined: 20 Sep 05
Posts: 23
Credit: 58,591
RAC: 0
Message 8891 - Posted: 13 Jan 2006, 1:09:41 UTC - in response to Message 8884.  

David, something needs to be done about this.


I have asked again for specifics... I don't expect an answer at this point to be today, but probably tomorrow. Handling the "cpu time exceeded" cases is likely to be more difficult than ones where the original issue date is known; the error message is not readily accessible. I don't know if this _can_ be done, it may be that to grant credit for these, credit would have to be given for every failed WU regardless of reason... and I don't know how big a problem that could cause. Someone WILL give more info as soon as it's available.

Be sure to look at the text file for credits and not just the results web page.


I checked the 4.2MB text file and found I received about 120c (one nice wu of 111, the rest being the small variety). Only another ~1900 to go...lol!

It would help if BOINC increased the DCF when a wu errored out on max_cpu_time_exceeded. I also understand why it is there since I had to stop and restart BOINC yesterday morning just prior to leaving for work as I had a stuck wu at 1% for 4hrs. I couldn't get the sdout info as I was running a little late. I lost 4hrs of cpu time, but atleast the wu then completed OK.

So overall there are two problems;

1. WU's get stuck at 1%.

2. WU's progress OK but are longer than typical and error out due to max_cpu_time_exceeded, but it would have completed if left to run.

So if we get rid of problem #1 we can relax the settings that cause #2.

Paul.
ID: 8891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8911 - Posted: 13 Jan 2006, 5:57:48 UTC - in response to Message 8891.  

So overall there are two problems;

1. WU's get stuck at 1%.

2. WU's progress OK but are longer than typical and error out due to max_cpu_time_exceeded, but it would have completed if left to run.

So if we get rid of problem #1 we can relax the settings that cause #2.

Paul.


You correctly note that the program solution for the 1% problem is in fact causing the "max_CPU_time_exceeded" errors. So the project really needs to decide what has more impact at this point. Most people recognize stuck WUs and act to intervene except on crunching farms where many systems are not watched very often.

But it is really three problems working against each other. Since R@H only runs well if kept in memory during application swaps because of the shortage of checkpoints during processing, the project team has decided to solve the 1% problem through extreme measures with a hard coded abort solution. If the application could be cleared from memory between swaps, this would in effect force a restart of the WU if it was stuck, and it would possibly then run to completion.

The "Max time" failures provide no warning that anything is amiss until they suddenly fail after many hours of work. Usually this is followed by more failures. But the problem is more complex than simply not enough time to finish. The way R@H does it progress monitoring aggravates all of this. Because it does not do checkpoints throughout the WU run time, the percent complete moves in chunks, which messes with the BOINC status keeping functions.

Before the project implemented the 1% fix I had never seen a Max time error. In some rare cases a WU might run as long as 25 or even 30 hours. I had a few of these. Now (on my systems) if they run longer than about 4 hours and 15 Min I can expect them to fail on a max time error, unless I make a manual adjustment to the DCF periodically.

The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues.

Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface.

But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are.

In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate.

Regards
Phil

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 11
Message 8915 - Posted: 13 Jan 2006, 6:12:56 UTC - in response to Message 8911.  

The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues.


The problem here is that BOINC provides a "maximum CPU time" field, but not a "maximum without percent complete" field - this requires a change to the application itself. Ideally, the root cause of the "hanging" can be found, rather than putting another patch on to terminate the WU early.

Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface.


I believe the problem is actually that _all_ the WUs from Rosetta contain the same "estimated number of flops" or "estimated time"... Einstein, for example, with the new Albert app and varying WU run-length, is (after an initial failure to do so) varying the _estimate_ and this maintains a more "reasonable" DCF. Once Rosetta Alpha is available to calculate average run times for the different types of WUs, this will be easier to do. Rosetta is pretty unique in having so _many_ different types of WUs - Einstein went from one to something like 6 or 7, SETI is pretty much 1 unless it's "noisy", Predictor changed every few weeks; Rosetta can have a dozen varieties being issued all at one time. Also, this is where "flops-counting" can potentially solve yet another problem... (hint!)

But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are.


I see ways to "improve" the % complete figure a bit without major changes; for example, if the ab initio stage normally takes 10% of the total run time for a structure, then there could at least be an 11%, 21%, 31% figure. And as we saw with the "default_xxxx_205" WU's (that had 1000 instead of 10 structs) the increment _could_ be much smaller, 0.1%, if we could tolerate a 100x increase in time... but it would be a big improvement even if, say, halfway through the "relax" portion, the % complete was bumped by 5%.

Rosetta by the way is _far_ from the "worst" on the % complete issue... and it _will_ be very difficult to come up with a way to report _very_ frequently, as SETI and Einstein do, just because of the nature of the work being done.

In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate.


Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.)

ID: 8915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8916 - Posted: 13 Jan 2006, 6:38:16 UTC - in response to Message 8915.  

....Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.)


Thanks for the note Bill. Also please understand I am not criticizing the team, I am trying to offer observations and ideas for them to think about to reach a solution. Everyone really needs to pull together to get this application to stand up and run. There are a lot of cures waiting to be found.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile The Gas Giant

Send message
Joined: 20 Sep 05
Posts: 23
Credit: 58,591
RAC: 0
Message 9612 - Posted: 23 Jan 2006, 2:53:24 UTC

So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys!

Live long and crunch.
ID: 9612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9615 - Posted: 23 Jan 2006, 4:26:01 UTC - in response to Message 9612.  

So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys!

Live long and crunch.


Gas,

That has not been decided yet. Right now they are trying to figure out how to locate all of these in the data base so that they can award credit if warranted. This may take some time for them to figure out as it is not as simple as it might seem. The fact that some of the records are in the archive and some in the live data complicates the problem.

Also the problem should begin to go away soon. On another thread they announced that they intend to reduce the run-length of the WU and they are going to give them more time to run. As those changes to the WUs begin to feed into the work the problem should dissolve.

That said, what they would like to do is get all the records into the archive where they can work on all of them at once. As you can imagine this may take a while because all the WUs that are out there and being sent out again after failing, have to be returned complete or Maxed out on resends.

I think if we are all patient the credit will be awarded, it just may take some time.

Regards
Phil
ID: 9615 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Maximum CPU time Exceeded...How about some granted credit!



©2024 University of Washington
https://www.bakerlab.org