Message boards : Number crunching : Maximum CPU time Exceeded...How about some granted credit!
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
|
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
I don't like to kick this post, but also I hate being taken for a fool. Why do you think you are being taken for a fool, I don't understand? I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :) Regards, Bob P. |
The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0 |
|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
I don't like to kick this post, but also I hate being taken for a fool. Well, if you remove something like a WU from someone's results list I assume that's reason enough to send that person a message or post something on the forum. Assuming that this will be happening to a lot of crunchers. And looking at the response time to other replies a day should be enough. Thereby it's not the amount of credits, but the fact it is happening which is disturbing me. |
Moderator7 Volunteer moderator Send message Joined: 27 Dec 05 Posts: 10 Credit: 0 RAC: 0 |
I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
I don't like to kick this post, but also I hate being taken for a fool. At some point ALL of your WU will be removed from your list. That is the way they keep the data base lean. How old was this WU you lost? If it was more than a week or two it would be removed in the normal course of running the project. But don't despair!! They have them ALL off line and they will eventually fix the problem. While it might seem you have been ignored, you have not. There are a lot of users with a number of issues to be answered. What you will eventually discover is that this project is the most responsive of all of the BOINC projects to the needs of the user community. I don't mean to slight E@H with that comment because it is really hard to pick which of the two is better, but the point is they will get this taken care of in due course. Just calm down and give them a few days to take a look at the problem. The few credits that you are waiting for will not make much difference in the big picture. As "The Gas Giant" pointed out he, I, and others, have a few thousand credits each at stake. This is supposed to be fun! This is more about lost computing time that could have been put to better use, than lost credits. What the discussion is about is fixing a problem in the application to make it a better science project. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered. David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. Another example of why I respect the management of this project so much. Thanks for the follow through! |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. Here, here!! Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
I don't know whether I've had more time-exceeding Wu's. Just came across this one so I knew it. People were talking about crediting om Monday but suddenly the WU was gone without crediting. Ofcourse the lost time is more important than credits, as I stated somewhere else I only choose a medical project in which I want to participate, but if (tome) strange things like this happens I get a bit ??? We'll see what happens next. |
The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0 |
I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered. Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance. David, something needs to be done about this. I have confirmed that if I manually alter the DCF a wu that has an extended completion time does complete normally. Maybe for these wu's you need to increase the number of estimate flops and iops. I know of 2 fairly large crunchers who have left this project because of this issue and the lost credit. Paul. |
Moderator7 Volunteer moderator Send message Joined: 27 Dec 05 Posts: 10 Credit: 0 RAC: 0 |
David, something needs to be done about this. I have asked again for specifics... I don't expect an answer at this point to be today, but probably tomorrow. Handling the "cpu time exceeded" cases is likely to be more difficult than ones where the original issue date is known; the error message is not readily accessible. I don't know if this _can_ be done, it may be that to grant credit for these, credit would have to be given for every failed WU regardless of reason... and I don't know how big a problem that could cause. Someone WILL give more info as soon as it's available. Be sure to look at the text file for credits and not just the results web page. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance. I can confirm Paul's solution. If the DCF is increased to a sufficiently high value all of the WUs will compete successfully. I would also add that the problem is made worse by the way R@H increments the values for WU progress. All other projects increment the CPU time and percent complete at the same time. BOINC then uses these values to calculate the time remaining. R@H does not increment the percent complete except at checkpoints (jumping 10% at a time). This causes the time remaining to rise as the WU progresses, and then suddenly drop by what BOINC calculates to be 10% of the time remaining when the Percent complete jumps. This will work fine early in the processing, but towards the end it can cause the time remaining to drop below the amount required to complete the WU, or even to zero out or go negative. When this happens the WU will fail. This will usually occur around 80-90 percent. On my system it is right at the jump point, and usually occurs above 90%. One solution would be to increment the percent complete even if it has no direct connection to the actual completion time, to prevent the time to completion from rising and throwing off the calculation of what is actually 10% of the time for the WU. There is no need to change the actual checkpoints to do this. This would fit the BOINC model and possibly fix the problem. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0 |
David, something needs to be done about this. I checked the 4.2MB text file and found I received about 120c (one nice wu of 111, the rest being the small variety). Only another ~1900 to go...lol! It would help if BOINC increased the DCF when a wu errored out on max_cpu_time_exceeded. I also understand why it is there since I had to stop and restart BOINC yesterday morning just prior to leaving for work as I had a stuck wu at 1% for 4hrs. I couldn't get the sdout info as I was running a little late. I lost 4hrs of cpu time, but atleast the wu then completed OK. So overall there are two problems; 1. WU's get stuck at 1%. 2. WU's progress OK but are longer than typical and error out due to max_cpu_time_exceeded, but it would have completed if left to run. So if we get rid of problem #1 we can relax the settings that cause #2. Paul. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
So overall there are two problems; You correctly note that the program solution for the 1% problem is in fact causing the "max_CPU_time_exceeded" errors. So the project really needs to decide what has more impact at this point. Most people recognize stuck WUs and act to intervene except on crunching farms where many systems are not watched very often. But it is really three problems working against each other. Since R@H only runs well if kept in memory during application swaps because of the shortage of checkpoints during processing, the project team has decided to solve the 1% problem through extreme measures with a hard coded abort solution. If the application could be cleared from memory between swaps, this would in effect force a restart of the WU if it was stuck, and it would possibly then run to completion. The "Max time" failures provide no warning that anything is amiss until they suddenly fail after many hours of work. Usually this is followed by more failures. But the problem is more complex than simply not enough time to finish. The way R@H does it progress monitoring aggravates all of this. Because it does not do checkpoints throughout the WU run time, the percent complete moves in chunks, which messes with the BOINC status keeping functions. Before the project implemented the 1% fix I had never seen a Max time error. In some rare cases a WU might run as long as 25 or even 30 hours. I had a few of these. Now (on my systems) if they run longer than about 4 hours and 15 Min I can expect them to fail on a max time error, unless I make a manual adjustment to the DCF periodically. The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues. Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface. But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are. In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues. The problem here is that BOINC provides a "maximum CPU time" field, but not a "maximum without percent complete" field - this requires a change to the application itself. Ideally, the root cause of the "hanging" can be found, rather than putting another patch on to terminate the WU early. Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface. I believe the problem is actually that _all_ the WUs from Rosetta contain the same "estimated number of flops" or "estimated time"... Einstein, for example, with the new Albert app and varying WU run-length, is (after an initial failure to do so) varying the _estimate_ and this maintains a more "reasonable" DCF. Once Rosetta Alpha is available to calculate average run times for the different types of WUs, this will be easier to do. Rosetta is pretty unique in having so _many_ different types of WUs - Einstein went from one to something like 6 or 7, SETI is pretty much 1 unless it's "noisy", Predictor changed every few weeks; Rosetta can have a dozen varieties being issued all at one time. Also, this is where "flops-counting" can potentially solve yet another problem... (hint!) But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are. I see ways to "improve" the % complete figure a bit without major changes; for example, if the ab initio stage normally takes 10% of the total run time for a structure, then there could at least be an 11%, 21%, 31% figure. And as we saw with the "default_xxxx_205" WU's (that had 1000 instead of 10 structs) the increment _could_ be much smaller, 0.1%, if we could tolerate a 100x increase in time... but it would be a big improvement even if, say, halfway through the "relax" portion, the % complete was bumped by 5%. Rosetta by the way is _far_ from the "worst" on the % complete issue... and it _will_ be very difficult to come up with a way to report _very_ frequently, as SETI and Einstein do, just because of the nature of the work being done. In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate. Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.) |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
....Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.) Thanks for the note Bill. Also please understand I am not criticizing the team, I am trying to offer observations and ideas for them to think about to reach a solution. Everyone really needs to pull together to get this application to stand up and run. There are a lot of cures waiting to be found. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0 |
So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys! Live long and crunch. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys! Gas, That has not been decided yet. Right now they are trying to figure out how to locate all of these in the data base so that they can award credit if warranted. This may take some time for them to figure out as it is not as simple as it might seem. The fact that some of the records are in the archive and some in the live data complicates the problem. Also the problem should begin to go away soon. On another thread they announced that they intend to reduce the run-length of the WU and they are going to give them more time to run. As those changes to the WUs begin to feed into the work the problem should dissolve. That said, what they would like to do is get all the records into the archive where they can work on all of them at once. As you can imagine this may take a while because all the WUs that are out there and being sent out again after failing, have to be returned complete or Maxed out on resends. I think if we are all patient the credit will be awarded, it just may take some time. Regards Phil |
Message boards :
Number crunching :
Maximum CPU time Exceeded...How about some granted credit!
©2024 University of Washington
https://www.bakerlab.org