Maximum CPU time Exceeded...How about some granted credit!

Author	Message
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 8733 - Posted: 10 Jan 2006, 20:45:37 UTC Last modified: 10 Jan 2006, 21:01:51 UTC ID: 8733 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 8805 - Posted: 11 Jan 2006, 22:02:54 UTC - in response to Message 8733. Last modified: 11 Jan 2006, 22:06:28 UTC ID: 8805 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 8806 - Posted: 11 Jan 2006, 22:07:11 UTC - in response to Message 8805. I don't like to kick this post, but also I hate being taken for a fool. If it is to much trouble to reply to this post, it becomes to much trouble to continue this project. Why do you think you are being taken for a fool, I don't understand? I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :) Regards, Bob P. ID: 8806 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0	Message 8808 - Posted: 11 Jan 2006, 22:14:11 UTC - in response to Message 8805. Last modified: 11 Jan 2006, 22:16:34 UTC ID: 8808 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 8809 - Posted: 11 Jan 2006, 22:14:44 UTC - in response to Message 8806. Last modified: 11 Jan 2006, 22:19:42 UTC I don't like to kick this post, but also I hate being taken for a fool. If it is to much trouble to reply to this post, it becomes to much trouble to continue this project. Why do you think you are being taken for a fool, I don't understand? I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :) Well, if you remove something like a WU from someone's results list I assume that's reason enough to send that person a message or post something on the forum. Assuming that this will be happening to a lot of crunchers. And looking at the response time to other replies a day should be enough. Thereby it's not the amount of credits, but the fact it is happening which is disturbing me. ID: 8809 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 8816 - Posted: 12 Jan 2006, 1:02:21 UTC - in response to Message 8809. I don't like to kick this post, but also I hate being taken for a fool. If it is to much trouble to reply to this post, it becomes to much trouble to continue this project. Why do you think you are being taken for a fool, I don't understand? I am assuming they have not had time to respond, which one may criticize as they are not giving enough priority to this issue, but I would not assume that I am a fool. :) Well, if you remove something like a WU from someone's results list I assume that's reason enough to send that person a message or post something on the forum. Assuming that this will be happening to a lot of crunchers. And looking at the response time to other replies a day should be enough. Thereby it's not the amount of credits, but the fact it is happening which is disturbing me. At some point ALL of your WU will be removed from your list. That is the way they keep the data base lean. How old was this WU you lost? If it was more than a week or two it would be removed in the normal course of running the project. But don't despair!! They have them ALL off line and they will eventually fix the problem. While it might seem you have been ignored, you have not. There are a lot of users with a number of issues to be answered. What you will eventually discover is that this project is the most responsive of all of the BOINC projects to the needs of the user community. I don't mean to slight E@H with that comment because it is really hard to pick which of the two is better, but the point is they will get this taken care of in due course. Just calm down and give them a few days to take a look at the problem. The few credits that you are waiting for will not make much difference in the big picture. As "The Gas Giant" pointed out he, I, and others, have a few thousand credits each at stake. This is supposed to be fun! This is more about lost computing time that could have been put to better use, than lost credits. What the discussion is about is fixing a problem in the application to make it a better science project. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. ID: 8816 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 8817 - Posted: 12 Jan 2006, 1:12:14 UTC - in response to Message 8813. I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered. David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. ID: 8817 · Rating: 0 · rate: / Reply Quote

Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0	Message 8820 - Posted: 12 Jan 2006, 2:13:07 UTC - in response to Message 8817. David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. Another example of why I respect the management of this project so much. Thanks for the follow through! ID: 8820 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 8821 - Posted: 12 Jan 2006, 3:22:43 UTC - in response to Message 8820. David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. Another example of why I respect the management of this project so much. Thanks for the follow through! Here, here!! Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. ID: 8821 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 8825 - Posted: 12 Jan 2006, 6:55:38 UTC - in response to Message 8816. Last modified: 12 Jan 2006, 6:56:32 UTC At some point ALL of your WU will be removed from your list. That is the way they keep the data base lean. How old was this WU you lost? If it was more than a week or two it would be removed in the normal course of running the project. But don't despair!! They have them ALL off line and they will eventually fix the problem. While it might seem you have been ignored, you have not. There are a lot of users with a number of issues to be answered. What you will eventually discover is that this project is the most responsive of all of the BOINC projects to the needs of the user community. I don't mean to slight E@H with that comment because it is really hard to pick which of the two is better, but the point is they will get this taken care of in due course. Just calm down and give them a few days to take a look at the problem. The few credits that you are waiting for will not make much difference in the big picture. As "The Gas Giant" pointed out he, I, and others, have a few thousand credits each at stake. This is supposed to be fun! This is more about lost computing time that could have been put to better use, than lost credits. What the discussion is about is fixing a problem in the application to make it a better science project. Regards Phil I don't know whether I've had more time-exceeding Wu's. Just came across this one so I knew it. People were talking about crediting om Monday but suddenly the WU was gone without crediting. Ofcourse the lost time is more important than credits, as I stated somewhere else I only choose a medical project in which I want to participate, but if (tome) strange things like this happens I get a bit ??? We'll see what happens next. ID: 8825 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0	Message 8881 - Posted: 12 Jan 2006, 20:41:34 UTC - in response to Message 8817. I just asked David Kim about the whole credits issue, and he said he has had "backend stuff" that has tied him up, that he would try to do something today if possible, and would post when he was done. (Servers have been down a couple of times today.) I have not seen the script he's going to be running, so I don't know exactly what is covered. David has just finished awarding credits to recently returned jobs, and will have gone through all of the archived jobs within the next two days. Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance. David, something needs to be done about this. I have confirmed that if I manually alter the DCF a wu that has an extended completion time does complete normally. Maybe for these wu's you need to increase the number of estimate flops and iops. I know of 2 fairly large crunchers who have left this project because of this issue and the lost credit. Paul. ID: 8881 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 8886 - Posted: 13 Jan 2006, 0:02:33 UTC - in response to Message 8881. Last modified: 13 Jan 2006, 0:04:20 UTC Ah, but the credit granted was not for max-cpu-time-exceeded. We have major problem here. BOINC/Rosetta is not capable of handling some of the versions of wu's that have been released when the cpu time exceeds the estimated time by something like 20%. Under normal circumstances of BOINC operation a wu hitting this limit is a regular occurance. David, something needs to be done about this. I have confirmed that if I manually alter the DCF a wu that has an extended completion time does complete normally. Maybe for these wu's you need to increase the number of estimate flops and iops.... Paul. I can confirm Paul's solution. If the DCF is increased to a sufficiently high value all of the WUs will compete successfully. I would also add that the problem is made worse by the way R@H increments the values for WU progress. All other projects increment the CPU time and percent complete at the same time. BOINC then uses these values to calculate the time remaining. R@H does not increment the percent complete except at checkpoints (jumping 10% at a time). This causes the time remaining to rise as the WU progresses, and then suddenly drop by what BOINC calculates to be 10% of the time remaining when the Percent complete jumps. This will work fine early in the processing, but towards the end it can cause the time remaining to drop below the amount required to complete the WU, or even to zero out or go negative. When this happens the WU will fail. This will usually occur around 80-90 percent. On my system it is right at the jump point, and usually occurs above 90%. One solution would be to increment the percent complete even if it has no direct connection to the actual completion time, to prevent the time to completion from rising and throwing off the calculation of what is actually 10% of the time for the WU. There is no need to change the actual checkpoints to do this. This would fit the BOINC model and possibly fix the problem. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. ID: 8886 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0	Message 8891 - Posted: 13 Jan 2006, 1:09:41 UTC - in response to Message 8884. David, something needs to be done about this. I have asked again for specifics... I don't expect an answer at this point to be today, but probably tomorrow. Handling the "cpu time exceeded" cases is likely to be more difficult than ones where the original issue date is known; the error message is not readily accessible. I don't know if this _can_ be done, it may be that to grant credit for these, credit would have to be given for every failed WU regardless of reason... and I don't know how big a problem that could cause. Someone WILL give more info as soon as it's available. Be sure to look at the text file for credits and not just the results web page. I checked the 4.2MB text file and found I received about 120c (one nice wu of 111, the rest being the small variety). Only another ~1900 to go...lol! It would help if BOINC increased the DCF when a wu errored out on max_cpu_time_exceeded. I also understand why it is there since I had to stop and restart BOINC yesterday morning just prior to leaving for work as I had a stuck wu at 1% for 4hrs. I couldn't get the sdout info as I was running a little late. I lost 4hrs of cpu time, but atleast the wu then completed OK. So overall there are two problems; 1. WU's get stuck at 1%. 2. WU's progress OK but are longer than typical and error out due to max_cpu_time_exceeded, but it would have completed if left to run. So if we get rid of problem #1 we can relax the settings that cause #2. Paul. ID: 8891 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 8911 - Posted: 13 Jan 2006, 5:57:48 UTC - in response to Message 8891. So overall there are two problems; 1. WU's get stuck at 1%. 2. WU's progress OK but are longer than typical and error out due to max_cpu_time_exceeded, but it would have completed if left to run. So if we get rid of problem #1 we can relax the settings that cause #2. Paul. You correctly note that the program solution for the 1% problem is in fact causing the "max_CPU_time_exceeded" errors. So the project really needs to decide what has more impact at this point. Most people recognize stuck WUs and act to intervene except on crunching farms where many systems are not watched very often. But it is really three problems working against each other. Since R@H only runs well if kept in memory during application swaps because of the shortage of checkpoints during processing, the project team has decided to solve the 1% problem through extreme measures with a hard coded abort solution. If the application could be cleared from memory between swaps, this would in effect force a restart of the WU if it was stuck, and it would possibly then run to completion. The "Max time" failures provide no warning that anything is amiss until they suddenly fail after many hours of work. Usually this is followed by more failures. But the problem is more complex than simply not enough time to finish. The way R@H does it progress monitoring aggravates all of this. Because it does not do checkpoints throughout the WU run time, the percent complete moves in chunks, which messes with the BOINC status keeping functions. Before the project implemented the 1% fix I had never seen a Max time error. In some rare cases a WU might run as long as 25 or even 30 hours. I had a few of these. Now (on my systems) if they run longer than about 4 hours and 15 Min I can expect them to fail on a max time error, unless I make a manual adjustment to the DCF periodically. The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues. Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface. But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are. In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. ID: 8911 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 7	Message 8915 - Posted: 13 Jan 2006, 6:12:56 UTC - in response to Message 8911. The correct solution will have to take all three of these elements into consideration. The 1% fix should check not only the CPU time but the percent complete as well. Only if a WU runs for some significant period of time without any change in percent complete should the system act. That in and of itself might fix both issues. The problem here is that BOINC provides a "maximum CPU time" field, but not a "maximum without percent complete" field - this requires a change to the application itself. Ideally, the root cause of the "hanging" can be found, rather than putting another patch on to terminate the WU early. Some folks have said that the variation in WU size is the cause of all of this. Projects like E@H have fairly large WUs like R@H but they are all about equal in size for a particular WU type. When the type changes the WU size also changes. These problems do not exist on any of the other projects. I for one do not think the variation in WU size is at the bottom of the problem, it simply brings the issue to the surface. I believe the problem is actually that _all_ the WUs from Rosetta contain the same "estimated number of flops" or "estimated time"... Einstein, for example, with the new Albert app and varying WU run-length, is (after an initial failure to do so) varying the _estimate_ and this maintains a more "reasonable" DCF. Once Rosetta Alpha is available to calculate average run times for the different types of WUs, this will be easier to do. Rosetta is pretty unique in having so _many_ different types of WUs - Einstein went from one to something like 6 or 7, SETI is pretty much 1 unless it's "noisy", Predictor changed every few weeks; Rosetta can have a dozen varieties being issued all at one time. Also, this is where "flops-counting" can potentially solve yet another problem... (hint!) But to be certain, the R@H application should be doing some kind of incremental movement of the percent complete all through the processing even if it actually only checkpoints at 10% intervals. the BOINC client expects this. Failing that approach, the system could take a measure of how many CPU seconds it takes to process the first 10% of the WU and deduct that amount from the time remaining at each 10% jump. This would make each 10% decrement of the time remaining equal as it is in the real world. The way it is now the early reductions in the time remaining are significantly larger than those near the end of processing because the time remaining is always increasing during processing. I just hope the project folks are seeing the same stuff we are. I see ways to "improve" the % complete figure a bit without major changes; for example, if the ab initio stage normally takes 10% of the total run time for a structure, then there could at least be an 11%, 21%, 31% figure. And as we saw with the "default_xxxx_205" WU's (that had 1000 instead of 10 structs) the increment _could_ be much smaller, 0.1%, if we could tolerate a 100x increase in time... but it would be a big improvement even if, say, halfway through the "relax" portion, the % complete was bumped by 5%. Rosetta by the way is _far_ from the "worst" on the % complete issue... and it _will_ be very difficult to come up with a way to report _very_ frequently, as SETI and Einstein do, just because of the nature of the work being done. In any case all of these failures are bugs in the system and I do not believe they can be traced back to problems at the user end of the pipe. This is what makes the awarding of credit for these WUs appropriate. Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.) ID: 8915 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 8916 - Posted: 13 Jan 2006, 6:38:16 UTC - in response to Message 8915. ....Agreed - which is why _I_ am so glad that the project staff is doing what they can to award these credits where possible. (Please note; my PC was down with a dead power supply for a week, right in the middle of the 'problem WU' period, so I personally was not affected much; maybe 2 or 3 credits worth, not the 1000's others were.) Thanks for the note Bill. Also please understand I am not criticizing the team, I am trying to offer observations and ideas for them to think about to reach a solution. Everyone really needs to pull together to get this application to stand up and run. There are a lot of cures waiting to be found. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. ID: 8916 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0	Message 9612 - Posted: 23 Jan 2006, 2:53:24 UTC So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys! Live long and crunch. ID: 9612 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 9615 - Posted: 23 Jan 2006, 4:26:01 UTC - in response to Message 9612. So it looks as though no credit will be issued for a problem caused by Rosetta that has resulted in a lot of wasted cpu time and wasted effort by our machines, thanks for telling us guys! Live long and crunch. Gas, That has not been decided yet. Right now they are trying to figure out how to locate all of these in the data base so that they can award credit if warranted. This may take some time for them to figure out as it is not as simple as it might seem. The fact that some of the records are in the archive and some in the live data complicates the problem. Also the problem should begin to go away soon. On another thread they announced that they intend to reduce the run-length of the WU and they are going to give them more time to run. As those changes to the WUs begin to feed into the work the problem should dissolve. That said, what they would like to do is get all the records into the archive where they can work on all of them at once. As you can imagine this may take a while because all the WUs that are out there and being sent out again after failing, have to be returned complete or Maxed out on resends. I think if we are all patient the credit will be awarded, it just may take some time. Regards Phil ID: 9615 · Rating: 0 · rate: / Reply Quote

nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 826,286 RAC: 0	Message 9666 - Posted: 23 Jan 2006, 22:37:47 UTC i know its a touch off topic but .. what type of recoards arcive are they useing here.. i know at my work we are stuck using a (sorry if i make mistakes typing the name) SQLserver database >a nightmare to sort out or through< i understand how dificult it is to sort throught databases for a specific refrence. honestly i would love to get credit for any WU's i have that are Maximum CPU time Exceeded... but from my minor ability and knoledge i would look myself at . for my self my stats and spend my own time figureing out the work units that are overlimit time...like thisand mabey post all the ones i could find together in a request for credit.. personaly i run beta projects and alpha projects also so if i loose a job now and then im not going to spend my time huntning down every loss of a credit so i dont expect the scientists to spend there time eithor. Although.. if the scientists could give us a list of what they would like to see from us to request credit for errored jobs or such it might make it easyer for myself and others to supply the info ID: 9666 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 9671 - Posted: 24 Jan 2006, 0:11:39 UTC - in response to Message 9666. i know its a touch off topic but .. what type of records archive are they using here.. i know at my work we are stuck using a (sorry if i make mistakes typing the name) SQLserver database >a nightmare to sort out or through< ... Although.. if the scientists could give us a list of what they would like to see from us to request credit for errored jobs or such it might make it easier for myself and others to supply the info Good questions. I do not know the answer to the first, but as for the second don't think they will need anything from us. Right now the problem is that when a WU errors out it is resent to another machine, so it takes quite some time for all of them to drain out of the active data base, and reach the archive. It is my understanding that at that point they can process them for credit. This happened just recently with a bad batch of WUs, and some of them are still floating around on machines with long queues. I do know that David Kim has increased the time available for a WU to complete and shortened the number of process cycles to be run on each WU to fix this problem. From what I am seeing on my systems it is working. I have been two days without a Max time error on my systems. This of course will make the 1% problem worse as it will take a WU longer to time out. Also some of the longest WUs may still fail. Regards phil ID: 9671 · Rating: 0 · rate: / Reply Quote