Message boards : Number crunching : Four kinds of errors
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I thought it would be useful to summarise some key points from many other threads. We have recently been seeing four different kinds of errors No Progress error This is where the cpu time for a result accumulates but the result never shows more that 1% progress. The project programmers hope thay have fixed this, so it is vitally important to say if you see this error again! Don't report them in this thread, please use the Report stuck here thread. Please make it clear that the progress is stuck but the clock is running OK. Long job error Believed to affect only WU with names DEFAULT_xxxxx_205_xxxxx. Certainly affects all such WU. Early symptoms are that you see fractional % points in the progress box (eg 1.3% etc), combined with unbelievably long projected time to completion. If left these jobs will self-destuct after a *long* time (~11hours on a fast box, well over a day on a slow one). Official advice is to abort these, preferably before they run or as soon as you notice them. My unofficial advice is to suspend them so that they don't get passed to someone else. Do the suspend in the Work tab, not in the project tab as that way Rosetta is free to run other work. Eventually, once the staff come back from their hols they will disable the jobs on the server and *then* you can resume and abort these jobs. The project team intend to give everyone full credit for the time spent on these jobs, whether you abort or whether they run to self-destruct. You will not get this credit right away, the team have to figure out how to do it first, so you will hopefully have a nice boost to your stats one day in Jan. They have said that they will be able to make 'some use' of the results, so the time is not totally wasted scientifically either :-) There is no need to report these - everything will be handled automatically one the job is aborted. Short job error Job starts OK but self destructs after a few seconds or a few minutes. There is nothing to do about these - BOINC should automatically get more work. Some users have experienced chains of these jobs - if you have a cache it might be worth resetting the project which will remove everything from your cache. However I have had only limited success with this and don't know if it helped. It is noticeable that some machines suffer more than others - is this just luck or does one such job poison the box for the next? One suggestion (thanks PoorBoy) is that too many short jobs affects the data held on total runtime & benchmarking. There is no need to report these. (tell us all if Ive got this wrong, Bill!) Clock Stops error CPU time stops increasing. Please note if this happens for a minute or so then it restarts, this is normal (meaning that Rosetta needed something on disk and the disks had been powered down to save energy). If it is stuck for more than 5 min you have been hit with this error. It is a particularly nasty error form the user perspective as there seems no recovery. The job will not get cancelled for exceeding its cpu time as this value has stopped increasing. Therefore it will help if you check your boxes more often than usual. My personal suggestion is to restart BOINC except on win98 & winME where I suggest a reboot. Otherwise, in my experience the problem just repeats. Your mileage may vary! There are other ways to look out for this one - in all cases don't treat it as this error until the symptoms last ~5min On Win2000 and winXP you can use task manager to identify this - ctrl-alt-del, select task manager, select the processes tab, click two times on the CPU column heading. (Not a double click, two single clicks with a gap between). The processes will be sorted so that the one using most CPU are at the top. This should be Rosetta. On a multi cpu box you should have as many Rosettas as you have cpus (or perhaps a Rosetta and an Einstein, or whatever) On Linux you can identify this erro using top. (From the GUI open a terminal window first). Enter the top command, and the processes are shown with the ones using most cpu at the top. Should be Rosetta. On a multi cpu box you should have as many Rosettas as you have cpus (or perhaps a Rosetta and an Einstein, or whatever) If you use BOINCview it is particularly easy to see this problem - go to the Work tab and the running task will be highlighted in yellow instead of the normal green. Tip: if BOINCview is watching several boxes, sort the work by status to get all the running tasks together on the screen. By the way, if you leave BV open for long periods to monitor this, please change the refresh interval for each location to something like 2min. BV is very hungry on cpu time at its standard 5sec refresh rate. The project have not said anything about giving credit for these jobs. They are not realistically going to be able to give credit for the time after the clock stopped, but I personally think it would be a nice gesture if they did give credit claimed for the time shown on the stopped clock. Perhaps the best thing would be to give full credit for all aborted jobs from 20th? Dec to the date when all these four issues are resolved? I hope credit is not the main thing that attracts anyone here, but most of us do like to see our numbers going up. I know I do! I'd suggest you report these jobs in the same Report stuck here thread, making it clear that it is the clock that has stuck as well as the progress. Long Term In the long term these sorts of problem will be screened out by the proposed "Ralph" team (Rosetta alpha) where those of us who enjoy a challenge will try out new WU and new apps before the majority who prefer to be able to donate cpu cycles without it becoming a hands-on job. This cluster of several errors should therefore be a one-off on the mainstream project. Happy New Year. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
My own personal opinion - River has the "No progress" and "long job" sections exactly right; I have some issues with the "short job" section, and I think the section on "clock stops error" is wrong. "Clock Stops" - If anyone has seen this on any platform _other_ than Windows 95/98/ME, please advise; every case I've heard about has been on one of those OS versions, which ARE NOT SUPPORTED by Rosetta. They may work, but anyone using them is "on their own". It is a KNOWN PROBLEM that these OS versions sometimes request 0 credits (on _any_ project) - and there is no way on Rosetta to grant credit on them. "Short Jobs" - I believe the "short WU poisoning" effect is not causing more computation errors, but instead may be causing (on Win9x) the "clock stops" problem on the following WU. I think the issue PoorBoy raised was that a series of short WUs will cause your Duration Correction Factor to be lowered, which _may_ cause a "CPU_time_exceeded" error on the next long-running 'normal' WU... I think at _this_ point, there is little need to clear your cache. Most of these are gone by now, unless you have a very large cache that would have days-old data in it. And even then, the newer data in your cache is probably good. All in all, the basic original recommendation remains: if you get a DEFAULT_xxxxx_205 WU (and please note ONLY the ones that start with DEFAULT and have a "205" for the batch number, NO others are affected) you should suspend (my preference also) or abort it. If you have one of these suspended, and it has any CPU time already applied, you will need to resume and abort it as SOON as the project staff returns and starts working on these, in order to get that credit. If you have aborted it, no further action is needed. NO OTHER action is needed on ANY other WUs. If you are on dial-up, and want to be absolutely sure of not spending more "communications" time than "calculation" time, then I would suggest you suspend the Rosetta _project_, until after Jan 1st. Otherwise, the best thing you can do is "nothing". Let it flow. If it is possible to grant credit for the "short" WUs, it would not surprise me to see the staff do this, but even if they do not, _most_ of the WUs currently flowing are just fine. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I should have listed five kinds of errors. The clock stop is different from the zero credit. The WU runs normally, accruing cou time nicely, then the clock freezes. I have seen this on linux as Ive been using top to catch it, as I said in the posting. To be fair I can't remember seeing it on winxp or win2k but will let you know if I do I thought it had been understood that these clock stops were a problem. Ive had to stop & restart four of my boinc clients (3 linux & one winME) several times in the last few days over this. If I had not they would not have crunched anything - how could they when top shows that the tasks are there but absolutely nothing is running. I will look back and report more in the 'stuck jobs' thread (link in my previous post)
That was what PoorBoy meant. I do not think that is the same as the clock stops effect though. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
I should have listed five kinds of errors. Hm... that's a new one on me. % complete is increasing but CPU time is not? On Linux? Yuck. I missed the "top" reference. Been a long day... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I should have listed five kinds of errors. No sorry, I still haven't explained it well then. at first the cpu & % progress behave normally (ie clock runs contiuously & progress occasionally jumps). Then after a while the cpu stops increasing and the progress never moves again either. Two examples posted on the 'report stuck' thread. R~~ |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
at first the cpu & % progress behave normally (ie clock runs contiuously & progress occasionally jumps). Then after a while the cpu stops increasing and the progress never moves again either. Okay, then this is probably related to the "stuck at 1%" ones - realizing that there are 10 "passes" in each WU, 10 random seeds... if it "sticks" at the first one, you'll see the "stuck at 1%" - if it sticks at the start of the _second_ one, it'll be "stuck at 10%", and so forth. Just guessing, but I think it's a special case of _that_ error, rather than related to the "clock stops but % complete keeps going" (which if the clock stops at the start, is the "0-credit") issue, which I still think is Win9x only. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
|
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
Short job error - two comments: I'm running Linux. Have had a bunch of work units terminate with code 131 (decimal). Looked at the reported results from those work units -- EVERY instance anyone has run has ended with 'Client error' -- except that on Windows computers the reported error code has been -5. What I find interesting is that with rosetta version 4.79, similar work units completed successfully for me. As soon as my system switched (automatically) to version 4.80, it started experiencing the code 131 errors on "topology_sample" work units. . |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
What I find interesting is that with rosetta version 4.79, similar work units completed successfully for me. As soon as my system switched (automatically) to version 4.80, it started experiencing the code 131 errors on "topology_sample" work units. Yes - they released the new app (only change as far as I know is to the graphics) and basically the same day changed the method of creation of the random seed. The new method increased the probability of a failure due to a bug in the code that had already been there, but had only been failing "about 7%" of the time, on some random seeds. Because this change was on the server side, it affected all of the different WU names. It seems to have reversed the probabilities (ie; 7% succeed) until it was yanked back out. Unfortunately, a whole lot of WUs "escaped" during the short time it was running, and until the staff can get back in and delete the remaining files from the server, they will cycle through until 11 people have had the same WU and errored out on it. Most were "flushed" in the first two days, but some were in large caches and are just now being processed by those people, and being sent on to others. The "short" errors aren't a problem unless someone is on dial-up; they just finish very quickly and get out of the way for the next one. If you're on dial-up, the download time is actually longer than the execution time, which is a pain. "No new work" and then "suspend" when cache is empty of Rosetta work is the recommended solution there. None of us are happy about the errors, including the project staff, and they have said they will take care of credit issues on the "long" WUs as soon as they're back. I don't know if it's even possible, or if so, if it's easy, but I would not be surprised if they at least try to give credit on at least some of the other errors as well. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... they will take care of credit issues on the "long" WUs good as soon as they're back Not literally I hope. It needs to be done in a few days, in my opinion, but the very first thing I'd like them to do is stop the server sending out any more. Prevention before cure. I don't know if it's even possible, or if so, if it's easy, but I would not be surprised if they at least try to give credit on at least some of the other errors... My understanding is that it is reasonably easy to run a script to retrospectively give claimed credit on wu that are still in the database, back to any selected date. That is only what I gather from half remembered postings on the Einstein board at the time of their bad wu storm - I don't actually *know* I have not doen it. Claimed credit will fully cover the long wu, and stuck (clock running) work in full. It will partially cover the stuck (clock stopped >> 0) work. It will not compensate at all for timelost after the clock stopped, nor if the client ran out of work due to short wu and could not get new work for a while as the server was too busy, etc etc. Personally I'd settle for that claimed credit as a reasonable balance between giving something and not spending too much time away from the science. In my opinion. River~~ |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The granting of credit to work still in the database is not a problem. The difficulty is that for some of us the records have already been purged. In any case, most of the places where I think I would be getting credit it is a 0.06 CS size dose, so, it is not going to be very noticable for me. So, if the work unit is still listed, a script can be easily run to change the contents. Of the work unit rows, and add a similar amount to the participant's account. Actually, it would have to be 3 statements, update team where ... update participant where, then update result where ... in that order ... :) |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Actually, it would have to be 3 statements, update team where ... update participant where, then update result where ... in that order ... :) Remembering that you will want to run the script more than once, for late arriving wu, so making sure either to purge the records after each script run or include some test (on return date perhaps?) to ensure people don't get credited twice for the lost work. Tho on second thoughts, that might be a way to keep people happy. No, on third thoughts people would then complain if thay had not had enough errors ;-) R~~ |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Where clause would be where claimed > 0, granted = 0 etc. so, once the third was run the granted would no longer be 0 and later runs would be fine ... |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
I thought it would be useful to summarize some key points from many other threads. We have recently been seeing four different kinds of errors I would have to say that there is a FIFTH type of error not mentioned here. That would be the case where, as the system works it makes adjustments to the projected time to completion for waiting WUs. The system then encounters either one or more short WUS and adjusts the projected time accordingly. It then tries to process a longer WU, and errors out for taking too long to complete the work. This is of course an introduced error caused by an attempt to fix the "1%" stuck problem. The fix here should be to not check only the CPU time but to also check the percent of progress to determine if a WU is stuck. The best fix would of course be to have people properly configure their systems. I have all of mine set to Keep the applications in memory during swaps and 120 min between swaps. So far this has caused no heating problems (even running CPDN) and I have yet to have a WU stop at 1%. I also use my systems for other things during the day and I have not had any issues in doing so. Moreover my WUs do not fail if I suspend them, or stop and restart BOINC. I am of course now having a number of client errors resulting from long WUs following shorter ones in the work queue. Usually this causes a loss of about 6-7 hours of processing time every time it happens. Of course I am running all Macs at system 10.4.+, which seems to put me (and other people) at the bottom of the queue for fixes (and grafics too). But if it helps to raise the priority, the Mac population on this project is nominally 4 %, put we have processed over 18% of the work. We might do even better with an application designed to use the Mac system Altevic code. In any case the fixes you deploy should be for all of the applications at reasonably the same time frame. Just my opiniion. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
But if it helps to raise the priority, the Mac population on this project is nominally 4 %, put we have processed over 18% of the work. Hm, I don't think your claim that 4% Macs have processed 18% of the work can be right; see this boincstats page (average RAC/host Darwin: 45; WinXP: 61; Linux: 75). As to your fifth type of error, could this be something Mac-specific ? I don't think I ever encountered this kind of problem on my Linux-box (don't know about Win). |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
As to your fifth type of error, could this be something Mac-specific? Definitely not - in fact, it's less likely on the Mac as even the "shorter but good" results just take too long. It's because of the imbalance between the length of the results, caused by the estimates being _way_ off on all of them, _can_ happen on any system, more likely on the fastest ones. (But still a matter of being unlucky in the order in which you get results...) Even with no other fix, if the estimates were made more accurate, it wouldn't be a problem. Einstein is currently sending out "Albert" results which can range from 25% to just over 100% of the "old" (and very stable) time per result - but they're attaching at least a _guess_ to each one as to how long it will run. The first few all had the old standard estimate, and caused some problems with DCF, same as Rosetta's shorter ones are. If Einstein had a lower max_cpu_time, they would have had the same failures there. This won't be a problem once all new types of WUs are run through Rosetta Alpha first and there is a good estimate of the length of them in relation to the "standard". The estimates don't have to (and can't) be "right", but if they're even in the ballpark, this type of problem goes away. |
Message boards :
Number crunching :
Four kinds of errors
©2024 University of Washington
https://www.bakerlab.org