Four kinds of errors

Message boards : Number crunching : Four kinds of errors

To post messages, you must log in.

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7749 - Posted: 27 Dec 2005, 21:15:49 UTC
Last modified: 27 Dec 2005, 21:16:25 UTC

I thought it would be useful to summarise some key points from many other threads. We have recently been seeing four different kinds of errors


No Progress error

This is where the cpu time for a result accumulates but the result never shows more that 1% progress.

The project programmers hope thay have fixed this, so it is vitally important to say if you see this error again! Don't report them in this thread, please use the Report stuck here thread. Please make it clear that the progress is stuck but the clock is running OK.


Long job error

Believed to affect only WU with names DEFAULT_xxxxx_205_xxxxx. Certainly affects all such WU.

Early symptoms are that you see fractional % points in the progress box (eg 1.3% etc), combined with unbelievably long projected time to completion. If left these jobs will self-destuct after a *long* time (~11hours on a fast box, well over a day on a slow one).

Official advice is to abort these, preferably before they run or as soon as you notice them.

My unofficial advice is to suspend them so that they don't get passed to someone else. Do the suspend in the Work tab, not in the project tab as that way Rosetta is free to run other work. Eventually, once the staff come back from their hols they will disable the jobs on the server and *then* you can resume and abort these jobs.

The project team intend to give everyone full credit for the time spent on these jobs, whether you abort or whether they run to self-destruct. You will not get this credit right away, the team have to figure out how to do it first, so you will hopefully have a nice boost to your stats one day in Jan. They have said that they will be able to make 'some use' of the results, so the time is not totally wasted scientifically either :-)

There is no need to report these - everything will be handled automatically one the job is aborted.


Short job error

Job starts OK but self destructs after a few seconds or a few minutes.

There is nothing to do about these - BOINC should automatically get more work.

Some users have experienced chains of these jobs - if you have a cache it might be worth resetting the project which will remove everything from your cache. However I have had only limited success with this and don't know if it helped.

It is noticeable that some machines suffer more than others - is this just luck or does one such job poison the box for the next? One suggestion (thanks PoorBoy) is that too many short jobs affects the data held on total runtime & benchmarking.

There is no need to report these. (tell us all if Ive got this wrong, Bill!)


Clock Stops error

CPU time stops increasing. Please note if this happens for a minute or so then it restarts, this is normal (meaning that Rosetta needed something on disk and the disks had been powered down to save energy). If it is stuck for more than 5 min you have been hit with this error.

It is a particularly nasty error form the user perspective as there seems no recovery. The job will not get cancelled for exceeding its cpu time as this value has stopped increasing. Therefore it will help if you check your boxes more often than usual.

My personal suggestion is to restart BOINC except on win98 & winME where I suggest a reboot. Otherwise, in my experience the problem just repeats. Your mileage may vary!

There are other ways to look out for this one - in all cases don't treat it as this error until the symptoms last ~5min

On Win2000 and winXP you can use task manager to identify this - ctrl-alt-del, select task manager, select the processes tab, click two times on the CPU column heading. (Not a double click, two single clicks with a gap between). The processes will be sorted so that the one using most CPU are at the top. This should be Rosetta. On a multi cpu box you should have as many Rosettas as you have cpus (or perhaps a Rosetta and an Einstein, or whatever)

On Linux you can identify this erro using top. (From the GUI open a terminal window first). Enter the top command, and the processes are shown with the ones using most cpu at the top. Should be Rosetta. On a multi cpu box you should have as many Rosettas as you have cpus (or perhaps a Rosetta and an Einstein, or whatever)

If you use BOINCview it is particularly easy to see this problem - go to the Work tab and the running task will be highlighted in yellow instead of the normal green. Tip: if BOINCview is watching several boxes, sort the work by status to get all the running tasks together on the screen. By the way, if you leave BV open for long periods to monitor this, please change the refresh interval for each location to something like 2min. BV is very hungry on cpu time at its standard 5sec refresh rate.

The project have not said anything about giving credit for these jobs. They are not realistically going to be able to give credit for the time after the clock stopped, but I personally think it would be a nice gesture if they did give credit claimed for the time shown on the stopped clock. Perhaps the best thing would be to give full credit for all aborted jobs from 20th? Dec to the date when all these four issues are resolved? I hope credit is not the main thing that attracts anyone here, but most of us do like to see our numbers going up. I know I do!

I'd suggest you report these jobs in the same Report stuck here thread, making it clear that it is the clock that has stuck as well as the progress.


Long Term

In the long term these sorts of problem will be screened out by the proposed "Ralph" team (Rosetta alpha) where those of us who enjoy a challenge will try out new WU and new apps before the majority who prefer to be able to donate cpu cycles without it becoming a hands-on job. This cluster of several errors should therefore be a one-off on the mainstream project.

Happy New Year.
ID: 7749 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 7757 - Posted: 27 Dec 2005, 21:53:49 UTC

My own personal opinion - River has the "No progress" and "long job" sections exactly right; I have some issues with the "short job" section, and I think the section on "clock stops error" is wrong.

"Clock Stops" - If anyone has seen this on any platform _other_ than Windows 95/98/ME, please advise; every case I've heard about has been on one of those OS versions, which ARE NOT SUPPORTED by Rosetta. They may work, but anyone using them is "on their own". It is a KNOWN PROBLEM that these OS versions sometimes request 0 credits (on _any_ project) - and there is no way on Rosetta to grant credit on them.

"Short Jobs" - I believe the "short WU poisoning" effect is not causing more computation errors, but instead may be causing (on Win9x) the "clock stops" problem on the following WU. I think the issue PoorBoy raised was that a series of short WUs will cause your Duration Correction Factor to be lowered, which _may_ cause a "CPU_time_exceeded" error on the next long-running 'normal' WU... I think at _this_ point, there is little need to clear your cache. Most of these are gone by now, unless you have a very large cache that would have days-old data in it. And even then, the newer data in your cache is probably good.

All in all, the basic original recommendation remains: if you get a DEFAULT_xxxxx_205 WU (and please note ONLY the ones that start with DEFAULT and have a "205" for the batch number, NO others are affected) you should suspend (my preference also) or abort it. If you have one of these suspended, and it has any CPU time already applied, you will need to resume and abort it as SOON as the project staff returns and starts working on these, in order to get that credit. If you have aborted it, no further action is needed.

NO OTHER action is needed on ANY other WUs. If you are on dial-up, and want to be absolutely sure of not spending more "communications" time than "calculation" time, then I would suggest you suspend the Rosetta _project_, until after Jan 1st. Otherwise, the best thing you can do is "nothing". Let it flow. If it is possible to grant credit for the "short" WUs, it would not surprise me to see the staff do this, but even if they do not, _most_ of the WUs currently flowing are just fine.

ID: 7757 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7762 - Posted: 27 Dec 2005, 23:04:16 UTC - in response to Message 7757.  


"Clock Stops" - If anyone has seen this on any platform _other_ than Windows 95/98/ME, please advise;

...

It is a KNOWN PROBLEM that these OS versions sometimes request 0 credits (on _any_ project) - and there is no way on Rosetta to grant credit on them.


I should have listed five kinds of errors.

The clock stop is different from the zero credit. The WU runs normally, accruing cou time nicely, then the clock freezes. I have seen this on linux as Ive been using top to catch it, as I said in the posting. To be fair I can't remember seeing it on winxp or win2k but will let you know if I do

I thought it had been understood that these clock stops were a problem. Ive had to stop & restart four of my boinc clients (3 linux & one winME) several times in the last few days over this. If I had not they would not have crunched anything - how could they when top shows that the tasks are there but absolutely nothing is running.

I will look back and report more in the 'stuck jobs' thread (link in my previous post)

"Short Jobs" - I believe the "short WU poisoning" effect is not causing more computation errors, but instead may be causing (on Win9x) the "clock stops" problem on the following WU. I think the issue PoorBoy raised was that a series of short WUs will cause your Duration Correction Factor to be lowered, which _may_ cause a "CPU_time_exceeded" error on the next long-running 'normal'


That was what PoorBoy meant. I do not think that is the same as the clock stops effect though.
ID: 7762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 7774 - Posted: 28 Dec 2005, 1:42:48 UTC - in response to Message 7762.  

I should have listed five kinds of errors.

The clock stop is different from the zero credit. The WU runs normally, accruing cou time nicely, then the clock freezes. I have seen this on linux as Ive been using top to catch it, as I said in the posting.


Hm... that's a new one on me. % complete is increasing but CPU time is not? On Linux? Yuck. I missed the "top" reference. Been a long day...

ID: 7774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7787 - Posted: 28 Dec 2005, 2:54:22 UTC - in response to Message 7774.  

I should have listed five kinds of errors.

The clock stop is different from the zero credit. The WU runs normally, accruing cou time nicely, then the clock freezes. I have seen this on linux as Ive been using top to catch it, as I said in the posting.


Hm... that's a new one on me. % complete is increasing but CPU time is not? On Linux? Yuck. I missed the "top" reference. Been a long day...


No sorry, I still haven't explained it well then.

at first the cpu & % progress behave normally (ie clock runs contiuously & progress occasionally jumps). Then after a while the cpu stops increasing and the progress never moves again either.

Two examples posted on the 'report stuck' thread.
R~~
ID: 7787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 7789 - Posted: 28 Dec 2005, 3:12:14 UTC - in response to Message 7787.  

at first the cpu & % progress behave normally (ie clock runs contiuously & progress occasionally jumps). Then after a while the cpu stops increasing and the progress never moves again either.


Okay, then this is probably related to the "stuck at 1%" ones - realizing that there are 10 "passes" in each WU, 10 random seeds... if it "sticks" at the first one, you'll see the "stuck at 1%" - if it sticks at the start of the _second_ one, it'll be "stuck at 10%", and so forth.

Just guessing, but I think it's a special case of _that_ error, rather than related to the "clock stops but % complete keeps going" (which if the clock stops at the start, is the "0-credit") issue, which I still think is Win9x only.

ID: 7789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7800 - Posted: 28 Dec 2005, 9:10:11 UTC - in response to Message 7789.  


Okay, then this is probably related to the "stuck at 1%" ones - realizing that there are 10 "passes" in each WU, 10 random seeds... if it "sticks" at the first one, you'll see the "stuck at 1%" - if it sticks at the start of the _second_ one, it'll be "stuck at 10%", and so forth.

No, because with the old stuck at 1% error the clock kept on going. I confidently predict that top would still show 99% cpu usage.

With the new stuck error (which I agree can happen at 1%, 10% ...80% ...) the clock stops.

This is an important clue to what is going wrong.

One is an infinite loop, where the prgram is repeatedly wasting cpu by repeating the same code over, the other is most likely a mutual task wait issue like a deadlock, a wait on a reply from a dead thread, two threads each waiting for the other to do something first, etc. They can't be the same error.

This is why the heartbeat message (see my posting in the 'Please report' thread would excite me if I had the job of tracking the bug. A dropped message between tasks is one way a mutual wait arises. Letter gets lost in post and relationship ends with both lovers saying 's/he owes me a letter I ain't writing again'.

Any error that prevents a wu making progress will obviously stop the progress counter. Same symptom != same disease.


ID: 7800 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 7816 - Posted: 28 Dec 2005, 13:07:02 UTC - in response to Message 7749.  

Short job error - two comments:

I'm running Linux. Have had a bunch of work units terminate with code 131 (decimal). Looked at the reported results from those work units -- EVERY instance anyone has run has ended with 'Client error' -- except that on Windows computers the reported error code has been -5.

What I find interesting is that with rosetta version 4.79, similar work units completed successfully for me. As soon as my system switched (automatically) to version 4.80, it started experiencing the code 131 errors on "topology_sample" work units.
.

ID: 7816 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 7832 - Posted: 28 Dec 2005, 18:07:49 UTC - in response to Message 7816.  

What I find interesting is that with rosetta version 4.79, similar work units completed successfully for me. As soon as my system switched (automatically) to version 4.80, it started experiencing the code 131 errors on "topology_sample" work units.


Yes - they released the new app (only change as far as I know is to the graphics) and basically the same day changed the method of creation of the random seed. The new method increased the probability of a failure due to a bug in the code that had already been there, but had only been failing "about 7%" of the time, on some random seeds. Because this change was on the server side, it affected all of the different WU names. It seems to have reversed the probabilities (ie; 7% succeed) until it was yanked back out. Unfortunately, a whole lot of WUs "escaped" during the short time it was running, and until the staff can get back in and delete the remaining files from the server, they will cycle through until 11 people have had the same WU and errored out on it. Most were "flushed" in the first two days, but some were in large caches and are just now being processed by those people, and being sent on to others.

The "short" errors aren't a problem unless someone is on dial-up; they just finish very quickly and get out of the way for the next one. If you're on dial-up, the download time is actually longer than the execution time, which is a pain. "No new work" and then "suspend" when cache is empty of Rosetta work is the recommended solution there.

None of us are happy about the errors, including the project staff, and they have said they will take care of credit issues on the "long" WUs as soon as they're back. I don't know if it's even possible, or if so, if it's easy, but I would not be surprised if they at least try to give credit on at least some of the other errors as well.

ID: 7832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7902 - Posted: 29 Dec 2005, 16:23:55 UTC - in response to Message 7832.  

... they will take care of credit issues on the "long" WUs


good

as soon as they're back


Not literally I hope. It needs to be done in a few days, in my opinion, but the very first thing I'd like them to do is stop the server sending out any more. Prevention before cure.

I don't know if it's even possible, or if so, if it's easy, but I would not be surprised if they at least try to give credit on at least some of the other errors...


My understanding is that it is reasonably easy to run a script to retrospectively give claimed credit on wu that are still in the database, back to any selected date. That is only what I gather from half remembered postings on the Einstein board at the time of their bad wu storm - I don't actually *know* I have not doen it.

Claimed credit will fully cover the long wu, and stuck (clock running) work in full.

It will partially cover the stuck (clock stopped >> 0) work.

It will not compensate at all for timelost after the clock stopped, nor if the client ran out of work due to short wu and could not get new work for a while as the server was too busy, etc etc.

Personally I'd settle for that claimed credit as a reasonable balance between giving something and not spending too much time away from the science.

In my opinion.

River~~
ID: 7902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7913 - Posted: 29 Dec 2005, 17:43:46 UTC

The granting of credit to work still in the database is not a problem. The difficulty is that for some of us the records have already been purged. In any case, most of the places where I think I would be getting credit it is a 0.06 CS size dose, so, it is not going to be very noticable for me.

So, if the work unit is still listed, a script can be easily run to change the contents. Of the work unit rows, and add a similar amount to the participant's account. Actually, it would have to be 3 statements, update team where ... update participant where, then update result where ... in that order ... :)
ID: 7913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7916 - Posted: 29 Dec 2005, 18:26:13 UTC - in response to Message 7913.  

Actually, it would have to be 3 statements, update team where ... update participant where, then update result where ... in that order ... :)


Remembering that you will want to run the script more than once, for late arriving wu, so making sure either to purge the records after each script run or include some test (on return date perhaps?) to ensure people don't get credited twice for the lost work.

Tho on second thoughts, that might be a way to keep people happy.

No, on third thoughts people would then complain if thay had not had enough errors ;-)

R~~
ID: 7916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7919 - Posted: 29 Dec 2005, 19:14:57 UTC

Where clause would be where claimed > 0, granted = 0 etc. so, once the third was run the granted would no longer be 0 and later runs would be fine ...
ID: 7919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 8276 - Posted: 3 Jan 2006, 15:31:21 UTC - in response to Message 7749.  

I thought it would be useful to summarize some key points from many other threads. We have recently been seeing four different kinds of errors


I would have to say that there is a FIFTH type of error not mentioned here. That would be the case where, as the system works it makes adjustments to the projected time to completion for waiting WUs. The system then encounters either one or more short WUS and adjusts the projected time accordingly. It then tries to process a longer WU, and errors out for taking too long to complete the work. This is of course an introduced error caused by an attempt to fix the "1%" stuck problem.

The fix here should be to not check only the CPU time but to also check the percent of progress to determine if a WU is stuck. The best fix would of course be to have people properly configure their systems. I have all of mine set to Keep the applications in memory during swaps and 120 min between swaps. So far this has caused no heating problems (even running CPDN) and I have yet to have a WU stop at 1%. I also use my systems for other things during the day and I have not had any issues in doing so. Moreover my WUs do not fail if I suspend them, or stop and restart BOINC.

I am of course now having a number of client errors resulting from long WUs following shorter ones in the work queue. Usually this causes a loss of about 6-7 hours of processing time every time it happens.

Of course I am running all Macs at system 10.4.+, which seems to put me (and other people) at the bottom of the queue for fixes (and grafics too). But if it helps to raise the priority, the Mac population on this project is nominally 4 %, put we have processed over 18% of the work. We might do even better with an application designed to use the Mac system Altevic code. In any case the fixes you deploy should be for all of the applications at reasonably the same time frame.

Just my opiniion.

Regards
Phil



We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 8276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 8281 - Posted: 3 Jan 2006, 17:43:35 UTC - in response to Message 8276.  
Last modified: 3 Jan 2006, 17:45:23 UTC

But if it helps to raise the priority, the Mac population on this project is nominally 4 %, put we have processed over 18% of the work.

Hm, I don't think your claim that 4% Macs have processed 18% of the work can be right; see this boincstats page (average RAC/host Darwin: 45; WinXP: 61; Linux: 75). As to your fifth type of error, could this be something Mac-specific ? I don't think I ever encountered this kind of problem on my Linux-box (don't know about Win).
ID: 8281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 8286 - Posted: 3 Jan 2006, 18:47:27 UTC - in response to Message 8281.  

As to your fifth type of error, could this be something Mac-specific?


Definitely not - in fact, it's less likely on the Mac as even the "shorter but good" results just take too long. It's because of the imbalance between the length of the results, caused by the estimates being _way_ off on all of them, _can_ happen on any system, more likely on the fastest ones. (But still a matter of being unlucky in the order in which you get results...) Even with no other fix, if the estimates were made more accurate, it wouldn't be a problem. Einstein is currently sending out "Albert" results which can range from 25% to just over 100% of the "old" (and very stable) time per result - but they're attaching at least a _guess_ to each one as to how long it will run. The first few all had the old standard estimate, and caused some problems with DCF, same as Rosetta's shorter ones are. If Einstein had a lower max_cpu_time, they would have had the same failures there.

This won't be a problem once all new types of WUs are run through Rosetta Alpha first and there is a good estimate of the length of them in relation to the "standard". The estimates don't have to (and can't) be "right", but if they're even in the ballpark, this type of problem goes away.

ID: 8286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Four kinds of errors



©2024 University of Washington
https://www.bakerlab.org