Computation Error

Message boards : Number crunching : Computation Error

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 7975 - Posted: 30 Dec 2005, 9:18:24 UTC
Last modified: 30 Dec 2005, 9:27:36 UTC

Anyone with a "suspended" DEFAULT_xxxxx_205 please check the webpage for your results, and look at that one - if the "errors" line at the top says "Cancelled", you can unsuspend it and abort it. That will let it get back to the server and be finished. Thanks!

ID: 7975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Etienne Guyot

Send message
Joined: 27 Oct 05
Posts: 10
Credit: 952,910
RAC: 0
Message 7980 - Posted: 30 Dec 2005, 9:37:26 UTC
Last modified: 30 Dec 2005, 9:43:27 UTC

Hello,
I've got many computation error with Rosetta 4.81 on all my computers Most of them ending with error 0xC0000005

Following is a sample of the boinc error log (stdoutdae.txt):

2005-12-29 22:17:57 [LHC@home] No work from project
2005-12-29 22:22:38 [---] request_reschedule_cpus: process exited
2005-12-29 22:22:38 [SETI@home] Computation for result 19fe05aa.27874.4082.498562.1.33_1 finished
2005-12-29 22:22:39 [rosetta@home] Starting result 1n0u__topology_sample_207_8081_9 using rosetta version 481
2005-12-29 22:22:41 [SETI@home] Started upload of 19fe05aa.27874.4082.498562.1.33_1_0
[color=red]2005-12-29 22:23:06 [rosetta@home] Unrecoverable error for result 1n0u__topology_sample_207_8081_9 ( - exit code -1073741819 (0xc0000005))[/color]
2005-12-29 22:23:06 [---] request_reschedule_cpus: process exited
2005-12-29 22:23:06 [rosetta@home] Computation for result 1n0u__topology_sample_207_8081_9 finished
2005-12-29 22:23:06 [SETI@home] Starting result 13dc03aa.20895.6785.561076.1.70_3 using setiathome version 418


I've noticed that this kind of error always happen at the exact moment Boinc Manager performs a project switch. Issuing a Suspend command while Rosetta is crunching or a Quit will produce the same behavior.
I also noticed that I always got a succesfully completed Rosetta WU if the task has neither been interrupted.

May be a clue to fix this problem?

(I'm running Boinc 5.2.15 on one computer and 5.3.6 on the other - Win32 XP, no graphics used, no screensaver, not linked to DEFAULT_xxxx_205 WU).

Regards,


Gex - France
ID: 7980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7981 - Posted: 30 Dec 2005, 10:13:16 UTC

Set your remain in memory when pre-empted and see what happens....Rosetta need this is seems.
ID: 7981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Etienne Guyot

Send message
Joined: 27 Oct 05
Posts: 10
Credit: 952,910
RAC: 0
Message 7984 - Posted: 30 Dec 2005, 10:39:44 UTC - in response to Message 7981.  

Set your remain in memory when pre-empted and see what happens....Rosetta need this is seems.


Thanks for the trick. I'll try it.

But it's not a long term fix as it's a general switch active for all projects.
I need not to swap too much my physical memory with hard drive as I run other applications (not only dedicated to Boinc). It's slowing done a lot my computers.

Hope Rosetta team will fix that quicly, otherwise I'll consider suspending this project as I waste cpu time for nothing! (And Rosetta project too)

Regards,


Gex - France
ID: 7984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7985 - Posted: 30 Dec 2005, 12:00:00 UTC

They are working to fix it but I don't know how long.....
ID: 7985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
winman

Send message
Joined: 5 Dec 05
Posts: 2
Credit: 267,850
RAC: 0
Message 8036 - Posted: 31 Dec 2005, 4:46:45 UTC
Last modified: 31 Dec 2005, 4:52:20 UTC

had similar probs with my 3200+. seems it would cause my machine to lock up, and cause other probs with it. It was very frustrating, I leave on Tuesday morning and don't get back till Friday evening, not happy to see that the machine locked up an hour or two after I left, set there idle not crunching for almost 4 days. My 3700+ seems to have no probs so it is the only machine that runs rosetta, and sadly it doesn't run when i am gone. Nice to hear I am not the only one with probs with rosetta though. My 3200+ happily crunches set and einstien 24/7 now, and my 3700+ runs rosetta and LHC(when there is work), or seti when LHC doesn't have anything.

Live long and crunch!!!

ID: 8036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 8616 - Posted: 9 Jan 2006, 3:44:39 UTC

There are still a few of these floating around - [url=https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3819739]this one[/quote] sat in someone's queue from Dec 21 to Jan 8. Six people have had it so far out of the 10 these were set to allow. So far the next result is "unsent"...

ID: 8616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8640 - Posted: 9 Jan 2006, 12:11:37 UTC - in response to Message 8616.  

There are still a few of these floating around - [url=https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3819739]this one
sat in someone's queue from Dec 21 to Jan 8. Six people have had it so far out of the 10 these were set to allow. So far the next result is "unsent"...
[/quote]

Could one of the project team make sure it is not sent out again please?

Maybe steal its files from the server (and wait for the questions about download errors) if you can't make its status change to 'not needed'

Obviously, I don't mean just this one, but maybe run a script to identify all wu where the number of errors exceeds the new max?

Just a fort ;-)
ID: 8640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aquila audax
Avatar

Send message
Joined: 13 Dec 05
Posts: 3
Credit: 55,412
RAC: 0
Message 8692 - Posted: 9 Jan 2006, 21:36:44 UTC
Last modified: 9 Jan 2006, 21:43:17 UTC

I am also still having problems with 'computation errors'... and these are with new WUs downloaded yesterday.

and as Etienne Guyot noted, they all occur when pausing a running R@H job as BOINC switches to a different job. [See log snippets below]

10/01/2006 1:06:39 AM|Predictor @ Home|Restarting result h0013B_1_139120_3 using mfoldB125 version 428
10/01/2006 1:06:39 AM|SETI@home|Restarting result 15mr05aa.29724.21872.447166.1.38_2 using setiathome version 418
10/01/2006 1:06:39 AM|rosetta@home|Pausing result NO_RAND_WTS_2tif_230_6530_0 (removed from memory)
10/01/2006 1:06:40 AM|rosetta@home|Unrecoverable error for result NO_RAND_WTS_2tif_230_6530_0 ( - exit code -1073741819 (0xc0000005))
10/01/2006 1:06:40 AM||request_reschedule_cpus: process exited

...

10/01/2006 5:14:42 AM|Einstein@Home|Restarting result r1_0930.0__761_S4R2a_1 using albert version 437
10/01/2006 5:14:42 AM|SETI@home|Restarting result 15mr05aa.29724.21872.447166.1.38_2 using setiathome version 418
10/01/2006 5:14:42 AM|rosetta@home|Pausing result MORE_FRAGS_W_BARCODE_2tif_231_6530_0 (removed from memory)
10/01/2006 5:14:42 AM|rosetta@home|Pausing result NO_RANDOM_WTS_OR_FRAGS_1dcj_223_9021_0 (removed from memory)
10/01/2006 5:14:43 AM|rosetta@home|Unrecoverable error for result MORE_FRAGS_W_BARCODE_2tif_231_6530_0 ( - exit code -1073741819 (0xc0000005))
10/01/2006 5:14:43 AM|rosetta@home|Unrecoverable error for result NO_RANDOM_WTS_OR_FRAGS_1dcj_223_9021_0 ( - exit code -1073741819 (0xc0000005))
10/01/2006 5:14:43 AM||request_reschedule_cpus: process exited
10/01/2006 5:14:43 AM|rosetta@home|Computation for result MORE_FRAGS_W_BARCODE_2tif_231_6530_0 finished
10/01/2006 5:14:43 AM|rosetta@home|Computation for result NO_RANDOM_WTS_OR_FRAGS_1dcj_223_9021_0 finished

ID: 8692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 8695 - Posted: 9 Jan 2006, 22:01:11 UTC - in response to Message 8692.  
Last modified: 9 Jan 2006, 22:01:58 UTC

10/01/2006 1:06:39 AM|rosetta@home|Pausing result NO_RAND_WTS_2tif_230_6530_0 (removed from memory)


Yes - Rosetta will error out if it is removed from memory. Until they find and fix this bug, you have to have "leave applications in memory when preempted" set to "yes" on the website preferences.

Also - please edit your post to break the lines in the 'pre' blocks. This causes stretching.

ID: 8695 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,103,208
RAC: 167
Message 8698 - Posted: 9 Jan 2006, 23:52:29 UTC

Yes - Rosetta will error out if it is removed from memory. Until they find and fix this bug, you have to have "leave applications in memory when preempted" set to "yes" on the website preferences.
==========

Bill, leaving applications in memory may help but it is not a cure all. I have my preferences set to that and I still get Computation Error's quit a bit. I had some this morning when I got up because of the Benchmarks that ran overnight, and I've had some during the day because of suspending WU's to run another Project.
ID: 8698 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 5
Message 8699 - Posted: 10 Jan 2006, 0:05:19 UTC - in response to Message 8698.  
Last modified: 10 Jan 2006, 0:10:10 UTC

Bill, leaving applications in memory may help but it is not a cure all. I have my preferences set to that and I still get Computation Error's quit a bit. I had some this morning when I got up because of the Benchmarks that ran overnight, and I've had some during the day because of suspending WU's to run another Project.


Benchmarks shouldn't be a problem with BOINC V5.2.8 or later... suspending shouldn't either with 4.45 or later. Only quitting BOINC completely will cause the "memory bug" computation error, if "leave in memory" is yes. It's possible that what you're seeing is a different problem? If it's one host only, could be overclocked too much, or overheating, or RAM problems - or of course just a string of bad luck, getting some of the "bad WUs". You've been around long enough - you know what to look for! :-)

Seriously, out of my last 80 results, four have computation errors; and all four of those were "bad WUs", three from over the holidays, one that just crawled out of somebody's cache where it'd been hiding since then.

ID: 8699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,103,208
RAC: 167
Message 8721 - Posted: 10 Jan 2006, 14:07:26 UTC - in response to Message 8699.  

Benchmarks shouldn't be a problem with BOINC V5.2.8 or later... suspending shouldn't either with 4.45 or later.


I've been running v5.2.7 for quite awhile now so suspending a WU is still a problem at times, the computation error's don't happen to often but enough to be irritating.

I've slowly started to update all my PC's to v5.2.13 to see if I still get an error sometimes when suspending a WU, will report if I do ... :)
ID: 8721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aquila audax
Avatar

Send message
Joined: 13 Dec 05
Posts: 3
Credit: 55,412
RAC: 0
Message 8734 - Posted: 10 Jan 2006, 20:53:48 UTC

I am running 5.2.13. I will try the leave in memory option. None of the other projects I have running mind being suspended.

PS. Apologies for the long lines, but I can't edit the post anymore to fix.
ID: 8734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,103,208
RAC: 167
Message 8768 - Posted: 11 Jan 2006, 12:01:17 UTC - in response to Message 8699.  

Only quitting BOINC completely will cause the "memory bug" computation error, if "leave in memory" is yes.


eeerrrgggg ... Yes, even after Upgrading to BOINC v5.2.13 I found that out this morning. I Suspended a WU with 3 hours on it & everything was Kewl, then I exited the BOINC Manager & when I started BOINC back up the WU gave me a Computation Error.

This should not happen & needs to be fixed because a lot of people don't leave their PC's running 24/7 like some of us do.
ID: 8768 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aquila audax
Avatar

Send message
Joined: 13 Dec 05
Posts: 3
Credit: 55,412
RAC: 0
Message 8811 - Posted: 11 Jan 2006, 22:15:16 UTC
Last modified: 11 Jan 2006, 22:15:48 UTC

Ok, I have been running with the 'leave in memory' option turned on for over a day now and have not had a single computation error so far.
ID: 8811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Trog Dog
Avatar

Send message
Joined: 25 Nov 05
Posts: 129
Credit: 57,345
RAC: 0
Message 8812 - Posted: 11 Jan 2006, 22:26:21 UTC - in response to Message 8695.  

10/01/2006 1:06:39 AM|rosetta@home|Pausing result NO_RAND_WTS_2tif_230_6530_0 (removed from memory)


Yes - Rosetta will error out if it is removed from memory. Until they find and fix this bug, you have to have "leave applications in memory when preempted" set to "yes" on the website preferences.



This problem doesn't seem to affect linux - I can happily crunch wu's on my two linux boxes (removing from memory when suspended) but the two windows boxes (XP & 98) choke with the 1073741819 error.
ID: 8812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Computation Error



©2024 University of Washington
https://www.bakerlab.org