WU errors after hibernate?

Message boards : Number crunching : WU errors after hibernate?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Deamiter

Send message
Joined: 9 Nov 05
Posts: 26
Credit: 3,793,650
RAC: 0
Message 8331 - Posted: 4 Jan 2006, 6:51:05 UTC
Last modified: 4 Jan 2006, 6:51:40 UTC

I've been following the boards here pretty regularly, so I'm well aware of the issue with the Rosetta WUs erroring on some machines when they are not set to "leave in memory when suspended." I never encountered the problem on any of my machines even when I played with the setting, but that's besides the point.

Today, I set my computer to hibernate by accident as usually I try to simply suspend it (I've had trouble with hibernating in the past, though never a boinc issue like this). When I turned it back on, the WU had errored out.

Is this the same issue as above, or is it something totally different?

The error was on computer 80879 on workunit NO_RANDOM_WTS_OR_FRAGS_1b72_223_5556
ID: 8331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 8334 - Posted: 4 Jan 2006, 7:04:51 UTC

It looks the same ...

I have had some errors when I had a forced re-boot. :(

I know the developers are working hard to find the issue. I also know that this is one of those hard bugs ...

As an example, my boss kept telling me he would load a model and it would fail. Well, actually, it was the second model he loaded that failed. Mattered not what the model was, just that the second model would fail because of an improperly initialized variable. That bug took me nearly 6 months to find because we could not clearly identify the failure mechanism. Every model that "failed" would load cleanly for the developers ... but we would just test the "bad" model ... sigh ...

Anyway, I know that they want to fix this, probably more than we do ...
ID: 8334 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 8341 - Posted: 4 Jan 2006, 10:39:25 UTC - in response to Message 8331.  

I've been following the boards here pretty regularly, so I'm well aware of the issue with the Rosetta WUs erroring on some machines when they are not set to "leave in memory when suspended." I never encountered the problem on any of my machines even when I played with the setting, but that's besides the point.

Today, I set my computer to hibernate by accident as usually I try to simply suspend it (I've had trouble with hibernating in the past, though never a boinc issue like this). When I turned it back on, the WU had errored out.

Is this the same issue as above, or is it something totally different?

The error was on computer 80879 on workunit NO_RANDOM_WTS_OR_FRAGS_1b72_223_5556



Yes it happens and the leave in memory (as far as I remember, will need to search for my posts) doesn't help either or it causes a 'zero status, no finish file'. It happens with both Suspend and Hibernate (WindowsXP)

Team mauisun.org
ID: 8341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 17 Sep 05
Posts: 211
Credit: 4,246,150
RAC: 0
Message 8345 - Posted: 4 Jan 2006, 12:23:30 UTC

BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit.
BOINC WIKI

BOINCing since 2002/12/8
ID: 8345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8360 - Posted: 4 Jan 2006, 18:35:00 UTC

and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal?
ID: 8360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 8364 - Posted: 4 Jan 2006, 18:59:30 UTC - in response to Message 8360.  

and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal?


Looks like only by an AMD processor, my P4 haven`t had any probs...
ID: 8364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 8368 - Posted: 4 Jan 2006, 19:51:34 UTC - in response to Message 8360.  

and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal?


Suspend shouldn't cause an error as long as "leave applications in memory when preempted" is "yes"... hibernate is a different level.

ID: 8368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8375 - Posted: 4 Jan 2006, 21:58:31 UTC - in response to Message 8368.  



Suspend shouldn't cause an error as long as "leave applications in memory when preempted" is "yes"... hibernate is a different level.

so why it was my fourth error like that?
ID: 8375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 8377 - Posted: 4 Jan 2006, 22:32:57 UTC - in response to Message 8375.  

so why it was my fourth error like that?


<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x77F52B6A write attempt to address 0x40253040

Exiting...

</stderr_txt>


This is absolutely typical of a host where "leave applications in memory" is a "no". Have you checked your preferences? Hit "update" on the host, and verified in the Messages tab that the preferences were picked up?

I looked at your last 8-10 errors on one host; about 1/3 seem to be the "bad WUs" that are floating around, in that you are not the only one to have errors with them. The others all look like this one. If you are truly getting the "left in memory" message when Rosetta switches out, then I would look for some other problem with your computer.

ID: 8377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 8388 - Posted: 4 Jan 2006, 22:50:19 UTC

819 errors can be bad drivers, aggressive overclocking, over heating, etc.

Are you seeing these type errors, actaully any client errors, on other projects?
ID: 8388 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sbfh

Send message
Joined: 6 Dec 05
Posts: 2
Credit: 1,624
RAC: 0
Message 8397 - Posted: 5 Jan 2006, 1:45:51 UTC - in response to Message 8345.  

BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit.


I don't have these problems with my other Boinc projects, just rosetta. I get a significantly higher rate of client errors here and have few or none on the other projects. Is this something that Rosetta is looking at?

fyi... I run Seti, Einstien and Predictor as well as Rosetta.


ID: 8397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 8418 - Posted: 5 Jan 2006, 12:39:44 UTC - in response to Message 8397.  

BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit.


I don't have these problems with my other Boinc projects, just rosetta. I get a significantly higher rate of client errors here and have few or none on the other projects. Is this something that Rosetta is looking at?

fyi... I run Seti, Einstien and Predictor as well as Rosetta.




And no problems here with Seti, CPDN (other than zero state file when coming ot of hibernate/suspend

If you say it is BOINC, are they looking in to it? After all it is common functionality to the OS and certainly going to me nore so when they try to move the the 'instant on' living room appliances etc..
I use it all the time.

I believe the Rosetta common 'errors' are the cause here though and 'leave in memory' should fix the client error in the most as I just checked the logs and it seems to be just 'zero error' which all projects seem to get (I assume these are just save point files, although cannot see why boinc shouldn't be able to handle the hibernate/suspend calls most other thing can)
Team mauisun.org
ID: 8418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sbfh

Send message
Joined: 6 Dec 05
Posts: 2
Credit: 1,624
RAC: 0
Message 8479 - Posted: 6 Jan 2006, 15:03:40 UTC - in response to Message 8397.  

BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit.


I don't have these problems with my other Boinc projects, just rosetta. I get a significantly higher rate of client errors here and have few or none on the other projects. Is this something that Rosetta is looking at?

fyi... I run Seti, Einstien and Predictor as well as Rosetta.



I think I will suspend my rosetta account for a while until this is taken care of. I just got another errored our unit. I would rather spend my computer's cpu downtime on projects that don't have this particular problem.


ID: 8479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8528 - Posted: 7 Jan 2006, 10:17:01 UTC

@Bill Michael
there were errors caused by aggresive overclocking, but on another host;) now there should be everything allright;) from that day i haven't run any rosetta WU and i don't know, whether it's ok now or not. Anyway, thanks for help!
ID: 8528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rich Zajac

Send message
Joined: 7 Nov 05
Posts: 4
Credit: 37,323
RAC: 0
Message 8544 - Posted: 7 Jan 2006, 19:54:58 UTC

I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005))

Could someone please give me an idea whats causing this....it started only fairly recently although I cant say exactly when. I'm going to suspend Rosetta until I get some idea cause all I'm doing now is wasting cycles.

Rich
ID: 8544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 8545 - Posted: 7 Jan 2006, 20:07:50 UTC - in response to Message 8544.  

I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005))


The "05" errors USUALLY mean you have "leave applications in memory when preempted" set to "no". If this is "yes", we'll need to dig deeper.

ID: 8545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rich Zajac

Send message
Joined: 7 Nov 05
Posts: 4
Credit: 37,323
RAC: 0
Message 8608 - Posted: 8 Jan 2006, 18:26:08 UTC - in response to Message 8545.  

I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005))


The "05" errors USUALLY mean you have "leave applications in memory when preempted" set to "no". If this is "yes", we'll need to dig deeper.


So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf".
ID: 8608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Deamiter

Send message
Joined: 9 Nov 05
Posts: 26
Credit: 3,793,650
RAC: 0
Message 8618 - Posted: 9 Jan 2006, 5:33:19 UTC - in response to Message 8608.  

So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf".

Yes it's true, and this is a problem that's being attacked as we speak. However, you should be aware that it does not leave the application in your RAM -- when left in memory, the application is automatically paged to the virtual memory on your hard drive. Unless you're REALLY hurting for hard drive space, I can't imagine why this would be such a problem.

Though in the end, there's no question that it IS a problem. I'm sure it'll get fixed eventually, but until then, I'm happy to give a few more megabytes of my HD space to the projects since it reduces the time it takes for my computer to switch between projects with no negative side effects (besides eating another few MB out of the 60GB I have free).
ID: 8618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 8620 - Posted: 9 Jan 2006, 5:54:01 UTC - in response to Message 8608.  

So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf".


As far as I know, SETI and Einstein are the _only_ two projects that are not harmed (much...) by being swapped out of memory. uFluids and SZTAKI won't error out - they just restart at the very beginning. ClimatePrediction and Predictor and LHC restart at the last checkpoint, which can (for them) be a significant part of the hour they by default run. SETI and Einstein _also_ restart at the last checkpoint, but as long as you have not raised the default "write to disk every" setting, they have checkpoints every 0.1% or so, so you only lose a few minutes of crunching time.

The setting is there for the very small number of people (mostly on Win9x boxes) who are VERY tight on memory. Even though it will be swapped out to virtual memory (on disk), there is a few K that remains in RAM. If you have a computer that meets the minimum requirements of Rosetta (512MB, Win2K and better, including Linux or Mac) then there is zero downside to setting the option to "yes". If you can't meet the minimum requirements shown on the website, then you really should never have signed up at all without understanding that you were "on your own", being below those standards.

The project staff has this bug on their list of things to fix, but frankly it's only a "medium" priority compared to several other issues.

ID: 8620 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8639 - Posted: 9 Jan 2006, 12:08:09 UTC - in response to Message 8620.  

Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf".


Could be your best move *if* it is really causing you a problem


As far as I know, SETI and Einstein are the _only_ two projects that are not harmed [by keep-in-mem=no] ... ClimatePrediction and Predictor and LHC restart at the last checkpoint, which can (for them) be a significant part of the hour they by default run.


On CPDN with the min spec 800MHz box and an earlier client, a sulphur WU takes more than 1 hour to get to checkpoint. In runs perfectly in one sense, but never progresses in real terms unless you either set a higher interval for swaps or set keep=yes.

I believe that later clients allow crunching to continue to the next checkpoint, but only on a 1-cpu box. (I'm not certain about this - I remember it being discussed but am not sure if it was actually done)

R~~
ID: 8639 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : WU errors after hibernate?



©2024 University of Washington
https://www.bakerlab.org