Message boards : Number crunching : WU errors after hibernate?
Author | Message |
---|---|
Deamiter Send message Joined: 9 Nov 05 Posts: 26 Credit: 3,793,650 RAC: 0 |
I've been following the boards here pretty regularly, so I'm well aware of the issue with the Rosetta WUs erroring on some machines when they are not set to "leave in memory when suspended." I never encountered the problem on any of my machines even when I played with the setting, but that's besides the point. Today, I set my computer to hibernate by accident as usually I try to simply suspend it (I've had trouble with hibernating in the past, though never a boinc issue like this). When I turned it back on, the WU had errored out. Is this the same issue as above, or is it something totally different? The error was on computer 80879 on workunit NO_RANDOM_WTS_OR_FRAGS_1b72_223_5556 |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
It looks the same ... I have had some errors when I had a forced re-boot. :( I know the developers are working hard to find the issue. I also know that this is one of those hard bugs ... As an example, my boss kept telling me he would load a model and it would fail. Well, actually, it was the second model he loaded that failed. Mattered not what the model was, just that the second model would fail because of an improperly initialized variable. That bug took me nearly 6 months to find because we could not clearly identify the failure mechanism. Every model that "failed" would load cleanly for the developers ... but we would just test the "bad" model ... sigh ... Anyway, I know that they want to fix this, probably more than we do ... |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
I've been following the boards here pretty regularly, so I'm well aware of the issue with the Rosetta WUs erroring on some machines when they are not set to "leave in memory when suspended." I never encountered the problem on any of my machines even when I played with the setting, but that's besides the point. Yes it happens and the leave in memory (as far as I remember, will need to search for my posts) doesn't help either or it causes a 'zero status, no finish file'. It happens with both Suspend and Hibernate (WindowsXP) Team mauisun.org |
Keck_Komputers Send message Joined: 17 Sep 05 Posts: 211 Credit: 4,246,150 RAC: 0 |
BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit. BOINC WIKI BOINCing since 2002/12/8 |
bartsob5&alicjam Send message Joined: 17 Sep 05 Posts: 6 Credit: 183,280 RAC: 0 |
and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal? |
Rebirther Send message Joined: 17 Sep 05 Posts: 116 Credit: 41,315 RAC: 0 |
and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal? Looks like only by an AMD processor, my P4 haven`t had any probs... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
and i've found, that every time i'm suspending rosetta project, the result which is being crunched got an error... is it normal? Suspend shouldn't cause an error as long as "leave applications in memory when preempted" is "yes"... hibernate is a different level. |
bartsob5&alicjam Send message Joined: 17 Sep 05 Posts: 6 Credit: 183,280 RAC: 0 |
so why it was my fourth error like that? |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
so why it was my fourth error like that? <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x77F52B6A write attempt to address 0x40253040 Exiting... </stderr_txt> This is absolutely typical of a host where "leave applications in memory" is a "no". Have you checked your preferences? Hit "update" on the host, and verified in the Messages tab that the preferences were picked up? I looked at your last 8-10 errors on one host; about 1/3 seem to be the "bad WUs" that are floating around, in that you are not the only one to have errors with them. The others all look like this one. If you are truly getting the "left in memory" message when Rosetta switches out, then I would look for some other problem with your computer. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
819 errors can be bad drivers, aggressive overclocking, over heating, etc. Are you seeing these type errors, actaully any client errors, on other projects? |
sbfh Send message Joined: 6 Dec 05 Posts: 2 Credit: 1,624 RAC: 0 |
BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit. I don't have these problems with my other Boinc projects, just rosetta. I get a significantly higher rate of client errors here and have few or none on the other projects. Is this something that Rosetta is looking at? fyi... I run Seti, Einstien and Predictor as well as Rosetta. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit. And no problems here with Seti, CPDN (other than zero state file when coming ot of hibernate/suspend If you say it is BOINC, are they looking in to it? After all it is common functionality to the OS and certainly going to me nore so when they try to move the the 'instant on' living room appliances etc.. I use it all the time. I believe the Rosetta common 'errors' are the cause here though and 'leave in memory' should fix the client error in the most as I just checked the logs and it seems to be just 'zero error' which all projects seem to get (I assume these are just save point files, although cannot see why boinc shouldn't be able to handle the hibernate/suspend calls most other thing can) Team mauisun.org |
sbfh Send message Joined: 6 Dec 05 Posts: 2 Credit: 1,624 RAC: 0 |
BOINC in general tends to not like Windows hibernate and stand by settings. Both have been known to cause complete wipes of the queue occasionally, and fairly commonly lost of the active workunit. I think I will suspend my rosetta account for a while until this is taken care of. I just got another errored our unit. I would rather spend my computer's cpu downtime on projects that don't have this particular problem. |
bartsob5&alicjam Send message Joined: 17 Sep 05 Posts: 6 Credit: 183,280 RAC: 0 |
@Bill Michael there were errors caused by aggresive overclocking, but on another host;) now there should be everything allright;) from that day i haven't run any rosetta WU and i don't know, whether it's ok now or not. Anyway, thanks for help! |
Rich Zajac Send message Joined: 7 Nov 05 Posts: 4 Credit: 37,323 RAC: 0 |
I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005)) Could someone please give me an idea whats causing this....it started only fairly recently although I cant say exactly when. I'm going to suspend Rosetta until I get some idea cause all I'm doing now is wasting cycles. Rich |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005)) The "05" errors USUALLY mean you have "leave applications in memory when preempted" set to "no". If this is "yes", we'll need to dig deeper. |
Rich Zajac Send message Joined: 7 Nov 05 Posts: 4 Credit: 37,323 RAC: 0 |
I keep getting a message similar to: 1/7/06 11:19:59 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1dtj_226_6184_2 ( - exit code -1073741819 (0xc0000005)) So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf". |
Deamiter Send message Joined: 9 Nov 05 Posts: 26 Credit: 3,793,650 RAC: 0 |
So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf". Yes it's true, and this is a problem that's being attacked as we speak. However, you should be aware that it does not leave the application in your RAM -- when left in memory, the application is automatically paged to the virtual memory on your hard drive. Unless you're REALLY hurting for hard drive space, I can't imagine why this would be such a problem. Though in the end, there's no question that it IS a problem. I'm sure it'll get fixed eventually, but until then, I'm happy to give a few more megabytes of my HD space to the projects since it reduces the time it takes for my computer to switch between projects with no negative side effects (besides eating another few MB out of the 60GB I have free). |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
So.....what you're telling me is that unlike the other BOINC projects, Rosetta requires that I keep large amounts of memory tied up for ALL projects since the "leave in memory" option is global?!?!?! Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf". As far as I know, SETI and Einstein are the _only_ two projects that are not harmed (much...) by being swapped out of memory. uFluids and SZTAKI won't error out - they just restart at the very beginning. ClimatePrediction and Predictor and LHC restart at the last checkpoint, which can (for them) be a significant part of the hour they by default run. SETI and Einstein _also_ restart at the last checkpoint, but as long as you have not raised the default "write to disk every" setting, they have checkpoints every 0.1% or so, so you only lose a few minutes of crunching time. The setting is there for the very small number of people (mostly on Win9x boxes) who are VERY tight on memory. Even though it will be swapped out to virtual memory (on disk), there is a few K that remains in RAM. If you have a computer that meets the minimum requirements of Rosetta (512MB, Win2K and better, including Linux or Mac) then there is zero downside to setting the option to "yes". If you can't meet the minimum requirements shown on the website, then you really should never have signed up at all without understanding that you were "on your own", being below those standards. The project staff has this bug on their list of things to fix, but frankly it's only a "medium" priority compared to several other issues. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Please let me know when you come up with a fix....until then I'm going to put Rosetta "on the shelf". Could be your best move *if* it is really causing you a problem
On CPDN with the min spec 800MHz box and an earlier client, a sulphur WU takes more than 1 hour to get to checkpoint. In runs perfectly in one sense, but never progresses in real terms unless you either set a higher interval for swaps or set keep=yes. I believe that later clients allow crunching to continue to the next checkpoint, but only on a 1-cpu box. (I'm not certain about this - I remember it being discussed but am not sure if it was actually done) R~~ |
Message boards :
Number crunching :
WU errors after hibernate?
©2024 University of Washington
https://www.bakerlab.org