Message boards : Number crunching : Unrecoverable error
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
From the snippets of BOINC messages provided, I can't tell what's causing the error, as you're only showing the error line, not what BOINC did just before the error occurred. My guess (and I may be wrong) is that you're running several BOINC projects and not keeping the work units in memory on switching. Rosetta has a known problem with this. See this thread for instance. I'm not sure if it's still a problem with the 5.2.x versions of BOINC as I keep work units in memory anyway and most of my machines only run Rosetta. *** Join BOINC@Australia today *** |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
From the snippets of BOINC messages provided, I can't tell what's causing the error, as you're only showing the error line, not what BOINC did just before the error occurred. My guess (and I may be wrong) is that you're running several BOINC projects and not keeping the work units in memory on switching. Rosetta has a known problem with this. I am only running Rosetta, I was running FAD as well but have turned it off for a couple of days to test for an incompatibility running them simulataneously. I am using the pc for internet and occasional burning. No video work etc at the moment. I have my preferences set to: Leave applications in memory while preempted? (suspended applications will consume swap space if 'yes') no |
vavega Send message Joined: 2 Nov 05 Posts: 82 Credit: 519,981 RAC: 0 |
mags i have that swap space checked to yes on all my machines. no problems, but then again i'm running version 5.2.6 and with Fad. you might want to try switching that preference just to check it. couldn't hurt for 1 wu. |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
mags Thanks VaVega, I'll try that and see. :) |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
mags Half an hour later and another sucessful wu sent up and results validated cheers VaVega, and anyone else who helped. join Fadbeens |
Spectre Send message Joined: 1 Nov 05 Posts: 20 Credit: 177,671 RAC: 0 |
This is getting ridiculous.... 2005-11-26 14:03:51 [rosetta@home] Unrecoverable error for result 1dtj__abrelax_rand_len10_jit02_omega_sim_04980_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 16:31:32 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_05727_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 19:28:53 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_05882_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 19:53:41 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_14435_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 20:30:03 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_19778_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 22:27:02 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_16037_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 23:22:21 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_22085_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 00:30:44 [rosetta@home] Unrecoverable error for result 1dtj__abrelax_rand_len10_jit02_omega_sim_17915_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 01:59:18 [rosetta@home] Unrecoverable error for result 1dtj__abrelax_rand_len10_jit02_omega_sim_19302_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 02:09:27 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_19883_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 02:15:16 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_17866_1 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 02:55:02 [rosetta@home] Unrecoverable error for result 1dtj__abrelax_rand_len10_jit02_omega_sim_16814_1 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 03:59:45 [rosetta@home] Unrecoverable error for result 1dcj__abrelax_rand_len10_jit02_omega_sim_07895_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 05:06:39 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_05833_2 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 09:25:45 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_28802_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 09:33:51 [rosetta@home] Unrecoverable error for result 1di2__abrelax_rand_len10_jit02_omega_sim_29759_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-27 09:51:26 [rosetta@home] Unrecoverable error for result 1dtj__abrelax_rand_len10_jit02_omega_sim_21782_0 ( - exit code -1073741819 (0xc0000005)) |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
This is getting ridiculous.... If you want to give me your phone number, I'll call and see if we can figure it out j2satx at stx dot rr dot com |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
If you have "leave applications in memory" set to "NO" then YOU WILL GET ERRORS. This is a known problem. If you can't leave applications in memory while preempted, then you might as well suspend Rosetta and work on another project until the Rosetta folks can find and fix this bug. They're working on it, but it's a difficult one to find and fix. And yes, the project staff is aware of this problem, in fact I exchanged emails with David Kim on it yesterday. Repeating - KNOWN BUG - must have "leave applications in memory" set to "YES" for Rosetta to work. Now I have to rant a little bit... first the disclaimer that I'm not "staff", just a volunteer participant like the rest of you, so if I make somebody mad, don't take it out on the project. I just re-read this thread and realized that this "leave in memory" rule had not been specifically said here, until Yoda mentioned that this sure sounded the same. I know it's been said over and over elsewhere, so I had assumed this (newer) problem thread was something else, not the known issue, or that someone had discussed it and I had missed that. Because I didn't have an answer, I didn't say anything. So - my fault for not catching that this was similar, and points to Yoda for realizing it. In defense of all of us who have been reading this and trying to help, however, I will point out that none of you with problems have YET given us a single line from the log files we've been asking for, or answered "what was it doing before that" - which is the key to knowing it's the memory issue. You just keep giving us the messages from the messages tab, which say "it broke", but not WHY. Yoda has said "we have to know what happened before that" and "post a few lines from the BOINC log". I thought "well, if they're new, they may not know where that is", so I gave the explicit file names that you have to open, and asked you to paste any error messages from those here. It's like we were totally ignored, nobody answered either one of us at all. I went and got the error text from the results themselves, since I could get to those, and they contain at least a LITTLE more info, and posted that here for you, in hopes that someone would be able to find an answer from that data. But none of us can get to the files that are on your computer, or read your mind to know your preference settings, timing of events etc. - and you aren't answering those questions. I don't mean to be rude to the (very welcome) new users, and I know that you haven't had a chance to become familiar with the details of BOINC. But to help you find a problem, we have to be given the information we ask for. If you don't understand what we're wanting, say so and we'll be more specific, give step-by-step instructions, whatever we need to do. I know getting errors is frustrating - you can imagine how it is on our side to know how frustrated you are, yet when we say "give us x so we can help", you don't, the errors continue, and you get madder and madder... Let's all take this as a learning experience. I will try not to assume that everyone already knows something. If you have a problem and someone is trying to help, and asks you a question, try to answer that question as best you can rather than repeating the same thing you already said. If we both do these things, we should be able to solve problems quicker, and everyone will be happier. End rant. Back to crunching! :-) |
Spectre Send message Joined: 1 Nov 05 Posts: 20 Credit: 177,671 RAC: 0 |
@j2satx: For the moment, Ive switched my memory setting from NO to YES and updated the client[s]. Will see if that helps... @Bill: Granted, Im no BOINC genius, but after 2 years of it, Im fairly familiar with how it works. Ive provided all I have at my disposal. Ive posted info from all relevant logfiles. Some files simply didnt have any error info of interest or useful info [last update was weeks ago even tho Im seeing errors]. Ive got 4 boxes doing the same thing. 2 running BOINC 5.2.7 and 2 running BOINC 4.4.5. All are running Windows [1 running 98, 2 running 2000pro, 1 running XP]. Ive been running Rosetta for a month now with no problems and this started a few days ago. Im not bitching or complaining, but Im not the only person with problems. If more info is needed, ask. Spectre |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
If you are running multiple BOINC projects using the BOINC Manager, we recommend setting your general preferences to "Leave applications in memory while preempted." To do this, after creating an account, do the following: Click the "Your account" link on the main home page under "Returning participants" or click the "Participants" link above. Login if you haven't already. Click the "View or edit general preferences" link on your account page. Click "Edit preferences". Select "yes" for the "Leave applications in memory while preempted?" option. I appreciate the help I have been given so far, but perhaps it would help to be more specific on these sbove instructions. I am only doing a single Boinc and a single Rosetta. This above instruction implies that I should leave it at NO not YES. ??? join Fadbeens |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
@Bill: Granted, Im no BOINC genius, but after 2 years of it, Im fairly familiar with how it works. Ive provided all I have at my disposal. Ive posted info from all relevant logfiles. Some files simply didnt have any error info of interest or useful info [last update was weeks ago even tho Im seeing errors]. (snip...) Ive been running Rosetta for a month now with no problems and this started a few days ago. Im not bitching or complaining, but Im not the only person with problems. If more info is needed, ask. Spectre, I was basing my "new user" comment on your join date of this month; I didn't know you had BOINCed but not Rosetta'd before that. Also, as you say, you aren't the only one with problems - so you aren't the only one who I was referring to either! :-) I see where you have posted the contents of the "Messages" tab (or perhaps stdout) but not anything from stderrdae or stderrgui - even with no errors, those files on my system were updated yesterday as they contain even things like "deferring communication" messages. They SHOULD contain at the very least the same text that was placed in your results file, that I was able to get to and copy to the board. The "slots" directories contain stderr.txt files that are "up to the second", but nothing that would be relevant here as once the WU has errored, the slots directory is cleared, thus I didn't ask for those. It sounds from when the problem started that it may have come in with rosetta 4.79. Have you had any rosetta_graphics_beta 4.80 WUs? Did they work or error? With your multiple computers and not knowing "when you did what", or which host is the biggest problem, it would take me forever to dig through all your results. If your stderrdae.txt and stderrgui.txt haven't been updated in weeks, there may be real issues with "where" BOINC is running on your system... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
I appreciate the help I have been given so far, but perhaps it would help to be more specific on these sbove instructions. I am only doing a single Boinc and a single Rosetta. This above instruction implies that I should leave it at NO not YES. ??? If you are running multiple BOINC projects, etc., it is better to have it set to YES. If you are running one project, AND THAT PROJECT IS NOT ROSETTA WHILE ROSETTA HAS A BUG, then it doesn't matter. Right now, to make Rosetta work properly, it must be at YES. |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
I appreciate the help I have been given so far, but perhaps it would help to be more specific on these sbove instructions. I am only doing a single Boinc and a single Rosetta. This above instruction implies that I should leave it at NO not YES. ??? Then it should say that, (multiple projects and/or Rosetta) for the moment, otherwise there will be a lot like me from FAD who have problems until told otherwise. join Fadbeens |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
Then it should say that, (multiple projects and/or Rosetta) for the moment, otherwise there will be a lot like me from FAD who have problems until told otherwise. @ADMINS: the page in question is https://boinc.bakerlab.org/rosetta/rah_requirements.php - I agree that it probably should be changed pending the fix for the "leave in memory" bug. |
Spectre Send message Joined: 1 Nov 05 Posts: 20 Credit: 177,671 RAC: 0 |
I see where you have posted the contents of the "Messages" tab (or perhaps stdout) but not anything from stderrdae or stderrgui - even with no errors, those files on my system were updated yesterday as they contain even things like "deferring communication" messages. They SHOULD contain at the very least the same text that was placed in your results file, that I was able to get to and copy to the board. The "slots" directories contain stderr.txt files that are "up to the second", but nothing that would be relevant here as once the WU has errored, the slots directory is cleared, thus I didn't ask for those. The last thing I posted was DIRECTLY from my stderrdae.txt file....if you wish to see the files, they are here: http://www.planetspectre.com/rosetta/ Ive had better luck with the graphics workunits than the standard ones. Thanks, Spectre |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
The last thing I posted was DIRECTLY from my stderrdae.txt file....if you wish to see the files, they are here: I looked at the files... thank you VERY much for making them available! When I look at my own stdoutdae.txt, I see pausing messages saying "left in memory". I was expecting to see messages in your file saying "removed from memory" right before an error, and there are indeed some: 2005-11-26 22:17:10 [---] Suspending computation and network activity - running CPU benchmarks 2005-11-26 22:17:10 [rosetta@home] Pausing result 1ogw__abrelax_rand_len10_jit02_omega_sim_16037_0 (removed from memory) 2005-11-26 22:17:11 [---] request_reschedule_cpus: process exited 2005-11-26 22:17:12 [---] Running CPU benchmarks 2005-11-26 22:18:12 [---] Benchmark results: ...snip... 2005-11-26 22:18:13 [rosetta@home] Restarting result 1ogw__abrelax_rand_len10_jit02_omega_sim_16037_0 using rosetta version 479 ...snip... 2005-11-26 22:27:02 [rosetta@home] Unrecoverable error for result 1ogw__abrelax_rand_len10_jit02_omega_sim_16037_0 ( - exit code -1073741819 (0xc0000005)) 2005-11-26 22:27:02 [---] request_reschedule_cpus: process exited 2005-11-26 22:27:02 [rosetta@home] Computation for result 1ogw__abrelax_rand_len10_jit02_omega_sim_16037_0 finished and I'm seeing an error right after you restart BOINC (which would also have removed from memory when you quit, no way around that) - but I'm also seeing errors because of "no child processes", "incorrect function", etc. Your benchmarks have quite a wide range of values, all within "reason", but still varying quite a bit. You have 384MB of RAM, which is below the minimum recommended, so this COULD all be due to a virtual memory "swapping" situation. Specifically the results that take "more memory" could be running into trouble. I am NOT seeing this as being the "leave in memory" problem; part could be, but the rest isn't. I'm thinking this is a PC instability issue of some type, with maybe a problem aggravated by low RAM. Are you overclocked? Any indication of other problems with this PC? You had some of the same errors on Predictor... I'm not sure what to tell you. I'm a Mac guy who owns one Windows PC, not a Windows/Intel guru by any means. I've seen threads where RAM test routines are recommended, or Prime95, etc., to help locate/fix this type of thing. Anyone? Edit: Spealink |
Spectre Send message Joined: 1 Nov 05 Posts: 20 Credit: 177,671 RAC: 0 |
The last thing I posted was DIRECTLY from my stderrdae.txt file....if you wish to see the files, they are here: Thanks a million for looking into this for me...Odd though that I had very few problems until the past couple days. My system is a bit OC'd. Ill ramp it down a bit and see if that helps. If not, Ill research a possible bad RAM/insuffient RAM issue. Thanks again! Spectre |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Where would one find the info that would need posting when a WU gives an error I'm just used to looking at my messages tab so where else does BOINC store that info, coz no doubt in 45 minutes I will need to post an error log Join us in Chat (see the forum) Click the Sig Join UBT |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 11 |
Where would one find the info that would need posting when a WU gives an error In some cases, the Messages tab has all the info needed; in other cases, the result itself, on the web page. In the most complex cases, there are four .txt files in the BOINC folder. Generally the most useful is stdoutdae.txt. The stdERR files might help someone who can understand what's in there, but that isn't me... |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Just found them all just now after having a browse around Join us in Chat (see the forum) Click the Sig Join UBT |
Message boards :
Number crunching :
Unrecoverable error
©2024 University of Washington
https://www.bakerlab.org