Message boards : Number crunching : Computational Error
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
Holly, Answer from Einstein moved to here: Oh, was it me you posted something to? People usually adress me as Fuzzy or Ms. Noodles for some. No, I couldn't use your post to anything, except the part about the person crunching that WU before me, had a Mac OS that's not compatible. I didn't know that, as I don't have a Mac. No, it was NOT a benchmark, which made the WU crash! My first WU went fine and it seems that the one I crunch on now is doing OK. It has reached the critical point of 83.33 %, so we'll have to see. So I think that particular WU is bad. I don't know if you bothered to see the spec's of my computer, as I don't have it hidden here (it's only over at Seti I have it hidden for reasons, I won't touch on now), so I don't know why you think my computer is memory limited? And no, it didn't crash because there was a benchmark running. I haven't had an automatic benchmark runned for several days now, but if it should come, while my computer is on Rosetta crunching, and the WU crash, so be it, untill they solve the problem here. But thanks anyway. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
It just happened again! Ok, let's see if we can dissect this problem from my my log: 10/8/2005 8:30:04 PM||Starting BOINC client version 4.72 for windows_intelx86 10/8/2005 8:30:04 PM||Data directory: C:ProgrammerBOINC 10/8/2005 8:30:04 PM||Processor Inventory: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz Processor(s) 10/8/2005 8:30:04 PM||Memory Inventory: Memory total - 503.36 MB, Swap total - 1.20 GB 10/8/2005 8:30:04 PM||Disk Inventory: Disk total - 37.25 GB, Disk available - 27.46 GB 10/8/2005 8:30:05 PM|rosetta@home|Computer ID: 12228; location: home; project prefs: home 10/8/2005 8:30:05 PM|LHC@home|Computer ID: 64638; location: home; project prefs: home 10/8/2005 8:30:05 PM|SETI@home|Computer ID: 1489784; location: home; project prefs: home 10/8/2005 8:30:05 PM||General prefs: from rosetta@home (last modified 2005-10-08 20:23:30) 10/8/2005 8:30:05 PM||General prefs: using separate prefs for home 10/8/2005 8:30:05 PM||Remote control not allowed; using loopback address 10/8/2005 8:30:05 PM|rosetta@home|Deferring computation for result 1cfyA_abrelax_13371_1 10/8/2005 8:30:05 PM|SETI@home|Deferring computation for result 18oc03ab.11910.20178.754822.194_0 10/8/2005 8:30:05 PM|LHC@home|Resuming computation for result wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2 using sixtrack version 4.67 10/8/2005 8:30:05 PM|SETI@home|Deferring communication with project for 1 minutes and 48 seconds 10/8/2005 8:41:35 PM||request_reschedule_cpus: project op 10/8/2005 8:41:36 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi .... A lot of contacts to LHC .... 10/8/2005 9:59:40 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/8/2005 9:59:40 PM|LHC@home|Reason: To fetch work 10/8/2005 9:59:40 PM|LHC@home|Requesting 8450 seconds of work, returning 0 results 10/8/2005 9:59:41 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/8/2005 9:59:41 PM|LHC@home|No work from project 10/8/2005 9:59:42 PM|LHC@home|Deferring communication with project for 16 minutes and 7 seconds 10/8/2005 9:59:52 PM||request_reschedule_cpus: process exited 10/8/2005 9:59:52 PM|LHC@home|Computation for result wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2 finished 10/8/2005 9:59:52 PM|rosetta@home|Restarting result 1cfyA_abrelax_13371_1 using rosetta version 4.77 10/8/2005 9:59:53 PM|LHC@home|Started upload of wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2_0 10/8/2005 10:00:00 PM|LHC@home|Finished upload of wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2_0 10/8/2005 10:00:00 PM|LHC@home|Throughput 7152 bytes/sec 10/8/2005 10:02:42 PM||request_reschedule_cpus: project op 10/8/2005 10:02:42 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/8/2005 10:02:42 PM|LHC@home|Reason: Requested by user 10/8/2005 10:02:42 PM|LHC@home|Requesting 8640 seconds of work, returning 1 results .... a lot of contacts to LHC .... 10/8/2005 10:29:20 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/8/2005 10:29:20 PM|LHC@home|Reason: To fetch work 10/8/2005 10:29:20 PM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/8/2005 10:29:21 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/8/2005 10:29:21 PM|LHC@home|No work from project 10/8/2005 10:29:22 PM|LHC@home|Deferring communication with project for 31 minutes and 9 seconds 10/8/2005 10:57:05 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/8/2005 10:57:05 PM|rosetta@home|Reason: To fetch work 10/8/2005 10:57:05 PM|rosetta@home|Requesting 2800 seconds of work, returning 0 results 10/8/2005 10:57:08 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/8/2005 10:57:09 PM|rosetta@home|Deferring communication with project for 5 seconds 10/8/2005 10:57:09 PM|rosetta@home|Started download of aa1acf_03_05.200_v1_3.gz 10/8/2005 10:57:09 PM|rosetta@home|Started download of aa1acf_09_05.200_v1_3.gz 10/8/2005 10:58:24 PM|rosetta@home|Finished download of aa1acf_03_05.200_v1_3.gz 10/8/2005 10:58:24 PM|rosetta@home|Throughput 14749 bytes/sec 10/8/2005 10:58:24 PM|rosetta@home|Started download of 1acf_.fasta 10/8/2005 10:58:25 PM|rosetta@home|Finished download of 1acf_.fasta 10/8/2005 10:58:25 PM|rosetta@home|Throughput 188 bytes/sec 10/8/2005 10:58:25 PM|rosetta@home|Started download of 1acf_.psipred_ss2.gz 10/8/2005 10:58:26 PM|rosetta@home|Finished download of 1acf_.psipred_ss2.gz 10/8/2005 10:58:26 PM|rosetta@home|Throughput 1429 bytes/sec 10/8/2005 10:58:26 PM|rosetta@home|Started download of 1acf.pdb.gz 10/8/2005 10:58:28 PM|rosetta@home|Finished download of 1acf.pdb.gz 10/8/2005 10:58:28 PM|rosetta@home|Throughput 16846 bytes/sec 10/8/2005 10:58:28 PM|rosetta@home|Started download of 1acf_.1d1jA.3dpair.base.pairmin_fixc3.cst.gz 10/8/2005 10:58:29 PM|rosetta@home|Finished download of 1acf_.1d1jA.3dpair.base.pairmin_fixc3.cst.gz 10/8/2005 10:58:29 PM|rosetta@home|Throughput 4522 bytes/sec 10/8/2005 10:59:39 PM|rosetta@home|Finished download of aa1acf_09_05.200_v1_3.gz 10/8/2005 10:59:39 PM|rosetta@home|Throughput 20819 bytes/sec 10/8/2005 10:59:39 PM||request_reschedule_cpus: files downloaded 10/8/2005 10:59:39 PM|rosetta@home|Pausing result 1cfyA_abrelax_13371_1 (removed from memory) 10/8/2005 10:59:40 PM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18 10/8/2005 10:59:40 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_13371_1 ( - exit code -1073741819 (0xc0000005)) 10/8/2005 10:59:41 PM||request_reschedule_cpus: process exited 10/8/2005 10:59:41 PM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds 10/8/2005 10:59:41 PM|rosetta@home|Computation for result 1cfyA_abrelax_13371_1 finished 10/8/2005 11:00:33 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/8/2005 11:00:33 PM|LHC@home|Reason: To fetch work 10/8/2005 11:00:33 PM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/8/2005 11:00:34 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/8/2005 11:00:34 PM|LHC@home|No work from project 10/8/2005 11:00:35 PM|LHC@home|Deferring communication with project for 18 minutes and 41 seconds .... a lot of contacts to LHC .... 10/8/2005 11:19:18 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/8/2005 11:19:18 PM|LHC@home|No work from project 10/8/2005 11:19:19 PM|LHC@home|Deferring communication with project for 1 hours, 53 minutes, and 37 seconds 10/8/2005 11:25:52 PM||request_reschedule_cpus: project op 10/8/2005 11:25:52 PM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory) 10/8/2005 11:25:53 PM|rosetta@home|Starting result 1acf__abrelax_no_cst_06323_0 using rosetta version 4.77 10/8/2005 11:25:53 PM||request_reschedule_cpus: process exited 10/8/2005 11:25:53 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/8/2005 11:25:53 PM|rosetta@home|Reason: Requested by user 10/8/2005 11:25:53 PM|rosetta@home|Requesting 0 seconds of work, returning 1 results // Here I return the first crashed WU 10/8/2005 11:25:55 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/9/2005 12:19:20 AM|LHC@home|Deferring communication with project for 53 minutes and 37 seconds 10/9/2005 1:12:58 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/9/2005 1:12:58 AM|LHC@home|Reason: To fetch work 10/9/2005 1:12:58 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/9/2005 1:12:59 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/9/2005 1:12:59 AM|LHC@home|No work from project 10/9/2005 1:13:00 AM|LHC@home|Deferring communication with project for 58 seconds .... more contacts to LHC .... 10/9/2005 1:23:44 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/9/2005 1:23:44 AM|LHC@home|Reason: To fetch work 10/9/2005 1:23:44 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/9/2005 1:23:45 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/9/2005 1:23:46 AM|LHC@home|No work from project 10/9/2005 1:23:47 AM|LHC@home|Deferring communication with project for 46 minutes and 53 seconds 10/9/2005 1:25:53 AM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18 10/9/2005 1:25:53 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_06323_0 (removed from memory) 10/9/2005 1:25:55 AM||request_reschedule_cpus: process exited 10/9/2005 2:10:41 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/9/2005 2:10:41 AM|LHC@home|Reason: To fetch work 10/9/2005 2:10:41 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/9/2005 2:10:42 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/9/2005 2:10:42 AM|LHC@home|No work from project 10/9/2005 2:10:43 AM|LHC@home|Deferring communication with project for 3 minutes and 29 seconds .... again contacts to LHC .... 10/9/2005 3:14:15 AM|LHC@home|Deferring communication with project for 1 hours, 1 minutes, and 13 seconds 10/9/2005 3:25:55 AM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory) 10/9/2005 3:25:56 AM|rosetta@home|Restarting result 1acf__abrelax_no_cst_06323_0 using rosetta version 4.77 10/9/2005 3:25:56 AM||request_reschedule_cpus: process exited 10/9/2005 4:14:16 AM|LHC@home|Deferring communication with project for 1 minutes and 12 seconds 10/9/2005 4:15:29 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/9/2005 4:15:29 AM|LHC@home|Reason: To fetch work 10/9/2005 4:15:29 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/9/2005 4:15:30 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/9/2005 4:15:30 AM|LHC@home|No work from project 10/9/2005 4:15:31 AM|LHC@home|Deferring communication with project for 58 seconds 10/9/2005 4:16:31 AM|LHC@home|Fetching master file 10/9/2005 4:16:32 AM|LHC@home|Master page download succeeded .... contacts to LHC ..... 10/9/2005 4:43:54 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 10/9/2005 4:43:54 AM|LHC@home|Reason: To fetch work 10/9/2005 4:43:54 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results 10/9/2005 4:43:55 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 10/9/2005 4:43:55 AM|LHC@home|No work from project 10/9/2005 4:43:56 AM|LHC@home|Deferring communication with project for 1 hours, 54 minutes, and 20 seconds 10/9/2005 5:25:56 AM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18 10/9/2005 5:25:56 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_06323_0 (removed from memory) 10/9/2005 5:25:58 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_06323_0 ( - exit code -1073741819 (0xc0000005)) 10/9/2005 5:25:58 AM||request_reschedule_cpus: process exited 10/9/2005 5:25:58 AM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds 10/9/2005 5:25:58 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_06323_0 finished 10/9/2005 5:26:00 AM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 10/9/2005 5:26:00 AM|SETI@home|Reason: To fetch work 10/9/2005 5:26:00 AM|SETI@home|Requesting 2171 seconds of work, returning 0 results 10/9/2005 5:26:02 AM|SETI@home|Scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded 10/9/2005 5:26:03 AM|SETI@home|Deferring communication with project for 10 minutes and 4 seconds 10/9/2005 5:26:03 AM|SETI@home|Started download of 29ap04ab.4305.12480.803406.195 10/9/2005 5:26:16 AM|SETI@home|Finished download of 29ap04ab.4305.12480.803406.195 10/9/2005 5:26:16 AM|SETI@home|Throughput 29700 bytes/sec 10/9/2005 5:26:16 AM||request_reschedule_cpus: files downloaded 10/9/2005 5:26:58 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/9/2005 5:26:58 AM|rosetta@home|Reason: To fetch work 10/9/2005 5:26:58 AM|rosetta@home|Requesting 8640 seconds of work, returning 1 results // here BOINC manager returns the second crashed WU. I didn't update this time! 10/9/2005 5:27:00 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/9/2005 5:27:01 AM||request_reschedule_cpus: files downloaded 10/9/2005 5:42:58 AM||request_reschedule_cpus: project op 10/9/2005 5:42:58 AM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory) 10/9/2005 5:42:59 AM|rosetta@home|Starting result 1acf__abrelax_no_cst_07670_0 using rosetta version 4.77 10/9/2005 5:42:59 AM||request_reschedule_cpus: process exited 10/9/2005 5:43:57 AM|LHC@home|Deferring communication with project for 54 minutes and 19 seconds I have set Rosetta to No New Work for now. And I have to look in the other threads and posts about what I can help with to sort out this. The last WU didn't seem stuck at any percentages, as the first one. The last time I looked to it, it was on about 87 % with less than a half hour to go. Mayby this is another bug of some kind. I'll save the files I have in my BOINC library right now, so David, if you're interested in them, you can contact me on fuzzy dot hollynoodles at gmail dot com. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
J D K Send message Joined: 23 Sep 05 Posts: 168 Credit: 101,266 RAC: 0 |
|
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
You must keep Rosetta in memory when doing another project.. // Where do I set it to that in 4.72? EDIT: Found it!!!! Let's see how it works out! Thanks! :-) [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
[B^S] sTrey Send message Joined: 25 Sep 05 Posts: 16 Credit: 15,524 RAC: 0 |
For what it's worth: I've crunched only 17 wu's but haven't had any problems yet and have gone through a couple of benchmarks. I'm running it on a HT cpu, but I've told it to limit cpus to one, since that keeps the cpu temp reasonable. (& it's set to keep projects in memory) Hardware is P4 3.2Ghz, 1G ram, running XP Pro Just another data point. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Holly, Fuzzy Holly, or whatever ... :) For whatever reason I thought the name was Holly. Then again, what do I know... sorry if it offended. But, the explanation was to tell you that there is a problem when Rosetta@Home is suspended and removed from memory. Using your logs: 10/8/2005 10:59:39 PM|rosetta@home|Pausing result 1cfyA_abrelax_13371_1 (removed from memory) 10/8/2005 10:59:40 PM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18 10/8/2005 10:59:40 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_13371_1 ( - exit code -1073741819 (0xc0000005)) are two key lines. The first suspends the process and removes it from memory. Triggering the fatal error. That is what I was trying to say and failing. Some get this error when Rosetta has work in memory and benchmarks run. I have been running a bit of Rosetta work and I think of the 200 or so I have only lost one to client error. But, I leave in memory as all my machies have at least 1G RAM ... PowerMac has 2.5G :) I looked at the one work unit you complained of, and looked at the result of the other user and was just trying to say that it is not a bad work unit. He/she did not process the work because of an OS-X problem, you because of the suspend problem ... anyway, it seems you are ok now ... |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
Thanks, Mr. Buck. No, I didn't get your meaning. But let's see how things go now, but I'm just puzzled now that my first WU apparently went well, as I had roundrobin turns to default 60 minutes, and have two other projects running at the same time. So it has been taken in and out of memory a couple of times at the least! Hmmm.... So maybe they have solved the problem and made some WU's that are not so sensible, and I got one???? But no matter what, I'll most probably get the problem again if I'm so unlucky that the automatic benchmark'ing kicks in, while I have a Rosetta WU running. I can't change to any client above 4.* before LHC will let me. But then I'll know what's going on and then just go on. This seem to be the price at this project, unless they get this solved in the nearest future. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Its PAUL! Geeze ... :) There are other people that are saying that they are not seeing the problem. SO, it may be intermittant too ... :( As far as my meaning, I have not been doing all that hot for some weeks now so I am not surprised I was not as clear as I would have liked. Heck, earlier today I was typing and I lost that skill too ... not good signs ... But, seem to be back to normal levels of bad, so we shall see if we improve. I know David Kim has been hard at work, I can hear his brain grinding away all the way over here ... and he is several states away from where I live ... or, maybe it is just the trash truck ... |
devn Send message Joined: 17 Sep 05 Posts: 18 Credit: 2,063 RAC: 0 |
i have HT but have also tried setting rosetta to use 1 cpu to see if it would make a difference. auto benchmarks caused "unrecoverable error" on a wu yesterday. rosetta is set to remain in memory when preempted but auto benchmarks throws it out of memory. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 167 |
You can run 2 CPU's with the HT CPU's & overcome the "unrecoverable error" by simply Suspending the Rosetta Project & doing a Manual BenchMark. Do this and mark down when you did it & then just do it again before 5 Days are up when the Server will ask for a Benchmark. I know this isn't probably practical for people with a Ton of Computers but for those with not so many it's a simple work around the Error until the Dev's can figure out why it's happening ... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
If you use BOINC VIew it should not be much more paiful than doing it to one computer. |
devn Send message Joined: 17 Sep 05 Posts: 18 Credit: 2,063 RAC: 0 |
You can run 2 CPU's with the HT CPU's & overcome the "unrecoverable error" by simply Suspending the Rosetta Project & doing a Manual BenchMark. wanted to try the 1 cpu idea since it had been noted that those computers with 1 cpu weren't having the problem. it didn't work for me but sTrey had success with it. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 167 |
wanted to try the 1 cpu idea since it had been noted that those computers with 1 cpu weren't having the problem. it didn't work for me but sTrey had success with it. ========== Well if you have a HT CPU and run it only as 1 instead of 2 your giving up around 15% to 25 % of your Performance Crunching the WU's, may as well Crunch for a Project thats not having a Problem with the HT CPU's. But that can be hard to do also since it seems all the Projects have some sort of Problem with them. Almost any Project I've run WU's for has a Problem from time to time starting up the Second or next WU when 1 of the 2 running finishes. It's a on again off again thing but all the Projects have the problem. Who knows ... :) |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
Problem solved! For me at least! But this problem was very confusing, and yes, I have read in the other threads about it, but because my first WU went fine, I din't think my problem was about letting the WU's staying in memory on my computer. Very confusing, when it didn't crash the first time under the very same conditions as later! But I have returned the next valid WU and has the third one crunching now. So GO Rosetta!!! [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
Here's 4 WUs that all failed (dual Xeon HT, host ID=1779)when the auto benchmark ran: 10/9/2005 10:00:10 AM||Suspending computation and network activity - running CPU benchmarks 10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_02636_0 (removed from memory)// set to leave in memory 10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_04179_0 (removed from memory) 10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_04241_0 (removed from memory) 10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_04274_0 (removed from memory) 10/9/2005 10:00:10 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_04179_0 ( - exit code -1073741819 (0xc0000005)) 10/9/2005 10:00:10 AM||request_reschedule_cpus: process exited 10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_02636_0 ( - exit code -1073741819 (0xc0000005)) 10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_04241_0 ( - exit code -1073741819 (0xc0000005)) 10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_04274_0 ( - exit code -1073741819 (0xc0000005)) 10/9/2005 10:00:11 AM||request_reschedule_cpus: process exited 10/9/2005 10:00:12 AM||Running CPU benchmarks 10/9/2005 10:01:09 AM||Benchmark results: 10/9/2005 10:01:09 AM|| Number of CPUs: 4 10/9/2005 10:01:09 AM|| 1222 double precision MIPS (Whetstone) per CPU 10/9/2005 10:01:09 AM|| 1044 integer MIPS (Dhrystone) per CPU 10/9/2005 10:01:09 AM||Finished CPU benchmarks 10/9/2005 10:01:09 AM||Resuming computation and network activity 10/9/2005 10:01:09 AM||request_reschedule_cpus: Resuming activities 10/9/2005 10:01:09 AM|rosetta@home|Deferring communication with project for 2 seconds 10/9/2005 10:01:09 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_02636_0 finished 10/9/2005 10:01:09 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_04241_0 finished 10/9/2005 10:01:09 AM|rosetta@home|resume_or_start(): unexpected process state 2 10/9/2005 10:01:09 AM|rosetta@home|resume_or_start(): unexpected process state 2 10/9/2005 10:01:09 AM|rosetta@home|Starting result 1acf__abrelax_04604_0 using rosetta version 4.77 10/9/2005 10:01:10 AM|rosetta@home|Starting result 1acf__abrelax_no_cst_05010_0 using rosetta version 4.77 10/9/2005 10:01:10 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_04179_0 finished 10/9/2005 10:01:11 AM|rosetta@home|Computation for result 1acf__abrelax_04274_0 finished 10/9/2005 10:01:12 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/9/2005 10:01:12 AM|rosetta@home|Reason: To fetch work 10/9/2005 10:01:12 AM|rosetta@home|Requesting 73006 seconds of work, returning 5 results 10/9/2005 10:01:13 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded 10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded 10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded 10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
devn Send message Joined: 17 Sep 05 Posts: 18 Credit: 2,063 RAC: 0 |
Well if you have a HT CPU and run it only as 1 instead of 2 your giving up around 15% to 25 % of your Performance Crunching the WU's, may as well Crunch for a Project thats not having a Problem with the HT CPU's. But that can be hard to do also since it seems all the Projects have some sort of Problem with them. Almost any Project I've run WU's for has a Problem from time to time starting up the Second or next WU when 1 of the 2 running finishes. It's a on again off again thing but all the Projects have the problem. Who knows ... :) [/quote] i wouldn't mind giving up a little performance if using 1 cpu had worked and allowed rosetta to run w/out errors. haven't had problems with HT and other projects so far. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Angus, I am way behind on log file work, but, can you zip up the TXT and OLD files and send them to me p.d.buck@comcast.net I hope to get at least one example out of them ... I don't like to pull from the pages here as I always seem to be missing something (when I compare the logs to posts). Anyway, Thanks! |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
I just had a HT machine comp error out without benchmarks being activated. I have had several computers (4) error out, 3 were HT, 1 was single threaded. I have sucessfully returned one work unit, it was from a single threaded cpu. Im pretty sure benchmarks were run on that computer but not sure Rosetta was actually interupted (could have been SETI). Anyway, heres the logs from the latest that failed. 10/09/05 13:24:46||request_reschedule_cpus: files downloaded 10/09/05 13:24:46|LHC@home|Restarting result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 using sixtrack version 4.67 10/09/05 13:24:46|rosetta@home|Restarting result 1cfyA_abrelax_18090_0 using rosetta version 4.77 10/09/05 13:24:46|SETI@home|Pausing result 19fe04ab.3149.13793.42344.183_2 (removed from memory) 10/09/05 13:24:46|SETI@home|Pausing result 19fe04ab.3149.14129.848566.207_1 (removed from memory) 10/09/05 13:24:47||request_reschedule_cpus: process exited 10/09/05 13:24:47|LHC@home|Pausing result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 (removed from memory) 10/09/05 13:24:47|rosetta@home|Starting result 1acf__abrelax_05497_2 using rosetta version 4.77 10/09/05 13:24:48||request_reschedule_cpus: process exited 10/09/05 15:24:49|LHC@home|Restarting result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 using sixtrack version 4.67 10/09/05 15:24:49|rosetta@home|Pausing result 1cfyA_abrelax_18090_0 (removed from memory) 10/09/05 15:24:49|SETI@home|Restarting result 19fe04ab.3149.13793.42344.183_2 using setiathome version 4.18 10/09/05 15:24:49|rosetta@home|Pausing result 1acf__abrelax_05497_2 (removed from memory) 10/09/05 15:24:51|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_18090_0 ( - exit code -1073741819 (0xc0000005)) 10/09/05 15:24:52|rosetta@home|Unrecoverable error for result 1acf__abrelax_05497_2 ( - exit code -1073741819 (0xc0000005)) 10/09/05 15:24:52||request_reschedule_cpus: process exited 10/09/05 15:24:52|rosetta@home|Computation for result 1cfyA_abrelax_18090_0 finished 10/09/05 15:24:52|rosetta@home|Computation for result 1acf__abrelax_05497_2 finished 10/09/05 15:24:52|LHC@home|Pausing result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 (removed from memory) 10/09/05 15:24:52|SETI@home|Restarting result 19fe04ab.3149.14129.848566.207_1 using setiathome version 4.18 |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
Angus, Paul - Sent. Let me know if they don't arrive - danged mail filters around here... edit - Since I just saw in another post that 5.x.x may fix the problem, I'll update one of the dual Xeon HT boxes to v5 to see if it changes anything. Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
Jord Send message Joined: 16 Sep 05 Posts: 41 Credit: 204,120 RAC: 0 |
Best wait a day or two longer. Release Client v5.2 is on its way. Expected some place this week. If only so it stops you from having to change Boinc yet again in a couple of days. |
Message boards :
Number crunching :
Computational Error
©2024 University of Washington
https://www.bakerlab.org