Message boards : Number crunching : Report stuck & aborted WU here please
Author | Message |
---|---|
Moderator8 Volunteer moderator Send message Joined: 10 Jan 06 Posts: 16 Credit: 0 RAC: 0 |
This thread replaces the previous stuck wu and please abort threads which were getting a little long and unwieldy. Existing postings have been left in the original threads and direct replies to those postings can be made there please. Thanks to everyone for the continued reports of bugs and suspicious events. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
This workunit was stuck at 1 percent for over 3 hours of cpu comp time. it was also recently created. NO_BARCODE_FRAGS_2reb_227_9206 |
Marc Miller Send message Joined: 30 Nov 05 Posts: 2 Credit: 18,163 RAC: 0 |
Must "Leave applications in memory while preempted?" still be set to yes? My corporate desktop is on all 24/7 and could be useful for projects like this, but is scarce on memory. BOINC 5.2.13 Windows XP rosetta 481 |
Marc Miller Send message Joined: 30 Nov 05 Posts: 2 Credit: 18,163 RAC: 0 |
Must "Leave applications in memory while preempted?" still be set to yes? My corporate desktop is on all 24/7 and could be useful for projects like this, but is scarce on memory. ...and my failed workloads ("Unrecoverable error - exit code 1073741819 (0xc0000005)") include NO_SIM_ANNEAL_1ogw_228_9701_0 MORE_FRAGS_W_BARCODE_1ogw_231_7263_0 NO_BARCODE_FRAGS_1b72_227_9863_0 NO_BARCODE_FRAGS_1dtj_227_9288_1 NO_RAND_WTS_1b72_230_7636_0 NO_RANDOM_WTS_OR_FRAGS_1b72_223_9475_1 MORE_FRAGS_W_BARCODE_1ogw_231_8193_0 NO_RAND_WTS_1mky_230_8461_0 INCREASE_CYCLES_10_1di2_226_8118_2 BARCODE_FRAG_30_1di2_234_119_0 MORE_FRAGS_W_BARCODE_1dtj_231_9106_0 |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
Ok... This is my last unit that show an error BARCODE_FRAG_30_1di2_234_471_0 Hope this help ! Godpiou |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
It's me again... Another unit that abort computation re: fatal error. The unit in question is: BARCODE_FRAG_30_2tif_234_838_0 Again, hope this help... Godpiou |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Must "Leave applications in memory while preempted?" still be set to yes? yes -though if Rosetta is your sole project and the box is on 24/7 you may get away with a 'no' setting - you would still risk losing a wu every time the benchmarks run. Best advice is to try it with setting = yes, and if it slugs the box then go to running only when machine not in use. I also had some success running with a 'yes' setting and with the max no cpus = 1 on an HT box - it got 75% of the throughput of allowing two 'cpus' to be used, but left the box responding to other tasks as well as when BOINC not running. If none of these work, I think best advice at present is to go to another project till Rosetta gets this sorted. Sorry! River~~ |
Steve Shedroff Send message Joined: 7 Nov 05 Posts: 11 Credit: 250,657 RAC: 0 |
I have a Work Unit that is at 1% processed after 21 hours, 24 minutes and counting. Id: NO_BARCODE_FRAGS_2reb_227_9692_0. The other Work Units I have seem to be running fine in parrallel on this multi CPU computer. Should I abort or let it fail on its own? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have a Work Unit that is at 1% processed after 21 hours, 24 minutes and counting. Id: NO_BARCODE_FRAGS_2reb_227_9692_0. The other Work Units I have seem to be running fine in parrallel on this multi CPU computer. Should I abort or let it fail on its own? My rule of thumb is to abort if a WU sticks at the same progress for more than half the time of a full length Rosetta WU running on the same box. Sometimes stopping BOINC and restarting can save such a WU, but with a multi cpu box you lose some crunch from each of the other cpus, as each of the other Rosetta wu will revert to their last checkpoints. So unless Rosetta WU typically take 42 hours on your box, it is time to abort in my opinion. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
INCREASE_CYCLES_10_1ogw_226_9787 This workunit errored out after 19000 seconds thats 5 and half hours wasted! I have been noticing my computer will succcessfully compute others failed units and my failed units are sometimes successfully run on other computers. Question to the project scientists what is going on here? I really dont have the time to babysit any of my computers. Mainly will check in from time to time randomly throughout the day for a couple of minutes. Cheers!! hope I can get a resonable answer from a project scientist as to why the problems of a few weeks ago still rear thier ugly head from time to time. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
INCREASE_CYCLES_10_1ogw_226_9787 I don't know exactly what is going on. for each work unit, we have now close to the targeted 10,000 successful completions, so there are clearly no systematic errors affecting all instantces of a wu. I would love to know how many failures of the sort you had there have been. It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem? we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now |
carl.h Send message Joined: 28 Dec 05 Posts: 555 Credit: 183,449 RAC: 0 |
1/11/2006 13:43:17|rosetta@home|Unrecoverable error for result NO_BARCODE_FRAGS_1di2_227_8993_0 ( - exit code -1073741819 (0xc0000005)) 1/11/2006 17:00:24|rosetta@home|Unrecoverable error for result DEFAULT_1n0u_218_633_9 (Incorrect function. (0x1) - exit code 1 (0x1)) Not all Czech`s bounce but I`d like to try with Barbar ;-) Make no mistake This IS the TEDDIES TEAM. |
kb7rzf Send message Joined: 7 Oct 05 Posts: 16 Credit: 35,427 RAC: 0 |
Got one that errored out. 1/11/2006 4:18:28 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1r69_240_504_0 ( - exit code -1073741819 (0xc0000005)) [edit] Heres the info on the WU: stderr out <core_client_version>5.2.13</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> No heartbeat from core client for 31 sec - exiting ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0xBF005BE0 Exiting... </stderr_txt> Validate state Invalid Claimed credit 16.5597049633772 Granted credit 0 application version 4.81 Thanks Jeremy |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5020591 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4952011 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4964562 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4964562 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4704097 Hi David, Thank-you for responding to my question. Out of 185 results I have yet to compute 15 so I have 170 results computed. Out of the 170 21 were the ones that error out in 20 seconds. Thier were 5 workunits that had the large compute time and errored out from anywere from 2 hours to five hours plus. Out of these five the big one has errored out on another computer after an hour and is currently on a third computer. The one that errored out after three hours failed on another computer after one hour. Then was suuccessfully completed by a third computer after four hours. The two that errored after 1 hour and a half; one was successfully computed and the other is on another computer. These time consuming wrkunits represent 3 percent of my 170 completions since dec 21 2005. I have included the five workunits addresses at the top of this post. Have a Great day....despite the rain ..........Cheers!!!!!!!!!!! It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem? we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now [/quote] |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
If you get a stuck WU, specifically a 1% stuck WU, and want to help diagnose the problem, follow the instruction in this thread: Help us solve the 1% bug! |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
Hi ! Sorry but...another WU aborted... Here's the details: rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1npsA_239_837_0 ( - exit code -164 (0xffffff5c)) Again..hope this help & have a good day, Godpiou |
Marie Lucie Send message Joined: 9 Dec 05 Posts: 5 Credit: 40,616 RAC: 0 |
Hello, for the first time I just had a problem with a wu. Here are the messages : 13/01/2006 17:25:02|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1mky_208_40_8 ( - exit code -1073741819 (0xc0000005)) 13/01/2006 17:25:02||request_reschedule_cpus: process exited 13/01/2006 17:25:02|rosetta@home|Computation for result INCREASE_CYCLES_10_1mky_208_40_8 finished Hope it helps |
Golden Turtle Send message Joined: 23 Sep 05 Posts: 34 Credit: 22,941 RAC: 0 |
ROSETTA 5.2.13. Windows XP 2Pro. WU:- No Barcode Frags - 1di2 227 9845 0. CPU Time = 08.29.21 Progress = 0% Time to completion = 7.44.04 Message: aborted via GUI RPC Unhandled exception Reason Access Violation [0xc0000005] at address 0x7c910f29 read attempt to adress 0x3f8f5c2d exiting. Hope this is of use! |
kevint Send message Joined: 8 Oct 05 Posts: 84 Credit: 2,530,451 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3537724 I assume this is a 1% bug, I have just recently started crunching Rosetta so am unfamiler with much of what has gone on in the past. This PC is a P4 Hyperthread (4 virtual CPU) - normaly crunches average of about 2 hours per CPU I noticed this stuck today at just over 3 hours and still sitting around 1% - I was unaware of a bug, thought it might be something wrong with my pc so I just aborted the thing. SETI.USA |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
This WU here Hung for over 4 hours at 70% complete. I stopped BOINC, restarted and it completed sucessfully. There are some error notations and de-bug info in the reported result at the above link. WU name - NO_RAND_WTS_1ogw_230_7724_0 Mac 1.4GHz Dual G4 Mac OS 10.4.3 BOINC 5.2.13 Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org