Report stuck & aborted WU here please

Author	Message
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 13552 - Posted: 12 Apr 2006, 16:16:02 UTC - in response to Message 13551. Last modified: 12 Apr 2006, 16:17:14 UTC anyway--main question--are people seeing more stuck work units now than 7-10 days ago? Rom (or someone) should probably do an analysis to see what (if any) common factors there are for the errored units, and the overall frequency. Knock on wood (although with limited sampling), I have kept my run time at 8 hours and have not had any problems with 4.98. Regards, Bob P. ID: 13552 · Rating: 0 · rate: / Reply Quote

arminius Send message Joined: 23 Sep 05 Posts: 8 Credit: 805,403 RAC: 0	Message 13553 - Posted: 12 Apr 2006, 16:29:10 UTC Last modified: 12 Apr 2006, 16:34:10 UTC my first (linux box) .... stuck at 1.04% TRUNCATE_TERMINI_FULLRELAX_1enh__433_38_0 a. ID: 13553 · Rating: 0 · rate: / Reply Quote

Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0	Message 13557 - Posted: 12 Apr 2006, 17:40:59 UTC Just got my first stuck WU. Yay me :( Anyway its. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13923483 It's currently at 8+49 CPU time. Stuck at 1.042% It has exceeded both the default run time and my run time setting. I have suspended the WU. Bonic 5.2.13. Please advise as to what to do with this WU. ID: 13557 · Rating: 0 · rate: / Reply Quote

jomebrew Send message Joined: 31 Mar 06 Posts: 2 Credit: 25,914,516 RAC: 0	Message 13559 - Posted: 12 Apr 2006, 18:02:57 UTC I have a couple of these on my Linux system. I would appreciate some help on a clean way to abort these on Linux. I have been hacking client_state.xml and deleting files in the slots directory. There has to be a better way. Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1ctf__429_247_0 was started at 2006-04-09 20:52:34 but has not finished! Warning! HBLR_1.0_2reb_426_994_0 was started at 2006-04-09 23:07:09 but has not finished! Warning! 7449_largescale_large_fullatom_relax_dec7449_1_05_6.pdb_431_53_0 was started at 2006-04-09 20:58:18 but has not finished! Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1vls__428_262_0 was started at 2006-04-09 21:19:28 but has not finished! Warning! 7485_largescale_large_fullatom_relax_dec7485_1_05_8.pdb_432_129_0 was started at 2006-04-09 21:50:01 but has not finished! Warning! TRUNCATE_TERMINI_FULLRELAX_1ptq__433_587_0 was started at 2006-04-11 17:55:43 but has not finished! ID: 13559 · Rating: 0 · rate: / Reply Quote

n7zfi Send message Joined: 7 Apr 06 Posts: 1 Credit: 4,623,875 RAC: 0	Message 13563 - Posted: 12 Apr 2006, 18:19:18 UTC Running on Windows XP Pro, I have a WU stuck at 1.04%. The graphics appears to be locked up; nothing is moving even though the CPU utilization clock keeps ticking. The WU in questions is: TRUNCATE_TERMINI_FULLRELAX_1ptq_433_906_0 I have suspended it after 1:34:22 of run time. The other WUs progress past that point in a few minutes. ID: 13563 · Rating: 0 · rate: / Reply Quote

snoekbaars Send message Joined: 16 Mar 06 Posts: 2 Credit: 12,136 RAC: 0	Message 13565 - Posted: 12 Apr 2006, 18:42:22 UTC Work unit aborted at 48% - CPU time used ~24 hours. Time needed to completion only going up. Nothing moved in the graphics. WU Name "FA_RLXpt_hom003_1ptq__361_156_3" - Application "rosetta 4.98" Workunit = 11684527; Result ID = 16802748; System = Intel P4 3.0GHz, Win-XP SP 2 The workunit still reports "in progress" at the time of writing this message. The workunit was aborted manually ("Aborted via GUI RPC"). ID: 13565 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 13569 - Posted: 12 Apr 2006, 19:17:41 UTC Just again had a WU that was running for more than 6 hours at 1.17% and when I checked it again another one had started which is running for 45 minutes now at 1.06% but I cannot find that other wu in my results. Better testdrive a project like this more thoroughly before letting so many people waste their money. If I go on this month it wil be the last anyway. Rather fed up with it. No fun at all anymore. ID: 13569 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13575 - Posted: 12 Apr 2006, 20:28:58 UTC Hi guys... thanks very much for reporting these errors. Work units with 1b3a, 1enh, 2tif, and 1ptq appear to be wreaking havoc throughout boinc. Sorry for the trouble -- this won't happen again, as we are increasing the stringency of our local tests that precede submission to boinc. Please MANUALLY ABORT work units with 1b3a, 1enh, 2tif, or 1ptq in the title! ID: 13575 · Rating: 0 · rate: / Reply Quote

Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0	Message 13581 - Posted: 12 Apr 2006, 21:34:58 UTC Last modified: 12 Apr 2006, 21:45:57 UTC Hi, I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are: RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622 WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890. Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98. OOPS! Just caught the blurb about 4.83 and 4.98 being identical! Hope this helps you find the little bugger! ID: 13581 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13582 - Posted: 12 Apr 2006, 21:45:47 UTC - in response to Message 13581. Thanks for the post -- the percent complete is particularly interesting. The reports of 1.04%, 1.042%, 1.17% are telling us that similar work units are getting stuck at rather different points along their simulations. Its helping us focus on where to look for the bug. Please keep posting information on stuck jobs! Hi, I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are: RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622 WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890. Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98. Hope this helps you find the little bugger! ID: 13582 · Rating: 0 · rate: / Reply Quote

Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0	Message 13586 - Posted: 12 Apr 2006, 22:21:00 UTC - in response to Message 13582. As for the post, no problem! We are all exploring our own little corner of the universe and sharing info is just another way of getting us closer to a better understanding of the "bigger picture". Being a M/F Sys Prog makes it easier for me to understand what you guys need to find the "speed bumps". <grin>. I can even relate to the "smack" sound that will occur when you do eventually find it. Not knowing anything about your algorithm, I have to rely on an intuitive guess as to where the problem might be. I would imagine that you are probably working on a some kind of descent down a "decision tree" and when you get to a dead end, you have to climb back up a level and pursue the next "branch". My guess is that the program is getting "into a bind" when the structure that is being analyzed is complex enough that the process "loses track" of its previous choices and gets into a loop re-analyzing the same sequence of molecules. Just my guess. Anyways, good luck in finding it. I suspect that you are "almost there". Thanks for the post -- the percent complete is particularly interesting. The reports of 1.04%, 1.042%, 1.17% are telling us that similar work units are getting stuck at rather different points along their simulations. Its helping us focus on where to look for the bug. Please keep posting information on stuck jobs! Hi, I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are: RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622 WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890. Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98. Hope this helps you find the little bugger! ID: 13586 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13588 - Posted: 12 Apr 2006, 22:38:38 UTC - in response to Message 13551. the only change is that we increased the default run time from 2 hours to 4 hours ... I doubt that's the problem. I think part of the problem is that there have been some bad WUs released recently (the ones Rhiju posted about). Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly. ID: 13588 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13591 - Posted: 12 Apr 2006, 22:46:21 UTC Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ... Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for all the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps! Jobs that should be aborted: TRUNCATE_TERMINI_FULLRELAX_1enh__433 TRUNCATE_TERMINI_FULLRELAX_1b3aA_433 TRUNCATE_TERMINI_FULLRELAX_1ptq__433 TRUNCATE_TERMINI_FULLRELAX_2tif__433 ID: 13591 · Rating: 0 · rate: / Reply Quote

charmed Send message Joined: 2 Nov 05 Posts: 11 Credit: 1,780,440 RAC: 0	Message 13592 - Posted: 12 Apr 2006, 23:08:12 UTC Last modified: 12 Apr 2006, 23:10:23 UTC About to abort WU FA_RLXpt_hom004_1ptq__361_308_2 it's stuck at 50.242% Stage full atom relax Model 9 Step 205901 its at 7 hours 43 minutes and counting. Here it is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11760498 Using Win XP home edition service pack 2 client id 62881. ID: 13592 · Rating: 0 · rate: / Reply Quote

Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0	Message 13602 - Posted: 12 Apr 2006, 23:50:57 UTC - in response to Message 13591. Last modified: 12 Apr 2006, 23:51:36 UTC CONGRATS! Just remember, given a choice between "lucky" and "good", ALWAYS choose LUCKY ... with enough luck, you may eventually get good <grin>! Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ... Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for all the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps! Jobs that should be aborted: TRUNCATE_TERMINI_FULLRELAX_1enh__433 TRUNCATE_TERMINI_FULLRELAX_1b3aA_433 TRUNCATE_TERMINI_FULLRELAX_1ptq__433 TRUNCATE_TERMINI_FULLRELAX_2tif__433 ID: 13602 · Rating: 0 · rate: / Reply Quote

cwangersky Send message Joined: 6 Nov 05 Posts: 6 Credit: 325,556 RAC: 0	Message 13605 - Posted: 13 Apr 2006, 0:39:58 UTC Here's an odd one... Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine. Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop). Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it... ID: 13605 · Rating: 0 · rate: / Reply Quote

Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0	Message 13606 - Posted: 13 Apr 2006, 0:46:06 UTC Very good news. Keep up the good work everyone! ID: 13606 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 13607 - Posted: 13 Apr 2006, 1:07:19 UTC - in response to Message 13605. Last modified: 13 Apr 2006, 1:08:59 UTC Here's an odd one... Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine. Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop). Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it... cwangersky, these are very big WUs which take a loooong time per model, on some P4s they might even take more than 2hr PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again... Solution: increase "time between swaps" to e.g. 4hr or (if your PC has lots of RAM and/or run few BOINC projects) set "leave in mem when preempted"=YES I would choose the latter. This very example is why Rosetta needs a BigWU flag in preferences IMHO... AMD's explained it in a previous comment: Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 13607 · Rating: 1 · rate: / Reply Quote

Dan Wulff Send message Joined: 17 Sep 05 Posts: 3 Credit: 21,939,262 RAC: 14,403	Message 13610 - Posted: 13 Apr 2006, 1:53:37 UTC Last modified: 13 Apr 2006, 1:57:57 UTC aborted wu After over 9.5 hours this one was still at 1.04% and showing 16 more hours to go. I manually aborted this unit. Result ID 16987331 Name TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_297_0 Workunit 13923431 ID: 13610 · Rating: 0 · rate: / Reply Quote

Kevin Send message Joined: 15 Jan 06 Posts: 21 Credit: 109,496 RAC: 0	Message 13612 - Posted: 13 Apr 2006, 2:13:21 UTC Glad to see the Truncate_Termini units were cancelled. I just noticed one of my machines was working on one of those units for a lil more than 29 hours. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13918811 ID: 13612 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II