Report stuck & aborted WU here please

Author	Message
Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 14184 - Posted: 20 Apr 2006, 17:14:20 UTC Last modified: 20 Apr 2006, 17:15:29 UTC After more than 30 hours runtime, and stuck for hours at the same percentage, I aborted the job: https://boinc.bakerlab.org/rosetta/result.php?resultid=17454155 260 credits lost... Member of Dutch Power Cows ID: 14184 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14186 - Posted: 20 Apr 2006, 17:41:56 UTC - in response to Message 14131. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. What about when deadline is approaching? Sometimes people crank their preference straight from 4 hrs to 24 hrs, and all of the sudden 10 WUs cannot be completed before deadline. So, if deadline is "near" (? how near?) then just finish the current model and end this WU so it can be reported in time. I don't know if you would call that an "abort". It's more of a normal end, in advance of the target runtime. Great progress! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14186 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 14188 - Posted: 20 Apr 2006, 18:03:56 UTC - in response to Message 14172. Last modified: 20 Apr 2006, 18:13:58 UTC AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise Lauren, I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. 4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC. Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them. PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 14188 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14190 - Posted: 20 Apr 2006, 18:29:47 UTC - in response to Message 14156. That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more. Perhaps it is a good idea to turn off the mechanism of resending failed WU at the moment (ie by setting number of failures until cancellation of that WU to 1). With such a setting bad WUs would only bother one participant. ID: 14190 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 14192 - Posted: 20 Apr 2006, 18:55:07 UTC It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked.. Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :) ID: 14192 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14193 - Posted: 20 Apr 2006, 19:38:01 UTC - in response to Message 14192. It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked.. Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :) Or even a user slected option for the client to report back to the servers every 3 to 6 Hrs Could give them a lot of alpha info to see what works better and hot any upgrades are working If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14193 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14207 - Posted: 21 Apr 2006, 0:21:01 UTC - in response to Message 14188. I haven't yet got the watchdog thread into Rosetta 5.01, but we have very high hopes for it! It was a great idea from this message board. It should go into the next update, probably early next week, if the Windows build cooperates. (We're trying not to do updates during the weekend -- we seem to have had bad luck in the past!) I'm paying attention to the ideas about reverse trickle, keeping contact between client and server, etc. -- these are nice suggestions. As I explained below, those will likely require some changes in the BOINC code, and we'll need help from the BOINC crew. They've been pretty occupied with their upcoming release. I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble! One final note: we just went through and granted credits to errored jobs in our database. I'm trying to code the watchdog so that it will gracefully abort, including the valid output of data, so that the job will automatically get credit (but will be tagged for us as a premature abort). AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise Lauren, I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. 4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC. Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them. PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves. ID: 14207 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14224 - Posted: 21 Apr 2006, 3:58:30 UTC Thank you Rhiju For listening to our needs and taking steeps to fix or improve a vary frustrating problem. If any my words were at all harsh Pleases forgive me. It was not my intent I just want to get my point across And words do not come easily to me I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end. Again Thank You If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14224 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14227 - Posted: 21 Apr 2006, 5:45:07 UTC - in response to Message 14224. Your comments have been really helpful -- please continue to make suggestions. Hopefully by next week we can ensure that these stupid stuck-at-1.04% jobs never show up again on your computers. Thanks for hanging in there! Thank you Rhiju For listening to our needs and taking steeps to fix or improve a vary frustrating problem. If any my words were at all harsh Pleases forgive me. It was not my intent I just want to get my point across And words do not come easily to me I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end. Again Thank You ID: 14227 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 14238 - Posted: 21 Apr 2006, 7:24:45 UTC Last modified: 21 Apr 2006, 7:29:49 UTC Rhiju, ( and other development team members) I have opened a "sticky" here for you and the development team to post Rosetta application release information as new versions are deployed. It might help people find the information and they can subscribe to the thread so they can be notified when you post something there. Could you post the details on Version 5.01 to kick this off? I know a lot of people would like to see this done regularly. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 14238 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14253 - Posted: 21 Apr 2006, 10:24:04 UTC Last modified: 21 Apr 2006, 10:55:05 UTC ANother HUGE ammount of CPU time wasted!!!! https://boinc.bakerlab.org/rosetta/result.php?resultid=17734977 CPU time 42670.640625 Claimed credit 145.838794071523 I had to abort this one as It was cought on a loop. Action done arround 6AM AST. stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1509912 # cpu_run_time_pref: 21600 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:7 cpu_run_time:30500.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:8 cpu_run_time:33366.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:9 cpu_run_time:34263 </stderr_txt> What irks me is that I was the second Computer to receive this WU. I just hope that that the third one that receives it is wise enough and aborts it before a lot of his cpu time is wasted. So dont gang up on me when I say ARGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!! PS Ah at least the new version doesnt wait too long to go the error ways. On that one I will report on the 5.01 therad :( This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14253 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 14267 - Posted: 21 Apr 2006, 14:59:49 UTC - in response to Message 14253. ANother HUGE ammount of CPU time wasted!!!! ... Jose, Your time is not wasted. Look at This post. From this statement the results are used and you will be granted credit. So perhaps not so much ARGH but more like AHHH! Regards Phil ID: 14267 · Rating: 0 · rate: / Reply Quote

Steven Purvis Send message Joined: 17 Sep 05 Posts: 1 Credit: 7,157,371 RAC: 4,428	Message 14280 - Posted: 21 Apr 2006, 17:21:33 UTC I've just aborted about 6 work units for rosetta 4.98 with names starting 7486_largescale_large_full_atom_relax_XXXXXXXXXXXX They all seemed to be stuck in the getting to about 1.4% but no higher. I have the "don't remove workunits from memory" enabled so that shouldn't cause a problem. The work units results were: 17191225 17191227 17191336 17191339 17191352 17191374 Hope this is useful in some way. ID: 14280 · Rating: 0 · rate: / Reply Quote

[DPC]FOKschaap~_mcintosh_ Send message Joined: 4 Dec 05 Posts: 5 Credit: 118,303 RAC: 0	Message 14318 - Posted: 21 Apr 2006, 23:10:15 UTC PROD_ABINITIO_FAST_1tul__447_32515 That one got aborted by BOINC. Claimed credit 251, hope 2 see that one day ;) ID: 14318 · Rating: 0 · rate: / Reply Quote

[DPC]Division_Brabant~OldButNotSoWise Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0	Message 14372 - Posted: 22 Apr 2006, 13:41:26 UTC Last modified: 22 Apr 2006, 13:42:01 UTC What should I do with this one? 1.6% 17:30:00 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png ID: 14372 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 14384 - Posted: 22 Apr 2006, 15:34:58 UTC - in response to Message 14372. What should I do with this one? 1.6% 17:30:00 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png It looks like you may have a problem WU. I looked at your system but I cannot tell which WU you are running from the list. There was a batch that were identified for aborting here. If it is one of those I would abort it. I see it is at 1.6%. In the display the percent should be displayed with 4 decimal places (1.xxxx %) Before you abort it make a note of full value of the percent and include that in your report, and provide a link to the result on your stats page. (Nice Belgian Sheepdog by the way) Moderator9 ROSETTA@home FAQ Moderator Contact ID: 14384 · Rating: 0 · rate: / Reply Quote

Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0	Message 14391 - Posted: 22 Apr 2006, 16:54:21 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=17824571 Aborted after 12 hours https://boinc.bakerlab.org/rosetta/result.php?resultid=17825321 7 hours for this one ID: 14391 · Rating: 0 · rate: / Reply Quote

Runaway1956 Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0	Message 14393 - Posted: 22 Apr 2006, 17:06:11 UTC 4/22/2006 11:59:27 AM\|rosetta@home\|Pausing result TRUNCATE_TERMINI_FULLRELAX_1enh__433_178_0 (left in memory) After this post, I'm going to abort this one. It seems to have run for two days before I caught it, and restarted BOINC to see what would happen. It just hung at 1.something percent, and the remaining time climbed past 30 hours. I SHOULD have copied the messages concerning this WU before resetting BOINC - all were gone when it restarte - sorry about that. ID: 14393 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~Ytsmabeer Send message Joined: 10 Nov 05 Posts: 2 Credit: 100,205 RAC: 0	Message 14403 - Posted: 22 Apr 2006, 18:08:20 UTC Reporting an WU whitch I aborted because of running for 17 hours and reading about the HBLR type HBLR_1.0_1ogw_420_8424 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021 been running four 17 hours made 14% complete ID: 14403 · Rating: 0 · rate: / Reply Quote

Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0	Message 14455 - Posted: 23 Apr 2006, 6:02:38 UTC Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent #1 from 2700xp Result ID 17772227 Name HBLR_1.0_1mky_420_9630_1 Workunit 13428053 Created 20 Apr 2006 21:42:41 UTC Sent 21 Apr 2006 4:22:49 UTC Received 23 Apr 2006 5:53:20 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 148992 Report deadline 5 May 2006 4:22:49 UTC CPU time 32013.537868 #2 From 1800 xp Result ID 17805638 Name NO_TERM_STRAND_1ogw_423_6947_2 Workunit 13496532 Created 21 Apr 2006 5:49:41 UTC Sent 21 Apr 2006 8:05:02 UTC Received 23 Apr 2006 5:52:38 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 105489 Report deadline 5 May 2006 8:05:02 UTC CPU time 24477.506926 #3 from 2000 xp Result ID 17748958 Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1 Workunit 14550587 Created 20 Apr 2006 16:34:25 UTC Sent 20 Apr 2006 22:38:14 UTC Received 23 Apr 2006 5:51:22 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 106748 Report deadline 4 May 2006 22:38:14 UTC CPU time 25011.984375 #4 from 2500 Xp Result ID 17786001 Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0 Workunit 14630032 Created 21 Apr 2006 1:00:11 UTC Sent 21 Apr 2006 3:09:30 UTC Received 23 Apr 2006 5:50:36 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 107679 Report deadline 5 May 2006 3:09:30 UTC CPU time 22721.8125 ID: 14455 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II