Report stuck work units here

Author	Message
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7869 - Posted: 29 Dec 2005, 9:13:48 UTC - in response to Message 7791. I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago. It comes to my mind that I have occasionally seen this on other projects - but maybe four or five times in ten months. Then suddenly see a lot of it here - is it a separate bug, or is it that the other problems are putting more stress on the boxes so that a very rare BOINC bug is triggered more often? Especially if it does turn out that the clock stopped problem is down to a dropped message between the client and the app, this is exactly where extra stress on the box is more likely to bring obscure issues to the surface. Just a thought, and it may be a red herring. River~~ ID: 7869 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7875 - Posted: 29 Dec 2005, 10:03:23 UTC I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? ID: 7875 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 7914 - Posted: 29 Dec 2005, 17:45:17 UTC - in response to Message 7875. I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? Don't know, what are the error codes? ID: 7914 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7922 - Posted: 29 Dec 2005, 19:43:42 UTC - in response to Message 7914. I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? Don't know, what are the error codes? The _topology_ jobs are giving [large negative number] = 0xc0000005 As far as I remember all (most?) of the other wu have been giving small positive numbers like 11, 131, etc, which is what made me notice the difference. Sorry can't be more detailed but have not had chance to note much down. If I am right the exact values will be in the db, and if my memory is wrong the db will say so too. R~~ ID: 7922 · Rating: 0 · rate: / Reply Quote

doc :) Send message Joined: 4 Oct 05 Posts: 47 Credit: 1,106,102 RAC: 0	Message 7960 - Posted: 30 Dec 2005, 5:51:24 UTC Last modified: 30 Dec 2005, 5:52:48 UTC this WU was stuck for more than 3 hours at 1% on step 2733 of the ab initio phase, exiting and restarting boinc got it to 10% in less than 10 minutes. i got a backup of the stdout.txt from before the restart if needed. pic: [img=http://img522.imageshack.us/img522/670/stuck2ch.th.jpg] edit: cant figure out how to post the thumbnail correct, bbcode is not my friend :) ID: 7960 · Rating: 0 · rate: / Reply Quote

anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0	Message 7974 - Posted: 30 Dec 2005, 8:56:43 UTC I have a work unit that has been at 1% for 15H 50 min. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4133381 What do u want me to do? Anders n ID: 7974 · Rating: 0 · rate: / Reply Quote

John Hetherington Send message Joined: 7 Oct 05 Posts: 1 Credit: 28,875 RAC: 0	Message 7992 - Posted: 30 Dec 2005, 15:59:45 UTC NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine. John ID: 7992 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,857 RAC: 9	Message 7996 - Posted: 30 Dec 2005, 17:08:35 UTC - in response to Message 7992. NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine. Since they aren't stuck, no this isn't the place to ask... but if I might guess, you're also running Predictor, which has the "Fortran errors"; all the errors on Rosetta I see in your results look like the "bad batch" or Rosetta being removed from memory, possibly when Predictor failed. If you're not running Predictor, please open a new thread and we'll see what we can do to figure this out. ID: 7996 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7999 - Posted: 30 Dec 2005, 19:04:25 UTC - in response to Message 7974. I have a work unit that has been at 1% for 15H 50 min. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4133381 What do u want me to do? Anders n hi Anders, the project staff are still on leave. I can't tell you what the project want, only try to make my best guess at it. It depends on the speed of your box. What is the longest time you have seen a Rosetta wu take on that box and still succeed in the end? Halve that time. If a wu is stuck without the progress changing for that amount of time, I call it time to abort/suspend and move on. Eventually you will need to abort it if credit matters to you - but please don't abort yet. Go to the work tab (not the project tab), highlight the stuck wu, and click the suspend button (which then changes to become a resume button). BOINC should download more work (if need be) and start running it. Later on, when the project team say it is time to abort the wu, you will highlight it again, first click abort, then click resume (the last to allow the wu to be reported for credit). Hope that helps - it is what I would do R~~ ID: 7999 · Rating: 0 · rate: / Reply Quote

anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0	Message 8005 - Posted: 30 Dec 2005, 20:05:44 UTC - in response to Message 7999. Halve that time. If a wu is stuck without the progress changing for that amount of time, I call it time to abort/suspend and move on. Eventually you will need to abort it if credit matters to you - but please don't abort yet. Go to the work tab (not the project tab), highlight the stuck wu, and click the suspend button (which then changes to become a resume button). BOINC should download more work (if need be) and start running it. Later on, when the project team say it is time to abort the wu, you will highlight it again, first click abort, then click resume (the last to allow the wu to be reported for credit). Hope that helps - it is what I would do R~~ Thanks for the reply. I did just that after some more time. Have a snapshot of the grafics to , just in case. Anders n ID: 8005 · Rating: 0 · rate: / Reply Quote

cwangersky Send message Joined: 6 Nov 05 Posts: 6 Credit: 325,556 RAC: 0	Message 8151 - Posted: 2 Jan 2006, 1:52:40 UTC On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0). ID: 8151 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 8166 - Posted: 2 Jan 2006, 9:09:24 UTC - in response to Message 8151. On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0). Yes report them here please. There seem to be three cases[list] * clock never started, progress grinds to a halt later * clock runs ok but progress grinds to a halt with clock still increasing * clock and progress start ok but both stop later on [list] so please make clear which it is, and mention to operating system especially in the first case. It is believed that the first case, where cpu stays at zero throughout, is confined to win95/98/ME - so reports of it happening on Win2k/XP or Linux would be particularly relevant. Thanks, R~~ ID: 8166 · Rating: 0 · rate: / Reply Quote

Plum Ugly Send message Joined: 3 Nov 05 Posts: 24 Credit: 2,005,763 RAC: 0	Message 8285 - Posted: 3 Jan 2006, 18:46:24 UTC - in response to Message 8166. Last modified: 3 Jan 2006, 19:18:09 UTC I have one at 1% with 12-14 hrs running.I suspended it.what is it that you need to see on it. Defaut_1b72_220_3516_0 run 12 hrs 33min 06 before I suspended it at 1%.It says 16hrs an 59 miniutes to completion. pentium 4/W2k system. ID: 8285 · Rating: 0 · rate: / Reply Quote

Plum Ugly Send message Joined: 3 Nov 05 Posts: 24 Credit: 2,005,763 RAC: 0	Message 8297 - Posted: 3 Jan 2006, 20:45:51 UTC Last modified: 3 Jan 2006, 20:46:42 UTC also have one NEW_SOFT_CENTROID_PACKING_2reb_225_2962_0 sutck at 1% with 8hrs 32 min ran.I have suspended it also. w2k on a p4 2.4 ID: 8297 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,857 RAC: 9	Message 8300 - Posted: 3 Jan 2006, 21:15:09 UTC I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. ID: 8300 · Rating: 0 · rate: / Reply Quote

hyperfusion Send message Joined: 2 Oct 05 Posts: 1 Credit: 120 RAC: 0	Message 8309 - Posted: 3 Jan 2006, 22:42:22 UTC I also have a workunit (INCREASE_CYCLES_10_2tif_226_3820) stuck for a few hours at 40%. I know it is not running, since a simple ps aux \| grep ^boinc reveals that all the rosetta* processes are sleeping. I tried renicing the processes to 0 (from 19), but that didn't do anything. Here's my stderr.txt: * glibc detected * corrupted double-linked list: 0x08d728f8 * [0x87074eb] [0x871f4bc] [0xffffe420] [0x8785674] [0x879a4c6] [0x879ee8d] [0x879f463] [0x879f8bf] [0x8770365] [0x87700d1] [0x80572dd] [0x81087e8] [0x810814a] [0x8785b7f] [0x871738f] [0x872049d] [0x87b1cba] * glibc detected * corrupted double-linked list: 0x093a0ab0 * [0x87074eb] [0x871f4bc] [0xffffe420] [0x8785674] [0x879a4c6] [0x879eeda] [0x879fb6c] [0x87a143d] [0x8770107] [0x86002f1] [0x85ef650] [0x85f843c] [0x860d529] [0x860df34] [0x840c3fe] [0x86a48c9] [0x85b1bbf] [0x85b36ec] [0x85b45f4] [0x83ca2af] [0x83cc2cf] [0x877e534] [0x8048121] The first 10 lines of stdout.txt: [2006-01-02 16:25:55] :: BOINC :: boinc_init() command executed: rosetta_4.80_i686-pc-linux-gnu aa 2tif _ -increase_cycles 10 -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 [STR OPT]New value for [-paths] frags400.txt. [T/F OPT]Default FALSE value for [-version] [T/F OPT]Default FALSE value for [-score] [T/F OPT]Default FALSE value for [-abinitio] [T/F OPT]Default FALSE value for [-refine] [T/F OPT]Default FALSE value for [-assemble] [T/F OPT]Default FALSE value for [-idealize] [T/F OPT]Default FALSE value for [-relax] And stdout.txt's last 10 lines: smooth trials: 80000 accepts: 2354 %: 2.9425 ----------------------------------------------------- ----------------------------------------------------- CYCLES::number is 1 x total_residue: 59 initializing full atom coordinates starting score 2173.93115 rms 4.0322547 starting full atom minimization CYCLES::number is 1 x total_residue: 177 starting score -109.494316 rms 3.95362806 starting full atom simulated anealing Looking into stdout.txt, it seems something is going wrong (to me, at least): Here are lines 1848-1864 Looking for psipred file: ./2tif_.psipred_ss2 Protein type: alpha/beta Fraction beta: 0.615384638 disabling sheet filter XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: CONSTRAINT FILE NOT FOUND Searched for: ./2tif_.cst Running without distance constraints XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND Searched for: ./2tif_.dpl Dipolar constraints will not be used XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX fragment file: ./aa2tif_03_05.400_v1_3.gz Total Residue 59 frag size: 3 frags/residue: 400 fragment file: ./aa2tif_09_05.400_v1_3.gz My computer is a 2.66ghz Pentium 4 (no hyperthreading) that runs Linux kernel 2.6.12. I hope this helps you guys resolve this issue! ID: 8309 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,857 RAC: 9	Message 8328 - Posted: 4 Jan 2006, 5:51:44 UTC - in response to Message 8300. I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. After making the copy, I restarted BOINC and resumed the "stuck" result; it restarted at 0 CPU time and finished in under 2 hours. I do still have the backup with it in the "stuck at 1%" state. ID: 8328 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 8329 - Posted: 4 Jan 2006, 6:37:18 UTC - in response to Message 8328. I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. After making the copy, I restarted BOINC and resumed the "stuck" result; it restarted at 0 CPU time and finished in under 2 hours. I do still have the backup with it in the "stuck at 1%" state. What are the last 20 lines of stdout.txt? (I wonder if the "sticking" point is always in the simulated annealing as in hyperfusion's case? ) Do people have an idea about what fracton of work units that are getting stuck, and whether any proteins or run conditions are getting stuck more frequently than others? ID: 8329 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,857 RAC: 9	Message 8337 - Posted: 4 Jan 2006, 7:35:56 UTC - in response to Message 8329. I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. What are the last 20 lines of stdout.txt? RSD_WT: 1.2 RSD_WT: 1.0961 [T/F OPT]New TRUE value for [-rand_SS_wt] [T/F OPT]Default FALSE value for [-random_parallel_antiparallel] SS_WT: 0.764209032 0.950430512 1.02168155 1.42587173 [T/F OPT]Default FALSE value for [-rand_cst_res_wt] [T/F OPT]Default FALSE value for [-random_frag] starting fragment insertions... [T/F OPT]New TRUE value for [-jitter_frag] [REAL OPT]Default value for [-jitter_amount] 2 [STR OPT]New value for [-jitter_variation] gauss. score0 done: (best, low) rms 0 0 18.3866825 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -5.50960827 -22.704874 11.2940054 5.82286787 standard trials: 2000 accepts: 600 %: 30 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -10.644 -10.644 -10.644 35 5.823 5.442 5.823 ID: 8337 · Rating: 0 · rate: / Reply Quote

Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0	Message 8349 - Posted: 4 Jan 2006, 12:53:02 UTC - in response to Message 8337. Last modified: 4 Jan 2006, 12:54:04 UTC I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. What are the last 20 lines of stdout.txt? RSD_WT: 1.2 RSD_WT: 1.0961 [T/F OPT]New TRUE value for [-rand_SS_wt] [T/F OPT]Default FALSE value for [-random_parallel_antiparallel] SS_WT: 0.764209032 0.950430512 1.02168155 1.42587173 [T/F OPT]Default FALSE value for [-rand_cst_res_wt] [T/F OPT]Default FALSE value for [-random_frag] starting fragment insertions... [T/F OPT]New TRUE value for [-jitter_frag] [REAL OPT]Default value for [-jitter_amount] 2 [STR OPT]New value for [-jitter_variation] gauss. score0 done: (best, low) rms 0 0 18.3866825 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -5.50960827 -22.704874 11.2940054 5.82286787 standard trials: 2000 accepts: 600 %: 30 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -10.644 -10.644 -10.644 35 5.823 5.442 5.823 COOL! Now if SETI can just find an alien to interpret this ;) Sorry, couldn't resist the urge. ID: 8349 · Rating: 0 · rate: / Reply Quote