Message boards : Number crunching : Report stuck work units here
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago. It comes to my mind that I have occasionally seen this on other projects - but maybe four or five times in ten months. Then suddenly see a lot of it here - is it a separate bug, or is it that the other problems are putting more stress on the boxes so that a very rare BOINC bug is triggered more often? Especially if it does turn out that the clock stopped problem is down to a dropped message between the client and the app, this is exactly where extra stress on the box is more likely to bring obscure issues to the surface. Just a thought, and it may be a red herring. River~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? Don't know, what are the error codes? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant? The _topology_ jobs are giving [large negative number] = 0xc0000005 As far as I remember all (most?) of the other wu have been giving small positive numbers like 11, 131, etc, which is what made me notice the difference. Sorry can't be more detailed but have not had chance to note much down. If I am right the exact values will be in the db, and if my memory is wrong the db will say so too. R~~ |
doc :) Send message Joined: 4 Oct 05 Posts: 47 Credit: 1,106,102 RAC: 0 |
this WU was stuck for more than 3 hours at 1% on step 2733 of the ab initio phase, exiting and restarting boinc got it to 10% in less than 10 minutes. i got a backup of the stdout.txt from before the restart if needed. pic: [img=http://img522.imageshack.us/img522/670/stuck2ch.th.jpg] edit: cant figure out how to post the thumbnail correct, bbcode is not my friend :) |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
I have a work unit that has been at 1% for 15H 50 min. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4133381 What do u want me to do? Anders n |
John Hetherington Send message Joined: 7 Oct 05 Posts: 1 Credit: 28,875 RAC: 0 |
NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine. John |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine. Since they aren't stuck, no this isn't the place to ask... but if I might guess, you're also running Predictor, which has the "Fortran errors"; all the errors on Rosetta I see in your results look like the "bad batch" or Rosetta being removed from memory, possibly when Predictor failed. If you're not running Predictor, please open a new thread and we'll see what we can do to figure this out. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have a work unit that has been at 1% for 15H 50 min. hi Anders, the project staff are still on leave. I can't tell you what the project want, only try to make my best guess at it. It depends on the speed of your box. What is the longest time you have seen a Rosetta wu take on that box and still succeed in the end? Halve that time. If a wu is stuck without the progress changing for that amount of time, I call it time to abort/suspend and move on. Eventually you will need to abort it if credit matters to you - but please don't abort yet. Go to the work tab (not the project tab), highlight the stuck wu, and click the suspend button (which then changes to become a resume button). BOINC should download more work (if need be) and start running it. Later on, when the project team say it is time to abort the wu, you will highlight it again, first click abort, then click resume (the last to allow the wu to be reported for credit). Hope that helps - it is what I would do R~~ |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
Thanks for the reply. I did just that after some more time. Have a snapshot of the grafics to , just in case. Anders n |
cwangersky Send message Joined: 6 Nov 05 Posts: 6 Credit: 325,556 RAC: 0 |
On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0). |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0). Yes report them here please. There seem to be three cases[list] * clock never started, progress grinds to a halt later * clock runs ok but progress grinds to a halt with clock still increasing * clock and progress start ok but both stop later on [list] so please make clear which it is, and mention to operating system especially in the first case. It is believed that the first case, where cpu stays at zero throughout, is confined to win95/98/ME - so reports of it happening on Win2k/XP or Linux would be particularly relevant. Thanks, R~~ |
Plum Ugly Send message Joined: 3 Nov 05 Posts: 24 Credit: 2,005,763 RAC: 0 |
I have one at 1% with 12-14 hrs running.I suspended it.what is it that you need to see on it. Defaut_1b72_220_3516_0 run 12 hrs 33min 06 before I suspended it at 1%.It says 16hrs an 59 miniutes to completion. pentium 4/W2k system. |
Plum Ugly Send message Joined: 3 Nov 05 Posts: 24 Credit: 2,005,763 RAC: 0 |
also have one NEW_SOFT_CENTROID_PACKING_2reb_225_2962_0 sutck at 1% with 8hrs 32 min ran.I have suspended it also. w2k on a p4 2.4 |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. |
hyperfusion Send message Joined: 2 Oct 05 Posts: 1 Credit: 120 RAC: 0 |
I also have a workunit (INCREASE_CYCLES_10_2tif_226_3820) stuck for a few hours at 40%. I know it is not running, since a simple ps aux | grep ^boinc reveals that all the rosetta* processes are sleeping. I tried renicing the processes to 0 (from 19), but that didn't do anything. Here's my stderr.txt: *** glibc detected *** corrupted double-linked list: 0x08d728f8 *** [0x87074eb] [0x871f4bc] [0xffffe420] [0x8785674] [0x879a4c6] [0x879ee8d] [0x879f463] [0x879f8bf] [0x8770365] [0x87700d1] [0x80572dd] [0x81087e8] [0x810814a] [0x8785b7f] [0x871738f] [0x872049d] [0x87b1cba] *** glibc detected *** corrupted double-linked list: 0x093a0ab0 *** [0x87074eb] [0x871f4bc] [0xffffe420] [0x8785674] [0x879a4c6] [0x879eeda] [0x879fb6c] [0x87a143d] [0x8770107] [0x86002f1] [0x85ef650] [0x85f843c] [0x860d529] [0x860df34] [0x840c3fe] [0x86a48c9] [0x85b1bbf] [0x85b36ec] [0x85b45f4] [0x83ca2af] [0x83cc2cf] [0x877e534] [0x8048121] The first 10 lines of stdout.txt: [2006-01-02 16:25:55] :: BOINC :: boinc_init() command executed: rosetta_4.80_i686-pc-linux-gnu aa 2tif _ -increase_cycles 10 -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 [STR OPT]New value for [-paths] frags400.txt. [T/F OPT]Default FALSE value for [-version] [T/F OPT]Default FALSE value for [-score] [T/F OPT]Default FALSE value for [-abinitio] [T/F OPT]Default FALSE value for [-refine] [T/F OPT]Default FALSE value for [-assemble] [T/F OPT]Default FALSE value for [-idealize] [T/F OPT]Default FALSE value for [-relax] And stdout.txt's last 10 lines: smooth trials: 80000 accepts: 2354 %: 2.9425 ----------------------------------------------------- ----------------------------------------------------- CYCLES::number is 1 x total_residue: 59 initializing full atom coordinates starting score 2173.93115 rms 4.0322547 starting full atom minimization CYCLES::number is 1 x total_residue: 177 starting score -109.494316 rms 3.95362806 starting full atom simulated anealing Looking into stdout.txt, it seems something is going wrong (to me, at least): Here are lines 1848-1864 Looking for psipred file: ./2tif_.psipred_ss2 Protein type: alpha/beta Fraction beta: 0.615384638 disabling sheet filter XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: CONSTRAINT FILE NOT FOUND Searched for: ./2tif_.cst Running without distance constraints XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND Searched for: ./2tif_.dpl Dipolar constraints will not be used XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX fragment file: ./aa2tif_03_05.400_v1_3.gz Total Residue 59 frag size: 3 frags/residue: 400 fragment file: ./aa2tif_09_05.400_v1_3.gz My computer is a 2.66ghz Pentium 4 (no hyperthreading) that runs Linux kernel 2.6.12. I hope this helps you guys resolve this issue! |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. After making the copy, I restarted BOINC and resumed the "stuck" result; it restarted at 0 CPU time and finished in under 2 hours. I do still have the backup with it in the "stuck at 1%" state. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them. What are the last 20 lines of stdout.txt? (I wonder if the "sticking" point is always in the simulated annealing as in hyperfusion's case? ) Do people have an idea about what fracton of work units that are getting stuck, and whether any proteins or run conditions are getting stuck more frequently than others? |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. RSD_WT: 1.2 RSD_WT: 1.0961 [T/F OPT]New TRUE value for [-rand_SS_wt] [T/F OPT]Default FALSE value for [-random_parallel_antiparallel] SS_WT: 0.764209032 0.950430512 1.02168155 1.42587173 [T/F OPT]Default FALSE value for [-rand_cst_res_wt] [T/F OPT]Default FALSE value for [-random_frag] starting fragment insertions... [T/F OPT]New TRUE value for [-jitter_frag] [REAL OPT]Default value for [-jitter_amount] 2 [STR OPT]New value for [-jitter_variation] gauss. score0 done: (best, low) rms 0 0 18.3866825 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -5.50960827 -22.704874 11.2940054 5.82286787 standard trials: 2000 accepts: 600 %: 30 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -10.644 -10.644 -10.644 35 5.823 5.442 5.823 |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. COOL! Now if SETI can just find an alien to interpret this ;) Sorry, couldn't resist the urge. |
Message boards :
Number crunching :
Report stuck work units here
©2024 University of Washington
https://www.bakerlab.org