Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
Profile OhioDude

Send message
Joined: 11 Dec 05
Posts: 8
Credit: 4,056,499
RAC: 0
Message 11853 - Posted: 10 Mar 2006, 12:40:30 UTC

Two more stuck at 1%

HB_BARCODE_30_1scjB_351_1354_0
HB_BARCODE_30_1a19A_351_3105_1
Visit my websites honoring some of America's heroes:
USS Rich DE-695
USS Bunch DE-694 / APD-79
ID: 11853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 11858 - Posted: 10 Mar 2006, 14:25:41 UTC

Computer reboot solved my problem. 8 hours stuck at 1%.
ID: 11858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 11881 - Posted: 11 Mar 2006, 8:27:41 UTC
Last modified: 11 Mar 2006, 8:35:43 UTC

Stuck at 1% (model 1 - step 21731) SHORTRELAX_2tif__333_146_1

Unfortunaly it did take 21:34 hrs before I noticed it.

It was stuck once before I got it. Maybe you might consider not sending stuck wu's out again, because R@h is loosing lots of valuable crunching time.
ID: 11881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 11884 - Posted: 11 Mar 2006, 14:37:07 UTC

SSFEATURES_BARCODE_ABINITIO_5croA_334_286_0

66 hours, 1%. Can't get it to budge.
ID: 11884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile OhioDude

Send message
Joined: 11 Dec 05
Posts: 8
Credit: 4,056,499
RAC: 0
Message 11885 - Posted: 11 Mar 2006, 14:42:09 UTC

Stuck at 1%

HB_BARCODE_30_1b3aA_351_3464_0
Visit my websites honoring some of America's heroes:
USS Rich DE-695
USS Bunch DE-694 / APD-79
ID: 11885 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 11932 - Posted: 12 Mar 2006, 9:51:20 UTC

Stuck at 1% (model 1 - step 21615) HB_BARCODE_30_1bk2_351_1307_0
ID: 11932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile OhioDude

Send message
Joined: 11 Dec 05
Posts: 8
Credit: 4,056,499
RAC: 0
Message 11938 - Posted: 12 Mar 2006, 13:59:24 UTC

Stuck at 1%

HB_BARCODE_30_1a68__351_3815

HB_BARCODE_30_1acf__351_3730
Visit my websites honoring some of America's heroes:
USS Rich DE-695
USS Bunch DE-694 / APD-79
ID: 11938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile OhioDude

Send message
Joined: 11 Dec 05
Posts: 8
Credit: 4,056,499
RAC: 0
Message 11975 - Posted: 13 Mar 2006, 11:52:59 UTC

More stuck at 1%:

HOMSr6_homDB027_1r69__352_1131_0


Visit my websites honoring some of America's heroes:
USS Rich DE-695
USS Bunch DE-694 / APD-79
ID: 11975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 11977 - Posted: 13 Mar 2006, 12:22:36 UTC
Last modified: 13 Mar 2006, 12:23:16 UTC

Hi all. I'm beginning to think that this problem is caused by the BOINC screen saver. I have changed my screen saver to a windows saver and so far so good. will keep you informed.
ID: 11977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 12003 - Posted: 14 Mar 2006, 6:34:48 UTC

got my first stuck at 1% wu on my athlonXP pc (had 2 long ago on my duron pc)
its stuck at step 22813 of model 1 for more than 1 hour and 30 minutes, exiting and restarting boinc brought it past that step in less than one minute.

WU - result (nothing to see there yet, its still running here while i type this :))
ID: 12003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile paul and kirsty yates
Avatar

Send message
Joined: 14 Jan 06
Posts: 2
Credit: 179,851
RAC: 0
Message 12099 - Posted: 16 Mar 2006, 17:55:38 UTC

think ive got one too
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11223579
FA_RLXfn_hom024_1fna__360_141_0
only 1 hour at present but no movement on screensaver
and no increase of %
shall i abort ??

ID: 12099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Robert

Send message
Joined: 3 Jan 06
Posts: 4
Credit: 23,755
RAC: 0
Message 12103 - Posted: 16 Mar 2006, 20:10:00 UTC

Got one stuck at 1% ,restarted Boinc no help,restarted PC CPU time reset to 00:00 and after 30 min still at 1%.Work unit FA_RLXai_hom020_aiu_359_363_0
Aborted gui rpc.
ID: 12103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 12125 - Posted: 17 Mar 2006, 4:03:24 UTC

This thread is approaching 100 posts - Is the project ANY closer to solving the 1% bug?


Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 12125 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xrobert

Send message
Joined: 28 Oct 05
Posts: 3
Credit: 168,865
RAC: 0
Message 12130 - Posted: 17 Mar 2006, 7:07:34 UTC

The WU I have has got stuck on 1%.

FA_RLXai_hom023_1aiu__359_454_0

I've now also aborted another unit which begins with ai_hom, to be safe.


ID: 12130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 12135 - Posted: 17 Mar 2006, 8:15:39 UTC

Hi all. Since turning off the Boinc screen saver I have no problems with the 1% bug during the last 4 days.
ID: 12135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vavega
Avatar

Send message
Joined: 2 Nov 05
Posts: 82
Credit: 519,981
RAC: 0
Message 12137 - Posted: 17 Mar 2006, 8:23:06 UTC

be vigilant mike, i've never run the screensaver and yesterday had 2 stuck at 1%. i tried rebooting (which twice in the past had gotten it to crunch past 1%) but alas this time failed. so they were aborted. i'm with angus on this.....can we have an update from someone?
ID: 12137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 12138 - Posted: 17 Mar 2006, 8:24:35 UTC
Last modified: 17 Mar 2006, 8:34:13 UTC

stuck at 1.00%
https://boinc.bakerlab.org/rosetta/result.php?resultid=13879990
02:37:08 cpu time - Pentium 4 1800 mhz
Rosetta 4.82 Windows

*I Suspended this WU (removing from memory),
later it should be re-started automatically

In case of More 3 hours of cpu time, stuck at 1.00% I will abort this wu

*My preference run time is 2 hours -:(
*and I don't use screen saver
Click signature for global team stats
ID: 12138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 12139 - Posted: 17 Mar 2006, 8:28:09 UTC

Sorry. I should have said I have disabled all screen savers.
ID: 12139 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pirxx

Send message
Joined: 8 Jan 06
Posts: 2
Credit: 42,312
RAC: 0
Message 12143 - Posted: 17 Mar 2006, 11:19:59 UTC

Hi!

After crunching for quite a while, I have my first WU that has got stuck on 1%:

FA_RLXbg_hom029_1bgf__359_368_0

I noticed it after it wasted about 12 hours of my CPU time :-(

I tried all the tricks proposed in this thread in order to get it going again - I suspended computing of this WU, reentered BOINC, rebooted my computer - with no luck (maybe I was a bit too impatient in those trials - as it turned out later).

I also ran the WU outside of BOINC, as proposed in David Kim's instructions.
It went OK, though it took ~30 minutes to go past 1%, so maybe it would also ran in BOINC after restart - if I had waited that long. I waited only about 20 min because I thought that with a standard WU crunch time of 2 hrs, 1% couldn't take that long.

I use P4 3GHz with HT, WinXP Professional, BOINC 5.2.13
I had 'leave app in memory' set to 'no' because I did not know I should have it set to 'yes'. But with fast computer and switching time between projects set to 120 min it shouldn't be such a problem.

I have stdout.txt saved and I can send it, if it could be of any use.

I hope the above notes will help you.

Pirxx
Pirxx

ID: 12143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile napolj2

Send message
Joined: 29 Dec 05
Posts: 1
Credit: 1,021,633
RAC: 0
Message 12144 - Posted: 17 Mar 2006, 11:41:46 UTC
Last modified: 17 Mar 2006, 11:45:03 UTC

Hello. Earlier I had my first 1% bug after processing many work units with little or no problems.

Result
WorkUnit

I noticed the work unit was stuck at 1% after having run for 15 hours! The graphics were completely frozen (no change in # steps) at Model 1, Ab initio, step 21924, while R@H was still consuming high CPU usage.

From stdout.txt:

command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.82_windows_intelx86.exe xx 1cg5 B -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom023_ -frags_name_prefix hom023_ -filter1 -240 -filter2 -255 -termini -cpu_run_time 7200 -constant_seed -jran 2764725


stdout.txt ended with:

score0 done: (best, low) rms
  0  0  29.1041889
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
 -38.2760544 -41.4388504  13.1321259  12.8618298
standard                       trials: 20000 accepts: 1434 %: 7.17
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
  0   -7.574   -7.574   -7.574   77   12.862   12.137   12.862


As per instructions, I stopped the BOINC service and ran Rosetta outside of BOINC from the command line; it did not get stuck. I closed Rosetta and then restarted BOINC. BOINC restarted the workunit from the beginning and ran it to completion--with the same random seed--without trouble.

stdout.txt now continued on like this:

score0 done: (best, low) rms
  0  0  29.1041889
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
 -38.2760544 -41.4388504  13.1321259  12.8618298
standard                       trials: 20000 accepts: 1434 %: 7.17
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
  0   -7.574   -7.574   -7.574   77   12.862   12.137   12.862
converged  2.55542803 104397
  1  -11.212  -11.212  -14.466   83   12.846   12.137   12.871
converged  1.9010148 104842
  2  -18.353  -18.353  -19.036   89   12.853   12.137   12.893
converged  1.58851826 110341
  3  -12.661  -12.661  -22.760   93   12.912   12.137   12.875
  4   16.287   16.287  -24.439   96   21.553   12.137   12.886
  5   18.037   18.037  -24.986   97   22.463   10.673   12.858
converged  2.06075215 119404
  6   -2.563   -2.563  -29.275   98   13.062   10.673   12.902
converged  1.70931101 123891
  7  -12.944  -12.944  -29.275   98   13.164   10.673   12.902
  8  -10.142  -10.142  -29.275   98   15.353   10.673   12.902
  9  -21.779  -21.779  -31.775  100   17.247   10.673   18.628
converged  2.26275945 100960
 10  -31.685  -31.685  -31.763  100   18.124   10.673   18.628
standard                       trials: 97008 accepts: 6066 %: 6.25309
-----------------------------------------------------
Starting score3 moves...
kk,score3,low_score,rms_err,low_rms,rms_min,naccept
  0   40.546   40.546   18.628   18.628   10.673 7500
  1   60.275   24.222   18.322   18.137   10.673 12142
pre-computing chuck/gunn move set for frag length 3
  2   42.101   21.584   17.448   18.061   10.673 15124
  3   33.776   21.028   18.461   17.431   10.673 17792
standard                       trials: 40000 accepts: 4642 %: 11.605
smooth                         trials: 80000 accepts: 5650 %: 7.0625
-----------------------------------------------------
-----------------------------------------------------


Comparing this from before, apparently BOINC (or Rosetta) got stuck somewhere after the line
  0   -7.574   -7.574   -7.574   77   12.862   12.137   12.862

but before
converged  2.55542803 104397
.
Also, this final stdout.txt contained the exact same numbers as did the stdout.txt from when I ran Rosetta from the command line.

This is weird. I can't say the problem lies with BOINC or the BOINC-Rosetta interface because BOINC ran the same exact comman d without error the second time around. I can only guess that the 1% bug is caused by some odd combination of outside events--like threads executing in a different order. I will see if I can do anything to reproduce it.

Other info on my configuration: WinXP SP2. 2.6GHz Celeron CPU. 1GB RAM. BOINC 5.2.13 running as system service. Rosetta 4.82. Also running RALPH@Home. Applications kept in memory when swapping.
ID: 12144 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org