Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

AuthorMessage
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 9249 - Posted: 18 Jan 2006, 5:40:12 UTC

I encountered the 1% fault on this workunit:


PRODUCTION_ABINITIO_1a68__250_204

https://boinc.bakerlab.org/rosetta/result.php?resultid=7033619

It was stuck on 1% for 8 and a half hours wih 14 hours in the time left to completion column.

This computer only runs rosetta. Although, it is a dual core and runs two workunits at a time.

Also checked the graphics and all motion had stopped except for the cpu time which was accuratey recording the time.

I went ahead and aborted the workunit.

ciao.......


ID: 9249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 9254 - Posted: 18 Jan 2006, 6:49:39 UTC - in response to Message 9249.  

I encountered the 1% fault on this workunit:


PRODUCTION_ABINITIO_1a68__250_204

https://boinc.bakerlab.org/rosetta/result.php?resultid=7033619

It was stuck on 1% for 8 and a half hours wih 14 hours in the time left to completion column.

This computer only runs rosetta. Although, it is a dual core and runs two workunits at a time.

Also checked the graphics and all motion had stopped except for the cpu time which was accuratey recording the time.

I went ahead and aborted the workunit.

ciao.......



Hi Bruce, can you try running this with the same random number seed outside of boinc (see David K's instructions below). thanks! David


ID: 9254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
premier

Send message
Joined: 30 Dec 05
Posts: 14
Credit: 23,872,868
RAC: 0
Message 9282 - Posted: 18 Jan 2006, 16:36:03 UTC - in response to Message 9234.  

premier,

can you email me both stdout.txt files? dekim at u dot washignton dot edu


Already sent :)
ID: 9282 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 9446 - Posted: 20 Jan 2006, 12:14:14 UTC - in response to Message 8882.  
Last modified: 20 Jan 2006, 12:15:25 UTC

I didn't encounter a single 1% bug among my 1000+ processed Rosetta WUs, so far. In fact, the only errors I had were the ones every one else was having over the Holidays, plus initially, a couple of errors caused by local problems (use of an obsolete BOINC client version, unstable memory). Oh, and I run Rosetta on Linux 24 hours a day (so no switching between projects, suspensions, shutdowns, etc).

I also have done around 1000 WUs on my linux machines with no 1% hang.

Hmmm... I know that Windows is more aggressive about locking in-use files, so it would be possible for a certain file usage pattern to deadlock on a Windows machine but not on a Linux machine. Does CPU usage go to zero when this bug hits?

ID: 9446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 10041 - Posted: 27 Jan 2006, 16:20:53 UTC - in response to Message 9254.  

I encountered the 1% fault on this workunit:


PRODUCTION_ABINITIO_1a68__250_204

https://boinc.bakerlab.org/rosetta/result.php?resultid=7033619

It was stuck on 1% for 8 and a half hours wih 14 hours in the time left to completion column.

This computer only runs rosetta. Although, it is a dual core and runs two workunits at a time.

Also checked the graphics and all motion had stopped except for the cpu time which was accuratey recording the time.

I went ahead and aborted the workunit.

ciao.......



Hi Bruce, can you try running this with the same random number seed outside of boinc (see David K's instructions below). thanks! David



I ran it per instructions it worked fine.

Ran into the bug again on a different workunit and just exited BOINC Manager then restarted it and everything worked fine.

Again the graphics screen was frozen except for the cpu time.

IIt appears to be a boinc problem since rosseta runs fine on its own, twice now.


Have a great day.......



ID: 10041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 10070 - Posted: 27 Jan 2006, 22:32:33 UTC

Definatly a BOINC problem I had one that stuck the other day after 2 hours it was still at 1%, after playing around I reset the process of the WU the WU ran again and never got stuck at 1%
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 10070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 10162 - Posted: 29 Jan 2006, 2:19:52 UTC
Last modified: 29 Jan 2006, 2:28:00 UTC

Well, this bug finally struck my system. I've been running Rosetta since Dec 15 with no problems, not even those that struck in Dec. Rosetta upgraded to 4.81 automatically when it was released. The WU in question is NO_SIM_ANNEAL_BARCODE_30_2reb_278_8946_0. It ran in excess of 5 hours before I noticed.

I suspended Rosetta, closed BOINC, then opened it back up - no good. I then followed the instructions in David Baker's opening entry below. After @10 minutes, the WU surpassed 1%. I closed the command window and re-opened BOINC. Just like Bruce Boytler experienced, there was no motion in the graphics window except for the cpu time.

The work unit is suspended now. Any suggestions on which course of action I should take, like simply aborting this WU, or?

[edit] Not sure if it matters, but I noticed that the random seed did not change from when the BOINC client was closed, then Rosetta was run from the command line, then back to the client. WU id is 6530186. [/edit]
ID: 10162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dakoina

Send message
Joined: 19 Dec 05
Posts: 1
Credit: 43,589
RAC: 0
Message 10226 - Posted: 30 Jan 2006, 17:38:04 UTC
Last modified: 30 Jan 2006, 17:39:56 UTC

Today I noticed the 1% bug too. I had this one:

NO_SIM_ANNEAL_BARCODE_30_2reb_283_9553_0

running for over 7hours stuck at 1%... anyway, pauzing did not help, but restarting the boinc client got it going again (cputime restarting at 0 seconds). Too bad, I forgot to check if the "screensaver" for that WU worked fine or not, before the client restarted. After the restarting proces the WU worked fine again. This WU should now be completed within 50minutes cputime.

Note: running the client on an AMD dualcore (if usefull) 1 wu per core
ID: 10226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile meckano
Avatar

Send message
Joined: 4 Jan 06
Posts: 28
Credit: 16,457
RAC: 0
Message 10507 - Posted: 6 Feb 2006, 17:05:15 UTC - in response to Message 10226.  
Last modified: 6 Feb 2006, 17:06:00 UTC

Edit:
Is there another way to find If I have had the problem?

I had result that took 19K sec.s, and another 12K sec.s
Are those of any interest to you?
-----------------------
Click to see my tag
My tag
SNAFU'ed? Turn the Page! :D
ID: 10507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 10508 - Posted: 6 Feb 2006, 17:09:56 UTC - in response to Message 10507.  

Edit:
Is there another way to find If I have had the problem?

I had result that took 19K sec.s, and another 12K sec.s
Are those of any interest to you?

The work units vary in size, so this is not unusual.

Regards,
Bob P.
ID: 10508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile The Gas Giant

Send message
Joined: 20 Sep 05
Posts: 23
Credit: 58,591
RAC: 0
Message 10570 - Posted: 8 Feb 2006, 10:14:57 UTC
Last modified: 8 Feb 2006, 10:19:33 UTC

This wu https://boinc.bakerlab.org/rosetta/workunit.php?wuid=7601894 was stuck at 1% for over 3hrs. I followed the guide right at the bottom to get the following command to be run in the termical window on XP. Within a few minutes the progress was at 10%

C:Program FilesBOINCprojectsboinc.bakerlab.org_rosetta>rosetta_4.81_windows_
intelx86.exe aa 2tif _ -abrelax -stringent_relax -more_relax_cycles -relax_score
_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_
wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_leng
th 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation
gauss -max_frags 400 -number_3mer_frags 200 -number_9mer_frags 100 -output_sile
nt_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 -constant_seed
-jran 1373221

Hope this helped a little.

Live long and crunch.

PPaul
(S@H1 8888)

Do as I say, not as I do!
ID: 10570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yin Gang

Send message
Joined: 17 Sep 05
Posts: 13
Credit: 63,992
RAC: 0
Message 10692 - Posted: 12 Feb 2006, 11:58:42 UTC
Last modified: 12 Feb 2006, 12:03:31 UTC

This WU (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8300276) was stuck at 1% (step 21669) for more than 4 hours, then after restarting the manager the wu was stuck at 1% again (step 23100). So I followed the guide to run the application in the cmd.exe and the progress went to 10% after 23 minutes.

rosetta_4.81_windows_intelx86.exe xx 1fna _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -nstruct 10 -constant_seed -jran 918021


I've encoutered many other WUs costing a rather long time in the first 1% but this is the first never-ending WU, so I aborted it...

Hope these would help;)

Best regards,
Yin Gang


Welcome To Team China!
ID: 10692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Biggles
Avatar

Send message
Joined: 22 Sep 05
Posts: 49
Credit: 102,114
RAC: 0
Message 10695 - Posted: 12 Feb 2006, 14:48:09 UTC

This work unit has been stuck at 1% for 25 hours now. I've only just noticed. You still wanting me to test it outside BOINC? I've suspended it for now.

For what it is worth, the computer is a Pentium M based laptop, running Windows XP and the Crunch3r SSE2 optimised BOINC client, latest version.
ID: 10695 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 10702 - Posted: 12 Feb 2006, 21:38:21 UTC

PRODUCTION_ABINITIO_CENTROID_PACKING_4ubpA_301_2382_0

21 hours, 1%. Now running from the command line (it says 16 minutes had elapsed, I don't know if that's relevant). It's hit 10% now so it appears to be going alright.
ID: 10702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Biggles
Avatar

Send message
Joined: 22 Sep 05
Posts: 49
Credit: 102,114
RAC: 0
Message 10766 - Posted: 15 Feb 2006, 3:46:59 UTC - in response to Message 10695.  

This work unit has been stuck at 1% for 25 hours now. I've only just noticed. You still wanting me to test it outside BOINC? I've suspended it for now.

For what it is worth, the computer is a Pentium M based laptop, running Windows XP and the Crunch3r SSE2 optimised BOINC client, latest version.


Ran this via the command line with the switches xx 256b A -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -nstruct 10 -constant_seed -jran 968001 and it passed 1% fairly quickly.

Resumed in BOINC and it reset itself, but didn't get stuck this time.

Bummed about losing over a day of CPU time though.
ID: 10766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10917 - Posted: 19 Feb 2006, 5:23:18 UTC
Last modified: 19 Feb 2006, 6:02:01 UTC

I attached my ole Celeron 500, win98se and 256M ram to Ralph. I was doing a 4.85 Barcode checking out a computation error that happens with CPU run time, when I noticed my % complete started at 1 immediately, then ONLY progressed past this when it completed a model. This takes anywhere up to 40 minutes, so I got to stare at 1% complete for 30 minutes anyway. So my percentages jumped from 1 to 18 to 61 then done. This is when I found out that all my hosts update the % done at the end of every model. They all start at 1% immediately after starting.

my question:

Has anyone made it past model 1 so it could advance?
Was anyone watching the graphic?
Could there be a code problem in the program preventing the hosts from completing model 1??

tony

If they set up a clock trigger with fine resolution, rather than updating % done with an event trigger, they might better locate the bug. I.E update a thousand times/wu and if you start seeing 4,5,and 6% bugs you'd know where (approximately) the lockup was occurring.

[edit]If we know it only occurs in Model 1, what is different about model one that's NOT in the other models?


ID: 10917 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10919 - Posted: 19 Feb 2006, 6:29:20 UTC

So, part of the 1% bug seems to be related to it not getting past the first stage, or switch times causing the restart of the first stage. It would seem that the slowest processor would get past the first stage after running for hours. I've looked through this thread and see two references to the step number present when it hung, those being 21669 and 21933. Can others post there step numbers (visible from graphic) and see if they're all around 21600-21900. Might there not be some code used in this area that's different from the other stages/models?

I'm just speculating here.
ID: 10919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 10920 - Posted: 19 Feb 2006, 6:34:50 UTC

mmciastro, this is a weird bug. Keep in mind that a restart with the same random seed runs okay so it appears to be a random event possibly caused by the interaction with the boinc client. If the bug were reproducible, it would obviously be more easily tracked down.
ID: 10920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thorm

Send message
Joined: 25 Sep 05
Posts: 1
Credit: 22,435
RAC: 0
Message 10925 - Posted: 19 Feb 2006, 11:29:38 UTC

Yesterday my WU stucked at 1% over 1.30 hour, but suddenly the progress jumped to 25%. I do not know why, cause i didnt start any action which could explain this.

Maybe I closed some programs and windows(2000) locked/unlocked some files, or maybe it's a RAM-issue? Dont know. :-(

Today i have the same problem, but i'm not sure if this is really a bug, or a very large WU? The Client isn't frozen, the step-counter is raising(Step 1.544.555 so far) but progress is at 1% for 1.20 hour

greetings
Thorm
ID: 10925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10926 - Posted: 19 Feb 2006, 11:52:15 UTC - in response to Message 10925.  

Today i have the same problem, but i'm not sure if this is really a bug, or a very large WU? The Client isn't frozen, the step-counter is raising(Step 1.544.555 so far) but progress is at 1% for 1.20 hour

greetings
Thorm

The percentage done seems to be updated after a model/stage is completed. Your Athlon processor is slow by todays standards and it seems appropriate you should see longer periods between updates than someone with a faster processor. I've looked at your results and it seems to be doing fine.
ID: 10926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org