Message boards : Number crunching : Help us solve the 1% bug!
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
I encountered the 1% fault on this workunit: PRODUCTION_ABINITIO_1a68__250_204 https://boinc.bakerlab.org/rosetta/result.php?resultid=7033619 It was stuck on 1% for 8 and a half hours wih 14 hours in the time left to completion column. This computer only runs rosetta. Although, it is a dual core and runs two workunits at a time. Also checked the graphics and all motion had stopped except for the cpu time which was accuratey recording the time. I went ahead and aborted the workunit. ciao....... |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I encountered the 1% fault on this workunit: Hi Bruce, can you try running this with the same random number seed outside of boinc (see David K's instructions below). thanks! David |
premier Send message Joined: 30 Dec 05 Posts: 14 Credit: 23,872,868 RAC: 0 |
premier, Already sent :) |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I didn't encounter a single 1% bug among my 1000+ processed Rosetta WUs, so far. In fact, the only errors I had were the ones every one else was having over the Holidays, plus initially, a couple of errors caused by local problems (use of an obsolete BOINC client version, unstable memory). Oh, and I run Rosetta on Linux 24 hours a day (so no switching between projects, suspensions, shutdowns, etc). I also have done around 1000 WUs on my linux machines with no 1% hang. Hmmm... I know that Windows is more aggressive about locking in-use files, so it would be possible for a certain file usage pattern to deadlock on a Windows machine but not on a Linux machine. Does CPU usage go to zero when this bug hits? |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
I encountered the 1% fault on this workunit: I ran it per instructions it worked fine. Ran into the bug again on a different workunit and just exited BOINC Manager then restarted it and everything worked fine. Again the graphics screen was frozen except for the cpu time. IIt appears to be a boinc problem since rosseta runs fine on its own, twice now. Have a great day....... |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Definatly a BOINC problem I had one that stuck the other day after 2 hours it was still at 1%, after playing around I reset the process of the WU the WU ran again and never got stuck at 1% Join us in Chat (see the forum) Click the Sig Join UBT |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
Well, this bug finally struck my system. I've been running Rosetta since Dec 15 with no problems, not even those that struck in Dec. Rosetta upgraded to 4.81 automatically when it was released. The WU in question is NO_SIM_ANNEAL_BARCODE_30_2reb_278_8946_0. It ran in excess of 5 hours before I noticed. I suspended Rosetta, closed BOINC, then opened it back up - no good. I then followed the instructions in David Baker's opening entry below. After @10 minutes, the WU surpassed 1%. I closed the command window and re-opened BOINC. Just like Bruce Boytler experienced, there was no motion in the graphics window except for the cpu time. The work unit is suspended now. Any suggestions on which course of action I should take, like simply aborting this WU, or? [edit] Not sure if it matters, but I noticed that the random seed did not change from when the BOINC client was closed, then Rosetta was run from the command line, then back to the client. WU id is 6530186. [/edit] |
Dakoina Send message Joined: 19 Dec 05 Posts: 1 Credit: 43,589 RAC: 0 |
Today I noticed the 1% bug too. I had this one: NO_SIM_ANNEAL_BARCODE_30_2reb_283_9553_0 running for over 7hours stuck at 1%... anyway, pauzing did not help, but restarting the boinc client got it going again (cputime restarting at 0 seconds). Too bad, I forgot to check if the "screensaver" for that WU worked fine or not, before the client restarted. After the restarting proces the WU worked fine again. This WU should now be completed within 50minutes cputime. Note: running the client on an AMD dualcore (if usefull) 1 wu per core |
meckano Send message Joined: 4 Jan 06 Posts: 28 Credit: 16,457 RAC: 0 |
Edit: Is there another way to find If I have had the problem? I had result that took 19K sec.s, and another 12K sec.s Are those of any interest to you? ----------------------- Click to see my tag My tag SNAFU'ed? Turn the Page! :D |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
Edit: The work units vary in size, so this is not unusual. Regards, Bob P. |
The Gas Giant Send message Joined: 20 Sep 05 Posts: 23 Credit: 58,591 RAC: 0 |
This wu https://boinc.bakerlab.org/rosetta/workunit.php?wuid=7601894 was stuck at 1% for over 3hrs. I followed the guide right at the bottom to get the following command to be run in the termical window on XP. Within a few minutes the progress was at 10% C:Program FilesBOINCprojectsboinc.bakerlab.org_rosetta>rosetta_4.81_windows_ intelx86.exe aa 2tif _ -abrelax -stringent_relax -more_relax_cycles -relax_score _filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_ wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_leng th 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -number_3mer_frags 200 -number_9mer_frags 100 -output_sile nt_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 -constant_seed -jran 1373221 Hope this helped a little. Live long and crunch. PPaul (S@H1 8888) Do as I say, not as I do! |
Yin Gang Send message Joined: 17 Sep 05 Posts: 13 Credit: 63,992 RAC: 0 |
This WU (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8300276) was stuck at 1% (step 21669) for more than 4 hours, then after restarting the manager the wu was stuck at 1% again (step 23100). So I followed the guide to run the application in the cmd.exe and the progress went to 10% after 23 minutes. rosetta_4.81_windows_intelx86.exe xx 1fna _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -nstruct 10 -constant_seed -jran 918021 I've encoutered many other WUs costing a rather long time in the first 1% but this is the first never-ending WU, so I aborted it... Hope these would help;) Best regards, Yin Gang Welcome To Team China! |
Biggles Send message Joined: 22 Sep 05 Posts: 49 Credit: 102,114 RAC: 0 |
This work unit has been stuck at 1% for 25 hours now. I've only just noticed. You still wanting me to test it outside BOINC? I've suspended it for now. For what it is worth, the computer is a Pentium M based laptop, running Windows XP and the Crunch3r SSE2 optimised BOINC client, latest version. |
arklms Send message Joined: 17 Dec 05 Posts: 7 Credit: 177,488 RAC: 0 |
PRODUCTION_ABINITIO_CENTROID_PACKING_4ubpA_301_2382_0 21 hours, 1%. Now running from the command line (it says 16 minutes had elapsed, I don't know if that's relevant). It's hit 10% now so it appears to be going alright. |
Biggles Send message Joined: 22 Sep 05 Posts: 49 Credit: 102,114 RAC: 0 |
This work unit has been stuck at 1% for 25 hours now. I've only just noticed. You still wanting me to test it outside BOINC? I've suspended it for now. Ran this via the command line with the switches xx 256b A -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -nstruct 10 -constant_seed -jran 968001 and it passed 1% fairly quickly. Resumed in BOINC and it reset itself, but didn't get stuck this time. Bummed about losing over a day of CPU time though. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I attached my ole Celeron 500, win98se and 256M ram to Ralph. I was doing a 4.85 Barcode checking out a computation error that happens with CPU run time, when I noticed my % complete started at 1 immediately, then ONLY progressed past this when it completed a model. This takes anywhere up to 40 minutes, so I got to stare at 1% complete for 30 minutes anyway. So my percentages jumped from 1 to 18 to 61 then done. This is when I found out that all my hosts update the % done at the end of every model. They all start at 1% immediately after starting. my question: Has anyone made it past model 1 so it could advance? Was anyone watching the graphic? Could there be a code problem in the program preventing the hosts from completing model 1?? tony If they set up a clock trigger with fine resolution, rather than updating % done with an event trigger, they might better locate the bug. I.E update a thousand times/wu and if you start seeing 4,5,and 6% bugs you'd know where (approximately) the lockup was occurring. [edit]If we know it only occurs in Model 1, what is different about model one that's NOT in the other models? |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
So, part of the 1% bug seems to be related to it not getting past the first stage, or switch times causing the restart of the first stage. It would seem that the slowest processor would get past the first stage after running for hours. I've looked through this thread and see two references to the step number present when it hung, those being 21669 and 21933. Can others post there step numbers (visible from graphic) and see if they're all around 21600-21900. Might there not be some code used in this area that's different from the other stages/models? I'm just speculating here. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
mmciastro, this is a weird bug. Keep in mind that a restart with the same random seed runs okay so it appears to be a random event possibly caused by the interaction with the boinc client. If the bug were reproducible, it would obviously be more easily tracked down. |
Thorm Send message Joined: 25 Sep 05 Posts: 1 Credit: 22,435 RAC: 0 |
Yesterday my WU stucked at 1% over 1.30 hour, but suddenly the progress jumped to 25%. I do not know why, cause i didnt start any action which could explain this. Maybe I closed some programs and windows(2000) locked/unlocked some files, or maybe it's a RAM-issue? Dont know. :-( Today i have the same problem, but i'm not sure if this is really a bug, or a very large WU? The Client isn't frozen, the step-counter is raising(Step 1.544.555 so far) but progress is at 1% for 1.20 hour greetings Thorm |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Today i have the same problem, but i'm not sure if this is really a bug, or a very large WU? The Client isn't frozen, the step-counter is raising(Step 1.544.555 so far) but progress is at 1% for 1.20 hour The percentage done seems to be updated after a model/stage is completed. Your Athlon processor is slow by todays standards and it seems appropriate you should see longer periods between updates than someone with a faster processor. I've looked at your results and it seems to be doing fine. |
Message boards :
Number crunching :
Help us solve the 1% bug!
©2024 University of Washington
https://www.bakerlab.org