Report stuck work units here

Message boards : Number crunching : Report stuck work units here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8370 - Posted: 4 Jan 2006, 19:52:54 UTC - in response to Message 8329.  
Last modified: 4 Jan 2006, 20:20:35 UTC


Do people have an idea about what fracton of work units that are getting stuck, and whether any proteins or run conditions are getting stuck more frequently than others?


In my view you won't get a reliable answer to this, David, as the variation between machines is enormous - probably luck of the draw rather than anything systematic.

I had 17 machines runnning Rosetta over the break. I could pick out spurious patterns on one and refute them on another. Four boxes got 90% downtime for a couple of days, one box got off unscathed, the others varying in between.

If I'd had just one of those boxes what story I'd give you would depend on which box it was.

I'd suggest getting an SQL wizard to coax the frequencies of errors etc and run-times out of the returned work pile. Even better because it is less dependent on cpu speed, frequecies of errors against credit claims for the various flavours of wu.

It won't have so much detail but at least it won't depend so much on the vagaries of what wu went to which observant/too-busy users.

River~~
ID: 8370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]FOKschaap~Jumparound

Send message
Joined: 17 Dec 05
Posts: 2
Credit: 60,626
RAC: 0
Message 8410 - Posted: 5 Jan 2006, 8:20:14 UTC

plesae check this WU, look at the points wasted :(
(and yes, the last 95 points were mine :()

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3761771
ID: 8410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile buffylove

Send message
Joined: 2 Nov 05
Posts: 2
Credit: 41,715
RAC: 0
Message 8411 - Posted: 5 Jan 2006, 8:44:43 UTC

Do you mean one like this?


ID: 8411 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile buffylove

Send message
Joined: 2 Nov 05
Posts: 2
Credit: 41,715
RAC: 0
Message 8412 - Posted: 5 Jan 2006, 8:50:07 UTC

Along with that, the computation errors, and bandwidth required for data downloads, it's not looking so good.
ID: 8412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cureseekers~Nightanimal
Avatar

Send message
Joined: 20 Nov 05
Posts: 19
Credit: 26,396
RAC: 0
Message 8442 - Posted: 5 Jan 2006, 23:36:08 UTC
Last modified: 5 Jan 2006, 23:53:25 UTC

Job NO_SIM_ANNEAL_1hz6_228_5495_0 went stuck here. 12hrs on 2% on a PIII 733mhz
The signature is away on the moment, just leave a message after the beep
ID: 8442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 8505 - Posted: 6 Jan 2006, 22:10:56 UTC
Last modified: 6 Jan 2006, 22:16:05 UTC

Job MORE_FRAGS_W_BARCODE_2reb_229_5862 stuck for 9hrs at 1%

MORE_FRAGS_W_BARCODE_2reb_229_5862

XP2100+ o/c 13x166 (Lister) stable prime95 / memtest86+ et al



ID: 8505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 8537 - Posted: 7 Jan 2006, 14:54:18 UTC
Last modified: 7 Jan 2006, 15:34:06 UTC

Hi,...

MORE_FRAGS_W_BARCODE_2reb_229_6182_0 (Result ID 5827255) been running for 5:30 CPU time, with 1% progress and more than 12 hours to completion. Should I Abort? ... be Credited?

O&O
ID: 8537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 8542 - Posted: 7 Jan 2006, 18:32:09 UTC

These "stuck at 1%" results _seem_ to restart and complete successfully if you exit BOINC and relaunch it. The project has not said that credit will be granted if you abort them, so I can't advise that - it would help them locate the problem, however, if you could get the last 20 lines from the stdout.txt file in the slots directory and paste them here before you relaunch BOINC.

ID: 8542 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 8571 - Posted: 8 Jan 2006, 2:49:18 UTC - in response to Message 8542.  
Last modified: 8 Jan 2006, 3:06:52 UTC

These "stuck at 1%" results _seem_ to restart and complete successfully if you exit BOINC and relaunch it. The project has not said that credit will be granted if you abort them, so I can't advise that - it would help them locate the problem, however, if you could get the last 20 lines from the stdout.txt file in the slots directory and paste them here before you relaunch BOINC.


Thanks Bill.

I did abort the result before having the chance to read your advise... which resulted ...
07/01/2006 17:14:42|rosetta@home|Unrecoverable error for result MORE_FRAGS_W_BARCODE_2reb_229_6182_0 (aborted via GUI RPC)
07/01/2006 17:14:43|rosetta@home|Computation for result MORE_FRAGS_W_BARCODE_2reb_229_6182_0 finished

Reporting 2 good results for Rosetta or PrimeGrid...probably 5 hours of CPU crunching time be added to a Climateprediction unit if this result was not a...stuck...turns to be a waste of a normal BOINC operation.

The project did not say that by crunching its results you may experience a lower Average creditis/day neither, did it?

Until "they" locate the problems causing such "abnormalities", I'll be allowing 30 minutes to have a Rosetta result...progresses beyond the 1%...otherwise I'll abort it.

O&O (Systems management 101: Kill the monster while it is young)
ID: 8571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 3 Jan 06
Posts: 3
Credit: 6,087,435
RAC: 1,365
Message 8657 - Posted: 9 Jan 2006, 16:59:00 UTC

Hi,
I didn't read much of this thread, so I'm just assuming you still need information. I had this WU which stayed at 1% for 2 hours. It was still in the ab initio phase, which I've never seen before. I restarted BOINC and it completed normally.

MrS
Scanning for our furry friends since Jan 2002
ID: 8657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 14,945,062
RAC: 0
Message 8715 - Posted: 10 Jan 2006, 11:34:08 UTC

Wao, I got one :-)

WU-Name: INCREASE_CYCLES_10_1hz6_226_6922_0

The last 20 lines of stdout:

Size: 3 NUMBER OF FRAGS FOR POS: 53 200
Size: 3 NUMBER OF FRAGS FOR POS: 54 200
Size: 3 NUMBER OF FRAGS FOR POS: 55 200
Size: 3 NUMBER OF FRAGS FOR POS: 56 200
Size: 3 NUMBER OF FRAGS FOR POS: 57 200
Size: 3 NUMBER OF FRAGS FOR POS: 58 200
Size: 3 NUMBER OF FRAGS FOR POS: 59 200
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
2 0 10.9471054
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-4.41134071 -17.6313858 8.64116001 4.5229826
standard trials: 20000 accepts: 585 %: 2.925
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -27.503 -12.841 -27.503 28 4.523 3.605 4.523


I have saved the whole boinc-directory, so, if you are interested, I can zip it for you and put it on one of my servers so that you can download it from me

The WU should normally take 6 hours on this machine; until now it has taken 6 hours and says 1%




Supporting BOINC, a great concept !
ID: 8715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8723 - Posted: 10 Jan 2006, 15:35:26 UTC - in response to Message 8715.  

Wao, I got one :-)

WU-Name: INCREASE_CYCLES_10_1hz6_226_6922_0

The last 20 lines of stdout:

Size: 3 NUMBER OF FRAGS FOR POS: 53 200
Size: 3 NUMBER OF FRAGS FOR POS: 54 200
Size: 3 NUMBER OF FRAGS FOR POS: 55 200
Size: 3 NUMBER OF FRAGS FOR POS: 56 200
Size: 3 NUMBER OF FRAGS FOR POS: 57 200
Size: 3 NUMBER OF FRAGS FOR POS: 58 200
Size: 3 NUMBER OF FRAGS FOR POS: 59 200
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
2 0 10.9471054
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-4.41134071 -17.6313858 8.64116001 4.5229826
standard trials: 20000 accepts: 585 %: 2.925
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -27.503 -12.841 -27.503 28 4.523 3.605 4.523


I have saved the whole boinc-directory, so, if you are interested, I can zip it for you and put it on one of my servers so that you can download it from me

The WU should normally take 6 hours on this machine; until now it has taken 6 hours and says 1%



thanks. I'll try to reproduce this locally.


ID: 8723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8724 - Posted: 10 Jan 2006, 15:35:31 UTC - in response to Message 8715.  

Wao, I got one :-)

WU-Name: INCREASE_CYCLES_10_1hz6_226_6922_0

The last 20 lines of stdout:

Size: 3 NUMBER OF FRAGS FOR POS: 53 200
Size: 3 NUMBER OF FRAGS FOR POS: 54 200
Size: 3 NUMBER OF FRAGS FOR POS: 55 200
Size: 3 NUMBER OF FRAGS FOR POS: 56 200
Size: 3 NUMBER OF FRAGS FOR POS: 57 200
Size: 3 NUMBER OF FRAGS FOR POS: 58 200
Size: 3 NUMBER OF FRAGS FOR POS: 59 200
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
2 0 10.9471054
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-4.41134071 -17.6313858 8.64116001 4.5229826
standard trials: 20000 accepts: 585 %: 2.925
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -27.503 -12.841 -27.503 28 4.523 3.605 4.523


I have saved the whole boinc-directory, so, if you are interested, I can zip it for you and put it on one of my servers so that you can download it from me

The WU should normally take 6 hours on this machine; until now it has taken 6 hours and says 1%



thanks. I'll try to reproduce this locally.


ID: 8724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 8764 - Posted: 11 Jan 2006, 8:49:14 UTC

David,...

I have two INCREASE_CYCLES_10_1xxx_226_xxxx_0 (Same Batch I presume) in a "ready to run" status, should I abort?

Thanks,
O&O
ID: 8764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Number crunching : Report stuck work units here



©2024 University of Washington
https://www.bakerlab.org