Report problems with Rosetta version 5.36

Message boards : Number crunching : Report problems with Rosetta version 5.36

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 30921 - Posted: 10 Nov 2006, 19:52:18 UTC - in response to Message 30906.  
Last modified: 10 Nov 2006, 20:44:30 UTC

> ...another "disk space exceeded" error

Same here, 3 straight WU's, plenty of disk space. Darnit!
ID: 30921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 30922 - Posted: 10 Nov 2006, 19:57:08 UTC - in response to Message 30869.  
Last modified: 10 Nov 2006, 19:57:26 UTC


watchdog can't kill an app that has already died for any other reason.

We call these "stopped clock" errors, or "cpu frozen", etc. What has really happened is that the app has gone to meet its maker but has been nailed to its perch by the client which has not noticed its death, early demise, etc. Perhaps we should call this the Norwegian Blue app ;-)
R~~

Why don't you call them zombies?
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 30922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 30950 - Posted: 11 Nov 2006, 14:13:48 UTC - in response to Message 30921.  
Last modified: 11 Nov 2006, 14:18:41 UTC

> ...another "disk space exceeded" error

Same here, 3 straight WU's, plenty of disk space. Darnit!


For what it's worth, they are all FRA_t369 tasks. Something is wrong with those WU's. I've lost every single one of them with this error - about 15 so far. I'm aborting the remaining 8 in queue.
Team Starfire World BOINC
ID: 30950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30965 - Posted: 11 Nov 2006, 20:20:10 UTC - in response to Message 30950.  

> ...another "disk space exceeded" error

Same here, 3 straight WU's, plenty of disk space. Darnit!


For what it's worth, they are all FRA_t369 tasks. Something is wrong with those WU's. I've lost every single one of them with this error - about 15 so far. I'm aborting the remaining 8 in queue.


Not all. Over half have been FRA_, but other proteins besides t369. Just take a look through these which I listed in an earlier post:

A1 + A2 + A3 + B1 + B2 + B3 + B4 + C1 + C2 + C3

that is 5 x FRA_t362, and mix of others. So far these have just hit three of my boxes and not the other 5 that are currently running Rosetta.

R~~
ID: 30965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30966 - Posted: 11 Nov 2006, 20:40:07 UTC - in response to Message 30922.  

... Perhaps we should call this the Norwegian Blue app ;-)

Why don't you call them zombies?[/quote]

To me a zombie task is one that is still known by the operating system (eg still in memory, or still has open files, etc) but has either permanently stopped running or can never complete because has lost a process on which it depends.

In linux terms, a zombie process has a pid that ps / kill / etc still recognise as valid even if we as humans can see that the pid will never be selected to run again.

I am only making my best guess about what is happening here, and I may well be wrong -- but if my guess is right then these are not zombies in that sense, they are completely dead processes as far as the OS is concenred, but the client has not yet noticed - the same kind of issue but arising at the level of the client rather than the OS.

So I'd understand a claim that these are zombies to be implying a different diagnosis to mine -- usefully so as the two diagnoses may lead on to divergent solutions. Thank you for an enlightening question.

R~~
ID: 30966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 30971 - Posted: 12 Nov 2006, 1:57:12 UTC
Last modified: 12 Nov 2006, 1:58:53 UTC

Maximum disk usage exceeded error:

46387071

Another FRA_t369....
ID: 30971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rudy Toody
Avatar

Send message
Joined: 18 Jul 06
Posts: 4
Credit: 280,134
RAC: 0
Message 30972 - Posted: 12 Nov 2006, 1:59:47 UTC

I've had to abort two WUs and one disappeared on its own. All three had "DUMMY" in the name and all three were being viewed in the graphic window when the problems occurred. I haven't had this problem with any of the other WUs.
ID: 30972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,236,942
RAC: 3,767
Message 30978 - Posted: 12 Nov 2006, 5:20:25 UTC

>> A follow up on my Screensaver lockup and failing workunit problems.
Since upgrading to Boinc Client Version 5.4.11, I have had 1 'lack of disc space' error and the last 3 workunits on the 4800+ machine have gone through with no problems, even with the Boinc screensaver on.
So perhaps with Boinc software updates the server side of things might no longer work 100% backward compatiable with older client versions? My version was the previous stable recommended one (5.2.13).
Anyway all appears to be working. Considering that only 1 or 2 out of a dozen worked to completion (some worked less than 2 minutes before failing), in the previous batch with the screensaver on, I am a lot happier, well at least for the moment.

> We live in a world of problems, some we create ourselves, some others create for us. All I can say is "It's not my fault, I was probably asleep at the time".<
ID: 30978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jerry Camden

Send message
Joined: 26 Sep 05
Posts: 1
Credit: 226,493
RAC: 0
Message 30991 - Posted: 12 Nov 2006, 12:08:17 UTC
Last modified: 12 Nov 2006, 12:17:36 UTC

I just had a C++ Error dialog.

ResultID is 46651597.

Messages in BOINC manager....
11/11/2006 7:48:31 PM|rosetta@home|Restarting task 1bkrA_BOINC_ABINITIO_SAVE_ALL_OUT_DUMMYMODEL__1364_1405_0 using rosetta version 536
|
|
11/12/2006 5:28:37 AM|rosetta@home|Unrecoverable error for result 1bkrA_BOINC_ABINITIO_SAVE_ALL_OUT_DUMMYMODEL__1364_1405_0 (The system cannot find the path specified. (0x3) - exit code 3 (0x3))
11/12/2006 5:28:37 AM|rosetta@home|Deferring scheduler requests for 1 minutes and 0 seconds
11/12/2006 5:28:37 AM|rosetta@home|Computation for task 1bkrA_BOINC_ABINITIO_SAVE_ALL_OUT_DUMMYMODEL__1364_1405_0 finished
[/pre]


ID: 30991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 31012 - Posted: 12 Nov 2006, 19:37:04 UTC

I just opened graphics on this WUand the screen froze--Control,Alt,Delete--and returns an error.

Tim



ID: 31012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,236,942
RAC: 3,767
Message 31013 - Posted: 12 Nov 2006, 20:39:50 UTC - in response to Message 30978.  

>> A follow up on my Screensaver lockup and failing workunit problems.
Since upgrading to Boinc Client Version 5.4.11, I have had 1 'lack of disc space' error and the last 3 workunits on the 4800+ machine have gone through with no problems, even with the Boinc screensaver on.
So perhaps with Boinc software updates the server side of things might no longer work 100% backward compatiable with older client versions? My version was the previous stable recommended one (5.2.13).
Anyway all appears to be working. Considering that only 1 or 2 out of a dozen worked to completion (some worked less than 2 minutes before failing), in the previous batch with the screensaver on, I am a lot happier, well at least for the moment.

> We live in a world of problems, some we create ourselves, some others create for us. All I can say is "It's not my fault, I was probably asleep at the time".<


Alas I spoke to soon as the next 5 WU's all failed, all have debugging info with the WU:-
https://boinc.bakerlab.org/rosetta/result.php?resultid=46588030
https://boinc.bakerlab.org/rosetta/result.php?resultid=46656845
these 2 had exit code 1073741819

https://boinc.bakerlab.org/rosetta/result.php?resultid=46507566
this one had 'Maximum Disk Usage Exceeded'

https://boinc.bakerlab.org/rosetta/result.php?resultid=46656798
https://boinc.bakerlab.org/rosetta/result.php?resultid=46656799
Theses 2 have the 'Stuck' problem, 'exit code 2147483645' also 'Breakpoint Encountered'

Guess I am not as happy as I thought I was.
ID: 31013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rudy Toody
Avatar

Send message
Joined: 18 Jul 06
Posts: 4
Credit: 280,134
RAC: 0
Message 31031 - Posted: 13 Nov 2006, 4:33:08 UTC - in response to Message 31012.  
Last modified: 13 Nov 2006, 4:36:16 UTC

I just opened graphics on this WUand the screen froze--Control,Alt,Delete--and returns an error.

Tim


I tried the same thing on my second PC (different graphics setup) and, within 10 seconds, it froze.
If I don't peek at them, they run to completion.
ID: 31031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,236,942
RAC: 3,767
Message 31043 - Posted: 13 Nov 2006, 8:49:25 UTC

> Sorry Rosetta team but I am Sick of this, another 2 out 3 failed, I will switch the screensaver back off, at least I can process some work that way.
These 2 have no debugging information.

https://boinc.bakerlabs.org/rosetta/result.php?resultid=46742010
exit code 1073807364

https://boinc.bakerlabs.org/rosetta/result.php?resultid=46754204
another 'Stuck' one watchdog killed with a validate error.

I could not prove the problem in Ralph as I received no Ralph WU's for many days now.
ID: 31043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,236,942
RAC: 3,767
Message 31085 - Posted: 13 Nov 2006, 20:58:15 UTC - in response to Message 31043.  

> Sorry Rosetta team but I am Sick of this, another 2 out 3 failed, I will switch the screensaver back off, at least I can process some work that way.
These 2 have no debugging information.

https://boinc.bakerlab.org/rosetta/result.php?resultid=46742010
exit code 1073807364

https://boinc.bakerlab.org/rosetta/result.php?resultid=46754204
another 'Stuck' one watchdog killed with a validate error.

I could not prove the problem in Ralph as I received no Ralph WU's for many days now.


This one was running when I switched and after 2 hours was at 1.02%, it eventually failed as being 'Stuck' with 'breakpoint encountered' error, has debugging data

https://boinc.bakerlab.org/rosetta/result.php?resultid=46742003
ID: 31085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile scsimodo

Send message
Joined: 17 Sep 05
Posts: 93
Credit: 946,359
RAC: 0
Message 31086 - Posted: 13 Nov 2006, 21:05:05 UTC

Dang! First V5.40-WU just crashed

Result

Host

ID: 31086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 31104 - Posted: 14 Nov 2006, 3:56:33 UTC

Validate error:

46722018

ID: 31104 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 31111 - Posted: 14 Nov 2006, 6:16:11 UTC

as Newbie here

in the morning i found these Workunits in Error
Workunit 41576568
DOC_1MLC_R061030_st_model_09_1383_1382_0
as next this
Workunit 41577287
DOC_2SIC_R061030_st_model_09_1389_1428_0
Validate error
- screensaver Version 5.4 crashed
and at last
14.11.2006 06:08:44|rosetta@home|Unrecoverable error for result DOC_1MLC_R061030_st_model_10_1383_1421_0 ( - exit code 1073807364 (0x40010004))


ID: 31111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 31113 - Posted: 14 Nov 2006, 6:38:58 UTC

Welcome to Killersocke. I just wanted to point out, to anyone looking, that you had three WUs so far, the first two failed, the third was successful... and all three were for the brand spankin' new v5.40. And if you check the v5.40 thread you will see Chu found a problem with docking work units running under the new application version. So, hopefully this issue is already addressed.

It's unfortunate that your first two WUs failed. But it is sorta like "mistakes". Doing it once is a "learning experience", doing it a second time is a "mistake". Well, with Rosetta, the entire project is a learning experience. We're helping break new ground in science. So they are constantly changing, enhancing and improving the application to try different approaches or test various ideas on how to devise better models. You will note that your first WU has already received credit. There is a daily job that runs to grant credit even for the failed WUs. After all, there wasn't anything that you did to cause it. It is part of the learning experience. The other two WUs should receive credit tomorrow.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 31113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
César

Send message
Joined: 8 Feb 06
Posts: 1
Credit: 14,964
RAC: 0
Message 31117 - Posted: 14 Nov 2006, 8:19:19 UTC

Work unit has been running for days, I have aborted it.
s002_BOINC_ABRELAX_SAVE_ALL_OUT_hom001__1313_31444_0

I'll try the new Rosetta version since the Moderator said it will solve this kind of problem.

If it doesn't, I'll switch to other projects. I don't think it is reasonable to provide idle time of our computer and have to babysit it to behave.

Management of tasks should be automatic (cleaning up those who exceed reasonable time, and not requiring users to inform them manually or check pages of bad work units manually).

Sharing of the computing environment with other Boing tasks should be well used. Sorry.

ID: 31117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JohanDM

Send message
Joined: 22 Nov 05
Posts: 1
Credit: 219,288
RAC: 0
Message 31125 - Posted: 14 Nov 2006, 9:51:55 UTC
Last modified: 14 Nov 2006, 9:59:41 UTC

I've two WU's that reported OK but didn't get credit because of validate error:
46913918
46814172

ID: 31125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report problems with Rosetta version 5.36



©2024 University of Washington
https://www.bakerlab.org