Message boards : Number crunching : Information on Ver 4.97 errors
Previous · 1 · 2
Author | Message |
---|---|
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Jeff Gilchrist spoketh: distributed folding? Yep.. that's the one. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? Please don't throw them away if they run fine--I'm very curious about the results! |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I just got back into town an hour ago, and have not yet been able to pinpoint the source of the recent problems. But I want to apologize in any event, the scale of the problems certainly was my fault. Here is what happened: I wanted to test the effects of an improvement in sampling alternative sidechain conformations during the high resolution stage of the search. Tests on our in house computers showed that this improvement resulted in consistently lower energy structures being found, and there were absolutely no signs of any run time problems. David K. sent out the new version of the code to RALPH thursday, and we submitted some test jobs. Friday afternoon we talked, and as there seemed to be no problems on ralph, and the code change was relatively minor, David sent the new version out to rosetta@home. I was very eager to see how the improvement in sampling would affect the searches I had been carrying out in the HBLR_1.0 series of runs you all had been doing over the past month, and as I was going out of town for a few days I submitted a large number of jobs friday evening so that there would be a clear picture when I returned. You can imagine my horror on checking up on rosetta and ralph in the few minutes before leaving saturday morning! It was clear by saturday that the test jobs I had sent out on ralph had a high error rate on windows, and that I had totoally jumped the gun by sending out the very large set of runs on rosetta on friday. I'm very sorry that I did this, and about the waste of resources and confusion this caused, and definitely learned my lesson--always make sure the ralph tests are complete and 100% positive before submitting large scale on rosetta. |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
All I know is that almost 2 days of my computer time have resulted in errors of the kind you describe, To wit: 16811046 13764140 9 Apr 2006 10:36:23 UTC 11 Apr 2006 7:01:19 UTC Over Client error Computing 12,238.19 37.94 --- 16697013 13665863 8 Apr 2006 21:09:49 UTC 9 Apr 2006 3:20:07 UTC Over Client error Computing 18,578.25 57.60 --- 16613497 13627278 8 Apr 2006 13:05:47 UTC 8 Apr 2006 22:54:25 UTC Over Client error Computing 25,537.47 79.18 --- 16564691 13587556 8 Apr 2006 5:48:01 UTC 8 Apr 2006 15:46:50 UTC Over Client error Computing 23,689.95 73.45 To say the least it has been frustrating. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
Did the "rosetta_4.97_windows_intelx86.pdb" file give you any useful information about what happend? Anders n |
XS_Duc Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0 |
I just got back into town an hour ago, and have not yet been able to pinpoint the source of the recent problems. But I want to apologize in any event, the scale of the problems certainly was my fault. Those who are free of sin, may now pick up a stone and throw it... We lost some time and resources, so what? It happened before and will certainly happen again I guess. Nothing is flawless, mistakes/errors will always be made... but they shall be forgiven and forgotten in the long run towards succes. The weak shall perish... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code. I know it is too late for this thread, but I'd like to correct this please, Feet1st. Reset is not the same as abort and reload. Reset does a forget and reload. Often the abort is useful to a project as the error file may contain some useful info. It also allows the WU to be released to another user. For the latter reason, often with a dodgy WU reset is more useful as it does not force a re-issue until the team have had a chance to stop the WU being issued. So both have their uses, but they are not the same. Where a project wants the error reports, the short procedure is to go to the work tab and abort each existing work unit, and let it report in due course. The full procedure if you want also to force a reload is quite complicated as you have to force through the flushing of the aborted work. 1) set No New Work for that project 2) abort all WU separately from the Work tab 3) suspend all other projects from the projects tab to force the aborted WU to run (sounds contradictory, but this is where each WU generates the error report) 4) in the unlikely event that these get stuck, resume then suspend one of the other projects - sometimes you'll find you need to do this as many times as you have aborted WU 5) update this project 6) wait for aborted WU to disappear from work tab 7) *now* reset project if required 8) set allow new work 9) resume all other projects from the projects tab. It is a lot to ask users to do - which is why a project may well just ask for a reset instead - a larger percentage of users will actually do it! But it still is not the same. River~~ |
Message boards :
Number crunching :
Information on Ver 4.97 errors
©2024 University of Washington
https://www.bakerlab.org