Message boards : Number crunching : Information on Ver 4.97 errors
Author | Message |
---|---|
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I have just recieved this essage from David Kim who is working on the version 4.97 error issue as I write this message. I just reverted back to the previous app. You should notice a version 4.98 now, which is really version 4.83 for windows and mac, and 4.82 for linux. You should all see some relief very soon. Your systems should update by them selves when the version change takes place, but if not please do a manual update. Moderator9 ROSETTA@home FAQ Moderator Contact |
Dave Wilson Send message Joined: 8 Jan 06 Posts: 35 Credit: 379,049 RAC: 0 |
Should we abort the work units that are going to use 4.97? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? They will be used. For what it is worth the Mac computers are not having any of these problems, so resetting the project is not universally required. There are also some Windows and Linux system that are not having trouble at this time. Moderator9 ROSETTA@home FAQ Moderator Contact |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 274 |
My machines both run Windows, (one NT4, the other XP), both have seen errors, but both have also run 4.97 to normal completion. Before I disabled Rosetta, I had 6 failures and 4 normal with 4.97. It's running again now with 4.98, good job team. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
simpe73 Send message Joined: 20 Feb 06 Posts: 4 Credit: 438,570 RAC: 0 |
What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. |
Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0 |
Tried a project reset, any new WU fails immediately with: core_client_version>5.2.13</core_client_version> <message>CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20) </message> What's happening there? |
Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0 |
What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... Every application, every DC project, every environment has its problems. We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend! I guess we'll get more comments by David on Monday in his weblog? Member of Dutch Power Cows |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
As I've read, these jobs and engine is tested on the test-environment (RALPH). AMEN to that. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
As I've read, these jobs and engine is tested on the test-environment (RALPH). People crunching Ralph saw and reported the same high error rate that people crunching Rosetta are seeing. I have no idea why they went ahead and released this stuff on Rosetta. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 274 |
What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. Reading the other thread, it would seem that the 4.97 app worked fine with the wu's it had been given. It was then released. It was not until a different set of wu's hit that code that the problems first appeared, both in RALPH, and sadly, in the production project. It is quite possible the new wu's hit a thread of code that had not been run before. These things happen in the best software, testing for absolutely every eventuality tends to add serious delays, and is really only justifiable in safety critical applications, which this is not. We are here to help these guys with their science. If the new science app delivers better results, then we all win! I'm sure they'll fix this quickly. The suggestion to roll out application changes early in the week is a decent idea though. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
IceQueen41 Send message Joined: 24 Jan 06 Posts: 1 Credit: 65,113 RAC: 0 |
Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these? |
Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) Edit: The above post by Moderator9 is exactly why I will be staying with this project. Stuff happens with this kind of research and it's "all about the science". A little instability and a few lost credits are nothing compared to the big picture here. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these? A large number of the errors are work unit related. As a result the application release will fix a lot of the issues, but there will be some time required for everything to settle out. David Kim is working the problem, and I would expect a statement from Dr. Baker on Monday with more details. The application was very stable in Ralph for a number of the original bug issues and that is why they released it to the production environment. For some reason the problems have not affected all machines equally. For instance Mac OS is not having any real problems, and the majority of windows machines are working with some increase in error rate. The problem seems to be a mixed bag of issues with the new work unit types, and some issue with the application for particular systems. This kind of problem is why what Rosetta is try to achieve has not been done before. Many BOINC projects are quite stable because the nature of what they are doing is well established, understood and remains the same across ALL of the work they do. Rosetta is not like that. This is a true research project, where everything from the approach to the work, to the actual work itself, and the design of the application is changing to accommodate new concepts and theories. While there are other protein research projects, the entire approach at Rosetta is different. Rosetta is trying to model whole proteins. The simple ones work fine, but the complex ones are tricky and that is where the problems come in. Last years CASP competition showed that Rosetta is on the right track. But there will always be issues that arise in pure research such as this. Thanks to those of you who contacted the project directly through the moderator e-mail, the project team was able to jump on this and implement a repair. Moderator9 ROSETTA@home FAQ Moderator Contact |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Moderator9: Last year's Casp CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Moderator9: Last year's Casp Not my first typo of the day. You are correct. I meant to say "the last CASP'. Sorry. Moderator9 ROSETTA@home FAQ Moderator Contact |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) You are absolutely right - these 7447_largescale_** jobs are relax only jobs of some relatively larger proteins. Since these proteins are larger, each job will take longer to finish. According to our current statistics, the average CPU time to finish such a job can be anywhere from 2 to 4 hours. |
ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 |
4/9/2006 10:03:52 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_425_4170_0 ( - exit code -1073741819 (0xc0000005)) 4/10/2006 12:42:42 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2reb_426_3929_0 ( - exit code -1073741819 (0xc0000005)) these 2 errored on 4.97. I have graphics turned off and leave in memory on. This is the only DC project I run. Since turning off graphics these are the first errors I have encountered. Ecaf |
Jeff Gilchrist Send message Joined: 7 Oct 05 Posts: 33 Credit: 2,398,990 RAC: 0 |
The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. Which one is that, distributed folding? I'm not sure if they are ever coming back... |
Message boards :
Number crunching :
Information on Ver 4.97 errors
©2024 University of Washington
https://www.bakerlab.org