Message boards : Number crunching : again on on computation error
Author | Message |
---|---|
Alessandro Freda Send message Joined: 17 Dec 05 Posts: 2 Credit: 410,881 RAC: 0 |
One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem? At now after this change on my account, cannot say if there is an improvement. If useful (hope that the developers can investigate all the errored results) these are the name of my last day failed WUs: 1hz6A_topology_sample_207_9720_3 1ogw__topology_sample_207_489_3 1hz6A_topology_sample_207_7251_4 DEFAULT_1di2_220_3000_0 all with ( - exit code -1073741819 (0xc0000005), the first are about 0 CPU time, the last one instead wasted 2 or 3 hours of CPU time. Regards, Alessandro |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem? It solves some problems, it does not solve them all. I am getting the same exit code as you and I use the yes setting |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem? The first three are clearly from the bad batch mentioned in other threads. Nothing you can do about those. The last one is not from that batch but without more details it's hard to tell what caused the error. Changing the setting to "yes" would probably have prevented the error if it happened during benchmarks or switching to another application. *** Join BOINC@Australia today *** |
Mike Smith Send message Joined: 27 Dec 05 Posts: 2 Credit: 3,913 RAC: 0 |
Not sure if I am on the same wavelength here but I have not had a single unit that Ive done actually give me a result I'm always getting client errors. Only on Rosetta, I'm doing other projects just fine. Any ideas |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,099,913 RAC: 601 |
I had 1 WU yesterday that I decided to suspend while I ran some WU's from another Project. As soon as I Suspended the WU it gave me a Computation error. It seems any of the Rosetta WU's are very touchy and Error out at the least little thing...IMO...I've found that running the Rosetta Project by it's self is the best way to go. And all you can do is hope & pray that you don't have to Suspend the Project or Exit the BOINC Manager while running it. If you have to do either 1 of those things then it's a toss up whether the WU will crash or not, and because of the long time between Check Points it's best to only Suspend or Exit the Manager when you actually see the WU advance to the next % Point, if you don't you may lose upwards of an hour or more of crunching time anyway ... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Not sure if I am on the same wavelength here but I have not had a single unit that Ive done actually give me a result I'm always getting client errors. Only on Rosetta, I'm doing other projects just fine. Any ideas Mike, I have a total of 28 boxes on BOINC (not all owned by me by the way), of which 25 were crunching Rosetta over the holidays. Four of those boxes got almost no credit at all including one that got none at all. Most of the rest of my boxes lost some credit at some stage, but it is very variable. Is this the luck of the draw, the bad wu going out in clusters? Is it the case that some bad wu 'poison' the box for others? Those are two good guesses but none of us really know. In your position I would first make sure that I have got that 'keep in memory' setting to 'yes'. Assuming that is so, I would be likely to try resetting the project next time a wu fails - before you do that set nomorework and then update the project so the falied wu gets reported. This forces all the project dependent files to be downloaded again. Then if it happens yet again, I'd be inclined to detach and re-attach, either re-attaching right away or in mid Jan if you are fed up getting nowhere. This downloads all the project dependent files, but also gives you a new set of host records on the Rosetta databases. (Later on you can merge the old host into the new one) Resetting and re-attaching will only help if there is some history dependent effect in all of this - but it won't harm even if it turns out to be a waste of the extra downloads. In my view, even if it is a user settings issue, it is not a good time to troubleshoot settings when some of the wu are still dodgy. Thanks for your patience - I would not have stuck around if my worst box had been my only one. I *would* have come back mid Jan because I really like the attitude of the project staff River~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I had 1 WU yesterday that I decided to suspend while I ran some WU's from another Project. As soon as I Suspended the WU it gave me a Computation error. hmmmmm... I use suspend from time to time. I wonder how many of my problems have been caused by that. The problem with stopping BOINC is known, itis linked to the keep in memory issue (obviously BOINC does not keep tasks in memory if BOINC itself stops!) Likewise when keep in memory = no you'd expect suspend to trigger an error as the task it taken out of memory. If you have keep in memory = yes and still see problems with suspend, then I think it is one more thing worth reporting. Your general point is just right - Rosetta is a fragile set of software at present. It will get better as Jack & DavidK find the bugs. We have certainly given the bug-hunters a target-rich environment this holiday! River~~ |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Further to what I wrote below... I run both SETI and Rosetta on most of my computers (currently on a 50/50 basis). I have "Leave applications in memory while preempted" set to "yes" and (apart from the recent bad batch) rarely have a problem. I can suspend individual work units. I can suspend Rosetta while it's running a work unit. I can switch to SETI. Not a problem. But if I set the above to "no", I will get many crashes, guaranteed. *** Join BOINC@Australia today *** |
Mike Smith Send message Joined: 27 Dec 05 Posts: 2 Credit: 3,913 RAC: 0 |
Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years. In that case you have probably just had an unlucky selection of wu. It's worth a reset of Rosetta, I'd say. R~~ |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years. No project is "perfect" - but the three that have the MOST complaints seem to be SETI, Predictor, and SZTAKI. If you've been able to function well with SETI and Predictor, I would almost guarantee that your current Rosetta problems are very temporary. |
Dirk Broer Send message Joined: 16 Nov 05 Posts: 22 Credit: 3,328,437 RAC: 3,521 |
Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years. My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with. Just looked at your results - the errors LOOK like the "application left in memory when preempted = no" errors. If that setting isn't "yes", then Rosetta _will_ fail, frequently, as covered in this and many other threads... it's a known bug, that they're chasing. (Much less fatal than Predictor's "fortran error", or "cypa" WUs that ran for longer than the extremely short deadline, or...) There was also a bad batch of WUs released, that at this point are all cleared out, unless you still have one in your cache from before today. |
Dirk Broer Send message Joined: 16 Nov 05 Posts: 22 Credit: 3,328,437 RAC: 3,521 |
My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with. But the setting is 'YES'! (Correction: that is to say when viewed in the Windows taskmanager. Rosetta is active no matter which other BOINC project is. Just altered my settings in the preferences to see whether this will have some benefit) I have no known 'bad' WUs, so I'll keep my fingers crossed. |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
If that setting isn't "yes", then Rosetta _will_ fail, frequently, as covered in this and many other threads... This isnt entirely accurate... If Rosetta is the only project a computer is attached to, or if the other projects are suspended, then Rosetta will get 100% of the computers resources and this setting does not matter. Rosetta WUs are indeed haveing problems right now. As more and more "bad" work units are getting recycled the ratio of bad to good tilts very much in favor of the bad. We will just have to work through them. The project leaders have said they will sort this out after the holiday. |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
[quote....]Rosetta WUs are indeed haveing problems right now. As more and more "bad" work units are getting recycled the ratio of bad to good tilts very much in favor of the bad. We will just have to work through them. The project leaders have said they will sort this out after the holiday.[/quote] Not quite true now, as said in another thread all the bad ones appear to have been set to 'cancelled' so once sent back they should not go out again.... |
Message boards :
Number crunching :
again on on computation error
©2024 University of Washington
https://www.bakerlab.org