Message boards : Number crunching : Loads and loads of computing errors today
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David). Regards phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David). Can you restart your client on the Dual G4 and see what happens? |
mscharmack Send message Joined: 29 Sep 05 Posts: 2 Credit: 11,323 RAC: 0 |
about 99% of the WU's downloaded since 21 Oct 2005 ~19:15 UTC have ended with an unrecoverable error. Nothing has changed on my computers and with so many people having the same problem, it has to be a problem with Rosetta@home work units. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Well, just to be contrary, but, I have not had an error since the 22nd ... Running on windows XP Pro and OS-X (G5 - Tiger) ... WIndows machines are AMD Athlon 64, Xeon (32-bit and 64-bit), P4 (HT and non-HT) ... So, I don't get it ... As I said, I had one that "stuck" but restarting the BOINC Client Software looks like it "cured" that one. Others I was suspicious about taking too long a suspend/resume seemed to work for them ... Obviously our mileage is varying ... |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
quote]about 99% of the WU's downloaded since 21 Oct 2005 ~19:15 UTC have ended with an unrecoverable error. Nothing has changed on my computers and with so many people having the same problem, it has to be a problem with Rosetta@home work units. [/quote] >I had a look at your computer benchmarks as I have similiar AMD units and you appear to be using BOINC ver.4.19 or are overclocked. If you are still using 4.19 you should upgrade to BOINC ver 5.2.2 and that should solve the problem.(I had the same problems on 4 boxes). Hope this helps....Cheers,Rog. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
I had 3 out of 4 WUs end in computation errors..... >If you are still using BOINC ver. 4.19 an upgrade to ver 5.2.2 should solve your problems.....Cheers, Rog. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
David, We need to put a restart capability into the science application. I just had another couple work units spend 2-5 hours at 1% completion... restarting the client seems to fix them (well, I have one that may be hung still after a restart it is at 8 minutes and still 1%). I think if it has spent an hour (ha! There it went up!) at 1%, it should be halted, completely unloaded, restarted, and if it hangs 3 times like that ... well ... something is bad. But, this is a pretty hefty waste of resources as it can "sneak-up" on you if you don't watch it. heck, if i had not been unable to sleep, these might have tried to run all night doing nothing ... As most of them seem to start within 10 minutes, we might try a lower limit of 20 minutes ... but, your call ... |
[BAT] tutta55 Send message Joined: 16 Sep 05 Posts: 59 Credit: 99,832 RAC: 0 |
Sorry to contradict you, Roger. But some people running 4.45 also have the problem. Just take a look at the WU I refer to in the message I started this thread with. There are many similar cases where both 4.19 and 4.45 result in an error, albeit with a different error message. And, the problem is indeed with the new Rosetta app, since I never had it with their 4.77 version. If they now require version 5 of the boinc client software, that is fine with me. But then it should be clearly stated, and if possible imposed by the server. If not, well I think the problem should be fixed. People running older versions of the boinc client may have good reasons to do so. BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair |
[BAT] tutta55 Send message Joined: 16 Sep 05 Posts: 59 Credit: 99,832 RAC: 0 |
As most of them seem to start within 10 minutes, we might try a lower limit of 20 minutes ... but, your call ... @Paul: 20 minutes would be a bit too low. Good ol' me has a PIII 800MHz and the WU named sim_aneal take about 50 minutes to get passed the 1% barrier :-) Additionally, if this auto restart is implemented, it would be nice if the CPU time already spent was not reset, but added to the total processing time. BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
Sorry to contradict you, Roger. But some people running 4.45 also have the problem. Just take a look at the WU I refer to in the message I started this thread with. There are many similar cases where both 4.19 and 4.45 result in an error, albeit with a different error message. >I can't argue with your logic as my problems obviously started with R@H 4.78 as well. I edited out my comment.... I guess I should have been more precise when I said it wasn't the Rosetta app. In my experience it isn't the Rosetta app. if you upgrade to BOINC 5.x. Your point is well taken, though, as some people not may want to upgrade.....Cheers, Rog. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
In my experience it isn't the Rosetta app. if you upgrade to BOINC 5.x. Your point is well taken, though, as some people not may want to upgrade.....Cheers, Rog. My Athlon XP 3000+ has the 5.2.2 boinc client and trashes about 2 out of 10 work units. Most of them are of the 0xC000005 type but there have been a few others (error 1 and error -164) No real pattern to it - sometimes it's every second WU, sometimes it's three in a row, sometimes all is well for half a dozen. It has 1GB RAM and runs Rosetta exclusively (apart from limited normal use of the computer). It had a hardware problem (disk drive cable), which showed up in Windows event logs (Windows XP Pro SP2) but that's been fixed. I still get problems with Rosetta WU crashing out and there's no messages in the event logs indicating there's any hardware/software problem at the time of these crashes. Maybe I should let it run dry, then reset the project or even detach, uninstall BOINC, and start from scratch? The only other thing that ~may~ be an issue, which I have seen on another PC, but not this one, was that when the clock (time) was adjusted, a WU crashed. Could it be related? *** Join BOINC@Australia today *** |
[AF>Belgique]Mamouth Send message Joined: 18 Sep 05 Posts: 4 Credit: 580,683 RAC: 0 |
my 50 cents on my P4 1.5 ghz WIN2K at work never had any error At home on a P4 3.0ghz HT with WINXP I get a lot of errors both are using CC 5.X |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David). David, I have already tried that. All of the errors seem to be on WU of the type "1hz6A_abrelaxmode_random_length05_16882" and only on the G$ Dual. The only error on the Dual G5 was of type "1hz6A_abrelaxmode_random_gauss_sim_aneal_00047". I will restart the entire system again and see if it makes a diff, but so far restarting BOINC has not changed a thing. These errors started on the 24th. Before that all was well on all three systems. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
I haven't had any problems the past many days, since I've had the WU's left in memory, for what it's worth. The WU's are fast as lightning, but they come and go without any problems. Only one WU vanished into thin air after a restore of my system. :-( [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
kb7rzf Send message Joined: 7 Oct 05 Posts: 16 Credit: 35,427 RAC: 0 |
Since I started with this project on Oct 15th, I have only had 2 WU's that errored, all others have been just fine. I'm running a Dell Dimension computer, with an Intel Celeron 2.6GHZ, 512mb Ram, WinXP Home, BOINC 5.2.2, and I have my preferences set to leave the project in memory. Dunno how helpful that info is but there ya have it. :-) Jeremy |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Ok, so just to be clear. I have three systems working on R@H. All three are Macs. One is a 2GHz G5 dual CPU, One is a 1.4 GHz G4 Dual CPU, and One is a Powerbook 1GHz G4. All three are running Mac OS 10.4.2. All three computers are running the same version of BOINC, MacNN 4.44 Superbench, and they are all using the R@H 4.77 client app. The G5 Dual is running E@H, R@H, P@H, and S@H. The The G4 Dual is running P@H, R@H, S@H, and CP@H, (it is also attached to E@H and XtremLab but they are suspended and have been for some time now). The Powerbook is running S@H, P@H and R@H. All three systems were running fine until the WUs distributed after 23 Oct 2005 23:09:31 UTC. Thats when the errors started. None of the other apps are having any problems on any of the systems. The G5 and the Powerbook are not having any problems except the occasional client error but they are not common. The G4 Dual gets a client error on every WU from R@H. I have tried restarting the BOINC client, I have tried restarting the computer (which of course restarts the BOINC client), I have tried resetting the R@H project. I am still getting client errors on every R@H WU. Sometimes they error in just a few seconds, and sometimes they error after an hour or so. Most of these WUs are "1hz6A_abrelaxmode_random_length05_xxxx" type, and some are "1hz6A_abrelaxmode_random_gauss_cntrlx_xxxx", but all the errors are one or the other of these two types. After the reset the system downloaded two more WUs (1pvaA_abrelax_68232, 1pvaA_abrelax_66394). It would seem to me that if these two WUs process ok, that this should tell us all something. It would seem to me that the BOINC client is not the problem (all three are the same), it would also seem to me that there is some problem in terms of compatibility with a Dual G4 Mac and your application where the "Length" and "Gauss" type Wus are concerned. Perhaps some element of dual CPUs is not compatable. in all cases with all WU types All other system conditions have remained constant on these systems. Now if the G4 Dual processes the two WUs it now has with no problem, then it is most likely the WUs that are causing the problem. If they fail it would seem to me that there is something wrong with the application compile. Perhaps a DEV flag not set right for the G4 Dual system during compile. In any case the BOINC client does nothing more than scheduling and keeping score, and in my case i do not believe the BOINC client is the problem. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
It could very well be a compiler issue for the G4 dual CPU. In order to support 10.3.9 we had to recompile the gcc4 compiler (rosetta has issues with gcc3.3) on a 10.3.9 machine since the gcc4 compiler that comes with Xcode2 limits apps to OSX10.4 (unless the cross-dev SDK is used, but it didn't work when I tried it). This has helped overall since we are now getting results from people with 10.3.9 and the success rates have increased dramatically. The drawback is that we can't take advantage of the Mac specific optimizations. The rosetta boinc code will be available soon if anyone is interested in helping us debug and optimize. I'll be sure to post it on the news when it is available. Snake_doctor, I would stop R@h on your dual G4 and dedicate it to the other projects until we get a fix (if you haven't already). |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
It could very well be a compiler issue for the G4 dual CPU. In order to support 10.3.9 we had to recompile the gcc4 compiler (rosetta has issues with gcc3.3) on a 10.3.9 machine since the gcc4 compiler that comes with Xcode2 limits apps to OSX10.4 (unless the cross-dev SDK is used, but it didn't work when I tried it). This has helped overall since we are now getting results from people with 10.3.9 and the success rates have increased dramatically. The drawback is that we can't take advantage of the Mac specific optimizations. The rosetta boinc code will be available soon if anyone is interested in helping us debug and optimize. I'll be sure to post it on the news when it is available. Snake_doctor, I would stop R@h on your dual G4 and dedicate it to the other projects until we get a fix (if you haven't already). David, You know one of the reasons i upgraded this machine to 10.4 was to do Rosetta. I will let it finish the two it is working on now just so we can see if the problem is WU related. Perhaps two versions of the App would solve the problem. Seems it was working fine before 4.77 on all but the 10.3.9 systems. So maybe the previous version for those with 10.4.x and the new one for 10.3.9? Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
It's uncanny. I decided to run PPAH for a while on my one PC that has regular problems with Rosetta (Athlon XP 3000+, Win XP Pro SP2, 1GB RAM, BOINC 5.2.2) to see if it's the PC, BOINC or Rosetta app causing the errors. https://boinc.bakerlab.org/rosetta/result.php?resultid=429221 Perhaps too early to tell, but it did 6 PPAH work units in a row, no problem. It went back to Rosetta and BANG - unrecoverable error, trashed the WU (over an hour's crunching wasted). And yes, I keep the WU in memory. It's started on the next Rosetta WU - will see how that goes. I'll keep an eye on it but if this keeps up, I may have to get this PC running something else, even though on paper it is more than qualified to run Rosetta. No problems on my other systems (3*P4 and 1*Athlon 64) EDIT: nothing in the Windows event logs at the time of the WU crashing and the computer did not crash or get rebooted. *** Join BOINC@Australia today *** |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
It's uncanny. I decided to run PPAH for a while on my one PC that has regular problems with Rosetta (Athlon XP 3000+, Win XP Pro SP2, 1GB RAM, BOINC 5.2.2) to see if it's the PC, BOINC or Rosetta app causing the errors. For what it is worth, and it probably has already been thought of, the problem seems so spotty and seemingly random (and I know that it is being looked into to see if it is not random), that I wonder if R@H causes some boxes to run hot, and that trashes the unit? Just a thought... Regards, Bob P. |
Message boards :
Number crunching :
Loads and loads of computing errors today
©2024 University of Washington
https://www.bakerlab.org