Message boards : Number crunching : 1% for 37 hours
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Thanks for that, will give it a try the next time I see one. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
We have been 1% free for weeks now but in the last day we have had two....has something changed?? Or is it just a random thing as this thread suggests? We also lost about 20 hours total when we could have been doing useful work. Any closer to finding the cause, I wonder?? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix. |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix. >OK, David....thanks for the prompt reply.....we have dropped our connection rate to .1 days and will monitor our boxes more closely (and hope for the best)....Cheers, Rog. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
To get a better knowledge of this problem, redundancy could help. I think, it would be interesting to see if it fails on all machines, if the problem appears on all machines ... Unfortunately *MY* experience with the 1% work units they have *ALL* run successfully after a restart. So, it *MAY* be due to random variations in the flux capacitors ... Even running on a different computer may not "prove" anything. Unless using the same random seed *AND* an identical CPU/FPU, well, you are going to see different behavior out of the models. And when I say identical, I mean identical down to the last transistor. That means the same stepping etc. Compilence with the IEEE 754 (and later) standards does *NOT* imply identical results output. You *WILL* see variations in the outer finges of precision. Even more, successive runs can still result in differences *IF* the FPU's operation is partially dependent on prior states. In other words, if it is not in the same state at the restart it *MAY* perform differently the second time through even though you THOUGHT you started at the same point. Oh, and the random cosmic ray can also "flip" a bit ... :) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck. I get the odd one still and (as far as I know) restarting BOINC has always fixed it. Not everybody watches progress however, so there could be CPUs out there that have been (or will be) spinning their wheels for days, weeks, perhaps longer. I'm not a programmer (I do some scripting for websites only) and I understand it may be hard to track the source of this seemingly random problem. I'm wondering though... When a WU is stuck at 1%, is it actually doing anything? Is there some way the app can trap a timeout or error condition and send a signal to BOINC to restart or resume the WU? *** Join BOINC@Australia today *** |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
It would be nice to get to the bottom of this vexing problem. We've had another box lock up at 1%(the third in the last 36 hours). We upgraded to BOINC 5.x on all boxes and that cured things for about 2 weeks. We are spending too much time monitoring and it is difficult to reset remotely housed boxes. Sadly, we will have to withdraw from the project until this is resolved. It's a great project and we will monitor your progress with one box and hope for the best. We will be back with the others once this bug is zapped....good hunting! |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
I have had 3 or 4 on mine (tried everything to kick start with no luck at all). Not had one for days now tho'. |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
We are spending too much time monitoring and it is difficult to reset remotely housed boxes. You know, that BOINCView can help you, save a lot of time monitoring your boxes ? Supporting BOINC, a great concept ! |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
We are spending too much time monitoring and it is difficult to reset remotely housed boxes. Thanks for the tip,Yeti. I'll give it a try. I see they have released a new science app. too. Maybe that will help as well......hope springs eternal!...Cheers, Rog. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Well, maybe when the graphics are available that will give us a clue ... Rom looked forever for a similarly intermittant problem for like forever. Not a bad error, but it took forever to find out the cause ("no finished file" problem). |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
Well, maybe when the graphics are available that will give us a clue ... I hear you,Paul.....so far no problems with R@H 4.79 on 5 boxes. Keeping my fingers crossed, though, as it was such a random, annoying thing for the Devs. (and everyone:)....Cheers, Rog. |
hugothehermit Send message Joined: 26 Sep 05 Posts: 238 Credit: 314,893 RAC: 0 |
I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did. Hi Hugo. I take it that it was a R@H 4.79 WU that got stuck. If that is the case then I guess we aren't out of the woods yet. Thanks for the info...Cheers, Rog. |
ralic Send message Joined: 22 Sep 05 Posts: 16 Credit: 46,481 RAC: 0 |
I did suggest that a time "cap" be placed on the start up of a work unit, though it was pointed out that the use of a fixed amount of time is not viable... It looks like there is a least some kind of cap present. resultid=1372617 reports "Maximum CPU time exceeded" after 60,359.58 CPU time. It hasn't been sent to another user, perhaps the project team can investigate this one? |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Not sure whether it's relevant or not, but I had one that wedged at 1% for a couple of hours. Rather than shut it down, I left it run, but took a quick look inside the stdout.txt file. The last line was this: pre-computing chuck/gunn move set for frag length 1 It's moved on now, but whatever that chuck/gunn move set thingumy is, it sure cogitated on it for a while. I'll save the stdout.txt file in case anyone's interested. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
When can you say there is a problem ? So after how many minutes/hours you should stop and restart the client or even stop the job ? Now running for 20 minutes and still 1%. |
Rebirther Send message Joined: 17 Sep 05 Posts: 116 Credit: 41,315 RAC: 0 |
When can you say there is a problem ? New Wus with _omega_ take much longer than the old ones, my P4 needs 1-1,5h to jump to 20%, its a little bit confused because the checkpoints here are 20,40,60,80,100. Finished some in 2:20h or up to 4h. |
Message boards :
Number crunching :
1% for 37 hours
©2024 University of Washington
https://www.bakerlab.org