Message boards : Number crunching : Work unit failures.
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
Since yesterday, I've had four work units run for the regular 12 hours I have set for target run time. Each has obviously run their model several times with different random start points. Each, at completion, has errored out with a status of 0x00000000. 48 hours of work dumped. No new tasks set on both systems. 1372319356 1226215916 3117659 25 Apr 2021, 6:25:50 UTC 26 Apr 2021, 12:33:06 UTC Error while computing 43,766.03 42,977.44 --- Rosetta v4.20 windows_x86_64 1372042794 1226012455 3161065 24 Apr 2021, 14:39:12 UTC 25 Apr 2021, 20:18:07 UTC Error while computing 43,119.99 43,083.17 388.00 Rosetta v4.20 windows_x86_64 1371978435 1225956023 3161065 24 Apr 2021, 10:46:43 UTC 25 Apr 2021, 14:10:27 UTC Error while computing 43,156.42 43,092.81 388.00 Rosetta v4.20 windows_x86_64 1371983541 1225891749 3161065 24 Apr 2021, 9:25:48 UTC 25 Apr 2021, 13:39:11 UTC Error while computing 43,110.57 43,042.69 387.00 Rosetta v4.20 windows_x86_64 Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Since yesterday, I've had four work units run for the regular 12 hours I have set for target run time. Each has obviously run their model several times with different random start points. Each, at completion, has errored out with a status of 0x00000000. 48 hours of work dumped. No new tasks set on both systems. You're right - it's been noticed by others too. You do get credit awarded when the daily clean-up job runs. They're all "norn_struct_profile_layered_design" tasks, but it doesn't happen to all of them, oddly. When it comes it only shows "upload failure: <file_xfer_error>", indicating it ran ok but something goes wrong after the server receives it back. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
>>> goes wrong after the server So, it is still the case that the work is lost, or do the files still exist for retrying? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
>>> goes wrong after the server I don't honestly know - good question. If I were to guess, it looks like we run them successfully - and we do eventually get credit for them as part of the daily clean-up job the project runs - but the project likely isn't getting their results for the 75% or so that report a Computation Error. Which makes it a far bigger problem to them than it does to us. A very good reason to fix it. Others have previously mentioned it (and I confirm I see the same issue myself) so I've just reported it. I don't know how I've given myself this job - no-one's given it to me - but it seems I've got it, like it or not. I'm reluctant to keep bugging the project guys, so I wait until it's apparent it's a consistent failure rather than a one-off. As long as I don't come across as a moaner or time-waster, I get a very good response, to be fair. They just seem to avoid looking at the forums - a time killer. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
I see my work units that were in that state, that I can still see on my results page, have been credited now. The credit is a little "odd", ie. it is 388 +/- 1 or 2. This is below, by quite a lot, what I would expect, the other tasks still visible on my results page show upper 400's to lower 500's. Everyone knows, of course, that the credit is not of any use. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
I see my work units that were in that state, that I can still see on my results page, have been credited now. The credit is a little "odd", ie. it is 388 +/- 1 or 2. This is below, by quite a lot, what I would expect, the other tasks still visible on my results page show upper 400's to lower 500's. Everyone knows, of course, that the credit is not of any use. Yes, that's often the way. The average for a default machine rather than the host's average. Still, a lot better than last year when it was a flat zero! Err... you may have noticed that our "norn_struct_profile_layered_design" tasks have now been aborted by the server. You were right that none of the results were getting back to them, so thanks for reporting it. To be investigated, but very likely because of a long filename error in the results file, which might explain why some were ok and some weren't. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
>>> The average for a default machine rather than the host's average. Probably, a default machine with a default run time. Both my machines here are 4GHz i7's with 12 hour target run times. I'd set the time up a while ago, our network was busy at the time, it was a simple thing to do which chopped a little load off it. >>> aborted by the server Yes, I saw that. I've allowed new tasks again, there were a number waiting to start anyway, so I doubt the fairly short time I closed downloads actually had any noticeable effect. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
>>> The average for a default machine rather than the host's average. My impression is it does account for runtime, hence the +-1, but not the processing power you have compared to the average. It is what it is >>> aborted by the server It aborted one or two of my running tasks, which was unfortunate. And then it downloaded a few more of the same type of tasks... It'll sort itself out before long - some stragglers slipping through for now. Give it a day or so. |
Message boards :
Number crunching :
Work unit failures.
©2024 University of Washington
https://www.bakerlab.org