Message boards : Number crunching : Minirosetta 3.73-3.78
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next
Author | Message |
---|---|
Michael H.W. Weber Send message Joined: 18 Sep 05 Posts: 13 Credit: 6,672,462 RAC: 0 |
On my systems and those of other team members, all WUs carrying the phrase "backrub" are breaking down with computation errors. Often after having consumed quite some CPU time. @Baker Lab: Please take a look at this WU series. Thanks. Michael. President of Rechenkraft.net e.V. http://www.rechenkraft.net - The world's first and largest distributed computing association. We make those things possible that supercomputers don't. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
On my systems and those of other team members, all WUs carrying the phrase "backrub" are breaking down with computation errors. Often after having consumed quite some CPU time. These are my jobs and I do realize that many of them are failing with memory issues on some platforms. I will definitely look into this. The batch is almost complete so I'm going to let them continue since they are producing results which I'm very interested in. Credit should still be granted for the jobs that fail. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Two of my systems have started intermittently falling into 'project backoff' for 10-40 hour periods after getting this message in the logs (If I go and do a manual 'request new tasks' they successfully get more tasks but I noticed because their work queues dry out:
Is this perhaps a result of higher 'memory requirements' attached to some of those jobs? If so, no worries, I'll just keep an eye on it until that batch finishes :) .. a side note though, the backrub type jobs seem to be completing successfully on my boxes - maybe it's something to do with my target runtime being short (4 hours) and it not getting a chance to chew through so much memory? (Speculation ftw!) If that's the case maybe jobs like this should be limited to a shorter target runtime? |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Two of my systems have started intermittently falling into 'project backoff' for 10-40 hour periods after getting this message in the logs (If I go and do a manual 'request new tasks' they successfully get more tasks but I noticed because their work queues dry out: Or you could just... you know... buy 60 gigs of RAM lol |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,273,400 RAC: 1,466 |
Two of my systems have started intermittently falling into 'project backoff' for 10-40 hour periods after getting this message in the logs (If I go and do a manual 'request new tasks' they successfully get more tasks but I noticed because their work queues dry out: I'd do just that for both of my computers if their motherboards could handle more memory. They can't. |
fractal Send message Joined: 12 Dec 08 Posts: 2 Credit: 1,000,245 RAC: 0 |
Two of my systems have started intermittently falling into 'project backoff' for 10-40 hour periods after getting this message in the logs (If I go and do a manual 'request new tasks' they successfully get more tasks but I noticed because their work queues dry out: I found two of my machines in that state this morning and several yesterday. 2/19/2016 5:54:25 PM | rosetta@home | Computation for task rb_11_07_60457_104894__t000__0_C1_beta_nov15_cart_fa_wt_0.40_SAVE_ALL_OUT_IGNORE_THE_REST_327108_852_1 finished That machine had 18 hours of backoff when I found it this morning. it still had one work unit running out of four cores. 2/20/2016 3:04:19 AM | rosetta@home | Computation for task foldit_2001101_s003_fold_and_dock_SAVE_ALL_OUT_328024_8728_0 finished This machine was completely out of work when I found it at the same time with over 24 hours of backoff. It got work as soon as I manually refreshed the project. My priority 0 backup project was not getting work either, but that never seems to work.. 2/20/2016 7:10:56 AM | Universe@Home | Sending scheduler request: To report completed tasks. I don't mind not getting a work unit that needs 60 GiB of RAM but please don't refuse to give my meager machine more bite sized work just because of that. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
2/20/2016 2:07:57 AM | rosetta@home | Rosetta Mini needs 57220.46 MB RAM but only 6842.83 MB is available for use. Maybe it's time to remove "mini" from the app name... ;-) On the serious side, considering that most PCs are still sold with 8GB or less, maybe creating another app name for this type of work would indeed be a good idea, so that only people who have much RAM can activate it in their profile while others won't be stopped from getting work (if that can't be solved in another way). . |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,273,400 RAC: 1,466 |
2/20/2016 2:07:57 AM | rosetta@home | Rosetta Mini needs 57220.46 MB RAM but only 6842.83 MB is available for use. I decided to buy another of my favorite brand of computers yesterday. They didn't offer any with more than 32 GB that fit my other requirements. |
fractal Send message Joined: 12 Dec 08 Posts: 2 Credit: 1,000,245 RAC: 0 |
2/20/2016 2:07:57 AM | rosetta@home | Rosetta Mini needs 57220.46 MB RAM but only 6842.83 MB is available for use. You generally need server class hardware to get more than 32 GiB of memory. <begin wry humor>And, since the project shuts you down if you fail for ANY work unit, you need 60 GiB of RAM per core. That's 240 GiB for a quad core. You can get that with AMD Opterons or Intel Xeons using registered ECC RDIM's. This is not a viable approach for most volunteers.<end wry humor> That aside, I had to manually update 8 stuck machines yesterday. I was about to say that I didn't have to restart any today but just found one on a 20 hour backoff. Fortunately I increased my buffer from a half a day to a full day to give me time to find them before they run dry. Oh, and why is it called "mini rosetta?" See https://www.rosettacommons.org/content/what-minirosetta |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,273,400 RAC: 1,466 |
2/20/2016 2:07:57 AM | rosetta@home | Rosetta Mini needs 57220.46 MB RAM but only 6842.83 MB is available for use. I might be able to afford server class hardware, but I don't feel like learning a server operating system - I've already learned enough operating systems. Also, I have rather strong electrical power limitations here. As for removing mini from minirosetta, it looks like someone doesn't know enough of the history of Rosetta@home to remember that the main application was rosetta a few years ago. Do the want the renamed application to be easily confused with the application of a few years ago? |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 440,472,381 RAC: 17,419 |
It looks like there are two different things going on here but they may be related. I have a number of servers and workstations that are being used for CPU and GPU computing. These were recently set to primarily to run rosetta for CPU work to help out that project. The rosetta Task status shows Ready to report but the Project Status goes to Communication Deferred for multiple hours (ex. 18 hrs) and the server runs dry. What I am seeing is that the project happily goes along for a while Requesting new tasks for CPU and gets the Scheduler request completed: got 1 task message. Then after a few hours it gets the Scheduler request completed: got 0 tasks. No work sent. Rosetta Mini for Android is not available for your type of computer. Finally, the message Rosetta Mini needs 57220.46 MB RAM but only 7363.62 MB is available for use. After that it stops updating. Remaining tasks will continue to upload until it runs out. Rosetta does not automatically download any more tasks or report any that were finished. You can manually update and get it to reset and start again however it will just run through to the same result in a few hours. I'm not going to baby sit all of these servers everyday to keep running rosetta. Also, these were purposefully only populated with 8GB memory to save on power and cooling requirements. CPU and GPU computing remember. Please look into this and provide a resolution soon or I will have to move on to other projects. Let me know if I can be of assistance or provide any more detailed information. Thanks. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,273,400 RAC: 1,466 |
It looks like there are two different things going on here but they may be related. It looks like all of your computers run some version of Windows and none of them run Android |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 440,472,381 RAC: 17,419 |
All of the systems are running Windows, either 2012/R2, 7 or 8.1. There isn't any that have an android emulator either. Had to give up my Linux servers. There were a couple of these that were left with more than 8GB memory. I am going to check if those also have the same problem. I will also check if one might already have 64 GB memory or upgrade it and see if it makes any difference. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
All of the systems are running Windows, either 2012/R2, 7 or 8.1. There isn't any that have an android emulator either. Had to give up my Linux servers. I think your (very impressive) fleet of servers is being affected by the same memory allocation messages I posted about (seen as follows in my logs):
The above causes the box to head into 'project standoff' for 20-40 hours. Hoping David sees this thread and can take a peak sooner than later :). |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
All of the systems are running Windows, either 2012/R2, 7 or 8.1. There isn't any that have an android emulator either. Had to give up my Linux servers. thanks for the heads up. I'll track this down and try to fix it on our end. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,273,400 RAC: 1,466 |
All of the systems are running Windows, either 2012/R2, 7 or 8.1. There isn't any that have an android emulator either. Had to give up my Linux servers. Something that MIGHT be worth trying: See if your account settings allow you to turn off Android workunits, since none of your computers run Android instead of Windows. |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 440,472,381 RAC: 17,419 |
I'm not seeing an option to change that setting in rosetta. It is available on a few other BOINC projects though. |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 440,472,381 RAC: 17,419 |
Update - Several of the servers that had 0 work left yesterday started up again today and began processing Rosetta tasks. Probably after the communication deferred timer ran out. Seems that if you manually update the project it triggers the loop but if you leave it alone it might sort it out by itself. There are a few that still are stuck so I can check on those tomorrow. Several servers already have 32GB memory so those are reporting a similar message with slightly different memory size available. Also, there are three servers one each with 64, 128 and 256GB of memory. They need patching and BOINC updates to 7.6.22 anyway. When I restart them I will watch how they behave. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Update - Several of the servers that had 0 work left yesterday started up again today and began processing Rosetta tasks. Probably after the communication deferred timer ran out. Not to be nosy, but how do you handle the heat from the servers? You're pulling over a quarter million of credit per day, that's very impressive! |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 440,472,381 RAC: 17,419 |
The servers are all in a lab room that has an AC cooling unit but I'm actually close to the limit it will handle. Works pretty well during the winter and cooler months but when the weather gets hot outside I have to throttle them back during that day and only run at night. If it gets past 90 F I have had to just let them run out of work units and idle. If we get to 100+ F I have had to shut them off and let the weather cool down a bit before starting back up again. Gives me a chance to update things and reset them anyway. |
Message boards :
Number crunching :
Minirosetta 3.73-3.78
©2024 University of Washington
https://www.bakerlab.org