Message boards : Number crunching : Discussion on increasing the default run time
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
Nuadormrac Send message Joined: 27 Sep 05 Posts: 37 Credit: 202,469 RAC: 0 |
If the task failed, then for some reason it is not running well on your machine. It is more conservative to replace it with another task that may run better for your environment. In other words if model 1 or 2 failed from this task, let's not push our luck with more. Better to get word back to the project server about the failure sooner. Perhaps there is a trend that will indicate similar future work should be held until a specific issue is resolved. If model one failed though it would both not impact people well, and yes that the tasks aren't working well on the machine can reasonably be argued. But then a longer WU time wouldn't effect it much if the unit was aborted early on, and a new unit needed to be downloaded (for instance 5 minutes after starting). That's well below even existing preferences. Where this could more likely be an issue is if, lets say for sake of argument 20 models completed successfully, and for whatever reason model number 21 failed. Now the unit was running 2.5 hours. Only if partial validation for the 20 models occurs would one avoid losing 20 models (vs just 1), and the user would lose the whole 2.5 hours worth of credits, vs just the amount lost for the one unit. Now arguably I haven't tended to see units fail much on Rossetta (though some have, for there to be discussion, along with recommendation on the team page for the Pentathalon challenge). But in the past I had seen it from time to time on RALPH, which is good because it means many are being caught in the alpha/beta stage, before getting released to people in general. But it can be a consideration. But for crunchers, there can be 2 big considerations with this proposed change. One is the effect on the BOINC queue, and the other is a reason for which shorter run times can be chosen/preferred, less likelihood of running into the odd error, if it has a smaller span of time in which to occur, and if it does happen smaller impact on potential for lost credits. For you, there's server load on the one hand, but also potential for lost models/work already completed on the other. (Given we're talking a change of 1-3 hour minimum and 3-6 hour default; units which successfully run for < 1 hour aren't a consideration with such a change as they'd get thrown out and new download would occur anyhow. Hence I'm presuming the first model or 2 has had a sucessful run, for it to now error out prior to either 1 or 3 hours respective. And yes I know a few models do run for 2 hours or so, though many end earlier.) |
Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0 |
Am I correct in assuming that this proposal has been shelved? Furthermore, please excuse my ignorance about the way the project works, am I correct in the following statements?: 1. Each work unit is pre-populated with 99 (or 100) models. 2. The work unit stops when the earlier of the pre-selected run time or the 99 models is run. 3. In the case that the run time causes the work unit to end, the models remaining untested are discarded and not used for future work units. 4. There are (for practical purposes) an infinite number of possible models, so, assuming point 3 to be correct, discarding the untested models is not an issue. 5. Given that the possible models are "infinite", there should never be a shortage of work units. 6. Shorter work units impact the server load but reduce the risk of crashing before completion or watchdog picking up an error. 7. Longer work units reduce the server load and reduce the risk of running out of work in the case of server issues such as we have recently experienced. Warped |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Am I correct in assuming that this proposal has been shelved? Let me take my best shot at these, I cannot confirm the status of the original proposal. 1. Not entirely correct. It just starts with the seed to a random number generator. It can generate any number of starting models from that. But some specific protocols were limited to producing 100 models, because the upload file sizes became quit large. 2. It won't interrupt a model in progress just to cut off at exactly the configured runtime preference. But it will try to avoid beginning the next model if it would be predicted (as based on the prior models in your own task) to run too long. And if it doesn't stop running within 4 hours of the configured runtime preference, that is when the watchdog is there to wrap things up. 3. Correct. Using a Monte Carlo approach allows that a sampling of the search space reveals your estimate at the answer, and so whether the specific models that would have been run if that specific task had continued or were run on a faster CPU are not specifically relevant. So long as the overall search space is adequately sampled, the specific models being examined is not critical. 4. Correct. 5. Not correct. Any server is going to have some limit on the amount of outstanding and completed work it can keep track of. And any Project Team is going to have to review the results to try and gain insights. So if everyone goes on vacation for the holidays, it doesn't make any sense to be sending out work just because there is no limits to the sampling of the search space that is POSSIBLE. It only makes sense to send work you will have staff enough to review. And it only makes sense to sample the search space to some limited degree. The objective ultimately is to be able to come up with a better, more accurate, answer with fewer samples. 6. Some volunteers have reported this. "Crashing" is a relative term. Generally models completed prior to any problem encountered are reported back and granted credit, so the specific fate of the last model of the task is not going to lose the good models you completed prior to that. If a given protocol has a quirk where some fraction of models end up running longer then 4 hours, then yes, by running with a longer runtime, you increase the number of models you begin, and therefore improve your odds of encountering one that runs for a long time. But if your alternative is to pick up another short runtime work unit which has the same odds, and begin running on it... you are exposed to the same chance of hitting a long-running model that requires watchdog intervention. There have been cases where errors were not wrapped up as cleanly as desired. But many of the reports of "crashing" fail to observe the nightly credit granting script that gives credit even after the validator has run on the task. 7. A longer running task and a less frequent server contact helps reduce server loads, certainly. But if you reduce your server contact to once per day rather then 10 times, and the server is not available at that time, you are still out of work (if you have no additional buffer or work). On the other hand, the BOINC client tends to contact the server several hours before it estimates the current work will complete, and so on average, a longer runtime would tend to help you ride through short outages given the same "additional days" of work settings. Odds improve that you will still be crunching on a 24hr task during a short outage and so it will pass without you even knowing it. If you hit the server 10 days per day, odds are you will notice any 3hr or longer outage. Just a question of whether you still have the same few hours of work left. In other words, if you have no cache of work, that cushion that the client builds in when it requests work before you absolutely run out goes a long way, because if the server is still up, you'll probably be set for another day. And if the server is down, there's a reasonable chance you still get more work before completing the tasks you have in progress. So a day long runtime would be more similar to a short runtime with a 1 day additional buffer, say a 3hr runtime with a 21hr additional buffer would be roughly the same as a 24hr runtime and a zero buffer. Rosetta Moderator: Mod.Sense |
Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0 |
Thanks for the detailed response, Mod.Sense. It certainly helps me understand how best I can contribute. |
John M. Kendall Send message Joined: 8 Dec 05 Posts: 3 Credit: 6,697,075 RAC: 0 |
To Completion time needs to be longer. Most of the work unit end up running at High Priority. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
it looks like this is an old thread. however, i'm against increasing the default run time beyond 3 hours the reasons are that i think many of the home community do not leave their PCs on 24x7 let alone crunch boinc round the day. i think there are also many who crunch boinc/rosetta@home occasionally and there are new users. Having too long a run time would discourage these groups who may abandon the project altogether as it is too long a wait to see results / feedback e.g. on the tasks web or simply requires too long for the PC to be on to get results. electricty is not necessarily free or cheap round the world for those who participates in the projects. and very importantly, poorly configured PCs can run with loud fans which would irritate the participants having to put up with longer periods of having the PC on and processing data changing that to 6 hours also do not resolve the issue where works are concurrently retrieved or submitted. i'd think the high traffic situation occurs in spikes in addition, average consumer cpus today has staggering improvements in processing speeds compared to say even just 5 years ago, these benchmarks are of ranges from 10 to 100 times faster compared to the old single core pentiums, p4, athlons etc. that means a same task single iteration(decoy/model) on that old cpu now takes 1/10 to 1/100 of the original run times on a modern CPU i'd think what could be done is to look at the protocols (e.g. boinc or even the rosetta app itself) to see if some improvements can be done so that submissions / retrieval may perhaps be staggered. other possibilities could be mirrors possibly hosted by partners or even a more sophisticated peer-to-peer protocol. After all as rosetta@home is a distributed computing project it's likely possible that there can be distributed boinc servers handling distributed work issue and submissions. there are many examples of such successes (e.g. bit-torrent file distribution networks) but it'd require some protocol changes at perhaps the boinc level and even clients perhaps |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
i'd like to present a suggestion perhaps: i reviewed some docs on boinc and apparently, it appears that this can be some what a challenge to implement but i'd like to just share a thought: a minimum run time of 3 hours and default to 6 hours can be a standard value. however, these values can be provided as a *computing preferences* which users can update in the user accounts on the 'computing preferences' page. the idea is that the minimum run time and max run times are a sort of 'custom preferences' that's specific to the project (rosetta@home) and specific to the user. (and even specific to host) when the user's boinc client connects it downloads the 'custom prefs' and saves that in an xml configuration file in the project directory perhaps. when minirosetta starts on the user's PC, it reads the 'custom prefs' as part of initialization. it can fallback to global defaults if the values are not specified or that it falls out of the 'valid ranges' other possible custom prefers could have the users indicating the preferences for (priorities of) small/medium/large/complex tasks, which the scheduler on the server may possibly use to present the relevant tasks. however, i'm not sure if this is already part of boinc today just that it's fully automated. -------------- i hope this may possibly solve some problems: 1) i noted that some recent tasks/models are apparently pretty large and possibly v complex. on my pc that's running a recent Intel Haswell i7 (probably considered a decently 'fast' consumer CPU), i've seen it completing a single (or very few) models/decoys in the 3 hour time frame. while the simpler jobs some of which completes as many as close to a hundred models/decoys in that same time frame) this may result in too few results being produced for the larger complicated structures having a longer minimum run time could help the large complex tasks complete more models/decoys 2) some users who may own somewhat slower PCs and it needed more time to complete the tasks or who would like the tasks to produce more models/decoys for each job (this would mean needing a longer run time) --------------- however, commitment to this 'default run time' durations as i elaborated previously is very much dependent on the users's specific circumstance and the global defaults should not be too 'onerous' to discourage the new users or the occasional 'light' volunteers group. while there are also others who probably leaves a host (PC/server) on crunching boinc/rosetta round the clock 24x7 i.e. one size fits all is probably a bad idea and custom user preferences specifying this as a custom 'computing preference' relevant to rosetta@home is probably a way to alleviate/resolve this just 2 cents |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
My first thought on reading your post was that this is what currently is supported (via the Rosetta-specific preferences, configured via the website). But I think what you are suggesting that would be different is essentially to tag the tasks with some relative size, and then allow the user to configure whether or not they want to accommodate that size. I guess even better would be to get BOINC Manager to do such selection for you. So it would be automatic that if you have a smaller machine or don't run very long each day that these tasks would not be sent. And it would seem as though this could be done by establishing appropriate memory and FLOP estimates on the tasks. Perhaps based upon reported results from Ralph@home. But that gets to be sticky with the existing runtime preference setting, and how BOINC Manager normalizes the runtimes. Rosetta Moderator: Mod.Sense |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
today: Total queued jobs: 9,095,679 In progress: 298,286 Successes last 24h: 115,117 perhaps it is time to reduce the default run time for everyone? thanks modsense, would try out the boinc-client/manager preferences 1st |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Seems to me like there are several issues mixed together here, and I've just spent a long time trying to sort out the thread without being to find a clear focus on things. Let me try to break it down another way: There seem to be two objectives: 1. Doing good by solving problems. 2. Earning credit for doing that good. 3. Productively using computer cycles that would be wasted. I was going to start breaking it down into tradeoffs, but almost all of the cases kept coming back to wasted effort on my part (or on my computer's part). For example, the original idea of this thread was to have larger run times, but that causes more conflicts with my normal usage of my computers. I'm already noticing how the long checkpoints tend to result in lost work each time a computer is started or shut down. I've mostly been focused on projects that don't do any checkpoint for an hour or longer. For several reasons I feel it is better to shut down properly rather than sleep, but if I check the status of the in-progress work, I often find that hours of work will be discarded unless I sleep the machine... The system is complicated and unreliable, and the only safe guideline seems to be favoring the smallest work units with the most frequent checkpoints and the longest deadlines. Just seeming too complicated and confusing, which is why I dropped my previous projects (after earning over 1 million and almost 300,000 "Work done" points). Right now I'm leaning towards doing what I can to help their bandwidth problems by dropping rosetta@home (after earning 1.4 million points). (In the ancient pre-BOINC days when seti@home was the only game, I had worked my way up to top 1% status, but I always felt that project was pointless.) Not likely, but maybe someone can represent a BOINC project that is NOT so troublesome? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@shanen, I'll just point out that the target runtime you set on R@h has no impact on the frequency of a task checkpointing. Most R@h tasks checkpoint every 15 minutes or so. Some types of tasks can take over an hour, but they are less common. The runtime just determines how many models your machine will work on for a given protein challenge. More models completed means more credit, and more crunch time before using bandwidth to get a new task. This thread is discussing the runtime. It actually sounds like you are more interested in the frequency of checkpointing. If that is the case, feel free to open a new thread. The number crunching board is probably the best place to discuss that topic further. Rosetta Moderator: Mod.Sense |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
shanen wrote: Not likely, but maybe someone can represent a BOINC project that is NOT so troublesome? Well, Seti@Home is still checkpointing pretty much as often as you want, or Milkyway@Home was checkpointing on the CPU as often as you want (IIRC, I run it mostly on my GPU without any checkpointing at all). You can simply use the results of WUProp@home to find a project, after all that's what for people let it run on their computers, to help others find suitable projects for their computers. ... or you simply hibernate you computer instead of shuting it down, there's nothing wrong with that, since XP came out I restart only when really needed, for example after some updates. . |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Well, I just made a second attempt to start a thread in the new direction as suggested, but it seems I failed again. Let me try to put a quick wrapper around the problem. I do NOT want to spend a lot of time trying to figure out why a BOINC project seems to be failing to make any progress, or even to understand how the project website works. Nor do I want to substantially modify my computer usage habits for the greater convenience of the BOINC projects. I just want to donate the available cycles to do some good. The main reason I abandoned the last two BOINC projects I was supporting was because of complexities in their operations. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
Well, I just made a second attempt to start a thread in the new direction as suggested, but it seems I failed again. Maybe I'm being very stupid, but I looked at a couple of your machines and you seem to complete 15-30 tasks a day. These will have had zero downtime. On the assumption you shutdown once a day (may be wrong) you might be losing a few minutes each since the last checkpoint of just your running tasks. That's as close to zero (for the day) as makes no difference. I agree the FKRP tasks seem to struggle for their first checkpoint - some hours - but only those tasks. Dare I say it, you seem to be wasting far more time trying to micro-manage tasks (and writing about them) than if you just let them run. If you want to save a whole heap of time, don't go checking your downloads at all and cherry-picking ones to delete. As long as there's enough time left in your day to complete them, they'll sort themselves out without any downtime and no micro-managing what Boinc and the tasks do routinely anyway. In answer to a question you asked in one of you threads, there isn't a problem so I don't do anything about them and just get on with my day. That said, maybe I've misunderstood or completely missed the issue. It wouldn't be the first time. |
Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0 |
We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers. There will be a transition period where your client will adjust to the new run time which will affect the number of tasks that are queued on your client. I've created this thread for a discussion on what would be the best way to transition to an increased run time. This obviously will only affect people with default run times (people who have not bothered to set this preference) or people who have set their run time to be less than 3 hours. (edit: not 6, whoops!) I think it's a good idea, why having so small windows of 1 hour or 3, is better 6. Anyway I communicate on my PC once a day only, and reserved 1 day work, and additional 1 day, I run fine with it, and I think is good for the servers, maybe something similar could be done for Android too, rare or no WUs the last few days, but the ones that I get run for 30 mins or 2 hours and I have seen they get cut out on their run time, after a % they finish and report, I want to run the WU completely if possible, 6 hours on my 4-Core Galaxy Tab4 is fine for me. I'd like to request WUs for my Tablet GPU if possible and my PC GPU, not based on OpenCL for older cards please, mine is like 7 years old, ATI 4670HD but works fine, Supports OpenCL 1.1b (Beta) or something a bit substandard for the minimum OpenCL Folding@Home computing, but can give several 32-bit GFlops (Giga FLOPs), and probably still millions of old videocards that could give you extra processing power |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Why not larger work units? Because some people don't run their computers for such long periods of time and because there are continuing problems with the checkpointing. In my own case, I have several machines that are normally only used for special purposes. In terms of constructive suggestions, I have two, though I feel both of them are more at the BOINC level than the project level. One is that the client should respond to normal shutdowns by checkpointing all active work units. I believe that is supported by all of the OSes I currently use, Windows 10, Ubuntu Linux, and the Mac OS, and I'm really unable to understand why that does not seem to be the case now. Maybe there are places where large amounts of memory would have to be saved to disk for a checkpoint? However that would actually tie into my second suggestion. The size of the work units should be related to the operating history of the machine. For a machine that tends to run for short periods of time, it should get shorter work units with more frequent checkpointing. In contrast, a machine that has a history of running without interruption should be allowed or even encouraged to work on larger work units with less checkpointing. Actually the reason I stopped by today was to see if there is any reason for the recent problems in completing work units. One of the most likely causes would be large work units, which is why I looked at this thread in the first place. Recently the project managers apparently got rid of the 4-hour work units from the rb series, and almost everything is running around 8 hours of working time. However, at least one project is creating at least some work units that can be completed in around 2 hours. Not really a suggestion, but in general I think all of the deadlines are just annoying and counterproductive from the perspective of the volunteers. Needless complication and if the project wants to impose them internally, there is no reason to bug us about them. The project managers frequently try to say how little they matter, but if that is true, why not make them look meaningless even if they have some internal relevance? In the "worst" case, they can just use the so-called "late" results to double-check the work that has already been accepted as part of a project's results. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,571,918 RAC: 7,228 |
Why not larger work units? Because some people don't run their computers for such long periods of time and because there are continuing problems with the checkpointing. In my own case, I have several machines that are normally only used for special purposes. I suggested, some times ago, to create a voice in user profile like "big wus" (like wus that needs big use of ram, for example). Every volunteers may decide to receive or not these wus. With new servers (new hw and sw), the "load server problem" is very decreased. |
Stephen "Heretic" Send message Joined: 2 Apr 20 Posts: 21 Credit: 11,028 RAC: 0 |
Right. You are not impacted by the proposed change to default run time, because you are not using the default. And you are not impacted by the proposed change to minimum runtime, because you are over the proposed new minimum runtime. . . Hi, . . I am a newbie refugee from S@H and was hoping someone could explain the function/purpose/usage of target times? . . I am at present running 2 cores of an i5 with default target times. These settings are only allowing one task to run. I originally only committed one core and one task was running AOK, but after trying 2 cores which made no difference then 3 cores which allowed a second task to run but let loose the dogs of war and sent everything into meltdown, when I returned to the original committment of one core the initial task remained in the waiting to run state until I commit a second core. What is the secret to making things play nice. With the current mode the CPU usage is very low. Also what memory resources are required and how is it best to manage these resources? Stephen ? ? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,780,029 RAC: 22,848 |
. . I am a newbie refugee from S@H and was hoping someone could explain the function/purpose/usage of target times?The length of time a Rosetta Task runs for is fixed (the default is 8 hours but you can select other target Runtimes in your Rosetta project settings). Rosetta basically tries & whole bunch of different scenarios on the data. Some may run in to a dead end- and end early. Others may go in indefinitely- never ending. So the target Runtime is set & that is how long a Task will run for (if there is an issue with it & things get stuck, the Watchdog timer will kill the Task off, but you will still get Credit for work done. Unless of course it choked up right at the start of processing). . . I am at present running 2 cores of an i5 with default target times. These settings are only allowing one task to run. I originally only committed one core and one task was running AOK, but after trying 2 cores which made no difference then 3 cores which allowed a second task to run but let loose the dogs of war and sent everything into meltdown, when I returned to the original committment of one core the initial task remained in the waiting to run state until I commit a second core. What is the secret to making things play nice. With the current mode the CPU usage is very low. Also what memory resources are required and how is it best to manage these resources?Having "Use at most xx% of the CPUs" at anything less ant 100% and running multiple projects, due to the CPU usage restriction, Resource share settings, cache settings & any other app_config.xml project specific limitations along with any locally set preferences that override web based ones you may have made will result in unexpected behaviour. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 2 Apr 20 Posts: 21 Credit: 11,028 RAC: 0 |
The length of time a Rosetta Task runs for is fixed (the default is 8 hours but you can select other target Runtimes in your Rosetta project settings). OK, so the target time limits the run length of the task, but how is that affected by or interact with the minimum target setting? Having "Use at most xx% of the CPUs" at anything less ant 100% and running multiple projects, due to the CPU usage restriction, Resource share settings, cache settings & any other app_config.xml project specific limitations along with any locally set preferences that override web based ones you may have made will result in unexpected behaviour. . . Actually I am finding just the opposite, with % CPU numbers running at 50% (2 cores) one task is running just fine (has just completed AOK and a second one has started) but when I increased CPUs to 75% (3 cores) all hell broke loose, it crashed BOINC totally and really screwed the pooch, so I dread to think what would happen if I tried running at 100%. I would like to understand what governs resource usage in this app (v4.12) so I am can gt it to play well with E@H which I intend to keep running. Apart from CPU support for the GPU app there is no conflict because E@H is GPU app only and Rosetta is CPU app only. |
Message boards :
Number crunching :
Discussion on increasing the default run time
©2024 University of Washington
https://www.bakerlab.org