Finish a workunit in < 1 minute: What would it take?

Author	Message
Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29307 - Posted: 13 Oct 2006, 22:03:51 UTC What aggregate clock speed would it take to finish a workunit in less than one minute? What CPU architecture is the most efficient for Rosetta@home jobs? Michael ID: 29307 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 20	Message 29311 - Posted: 13 Oct 2006, 22:41:20 UTC - in response to Message 29307. What aggregate clock speed would it take to finish a workunit in less than one minute? What CPU architecture is the most efficient for Rosetta@home jobs? Michael Dodgy hardware can finish a job in under a minute! If you're talking about valid jobs, then it depends on the Work Unit being crunched. To complete one decoy (and therefore the minimum size WU), using a job from one of my PCs, it takes around 14,000s to complete 21 decoys. That's 11.1 minutes per decoy, on an AthlonXP-M@2.3 GHz. I'd expect you'd need a CPU based on the Core architecture at around 20GHz, or maybe an Athlon64 at around 24GHz to complete one of those decoys in a minute - and multiple cores don't make any difference as the processing is single threaded (although of course you run different WUs simuntaneously on multi-CPU machines). I'm fairly sure Core/Core2 is the most efficient. Rosetta is quite FPU and cache-intesive so the Pentium-M and AthlonXP/Athlon64 all do well per MHz. The Core-based xeons and Opterons are the fastest of these. HTH Danny ID: 29311 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29315 - Posted: 14 Oct 2006, 0:28:56 UTC I don't understand. Your reply contradicts itself. There is no such thing as 20GHz CPU, so how exactly would a 1 minute WU be possible? Do you know why WU are limited to a single thread? Michael Join Team Zenwalk ID: 29315 · Rating: 0 · rate: / Reply Quote

Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0	Message 29319 - Posted: 14 Oct 2006, 0:35:14 UTC Last modified: 14 Oct 2006, 0:38:08 UTC Basically, it's NOT. Atleast not valid work units. Each type of wu is different, they take different amounts of work and have different amounts of steps per decoy/model. A WU consists of something akin to 10,000 (or is that 100,000) models so we only do a small portion thereof. (I'm not up on the exact science of rosetta). Below is a chart of some of another users WU from one computer. A WU will run until atleast ONE model is done, regardless of this setting. See how he does a different amount of models with a ONE hour "cpu run time" preference. Dag nabbit, I posted the one with the wrong headers. The credit columns (right after cpu time) should read "claimed Credit, Granted Credit, Claimed Credit/hour, and Granted credit/hour. sorry ID: 29319 · Rating: 0 · rate: / Reply Quote

Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0	Message 29320 - Posted: 14 Oct 2006, 0:54:35 UTC Last modified: 14 Oct 2006, 1:12:30 UTC I have an AMD64 X2 4800, one of the faster puters available (except core2, conroe, kentsfield, etc) and here's a chart from it. You'll see how long each model/decoy takes on different WU types. I made this just now. To get that down under 1 second would take one heck of a computer, or an unheard of small model. ID: 29320 · Rating: 0 · rate: / Reply Quote

Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0	Message 29321 - Posted: 14 Oct 2006, 1:13:38 UTC If it could be done in less than 1 second the host would run out of work waiting on the download of the next wu. LOL ID: 29321 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29328 - Posted: 14 Oct 2006, 6:18:06 UTC Why can't WU be divided among cores? Michael Join Team Zenwalk ID: 29328 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 29331 - Posted: 14 Oct 2006, 9:23:08 UTC - in response to Message 29328. Why can't WU be divided among cores? Because most of the program is a tight loop which does not lend itself to muti-threading. BUT there is no particular need to spread across cores since you can run two at a time and is more efficient to do this. Most distibuted computing project run this way. The only ones that would benefit if it was possible would be the very long running ones, like Climate Prediction and the rendering projects doing one thing takes a long long time. In effect we are dividing among the cores as seen from the point of view of the project results. Team mauisun.org ID: 29331 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 20	Message 29332 - Posted: 14 Oct 2006, 9:24:10 UTC - in response to Message 29328. Last modified: 14 Oct 2006, 9:24:36 UTC I don't understand. Your reply contradicts itself. There is no such thing as 20GHz CPU, so how exactly would a 1 minute WU be possible? There isn't, so no, a 1 minute WU isn't possible from the WUs we've seen recently. My reply was just a theoretical answer to your question. Why can't WU be divided among cores? A single WU can't because the process is itterative - one calculation depends on the previous. I'm sure parts of the code could be re-written to pull out some steps that could be run as seperate threads, but it's more efficient to run one WU per core - multithreading would have little benefit other than reducing the RAM requirement. Why would you want a sub 1m WU? ID: 29332 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29344 - Posted: 14 Oct 2006, 16:41:34 UTC Last modified: 14 Oct 2006, 16:43:35 UTC Why would you want a sub 1m WU? What if there was only one WU left? Or what if I wanted to push the WU queue down to zero? I understand that with current technology using multiple cores for multiple workunits is the most efficient use of resources, but I wanted to know why it was technically impossible to subdivide a workunit (in a field based on dividing work, distributed computing). The answer seemed intuitive, the WU could not be subdivided any further, but I wanted a better answer. If asked about it I will say that WU have reached the limit of how far work can be subdivided. Because of the computations involved WU are already as small as they can be. Dividing WU any further would decrease efficiency. If someone wants to increase the pace of the project, then faster computers will have to be invented. Thank you for your help. Michael Join Team Zenwalk ID: 29344 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 29346 - Posted: 14 Oct 2006, 16:59:54 UTC - in response to Message 29344. Why would you want a sub 1m WU? What if there was only one WU left? Or what if I wanted to push the WU queue down to zero? I understand that with current technology using multiple cores for multiple workunits is the most efficient use of resources, but I wanted to know why it was technically impossible to subdivide a workunit (in a field based on dividing work, distributed computing). The answer seemed intuitive, the WU could not be subdivided any further, but I wanted a better answer. If asked about it I will say that WU have reached the limit of how far work can be subdivided. Because of the computations involved WU are already as small as they can be. Dividing WU any further would decrease efficiency. If someone wants to increase the pace of the project, then faster computers will have to be invented. Thank you for your help. The subdiving for this project is true, but not necesseraly for other project. It all really depends on the calculations being performed. e.g. The rendering projects. While the approach of sending out individual frames (or groups of frames) is the way to go. The process of rendering is suitable for multiprocessor usage so in that type of distributed project it would be advantages to use it in the configuration. Since they get the results back faster and with (probably) the same efficiency. A similar thing with Climate models, these take months to complete so if a computer could do it in even 2/3rds the time on a dual or more configuration it would be well worth it. Even if it was not as efficient overall, since getting the result back weeks earlier than before is far more beneficial. Since we compute a single model (not task/wu) pretty quickly it not so important and the other method outways any potenetial dual processor performance. (at the moment ;-) Team mauisun.org ID: 29346 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29348 - Posted: 14 Oct 2006, 17:20:01 UTC Could the Cell use its special abilities to help Rosetta@home or are only clock cycles useful to Rosetta? For instance, which half of Roadrunner be more useful to Rosetta? Michael Join Team Zenwalk ID: 29348 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29459 - Posted: 16 Oct 2006, 13:46:59 UTC Most of the calculations in Rosetta are single point floating point calculations, so the Cell processor could well do that. However, Rosetta is at the moment written as a single-threaded application, so to make use of a multi-core cell, you'd have to run several copies of rosetta on the same machine... The K8 (Opteron, etc) is also pretty good at floating point, so a set of 16000 of those wouldn't be a bad thing for rosetta. Assuming that they do 2.6GHz like my dual core ones, they would give about 4000 x 1500 => 6 million credits per day, which is pretty decent. Clock-cycles is about as useful for CPU performance as a rev-count on a motor-engine... As a number on it's own, it's perfectly useless. My friend Denis's motorcycle has a red-line of 22000 RPM - 7-time World Champion Velentino Rossi's only go to around 17000. But I can assure you that Denis's bike ONLY beats Vale's bike on that particular measurement... Because Vale's bike is a 1000cc special-built race-bike producing 200+ bhp, Denis's a 8-or-so year old standard Honda CBR250RR - giving around 30-odd bhp [which is quite respectable for a 250cc four-stroke]. -- Mats ID: 29459 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29472 - Posted: 16 Oct 2006, 18:23:14 UTC I have a hyperthreaded CPU and Rosetta automatically runs two jobs, one on each "processor". Couldn't I simply request 32,000 jobs if I had the Roadrunner? Michael Join Team Zenwalk ID: 29472 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 20	Message 29475 - Posted: 16 Oct 2006, 19:39:57 UTC - in response to Message 29472. I have a hyperthreaded CPU and Rosetta automatically runs two jobs, one on each "processor". Couldn't I simply request 32,000 jobs if I had the Roadrunner? yeah - be sure to warn bakerlabs first if you get the chance tho! ID: 29475 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 29484 - Posted: 16 Oct 2006, 22:55:40 UTC They don't mention if RoadRunner will have enough Ram to run 32,000 Rosetta clients. (8 to 16 Terabytes). Or willingness to trial run their system for nuclear waste research - on medical research. And we'd still need a Cell based client (which is where most of the floating point ability is coming from.) We've got around 60k Intel and AMD cpus (and the IBM cpus in the Macs) producing about 40? teraflops. It seems like every time you get newer hardware, the software folks find a way of using it. And not just to run something faster. My 1988 Everex 386 at 20Mhz was at least 1000 times slower than my current system at home. Windows 3.0 booted up faster on that Everex 386 at 20Mhz than WinXP on my current system does for me. When we get hardware that can crunch through current 60 and 120 minute first decoys in a minute - the Rosetta team will come out with new approaches that do things in much more detail and much higher computational requirements - just to produce results with lower RMSD and higher likelyhood of identifying the right result. And it'll take the same amount of time on the new hardware as the decoys are running on the current hardware. Programmers and researchers.. /e rolls eyes. grin ID: 29484 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29497 - Posted: 17 Oct 2006, 8:04:07 UTC Last modified: 17 Oct 2006, 8:06:16 UTC I have to say operating systems are a completely different story from scientific calculations. Finding ways to waste hardware is much easier to do with things like virus scanners and crazy OS crap than it is to do with things like scientific calculations. Compare the 1988 equivalent of a Tablet PC running 40 different TSRs vs. the 1988 equivalent of a human protein; one of them has changed a lot. Plus Windows sucks...Join Team Zenwalk Michael Join Team Zenwalk ID: 29497 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 29506 - Posted: 17 Oct 2006, 10:48:37 UTC Some of the initial WU in a new series can have decoys that are very fast - somebody reported > 200 in an hour once if I remember rightly. The project use these to find out the parameters to use for the main WU in the run. However, you can't choose what you get. Even if you could ask for a 60 second WU, you would not usually get one. As people with slow boxes know, the software always runs a complete decoy even if it goes well over the target run time. Probably one of the reasons the project don't offer a run time less than an hour is that it would annoy many people if they could ask for a 1 min WU, and got one that ran for 45min, or even over an hour. The longest WU I had on a 1hr setting was 1hr45min (at 697MHz). That is how long that WU would have lasted had it been possible to ask for a 1 min WU. Another reason for having a minimum length is that there is a database overhead for every WU - so filling the server up with returned WU with only one decoy in each would stress the db servers. If a WU has run for more than one hour, you can get it to exit at the next checkpoint by setting a 1hr run time in the prefs and updating. If it has run for less than an hour, you might have to wait for more than one checkpoint, but it should stop reasonably soon. Can you really not wait that odd extra hour when running a project down? If not, then a bit of editing can save the checkpointed work even under an hour. Usual safety warnings apply here - don't do it if you mind breaking something! Preferably just after Rosetta checkpoints (as seen by a jump in the %complete figure) stop BOINC. Edit the file account_boinc.bakerlab.org.xml (the filename may or may not _rosetta in its name as well) - notepad can be used to do this. In the lines that say something like <cpu_run_time>3600</cpu_run_time> edit the 3600 to 1. There could be from one to four of these lines, depending which venues you use. Edit them all, or look at the structure of the file and edit the relevant one. You have now told the client that you want a 1second WU. The client and the Rosetta app are both happy with this, even tho it is not a value offered on the website. Restart BOINC. Your task will restart, immediately find it has overrun, and complete nicely, upload, etc. Next time your client updates from Rosetta the settings from the website will be re-instated. If you had any unstarted WU yu will find they each run for a single decoy - the app does not do the test till it has been round once. It might be better to abort and report unstarted & uncheckpointed work so it can be given to someone else. Please don't make a habit of doing this just for fun - it fills up the Rosetta databases unnecessarily - but if it saves some work when you have to shut down unexpectedly, then I reckon you'd be forgiven ;-) River~~ PS If you do find you have broken something and Rosetta will not run, reset the project. In that case you will have lost the work done for Rosetta. A bad edit of that file shouldn't harm any other project you are running, but usual disclaimers apply. ID: 29506 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 29507 - Posted: 17 Oct 2006, 11:02:50 UTC - in response to Message 29315. Last modified: 17 Oct 2006, 11:40:58 UTC Do you know why WU are limited to a single thread? Good question. Acknowledgements to FluffyChicken for a short answer earlier in the thread. Here's a longer one for those who like the details. Actually almost all DC projects have WU that are single threaded, and for a good reason. Rosetta, like any DC project, actually runs in thousands of threads, one on your box, two on mine, one on his, four on hers, etc etc. When designing any DC project the work is split into the smallest autonomous units. On Rosetta these run for a few minutes to an hour or so. On CPDN they run for weeks or months. Usually, by the time the work is split into the smalles convenient unit there is no more room to split it amongst threads. That would mean that part of a WU needed input from two autonomous earlier chumks of work. Not an impossible situation, but in fact rare. In practice if a WU is threadable then the designers of that DC project have not sub-divided the work as much as they might. In addition, Rosetta runs in tight loops, so that there is the max chance that the iterated code stays in the cache. Running two threads would mean that both loops would need to live in the cache, or you'd get a cache wait every time processing swapped from one thread to another. You design for just one CPU, remember. If you happen to have a second CPU, you don't put a thread in it, you put another WU entirely -- perhaps even from another project. Or you let the owner in to use one cpu for the keyboard/word/excell/etc and keep the other for the project (hence the option in the prefs) River~~ edit: typos, added acknowledgment to FC's earlier answer ID: 29507 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29508 - Posted: 17 Oct 2006, 11:52:40 UTC It sounds like the basic unit is the decoy, and work units are made up of decoys. Are there other work unit "ingredients?" How does Rosetta decide how many ingredients are in a single work unit? Does it limit work units by the number of ingredients, the number of bytes, or both? My inquiry is about how to complete an average work unit in under one minute using hardware upgrades and soft division. I am not interested in work units that would only take one minute on current hardware. I want to know what it would take to empty the WU queue. Michael Join Team Zenwalk ID: 29508 · Rating: 0 · rate: / Reply Quote