Message boards : Number crunching : PURGE facility please
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
FluffyChicken said in another thread
These are clearly suggestions for BOINC rather than Rosetta, and maybe we need a separate board for BOINC suggestions. In the meantime this seems the best place to discuss this. I strongly agree with the PURGE idea. Many projects are doubly redundant. For example Einstein want a minimum redundancy factor of three, so that there are three copies of each answer before they accept it as valid. This redundancy is scientifically acceptable, as the variation if any between answers gives a clue to the reliability of the result. But Einstein actually sends out four results for each WU. This extra redundancy is so that at least three results are likely to come back even if one fails. This makes sense for them as they have a constraint on the number of 'In Progress' WU. Similar practices make sense on Predictor because scientists are wating to use the results, and on LHC becasue the results of one round will help them create the WU for the next round. However it also means that often by the time soemone with a larger cache starts crunching, the WU is already complete. It has verified and has a canonical result. The donor gets credit but their crunching does not really contribute to the science -- it was just there in effect as a safety net. The purge jobs facility would enable those projects to issue over-redundant sets of results but cancel those that had not started. Sometimes it would be too late -- all four results would still come back. Other times the slowest result would be replaced by more useful work. More thoroughput for no extra cpu time. Purgung a result where crunching had started would be unfair on the donor unless you could also give them credit for the work done so far. This would need a bigger change to the code and mught not be worthwhile. Purging unstarted results where the WU has found its canonical result would be a good idea. AUTO jobs. Sorry, I don't like this idea. There is too much scope for cheating if the project loses control of who has which WU to crunch. Instead I'd favour the PURGE idea combined with a deliberate over-supply of work, knowing that much of it could be safely purged at the next connect. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The only fly in the ointment is that the only way to send the cancel signal is when the computer to have the work canceled contacts the scheduler. Einstein@Home uses a fairly short deadline to prevent the work from "building up" too much. Mostly what SETI@Home and Einstein@Home are doing is issuing one extra result to attempt to short circuit the reissue and extend the deadline problem. This allows a faster purge of the result data files and the database. |
Honza Send message Joined: 18 Sep 05 Posts: 48 Credit: 173,517 RAC: 0 |
A purge job is not a new idea. Actually, it has been implemented on CPDN already. There is a slight difference: purge job on CPDN SpinUp is to terminate ongoing model (200 years or 3.000 CPU hours is a long WU). This so-called trickle-down is ment to terminate model upon core team decison when they see that particular model is not needed to be finished in whole (e.g. 150 yers is enough, 'cause it already went stable). I agree that there are some aspects of BOINC schedulers in order to make computing effecient. Another approach to this problem might be that scheduler doesn't simply act like a FIFO queue but sends WUs to host according to their Average turnaround time. This might be used in several cases: - when results are needed really soon, send them to host with low Average turnaround time - send test WUs to such hosts so that team knowns fast if a new application/WU works - if a WU fails, re-send it to such host. Such approach might even lower queue "to validate" hence lower disk space. It may also prevent users from waiting too long to credit. I think that such approach is suitable for any project with or without WUs redundancy. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,688,048 RAC: 10,544 |
I'd like to see some form of queue as standard - I've set my profile to store enough work for 1 day, but by default it's too easy to run out of work due to network problems at one end or the other. I think having a purge function is definitely a good thing too - although I think this can be controlled by smart allocation of jobs in the first place - send urgent jobs to quick/reliable crunchers. |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Well... Let's make an example with Seti_Enhanced... Host A uses cache-size 0.5 days and uses 9h-135h on a result, while host B uses cache-size 0.1 days and uses 6 days on a VHAR. If host A has just crunched through a bunch of "slow" results, and B some VHAR, they can be paired together with similar "Average turnaround time". If they get assigned a VHAR, host A will report 5 days before B. If instead they get assigned a "slow" result, host A will report 80+ days before B... Let's add another host, C that uses 1h-15h on a result, and uses cache-setting 0.25 days. If host A has just crunched some VHAR, it's average turnaround-time would be 0.875 days, while if C has crunched some "slow" results, it's average turnaround-time will also be 0.875 days... Both of these computers can be seen as having fast turnaround-time, and can be choosen as test-computer, or re-issue. But, if it's a "normal" result, host A will use 4.6 days, while host C will use 0.7 days... |
Honza Send message Joined: 18 Sep 05 Posts: 48 Credit: 173,517 RAC: 0 |
although I think this can be controlled by smart allocation of jobs in the first place - send urgent jobs to quick/reliable crunchers.Correct - that's what I was ment to say. @ Ingleside - it is evident that "smart" schedulling doesn't work easy when there is no "smart" time-to-complete estimate. Take into account, that finishing a WU takes about the same on same machines. Once a first results is in place, a time-to-complete is better known [when erroring out, lower time-to-complete is known]. Resending a WU to slow machines with rare internet connection and not running 24/7 (just to name other aspects of quick/reliable) is not "smart" anyway. If there are WUs that takes dozens of hours or even days, such project should take some knowledge from CPDN and implement trickles - a partial result/progress upload and sending message that WU is still being processed. This is another way to help make schedulling smart. |
Message boards :
Number crunching :
PURGE facility please
©2024 University of Washington
https://www.bakerlab.org