Questions and Answers : Unix/Linux : Jobs seem to complete OK but have status 'abandoned'
Author | Message |
---|---|
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
Hi, I am running jobs on a cluster via a resource manager. The batch script I use starts BOINC in the following manner: boinc --no_gui_rpc --fetch_minimal_work --exit_when_idle --attach_project ${URL} ${AUTH} The jobs seem to complete OK and do consume CPU time on the cluster, and there are no errors in the client log. Howver the status show on the R@H website often seems to be 'abandoned'. Is the way I am calling BOINC incorrect? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Have you just changed to the new project URL with the S after the http? That has been the only time I've seen "abandoned" work units personally. Rosetta Moderator: Mod.Sense |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
I changed the URL to httpsand a single job was subsequently completed and validated. However, of an array of 10 jobs started at the same time, 6 complete almost immediately with "exiting because no more results", but I think that is a different problem. I have already added some random delay to prevent too many requests for tasks happening at the same time, but perhaps this delay needs to be longer. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Now that the DB is shared across all running R@h tasks, I doubt you need the delays. But, I guess I'm not positive what you mean about happening at the same time, do you mean starting? or running? A delay wouldn't change how many eventually get running, so I think you mean you are staggering their start. I doubt you need this with now with v4.20. Rosetta Moderator: Mod.Sense |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
Yes, I mean staggering. This does seem to be necessary although I still got 07-May-2020 08:57:57 [Rosetta@home] Not sending work - last request too recent: 0 sec for one of four jobs started one minute apart. What version are you referring to? I have client version 7.16.5. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I was referring to the Rosetta verison. v4.20 made changes to share the large database directory across all active threads, rather than each expanding its own copy in each slot directory. Rosetta Moderator: Mod.Sense |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
Please pardon my confusion but why partner fetch minimal work with exit when idle? |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
I am not sure I understand your question but I am trying to set things up so that each job I submit to the cluster just fetches a single r@h task. Currently I am starting single jobs by hand with a separation of a couple of minutes, but each job seems to cause the previous job to be abandoned. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1680 Credit: 17,844,443 RAC: 22,972 |
With your computers hidden helping you is pretty much impossible. Having said that, BOINC is not designed to be run on a cluster, so that is most likely where your issues are. Install BOINC on each system, attach to the project, and then things should work (as long as the hardware is sufficient). Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help. Rosetta Moderator: Mod.Sense |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
@loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help. Where is the correct place? I thought this forum was for questions relating to "Installing and running BOINC on Unix and Linux". |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
In what way are my computers hidden? Regarding the cluster, the software is installed (via NFS) on all nodes of the cluster. The problem, I think, is more to do with the way I start the jobs via the scheduling system. Possibly it is to do with the fact that the scheduler could try to start multiple jobs on one node. Perhaps max_ncpus_pctthen applies to all the jobs, so all but one get terminated. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1680 Credit: 17,844,443 RAC: 22,972 |
In what way are my computers hidden? loris User ID 2120609 Rosetta@home member since 26 Mar 2020 Country International Total credit 3,588 Recent average credit 122.29 Computers hidden In your account page, Preferences, Preferences for this project "Rosetta@home preferences" "Should Rosetta@home show your computers on its web site?" would be unselected. As i posted before- BOINC was not designed to make use of a cluster. It is for installing on individual computers and the Manager on each computer is responsible for getting work, downloading the appropriate application as required, and returning the results & reporting them. Grant Darwin NT |
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0 |
Thanks for the info regarding my computers being hidden. As far as installing on a cluster is concerned, I realize that is not what BOINC was designed for. However, since every node essentially behaves as an individual computer, I thought it wouldn't be too hard to get it to work. I'll try running a number of jobs serially and see how that goes. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@loris, sorry Loris, we seem to be talking about two different things. So, it sounds like you are indeed in the right place. You just (jokingly) have to put up with all of the questions about how you went about settings this up and are trying to run it. The simplest way would be to install each machine and let them each do their own connections to the project for work. In that sense, the project never sees a cluster. Rosetta Moderator: Mod.Sense |
Questions and Answers :
Unix/Linux :
Jobs seem to complete OK but have status 'abandoned'
©2024 University of Washington
https://www.bakerlab.org