Jobs seem to complete OK but have status 'abandoned'

Author	Message
loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96009 - Posted: 4 May 2020, 12:29:31 UTC Hi, I am running jobs on a cluster via a resource manager. The batch script I use starts BOINC in the following manner: boinc --no_gui_rpc --fetch_minimal_work --exit_when_idle --attach_project ${URL} ${AUTH} The jobs seem to complete OK and do consume CPU time on the cluster, and there are no errors in the client log. Howver the status show on the R@H website often seems to be 'abandoned'. Is the way I am calling BOINC incorrect? ID: 96009 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 96039 - Posted: 4 May 2020, 16:01:08 UTC - in response to Message 96009. Have you just changed to the new project URL with the S after the http? That has been the only time I've seen "abandoned" work units personally. Rosetta Moderator: Mod.Sense ID: 96039 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96154 - Posted: 6 May 2020, 7:49:00 UTC - in response to Message 96039. I changed the URL to https and a single job was subsequently completed and validated. However, of an array of 10 jobs started at the same time, 6 complete almost immediately with "exiting because no more results", but I think that is a different problem. I have already added some random delay to prevent too many requests for tasks happening at the same time, but perhaps this delay needs to be longer. ID: 96154 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 96168 - Posted: 6 May 2020, 13:04:49 UTC - in response to Message 96154. Now that the DB is shared across all running R@h tasks, I doubt you need the delays. But, I guess I'm not positive what you mean about happening at the same time, do you mean starting? or running? A delay wouldn't change how many eventually get running, so I think you mean you are staggering their start. I doubt you need this with now with v4.20. Rosetta Moderator: Mod.Sense ID: 96168 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96202 - Posted: 7 May 2020, 7:15:51 UTC - in response to Message 96168. Yes, I mean staggering. This does seem to be necessary although I still got 07-May-2020 08:57:57 [Rosetta@home] Not sending work - last request too recent: 0 sec for one of four jobs started one minute apart. What version are you referring to? I have client version 7.16.5. ID: 96202 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 96367 - Posted: 11 May 2020, 14:39:45 UTC - in response to Message 96202. I was referring to the Rosetta verison. v4.20 made changes to share the large database directory across all active threads, rather than each expanding its own copy in each slot directory. Rosetta Moderator: Mod.Sense ID: 96367 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 440 Credit: 15,194,563 RAC: 692	Message 96392 - Posted: 12 May 2020, 12:28:35 UTC Please pardon my confusion but why partner fetch minimal work with exit when idle? ID: 96392 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96623 - Posted: 19 May 2020, 7:30:59 UTC - in response to Message 96392. I am not sure I understand your question but I am trying to set things up so that each job I submit to the cluster just fetches a single r@h task. Currently I am starting single jobs by hand with a separation of a couple of minutes, but each job seems to cause the previous job to be abandoned. ID: 96623 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 96624 - Posted: 19 May 2020, 9:03:02 UTC Last modified: 19 May 2020, 9:06:01 UTC With your computers hidden helping you is pretty much impossible. Having said that, BOINC is not designed to be run on a cluster, so that is most likely where your issues are. Install BOINC on each system, attach to the project, and then things should work (as long as the hardware is sufficient). Grant Darwin NT ID: 96624 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 96628 - Posted: 19 May 2020, 15:49:07 UTC - in response to Message 96623. @loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help. Rosetta Moderator: Mod.Sense ID: 96628 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96645 - Posted: 20 May 2020, 6:16:37 UTC - in response to Message 96628. @loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help. Where is the correct place? I thought this forum was for questions relating to "Installing and running BOINC on Unix and Linux". ID: 96645 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96646 - Posted: 20 May 2020, 6:25:52 UTC - in response to Message 96624. In what way are my computers hidden? Regarding the cluster, the software is installed (via NFS) on all nodes of the cluster. The problem, I think, is more to do with the way I start the jobs via the scheduling system. Possibly it is to do with the fact that the scheduler could try to start multiple jobs on one node. Perhaps max_ncpus_pct then applies to all the jobs, so all but one get terminated. ID: 96646 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 96648 - Posted: 20 May 2020, 8:40:12 UTC - in response to Message 96646. Last modified: 20 May 2020, 8:43:21 UTC In what way are my computers hidden? loris User ID 2120609 Rosetta@home member since 26 Mar 2020 Country International Total credit 3,588 Recent average credit 122.29 Computers hidden In your account page, Preferences, Preferences for this project "Rosetta@home preferences" "Should Rosetta@home show your computers on its web site?" would be unselected. As i posted before- BOINC was not designed to make use of a cluster. It is for installing on individual computers and the Manager on each computer is responsible for getting work, downloading the appropriate application as required, and returning the results & reporting them. Grant Darwin NT ID: 96648 · Rating: 0 · rate: / Reply Quote

loris Send message Joined: 26 Mar 20 Posts: 7 Credit: 3,937 RAC: 0	Message 96656 - Posted: 20 May 2020, 13:11:04 UTC - in response to Message 96648. Thanks for the info regarding my computers being hidden. As far as installing on a cluster is concerned, I realize that is not what BOINC was designed for. However, since every node essentially behaves as an individual computer, I thought it wouldn't be too hard to get it to work. I'll try running a number of jobs serially and see how that goes. ID: 96656 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 96659 - Posted: 20 May 2020, 14:00:48 UTC - in response to Message 96645. @loris, sorry Loris, we seem to be talking about two different things. So, it sounds like you are indeed in the right place. You just (jokingly) have to put up with all of the questions about how you went about settings this up and are trying to run it. The simplest way would be to install each machine and let them each do their own connections to the project for work. In that sense, the project never sees a cluster. Rosetta Moderator: Mod.Sense ID: 96659 · Rating: 0 · rate: / Reply Quote