Problems and Technical Issues with Rosetta@home

Author	Message
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110025 - Posted: 13 Nov 2024, 10:08:19 UTC - in response to Message 110024. boinc-process is staring to die again. And it's now dead, Validation backlog starting to develop. Grant Darwin NT ID: 110025 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110026 - Posted: 13 Nov 2024, 20:23:40 UTC - in response to Message 110025. boinc-process is starting to die again. And it's now dead, Validation backlog starting to develop. Lol - I've only just noticed. Fortunately there aren't many unreturned tasks it'll affect. And there was me running my non-Rosetta buffer down... ID: 110026 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110029 - Posted: 15 Nov 2024, 0:45:21 UTC - in response to Message 110026. boinc-process is starting to die again. And it's now dead, Validation backlog starting to develop. Lol - I've only just noticed. Fortunately there aren't many unreturned tasks it'll affect. And there was me running my non-Rosetta buffer down... Boinc-process is back at least. A few 10s of thousands of tasks seem to have been issued through the day, but all are getting gobbled up as fast as they appear. Most of my team have had a few tasks but only in one or two small grabs and it's a return to the backup projects pretty soon. Hand-to-mouth again ID: 110029 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110034 - Posted: 18 Nov 2024, 7:57:26 UTC - in response to Message 110029. A few 10s of thousands of tasks seem to have been issued through the day, but all are getting gobbled up as fast as they appear. Just spotted another 40k come through but they've all gone already I managed to load up one PC, so if you wondered who got them... ID: 110034 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 91 Credit: 64,364,964 RAC: 81,021	Message 110035 - Posted: 19 Nov 2024, 18:48:52 UTC Looks like they've done it again. I have a machine pretty much locked up tight. The rosetta_4.20 processes were asking, the last time I could get to the machine, for 2+GB each. Basic math tells me that 32 threads won't work with 32GB of memory. Back in the old days, last month, I could have just hit the giant power button and rebooted it. Unfortunately at the present time, and until April, I'm sitting 2,410 miles (as the crow flies) away and my arm isn't that long. Guess I'll have to cripple rosetta since I can't trust it to play nice. ID: 110035 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 289 Credit: 543,048 RAC: 167	Message 110036 - Posted: 19 Nov 2024, 18:53:53 UTC Perhaps you could tell boinc not to use all memory. ID: 110036 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110037 - Posted: 20 Nov 2024, 5:12:21 UTC - in response to Message 110035. Only 500MB or so in use by each of my presently running Tasks. Grant Darwin NT ID: 110037 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110038 - Posted: 20 Nov 2024, 21:12:33 UTC Guess what the Server Status page is telling us right now... ID: 110038 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110040 - Posted: 21 Nov 2024, 4:28:11 UTC - in response to Message 110038. Guess what the Server Status page is telling us right now... It almost lasted for a week. Grant Darwin NT ID: 110040 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 91 Credit: 64,364,964 RAC: 81,021	Message 110041 - Posted: 21 Nov 2024, 21:49:40 UTC - in response to Message 110040. Guess I broke it. So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers. Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running boinc. I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me? ID: 110041 · Rating: 0 · rate: / Reply Quote

Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 221 Credit: 7,586,487 RAC: 717	Message 110042 - Posted: 22 Nov 2024, 0:25:07 UTC - in response to Message 110041. ... they're sitting there (headless for the most part) doing nothing but running boinc. I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? My Linux machine runs lots of processes. It has 16 cores and 128 GBytes of RAM. As fare as Boinc is concerned, the main process is the Boinc Client. It uses very little RAM and very little CPU time. From time-to-time, the boinc client sends a message a Boinc server and asks for work. The server send a reply complaining it cannot find any work, or a bunch of messages describinb the files the client hould download. In the latter case, the client downloads the files in the proper places. Then if the client has spare cores, it selects one and forks off a process to run it. So let us say there are no Boinc tasks running, the client has just received a task from the Rosetta server. The client then fork off the Rosetta task. top - 19:12:56 up 16 days, 8:42, 2 users, load average: 13.38, 13.32, 13.29 Tasks: 483 total, 14 running, 469 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.9 us, 0.3 sy, 80.6 ni, 18.0 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 128086.0 total, 5047.0 free, 7395.4 used, 115643.6 buff/cache MiB Swap: 15992.0 total, 15687.0 free, 305.0 used. 116733.0 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 3176351 2043 boinc 39 19 R 596760 0.5 99.0 13 10:12.79 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3161135 2043 boinc 39 19 R 581420 0.4 99.3 2 121:33.16 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3111703 2043 boinc 39 19 R 541240 0.4 99.1 9 455:40.07 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3163687 2043 boinc 39 19 R 481148 0.4 99.2 10 103:13.41 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3144411 2043 boinc 39 19 R 443480 0.3 99.1 6 233:56.51 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 2043 1 boinc 30 10 S 54708 0.0 0.1 8 300278:26 /usr/bin/boinc 3171024 2043 boinc 39 19 R 39676 0.0 99.3 4 48:38.05 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3166711 2043 boinc 39 19 R 39668 0.0 99.3 11 80:07.82 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3171561 2043 boinc 39 19 R 39584 0.0 99.2 0 44:34.46 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3167425 2043 boinc 39 19 R 39520 0.0 99.3 7 75:58.11 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3176944 2043 boinc 39 19 R 39172 0.0 99.4 15 5:33.72 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3172039 2043 boinc 39 19 R 39116 0.0 99.3 3 41:39.57 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3176627 2043 boinc 39 19 R 36824 0.0 99.4 1 8:20.14 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_i686-pc-l+ 3141011 2043 boinc 39 19 R 29944 0.0 99.3 5 258:04.99 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ Pid is the process Id, PPID is the PID of the process's parent. Pid 1 is the process that starts the parent of all other processes. One of the processes it starts is Pid 2043 that is my Boinc Client, /usr/bin/boinc This client starts all the others. ID: 110042 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 91 Credit: 64,364,964 RAC: 81,021	Message 110043 - Posted: 22 Nov 2024, 2:06:41 UTC - in response to Message 110042. I'm running openSUSE on all of my computers. Here's the one that caused the problem: top - 16:42:35 up 2 days, 6:10, 2 users, load average: 33.55, 33.40, 33.38 Tasks: 475 total, 34 running, 441 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.2 sy, 99.8 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31927.27+total, 22505.63+free, 8120.531 used, 1792.707 buff/cache MiB Swap: 2048.062 total, 2048.062 free, 0.000 used. 23806.74+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 61822 boinc 39 19 78392 35748 2304 R 100.3 0.109 100:31.99 wcgrid_mcm1_map 61824 boinc 39 19 78232 38868 2048 R 100.3 0.119 99:54.31 wcgrid_mcm1_map 62632 boinc 39 19 825532 674520 70144 R 100.3 2.063 249:05.56 rosetta_4.20_x8 62921 boinc 39 19 903184 753668 70400 R 100.3 2.305 217:03.78 rosetta_4.20_x8 63867 boinc 39 19 78524 38568 2048 R 100.3 0.118 108:19.06 wcgrid_mcm1_map 64033 boinc 39 19 78392 34544 2304 R 100.3 0.106 89:55.77 wcgrid_mcm1_map 64063 boinc 39 19 78396 40032 2048 R 100.3 0.122 87:05.91 wcgrid_mcm1_map 60802 boinc 39 19 2272136 2.027g 76288 R 100.0 6.502 448:17.00 rosetta_4.20_x8 61312 boinc 39 19 771740 617852 70144 R 100.0 1.890 387:41.08 rosetta_4.20_x8 61344 boinc 39 19 819136 665404 70400 R 100.0 2.035 382:14.14 rosetta_4.20_x8 61727 boinc 39 19 78640 40168 2304 R 100.0 0.123 109:32.81 wcgrid_mcm1_map 62042 boinc 39 19 956000 801992 70144 R 100.0 2.453 311:50.22 rosetta_4.20_x8 63796 boinc 39 19 78232 38764 2304 R 100.0 0.119 115:34.03 wcgrid_mcm1_map 63805 boinc 39 19 78232 39828 2304 R 100.0 0.122 112:30.09 wcgrid_mcm1_map 63859 boinc 39 19 78536 38488 2304 R 100.0 0.118 110:09.20 wcgrid_mcm1_map 63873 boinc 39 19 78232 39728 2304 R 100.0 0.122 107:39.82 wcgrid_mcm1_map 63877 boinc 39 19 78392 38792 2048 R 100.0 0.119 106:19.39 wcgrid_mcm1_map 63881 boinc 39 19 78524 38808 2304 R 100.0 0.119 105:28.10 wcgrid_mcm1_map 63885 boinc 39 19 78232 38796 2304 R 100.0 0.119 104:28.67 wcgrid_mcm1_map 63976 boinc 39 19 78392 39360 2304 R 100.0 0.120 92:57.77 wcgrid_mcm1_map 64022 boinc 39 19 78468 39464 2304 R 100.0 0.121 92:03.05 wcgrid_mcm1_map 64027 boinc 39 19 78392 39200 2304 R 100.0 0.120 90:19.73 wcgrid_mcm1_map 64061 boinc 39 19 78652 39572 2304 R 100.0 0.121 87:13.84 wcgrid_mcm1_map 64067 boinc 39 19 78536 38944 2304 R 100.0 0.119 86:28.16 wcgrid_mcm1_map 64757 boinc 39 19 78260 39200 2304 R 100.0 0.120 4:00.44 wcgrid_mcm1_map 61710 boinc 39 19 78592 39564 2304 R 99.67 0.121 114:27.44 wcgrid_mcm1_map 61732 boinc 39 19 78392 39740 2304 R 99.67 0.122 108:00.30 wcgrid_mcm1_map 63854 boinc 39 19 78232 39136 2304 R 99.67 0.120 111:13.06 wcgrid_mcm1_map 64029 boinc 39 19 78580 39696 2304 R 99.67 0.121 90:13.18 wcgrid_mcm1_map 64038 boinc 39 19 78292 38760 2304 R 99.67 0.119 87:54.97 wcgrid_mcm1_map 61814 boinc 39 19 78392 39304 2304 R 99.34 0.120 102:28.23 wcgrid_mcm1_map 63851 boinc 39 19 78232 39780 2304 R 99.34 0.122 112:01.97 wcgrid_mcm1_map Notice the size of the rosetta processes. I've gone in and created the app_config, as root, to control how many rosetta processes can run. # cd /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/ # cat app_config.xml <app_config> <app> <name>rosetta_beta</name> <max_concurrent>6</max_concurrent> </app> <app> <name>rosetta</name> <max_concurrent>6</max_concurrent> </app> </app_config> I have a newer machine running and the rosetta processes are much smaller. Dunno why. I guess the answer is to go get more memory. Unfortunately I won't be near that computer until April. ID: 110043 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110044 - Posted: 22 Nov 2024, 5:34:12 UTC - in response to Message 110041. I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? There is no BOINC process. Those BOINC settings limit the amount of RAM, disk space, network activity, CPU usage etc available for all projects that run under BOINC. Setting a massive swap file value allows Tasks that require massive amounts of memory to start, but don't actually use that amount of memory in order to actually run. So systems with limited amounts of RAM can still run Tasks that require significant amount of RAM- as long as the RAM they have (and is available for BOINC to use) is more than they need to run- even if they claim to require massive amounts of RAM above & beyond that in order to actually start (if they do need more RAM than is available for BOINC to make use of, then the large swap file allows them to still run- but with massive amounts of swapping. If you have a SSD, it means the system will be sluggish at best. A HDD- it will probably grind to a non-responsive halt, depending on just how much more RAM the Task(s) need, and how many of them are trying to run). Grant Darwin NT ID: 110044 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110045 - Posted: 22 Nov 2024, 5:42:44 UTC And boinc-process is back up and running. That must be one of it's shortest outages yet. Grant Darwin NT ID: 110045 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110046 - Posted: 22 Nov 2024, 6:37:20 UTC - in response to Message 110041. Guess I broke it. If only any of us were that powerful... So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers. Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running Boinc. Personally I agree with your choice rather than the suggestion of restricting RAM so that there's even less space to run. I guess my question is... that's the "Boinc" process right? When the "Rosetta" process kicks off how does it interact with that? Do the jobs that Rosetta provides indicate how much memory they will be using? Aiui yes. Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me? I certainly don't understand the exact mechanics of how this works, but we have had the experience in the past where disk space became a limitation and the solution wasn't entirely obvious. I do recall that my original setting was 10-20Gb rather than 2Gb, so that's one thing, but even that became an issue/restriction when the problem arose before. I recall raising it to 500Gb and it still not solving the problem back then. The solution turned out to be not having any restriction at all. That is, on the disk tab, unselecting "Use no more than xx Gb" Combined with that I'm currently using "Use no more than 90% of total" disk space and Page/swap file: Use at most 75% In the situation where you're currently running headless, I hope none of these settings should be a problem on those hosts. I have no idea whether this solution to an old problem also solves your current one, but I am surprised you're reporting the problem at all as I haven't heard anyone experiencing anything similar in recent years. Nothing to lose by trying anyway. ID: 110046 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110047 - Posted: 22 Nov 2024, 6:46:13 UTC - in response to Message 110045. And boinc-process is back up and running. That must be one of it's shortest outages yet. That's what I originally came here to say. Not that it matters a great deal with current task availability but still... ID: 110047 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 440 Credit: 15,189,162 RAC: 15,399	Message 110048 - Posted: 22 Nov 2024, 9:19:35 UTC - in response to Message 110041. Guess I broke it. So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers. Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running boinc. I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me? No, the percentage you set covers all the programs running under the Boinc user, so, the Boinc Client, Manager and all of the e.g. Rosetta work units are within the same pot and there’s no “interaction” between the one and the other. The swap works the same way, the percentage covers all memory required by Boinc over the physical memory present. ID: 110048 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,054,708 RAC: 28,954	Message 110051 - Posted: 23 Nov 2024, 8:08:07 UTC - in response to Message 110047. And boinc-process is back up and running. That must be one of it's shortest outages yet. That's what I originally came here to say. Not that it matters a great deal with current task availability but still... Some 40-50k came available maybe 12-14hrs ago and it seems another 800k in the last few hours ID: 110051 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 440 Credit: 15,189,162 RAC: 15,399	Message 110052 - Posted: 23 Nov 2024, 12:56:18 UTC - in response to Message 110051. So far one error :- <core_client_version>8.0.4</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-linux-gnu @8a_hal_u_hal_8aa_4jp3235_d104_0001_1.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_f5ae1de8e1/database ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT. ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2534 BOINC:: Error reading and gzipping output datafile: default.out 09:31:32 (83823): called boinc_finish(1) </stderr_txt> ]]> ID: 110052 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1918 Credit: 18,534,891 RAC: 0	Message 110055 - Posted: 23 Nov 2024, 20:09:24 UTC - in response to Message 110052. ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT. Unfortunately, one of the usual ones. Grant Darwin NT ID: 110055 · Rating: 0 · rate: / Reply Quote