Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 298 · 299 · 300 · 301 · 302 · Next
Author | Message |
---|---|
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,431,332 RAC: 5,665 |
... they're sitting there (headless for the most part) doing nothing but running boinc. My Linux machine runs lots of processes. It has 16 cores and 128 GBytes of RAM. As fare as Boinc is concerned, the main process is the Boinc Client. It uses very little RAM and very little CPU time. From time-to-time, the boinc client sends a message a Boinc server and asks for work. The server send a reply complaining it cannot find any work, or a bunch of messages describinb the files the client hould download. In the latter case, the client downloads the files in the proper places. Then if the client has spare cores, it selects one and forks off a process to run it. So let us say there are no Boinc tasks running, the client has just received a task from the Rosetta server. The client then fork off the Rosetta task. top - 19:12:56 up 16 days, 8:42, 2 users, load average: 13.38, 13.32, 13.29 Tasks: 483 total, 14 running, 469 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.9 us, 0.3 sy, 80.6 ni, 18.0 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 128086.0 total, 5047.0 free, 7395.4 used, 115643.6 buff/cache MiB Swap: 15992.0 total, 15687.0 free, 305.0 used. 116733.0 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 3176351 2043 boinc 39 19 R 596760 0.5 99.0 13 10:12.79 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3161135 2043 boinc 39 19 R 581420 0.4 99.3 2 121:33.16 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3111703 2043 boinc 39 19 R 541240 0.4 99.1 9 455:40.07 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3163687 2043 boinc 39 19 R 481148 0.4 99.2 10 103:13.41 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 3144411 2043 boinc 39 19 R 443480 0.3 99.1 6 233:56.51 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 2043 1 boinc 30 10 S 54708 0.0 0.1 8 300278:26 /usr/bin/boinc 3171024 2043 boinc 39 19 R 39676 0.0 99.3 4 48:38.05 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3166711 2043 boinc 39 19 R 39668 0.0 99.3 11 80:07.82 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3171561 2043 boinc 39 19 R 39584 0.0 99.2 0 44:34.46 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3167425 2043 boinc 39 19 R 39520 0.0 99.3 7 75:58.11 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3176944 2043 boinc 39 19 R 39172 0.0 99.4 15 5:33.72 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3172039 2043 boinc 39 19 R 39116 0.0 99.3 3 41:39.57 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 3176627 2043 boinc 39 19 R 36824 0.0 99.4 1 8:20.14 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_i686-pc-l+ 3141011 2043 boinc 39 19 R 29944 0.0 99.3 5 258:04.99 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ Pid is the process Id, PPID is the PID of the process's parent. Pid 1 is the process that starts the parent of all other processes. One of the processes it starts is Pid 2043 that is my Boinc Client, /usr/bin/boinc This client starts all the others. |
Bill Swisher Send message Joined: 10 Jun 13 Posts: 36 Credit: 33,183,499 RAC: 43,338 |
I'm running openSUSE on all of my computers. Here's the one that caused the problem: top - 16:42:35 up 2 days, 6:10, 2 users, load average: 33.55, 33.40, 33.38 Tasks: 475 total, 34 running, 441 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.2 sy, 99.8 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31927.27+total, 22505.63+free, 8120.531 used, 1792.707 buff/cache MiB Swap: 2048.062 total, 2048.062 free, 0.000 used. 23806.74+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 61822 boinc 39 19 78392 35748 2304 R 100.3 0.109 100:31.99 wcgrid_mcm1_map 61824 boinc 39 19 78232 38868 2048 R 100.3 0.119 99:54.31 wcgrid_mcm1_map 62632 boinc 39 19 825532 674520 70144 R 100.3 2.063 249:05.56 rosetta_4.20_x8 62921 boinc 39 19 903184 753668 70400 R 100.3 2.305 217:03.78 rosetta_4.20_x8 63867 boinc 39 19 78524 38568 2048 R 100.3 0.118 108:19.06 wcgrid_mcm1_map 64033 boinc 39 19 78392 34544 2304 R 100.3 0.106 89:55.77 wcgrid_mcm1_map 64063 boinc 39 19 78396 40032 2048 R 100.3 0.122 87:05.91 wcgrid_mcm1_map 60802 boinc 39 19 2272136 2.027g 76288 R 100.0 6.502 448:17.00 rosetta_4.20_x8 61312 boinc 39 19 771740 617852 70144 R 100.0 1.890 387:41.08 rosetta_4.20_x8 61344 boinc 39 19 819136 665404 70400 R 100.0 2.035 382:14.14 rosetta_4.20_x8 61727 boinc 39 19 78640 40168 2304 R 100.0 0.123 109:32.81 wcgrid_mcm1_map 62042 boinc 39 19 956000 801992 70144 R 100.0 2.453 311:50.22 rosetta_4.20_x8 63796 boinc 39 19 78232 38764 2304 R 100.0 0.119 115:34.03 wcgrid_mcm1_map 63805 boinc 39 19 78232 39828 2304 R 100.0 0.122 112:30.09 wcgrid_mcm1_map 63859 boinc 39 19 78536 38488 2304 R 100.0 0.118 110:09.20 wcgrid_mcm1_map 63873 boinc 39 19 78232 39728 2304 R 100.0 0.122 107:39.82 wcgrid_mcm1_map 63877 boinc 39 19 78392 38792 2048 R 100.0 0.119 106:19.39 wcgrid_mcm1_map 63881 boinc 39 19 78524 38808 2304 R 100.0 0.119 105:28.10 wcgrid_mcm1_map 63885 boinc 39 19 78232 38796 2304 R 100.0 0.119 104:28.67 wcgrid_mcm1_map 63976 boinc 39 19 78392 39360 2304 R 100.0 0.120 92:57.77 wcgrid_mcm1_map 64022 boinc 39 19 78468 39464 2304 R 100.0 0.121 92:03.05 wcgrid_mcm1_map 64027 boinc 39 19 78392 39200 2304 R 100.0 0.120 90:19.73 wcgrid_mcm1_map 64061 boinc 39 19 78652 39572 2304 R 100.0 0.121 87:13.84 wcgrid_mcm1_map 64067 boinc 39 19 78536 38944 2304 R 100.0 0.119 86:28.16 wcgrid_mcm1_map 64757 boinc 39 19 78260 39200 2304 R 100.0 0.120 4:00.44 wcgrid_mcm1_map 61710 boinc 39 19 78592 39564 2304 R 99.67 0.121 114:27.44 wcgrid_mcm1_map 61732 boinc 39 19 78392 39740 2304 R 99.67 0.122 108:00.30 wcgrid_mcm1_map 63854 boinc 39 19 78232 39136 2304 R 99.67 0.120 111:13.06 wcgrid_mcm1_map 64029 boinc 39 19 78580 39696 2304 R 99.67 0.121 90:13.18 wcgrid_mcm1_map 64038 boinc 39 19 78292 38760 2304 R 99.67 0.119 87:54.97 wcgrid_mcm1_map 61814 boinc 39 19 78392 39304 2304 R 99.34 0.120 102:28.23 wcgrid_mcm1_map 63851 boinc 39 19 78232 39780 2304 R 99.34 0.122 112:01.97 wcgrid_mcm1_map Notice the size of the rosetta processes. I've gone in and created the app_config, as root, to control how many rosetta processes can run. # cd /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/ # cat app_config.xml <app_config> <app> <name>rosetta_beta</name> <max_concurrent>6</max_concurrent> </app> <app> <name>rosetta</name> <max_concurrent>6</max_concurrent> </app> </app_config> I have a newer machine running and the rosetta processes are much smaller. Dunno why. I guess the answer is to go get more memory. Unfortunately I won't be near that computer until April. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that?There is no BOINC process. Those BOINC settings limit the amount of RAM, disk space, network activity, CPU usage etc available for all projects that run under BOINC. Setting a massive swap file value allows Tasks that require massive amounts of memory to start, but don't actually use that amount of memory in order to actually run. So systems with limited amounts of RAM can still run Tasks that require significant amount of RAM- as long as the RAM they have (and is available for BOINC to use) is more than they need to run- even if they claim to require massive amounts of RAM above & beyond that in order to actually start (if they do need more RAM than is available for BOINC to make use of, then the large swap file allows them to still run- but with massive amounts of swapping. If you have a SSD, it means the system will be sluggish at best. A HDD- it will probably grind to a non-responsive halt, depending on just how much more RAM the Task(s) need, and how many of them are trying to run). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
And boinc-process is back up and running. That must be one of it's shortest outages yet. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Guess I broke it. If only any of us were that powerful... So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers. Personally I agree with your choice rather than the suggestion of restricting RAM so that there's even less space to run. I guess my question is... that's the "Boinc" process right? When the "Rosetta" process kicks off how does it interact with that? Aiui yes. Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me? I certainly don't understand the exact mechanics of how this works, but we have had the experience in the past where disk space became a limitation and the solution wasn't entirely obvious. I do recall that my original setting was 10-20Gb rather than 2Gb, so that's one thing, but even that became an issue/restriction when the problem arose before. I recall raising it to 500Gb and it still not solving the problem back then. The solution turned out to be not having any restriction at all. That is, on the disk tab, unselecting "Use no more than xx Gb" Combined with that I'm currently using "Use no more than 90% of total" disk space and Page/swap file: Use at most 75% In the situation where you're currently running headless, I hope none of these settings should be a problem on those hosts. I have no idea whether this solution to an old problem also solves your current one, but I am surprised you're reporting the problem at all as I haven't heard anyone experiencing anything similar in recent years. Nothing to lose by trying anyway. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
And boinc-process is back up and running. That must be one of it's shortest outages yet. That's what I originally came here to say. Not that it matters a great deal with current task availability but still... |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
Guess I broke it. No, the percentage you set covers all the programs running under the Boinc user, so, the Boinc Client, Manager and all of the e.g. Rosetta work units are within the same pot and there’s no “interaction” between the one and the other. The swap works the same way, the percentage covers all memory required by Boinc over the physical memory present. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
And boinc-process is back up and running. That must be one of it's shortest outages yet. Some 40-50k came available maybe 12-14hrs ago and it seems another 800k in the last few hours |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
So far one error :- <core_client_version>8.0.4</core_client_version> |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.Unfortunately, one of the usual ones. Grant Darwin NT |
Dr Who Fan Send message Joined: 28 May 06 Posts: 70 Credit: 267,358 RAC: 452 |
Upload/Download SERVER(s) appear to be off-line again but the server status page is all green 11/23/2024 16:25:35 Internet access OK - project servers may be temporarily down. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.Unfortunately, one of the usual ones. Yes, presumably a definition error for the molecule being tested. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Upload/Download SERVER(s) appear to be off-line again but the server status page is all greenI'm not having any issues at all. Grant Darwin NT |
Dr Who Fan Send message Joined: 28 May 06 Posts: 70 Credit: 267,358 RAC: 452 |
Upload/Download SERVER(s) appear to be off-line again but the server status page is all greenI'm not having any issues at all. Did a manual retry a few minutes ago and they downloaded successfully. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Upload/Download SERVER(s) appear to be off-line again but the server status page is all greenI'm not having any issues at all. I didn't see it here at Rosetta, but for 7 or 10 days it was happening to everyone at WCG and each of 6 files per upload needed 5-10 tries on tasks that uploaded and downloaded 4-6 times as often. If anything happened at Rosetta in that time it was lost among 40 files waiting to transfer to WCG at any one time. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
[ Currently experiencing transient HTTPS errors on probably half of the downloads and this has been going on for maybe 4 days. Some downloads have taken 15 retries to clear. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
I didn't see it here at Rosetta, but for 7 or 10 days it was happening to everyone at WCG and each of 6 files per upload needed 5-10 tries on tasks that uploaded and downloaded 4-6 times as often. It's weird that I'm just as susceptible as anyone else to those errors coming from WCG, but don't see any here at Rosetta. The only solution I know is manually retrying for as long as it takes |
mmonnin Send message Joined: 2 Jun 16 Posts: 59 Credit: 24,222,307 RAC: 83,030 |
I have to retry all the time to download tasks here at Rosetta which is something new for Rosetta. Some retries work on the 1st attempt and others won't download after a dozen attempts. I've even aborted a task to download more work and those new ones will download. It's typically the smaller files from Rosetta that need reties. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Still no signs of file transfer issues in my Event log, sounds like there is some sort of network issue between ISPs. Grant Darwin NT |
Bill Swisher Send message Joined: 10 Jun 13 Posts: 36 Credit: 33,183,499 RAC: 43,338 |
transient http errors As a snowbird I relocated earlier this month. Between the time I shutdown one computer, I pack this one in my checked baggage, and when I turned it on at the new location WCG started giving me errors. LOTS of errors on multiple computers. Thinking it was because I switched ISP's I diddled around with it a lot before I did some real testing. First I fired off the VPN and used a place in Europe as my gateway, no change, then I really got serious. I did a ssh connect to the computers back where I live most of the time. They were clogged up also. After about a week and a half things settled down and traffic to WCG went to normal. Then Rosetta hiccuped a few times. At the moment all seems to be OK. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org