Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 297 · 298 · 299 · 300 · 301 · 302 · 303 . . . 320 · Next

Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110025 - Posted: 13 Nov 2024, 10:08:19 UTC - in response to Message 110024.  

boinc-process is staring to die again.
And it's now dead, Validation backlog starting to develop.
Darwin NT
ID: 110025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110026 - Posted: 13 Nov 2024, 20:23:40 UTC - in response to Message 110025.  

boinc-process is starting to die again.
And it's now dead, Validation backlog starting to develop.

Lol - I've only just noticed. Fortunately there aren't many unreturned tasks it'll affect.
And there was me running my non-Rosetta buffer down...
ID: 110026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110029 - Posted: 15 Nov 2024, 0:45:21 UTC - in response to Message 110026.  

boinc-process is starting to die again.
And it's now dead, Validation backlog starting to develop.

Lol - I've only just noticed. Fortunately there aren't many unreturned tasks it'll affect.
And there was me running my non-Rosetta buffer down...

Boinc-process is back at least.
A few 10s of thousands of tasks seem to have been issued through the day, but all are getting gobbled up as fast as they appear.
Most of my team have had a few tasks but only in one or two small grabs and it's a return to the backup projects pretty soon.
Hand-to-mouth again
ID: 110029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110034 - Posted: 18 Nov 2024, 7:57:26 UTC - in response to Message 110029.  

A few 10s of thousands of tasks seem to have been issued through the day, but all are getting gobbled up as fast as they appear.

Just spotted another 40k come through but they've all gone already
I managed to load up one PC, so if you wondered who got them...
ID: 110034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 56
Credit: 43,632,241
RAC: 193,290
Message 110035 - Posted: 19 Nov 2024, 18:48:52 UTC

Looks like they've done it again. I have a machine pretty much locked up tight. The rosetta_4.20 processes were asking, the last time I could get to the machine, for 2+GB each. Basic math tells me that 32 threads won't work with 32GB of memory. Back in the old days, last month, I could have just hit the giant power button and rebooted it. Unfortunately at the present time, and until April, I'm sitting 2,410 miles (as the crow flies) away and my arm isn't that long. Guess I'll have to cripple rosetta since I can't trust it to play nice.
ID: 110035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Send message
Joined: 22 Feb 11
Posts: 278
Credit: 527,663
RAC: 239
Message 110036 - Posted: 19 Nov 2024, 18:53:53 UTC

Perhaps you could tell boinc not to use all memory.

ID: 110036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110037 - Posted: 20 Nov 2024, 5:12:21 UTC - in response to Message 110035.  

Only 500MB or so in use by each of my presently running Tasks.
Darwin NT
ID: 110037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110038 - Posted: 20 Nov 2024, 21:12:33 UTC

Guess what the Server Status page is telling us right now...
ID: 110038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110040 - Posted: 21 Nov 2024, 4:28:11 UTC - in response to Message 110038.  

Guess what the Server Status page is telling us right now...
It almost lasted for a week.
Darwin NT
ID: 110040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 56
Credit: 43,632,241
RAC: 193,290
Message 110041 - Posted: 21 Nov 2024, 21:49:40 UTC - in response to Message 110040.  

Guess I broke it.

So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers.

Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running boinc.
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me?
ID: 110041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 209
Credit: 7,231,355
RAC: 10,730
Message 110042 - Posted: 22 Nov 2024, 0:25:07 UTC - in response to Message 110041.  

... they're sitting there (headless for the most part) doing nothing but running boinc.
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using?

My Linux machine runs lots of processes. It has 16 cores and 128 GBytes of RAM.

As fare as Boinc is concerned, the main process is the Boinc Client. It uses very little RAM and very little CPU time. From time-to-time, the boinc client sends a message a Boinc server and asks for work. The server send a reply complaining it cannot find any work, or a bunch of messages describinb the files the client hould download. In the latter case, the client downloads the files in the proper places. Then if the client has spare cores, it selects one and forks off a process to run it.

So let us say there are no Boinc tasks running, the client has just received a task from the Rosetta server. The client then fork off the Rosetta task.

top - 19:12:56 up 16 days,  8:42,  2 users,  load average: 13.38, 13.32, 13.29
Tasks: 483 total,  14 running, 469 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  0.3 sy, 80.6 ni, 18.0 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem : 128086.0 total,   5047.0 free,   7395.4 used, 115643.6 buff/cache
MiB Swap:  15992.0 total,  15687.0 free,    305.0 used. 116733.0 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
3176351    2043 boinc     39  19 R 596760   0.5  99.0 13  10:12.79 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3161135    2043 boinc     39  19 R 581420   0.4  99.3  2 121:33.16 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3111703    2043 boinc     39  19 R 541240   0.4  99.1  9 455:40.07 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3163687    2043 boinc     39  19 R 481148   0.4  99.2 10 103:13.41 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3144411    2043 boinc     39  19 R 443480   0.3  99.1  6 233:56.51 ../../projects/ 
   2043       1 boinc     30  10 S  54708   0.0   0.1  8 300278:26 /usr/bin/boinc                                                            
3171024    2043 boinc     39  19 R  39676   0.0  99.3  4  48:38.05 ../../projects/ 
3166711    2043 boinc     39  19 R  39668   0.0  99.3 11  80:07.82 ../../projects/ 
3171561    2043 boinc     39  19 R  39584   0.0  99.2  0  44:34.46 ../../projects/ 
3167425    2043 boinc     39  19 R  39520   0.0  99.3  7  75:58.11 ../../projects/ 
3176944    2043 boinc     39  19 R  39172   0.0  99.4 15   5:33.72 ../../projects/ 
3172039    2043 boinc     39  19 R  39116   0.0  99.3  3  41:39.57 ../../projects/ 
3176627    2043 boinc     39  19 R  36824   0.0  99.4  1   8:20.14 ../../projects/ 
3141011    2043 boinc     39  19 R  29944   0.0  99.3  5 258:04.99 ../../projects/ 

Pid is the process Id, PPID is the PID of the process's parent.
Pid 1 is the process that starts the parent of all other processes. One of the processes it starts is Pid 2043 that is my Boinc Client,
This client starts all the others.
ID: 110042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 56
Credit: 43,632,241
RAC: 193,290
Message 110043 - Posted: 22 Nov 2024, 2:06:41 UTC - in response to Message 110042.  

I'm running openSUSE on all of my computers. Here's the one that caused the problem:

top - 16:42:35 up 2 days, 6:10, 2 users, load average: 33.55, 33.40, 33.38
Tasks: 475 total, 34 running, 441 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 99.8 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31927.27+total, 22505.63+free, 8120.531 used, 1792.707 buff/cache
MiB Swap: 2048.062 total, 2048.062 free, 0.000 used. 23806.74+avail Mem

61822 boinc 39 19 78392 35748 2304 R 100.3 0.109 100:31.99 wcgrid_mcm1_map
61824 boinc 39 19 78232 38868 2048 R 100.3 0.119 99:54.31 wcgrid_mcm1_map
62632 boinc 39 19 825532 674520 70144 R 100.3 2.063 249:05.56 rosetta_4.20_x8
62921 boinc 39 19 903184 753668 70400 R 100.3 2.305 217:03.78 rosetta_4.20_x8
63867 boinc 39 19 78524 38568 2048 R 100.3 0.118 108:19.06 wcgrid_mcm1_map
64033 boinc 39 19 78392 34544 2304 R 100.3 0.106 89:55.77 wcgrid_mcm1_map
64063 boinc 39 19 78396 40032 2048 R 100.3 0.122 87:05.91 wcgrid_mcm1_map
60802 boinc 39 19 2272136 2.027g 76288 R 100.0 6.502 448:17.00 rosetta_4.20_x8
61312 boinc 39 19 771740 617852 70144 R 100.0 1.890 387:41.08 rosetta_4.20_x8
61344 boinc 39 19 819136 665404 70400 R 100.0 2.035 382:14.14 rosetta_4.20_x8
61727 boinc 39 19 78640 40168 2304 R 100.0 0.123 109:32.81 wcgrid_mcm1_map
62042 boinc 39 19 956000 801992 70144 R 100.0 2.453 311:50.22 rosetta_4.20_x8
63796 boinc 39 19 78232 38764 2304 R 100.0 0.119 115:34.03 wcgrid_mcm1_map
63805 boinc 39 19 78232 39828 2304 R 100.0 0.122 112:30.09 wcgrid_mcm1_map
63859 boinc 39 19 78536 38488 2304 R 100.0 0.118 110:09.20 wcgrid_mcm1_map
63873 boinc 39 19 78232 39728 2304 R 100.0 0.122 107:39.82 wcgrid_mcm1_map
63877 boinc 39 19 78392 38792 2048 R 100.0 0.119 106:19.39 wcgrid_mcm1_map
63881 boinc 39 19 78524 38808 2304 R 100.0 0.119 105:28.10 wcgrid_mcm1_map
63885 boinc 39 19 78232 38796 2304 R 100.0 0.119 104:28.67 wcgrid_mcm1_map
63976 boinc 39 19 78392 39360 2304 R 100.0 0.120 92:57.77 wcgrid_mcm1_map
64022 boinc 39 19 78468 39464 2304 R 100.0 0.121 92:03.05 wcgrid_mcm1_map
64027 boinc 39 19 78392 39200 2304 R 100.0 0.120 90:19.73 wcgrid_mcm1_map
64061 boinc 39 19 78652 39572 2304 R 100.0 0.121 87:13.84 wcgrid_mcm1_map
64067 boinc 39 19 78536 38944 2304 R 100.0 0.119 86:28.16 wcgrid_mcm1_map
64757 boinc 39 19 78260 39200 2304 R 100.0 0.120 4:00.44 wcgrid_mcm1_map
61710 boinc 39 19 78592 39564 2304 R 99.67 0.121 114:27.44 wcgrid_mcm1_map
61732 boinc 39 19 78392 39740 2304 R 99.67 0.122 108:00.30 wcgrid_mcm1_map
63854 boinc 39 19 78232 39136 2304 R 99.67 0.120 111:13.06 wcgrid_mcm1_map
64029 boinc 39 19 78580 39696 2304 R 99.67 0.121 90:13.18 wcgrid_mcm1_map
64038 boinc 39 19 78292 38760 2304 R 99.67 0.119 87:54.97 wcgrid_mcm1_map
61814 boinc 39 19 78392 39304 2304 R 99.34 0.120 102:28.23 wcgrid_mcm1_map
63851 boinc 39 19 78232 39780 2304 R 99.34 0.122 112:01.97 wcgrid_mcm1_map

Notice the size of the rosetta processes. I've gone in and created the app_config, as root, to control how many rosetta processes can run.
# cd /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/
# cat app_config.xml

I have a newer machine running and the rosetta processes are much smaller. Dunno why.
I guess the answer is to go get more memory. Unfortunately I won't be near that computer until April.
ID: 110043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110044 - Posted: 22 Nov 2024, 5:34:12 UTC - in response to Message 110041.  

I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that?
There is no BOINC process.
Those BOINC settings limit the amount of RAM, disk space, network activity, CPU usage etc available for all projects that run under BOINC.

Setting a massive swap file value allows Tasks that require massive amounts of memory to start, but don't actually use that amount of memory in order to actually run.
So systems with limited amounts of RAM can still run Tasks that require significant amount of RAM- as long as the RAM they have (and is available for BOINC to use) is more than they need to run- even if they claim to require massive amounts of RAM above & beyond that in order to actually start (if they do need more RAM than is available for BOINC to make use of, then the large swap file allows them to still run- but with massive amounts of swapping. If you have a SSD, it means the system will be sluggish at best. A HDD- it will probably grind to a non-responsive halt, depending on just how much more RAM the Task(s) need, and how many of them are trying to run).
Darwin NT
ID: 110044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110045 - Posted: 22 Nov 2024, 5:42:44 UTC

And boinc-process is back up and running. That must be one of it's shortest outages yet.
Darwin NT
ID: 110045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110046 - Posted: 22 Nov 2024, 6:37:20 UTC - in response to Message 110041.  

Guess I broke it.

If only any of us were that powerful...

So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers.

Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running Boinc.

Personally I agree with your choice rather than the suggestion of restricting RAM so that there's even less space to run.

I guess my question is... that's the "Boinc" process right? When the "Rosetta" process kicks off how does it interact with that?
Do the jobs that Rosetta provides indicate how much memory they will be using?

Aiui yes.

Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me?

I certainly don't understand the exact mechanics of how this works, but we have had the experience in the past where disk space became a limitation and the solution wasn't entirely obvious.
I do recall that my original setting was 10-20Gb rather than 2Gb, so that's one thing, but even that became an issue/restriction when the problem arose before.
I recall raising it to 500Gb and it still not solving the problem back then.

The solution turned out to be not having any restriction at all.
That is, on the disk tab, unselecting "Use no more than xx Gb"
Combined with that I'm currently using "Use no more than 90% of total" disk space and
Page/swap file: Use at most 75%

In the situation where you're currently running headless, I hope none of these settings should be a problem on those hosts.

I have no idea whether this solution to an old problem also solves your current one, but I am surprised you're reporting the problem at all as I haven't heard anyone experiencing anything similar in recent years.
Nothing to lose by trying anyway.
ID: 110046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110047 - Posted: 22 Nov 2024, 6:46:13 UTC - in response to Message 110045.  

And boinc-process is back up and running. That must be one of it's shortest outages yet.

That's what I originally came here to say.
Not that it matters a great deal with current task availability but still...
ID: 110047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 414
Credit: 13,127,327
RAC: 19,939
Message 110048 - Posted: 22 Nov 2024, 9:19:35 UTC - in response to Message 110041.  

Guess I broke it.

So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers.

Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running boinc.
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me?

No, the percentage you set covers all the programs running under the Boinc user, so, the Boinc Client, Manager and all of the e.g. Rosetta work units are within the same pot and there’s no “interaction” between the one and the other.

The swap works the same way, the percentage covers all memory required by Boinc over the physical memory present.
ID: 110048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2251
Credit: 42,690,214
RAC: 22,669
Message 110051 - Posted: 23 Nov 2024, 8:08:07 UTC - in response to Message 110047.  

And boinc-process is back up and running. That must be one of it's shortest outages yet.

That's what I originally came here to say.
Not that it matters a great deal with current task availability but still...

Some 40-50k came available maybe 12-14hrs ago and it seems another 800k in the last few hours
ID: 110051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 414
Credit: 13,127,327
RAC: 19,939
Message 110052 - Posted: 23 Nov 2024, 12:56:18 UTC - in response to Message 110051.  

So far one error :-

process exited with code 1 (0x1, -255)</message>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-linux-gnu @8a_hal_u_hal_8aa_4jp3235_d104_0001_1.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_f5ae1de8e1/database

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/ line: 2534
BOINC:: Error reading and gzipping output datafile: default.out
09:31:32 (83823): called boinc_finish(1)

ID: 110052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1773
Credit: 18,534,891
RAC: 18
Message 110055 - Posted: 23 Nov 2024, 20:09:24 UTC - in response to Message 110052.  

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
Unfortunately, one of the usual ones.
Darwin NT
ID: 110055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 297 · 298 · 299 · 300 · 301 · 302 · 303 . . . 320 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

©2025 University of Washington