Unstoppable Rosetta client...

Questions and Answers : Unix/Linux : Unstoppable Rosetta client...

To post messages, you must log in.

AuthorMessage
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 1827 - Posted: 27 Oct 2005, 10:08:25 UTC

I've had this a couple of times now, and at first I thought it was just something going wrong on the odd occasion, but it seems to happen more regularly than it really should...

I'm running Fedora Core4 (64-bit), but I've also seen this on SuSE 9.3 and FC4 32-bit [yes, I've got several machines ;-)].

The process is stuck at X% (most often zero, but this one is stuck at 60%). I can't kill it, no matter what I do. Suspend doesn't stop it (but starts another one). Abort in Boinc gives no different reaction, it's still at 47.8% of the CPU time in Boinc (because another one started when I did suspend, and boinc actually thinks the process is suspended).

If I try "kill 3289", it still stays up there. "kill -9 3289" doesn't make any difference.

Trying to attach to 3289 in gdb doesn't make it either, gdb is just stuck...

Any thoughts? [Reboot, however, does solve the problem, but I HATE to reboot my machine].

I have seen a similar thing where my Windows machine got stuck for 3 days at 0% work done, and just chugged along on the same WU, so it's probably not a Linux specific problem as such.

--
Mats
ID: 1827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Desti

Send message
Joined: 16 Sep 05
Posts: 50
Credit: 3,018
RAC: 0
Message 1912 - Posted: 29 Oct 2005, 13:56:05 UTC

Do you use BOINC 4 or 5?
LUE
ID: 1912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 1989 - Posted: 31 Oct 2005, 15:24:07 UTC - in response to Message 1912.  

Do you use BOINC 4 or 5?


This particular one is running BOINC 4. However, I have a feeling this is not in itself would not be solved by using BOINC 5 (which I'm using on another machine).

--
Mats
ID: 1989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Crouse
Avatar

Send message
Joined: 1 Nov 05
Posts: 33
Credit: 67,332
RAC: 0
Message 2302 - Posted: 4 Nov 2005, 23:04:10 UTC

ID: 2302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 2600 - Posted: 7 Nov 2005, 22:09:07 UTC - in response to Message 2302.  

Try this:

killall boinc


According to strace, it does "kill(9946, SIGTERM);", which is no different from kill {pid} or kill -9 {pid}. [I've rebooted the machine that had the stuck process by now, because it did start to annoy me having a rogue process slurping up 50% of the process power on the system].

By the way, the process that is stuck isn't boinc, but the rosetta, so "killall rosetta-4.78-i686-linux" or whatever it is that the executable is called. But it does the same as what I'd already done, except I did "ps ax|grep rosetta" to find the process number and kill it by number. However, that makes a difference only in the amount of work needed by the "kill" command and/or user to find the process id, not what the OS-kernel actually does.

But after some more looking into the problem, with the help of a Linux kernel hacker that knows a thing or two about kernels, it appears that it's the kernel that is at fault here, but probably triggered by some sort of flaw in Rosetta that causes an exception (looking at the call-stack in the kernel, it appears that it's looping around a point in signal handling, so it's probably a signal handling flaw, perhaps also influenced by the fact that the kernel happens to have auditing enabled [that's speculation]).

--
Mats
ID: 2600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Crouse
Avatar

Send message
Joined: 1 Nov 05
Posts: 33
Credit: 67,332
RAC: 0
Message 2647 - Posted: 8 Nov 2005, 18:01:59 UTC
Last modified: 8 Nov 2005, 18:14:04 UTC

crouse      1491        _ -bash
crouse      1828            _ ./boinc -return_results_immediately
crouse      1829            |   _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrelax_m
crouse      1831            |       _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrel
crouse      1832            |           _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -a
crouse      1988            _ ps auxf


Well.... ./boinc is the main process ...so killing it seems to work for killing rosetta. It appears that if you just try to kill rosetta....boinc starts it up again. I suppose I could look up the pid for boinc and kill it that way too...but killall boinc ALWAYS works without the hassle of looking up the pid number. ;)


Now... if your trying to stop just the rosetta client, and start a different project for boinc from the command line....... i haven't figured that out yet. I'm only running rosetta@home on my boinc client..so i haven't had to deal with that. It's easy enough with the ./run_manager thing, but if your running a box with no gui that isn't much help either.

http://boinc.berkeley.edu/client_unix.php Shows some boinc commands, but none appear to just "stop" the clients from running..... the detach command would work...but then you would have to "attach" again to start it up. Attaching again, might create a duplicate entry of your computer in the stats (not sure, i haven't actually tried that yet).
Vist http://usalug.org
https://boinc.bakerlab.org/rosetta/team_display.php?teamid=593
ID: 2647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 2649 - Posted: 8 Nov 2005, 18:53:58 UTC - in response to Message 2647.  

crouse      1491        _ -bash
crouse      1828            _ ./boinc -return_results_immediately
crouse      1829            |   _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrelax_m
crouse      1831            |       _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrel
crouse      1832            |           _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -a
crouse      1988            _ ps auxf


Well.... ./boinc is the main process ...so killing it seems to work for killing rosetta. It appears that if you just try to kill rosetta....boinc starts it up again. I suppose I could look up the pid for boinc and kill it that way too...but killall boinc ALWAYS works without the hassle of looking up the pid number. ;)


Now... if your trying to stop just the rosetta client, and start a different project for boinc from the command line....... i haven't figured that out yet. I'm only running rosetta@home on my boinc client..so i haven't had to deal with that. It's easy enough with the ./run_manager thing, but if your running a box with no gui that isn't much help either.

http://boinc.berkeley.edu/client_unix.php Shows some boinc commands, but none appear to just "stop" the clients from running..... the detach command would work...but then you would have to "attach" again to start it up. Attaching again, might create a duplicate entry of your computer in the stats (not sure, i haven't actually tried that yet).


I understand how killing the whole tree of processes would work, but in the particular case, I'd already killed (and restarted) boinc, but the rosetta process was STILL running, unstoppable. Stopping (killing) boinc _SHOULD_ have killed the rosetta process, but it didn't. I think the reason is, as I explained, a combination of some buggyness in the kernel and something in rosetta causing it to take a signal which then causes the kernel to essentially infinitely loop around...

So, in conclusion: there's two problems - Rosetta sometimes gets upset and doesn't actually calculate the correct result and loops forever, and there's a kernel bug that causes a signal-loop which makes the process unkillable. It's possible that upgrading to the 5.2 BOINC would solve the initial process locking up - I'm upgrading at some point in the future.

I think I'll close this issue, as it's not really a rosetta problem...

--
Mats
ID: 2649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ColdRain~old
Avatar

Send message
Joined: 1 Nov 05
Posts: 27
Credit: 33,378
RAC: 0
Message 2650 - Posted: 8 Nov 2005, 19:04:45 UTC
Last modified: 8 Nov 2005, 19:06:17 UTC

--- snip --
ID: 2650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Desti

Send message
Joined: 16 Sep 05
Posts: 50
Credit: 3,018
RAC: 0
Message 2764 - Posted: 9 Nov 2005, 22:55:49 UTC

Sounds like Rosetta wents into a zombie process.
ID: 2764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 2803 - Posted: 10 Nov 2005, 13:12:46 UTC - in response to Message 2764.  

Sounds like Rosetta wents into a zombie process.


Nope, it wasn't marked as a zombie by the OS, and if nothing else was running on the system, it would consume 99.9% of the CPU time, so it was definitely still running.

--
Mats
ID: 2803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : Unstoppable Rosetta client...



©2024 University of Washington
https://www.bakerlab.org