Questions and Answers : Unix/Linux : Unstoppable Rosetta client...
Author | Message |
---|---|
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I've had this a couple of times now, and at first I thought it was just something going wrong on the odd occasion, but it seems to happen more regularly than it really should... I'm running Fedora Core4 (64-bit), but I've also seen this on SuSE 9.3 and FC4 32-bit [yes, I've got several machines ;-)]. The process is stuck at X% (most often zero, but this one is stuck at 60%). I can't kill it, no matter what I do. Suspend doesn't stop it (but starts another one). Abort in Boinc gives no different reaction, it's still at 47.8% of the CPU time in Boinc (because another one started when I did suspend, and boinc actually thinks the process is suspended). If I try "kill 3289", it still stays up there. "kill -9 3289" doesn't make any difference. Trying to attach to 3289 in gdb doesn't make it either, gdb is just stuck... Any thoughts? [Reboot, however, does solve the problem, but I HATE to reboot my machine]. I have seen a similar thing where my Windows machine got stuck for 3 days at 0% work done, and just chugged along on the same WU, so it's probably not a Linux specific problem as such. -- Mats |
Desti Send message Joined: 16 Sep 05 Posts: 50 Credit: 3,018 RAC: 0 |
Do you use BOINC 4 or 5? LUE |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Do you use BOINC 4 or 5? This particular one is running BOINC 4. However, I have a feeling this is not in itself would not be solved by using BOINC 5 (which I'm using on another machine). -- Mats |
Crouse Send message Joined: 1 Nov 05 Posts: 33 Credit: 67,332 RAC: 0 |
Try this: killall boinc Vist http://usalug.org https://boinc.bakerlab.org/rosetta/team_display.php?teamid=593 |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Try this: According to strace, it does "kill(9946, SIGTERM);", which is no different from kill {pid} or kill -9 {pid}. [I've rebooted the machine that had the stuck process by now, because it did start to annoy me having a rogue process slurping up 50% of the process power on the system]. By the way, the process that is stuck isn't boinc, but the rosetta, so "killall rosetta-4.78-i686-linux" or whatever it is that the executable is called. But it does the same as what I'd already done, except I did "ps ax|grep rosetta" to find the process number and kill it by number. However, that makes a difference only in the amount of work needed by the "kill" command and/or user to find the process id, not what the OS-kernel actually does. But after some more looking into the problem, with the help of a Linux kernel hacker that knows a thing or two about kernels, it appears that it's the kernel that is at fault here, but probably triggered by some sort of flaw in Rosetta that causes an exception (looking at the call-stack in the kernel, it appears that it's looping around a point in signal handling, so it's probably a signal handling flaw, perhaps also influenced by the fact that the kernel happens to have auditing enabled [that's speculation]). -- Mats |
Crouse Send message Joined: 1 Nov 05 Posts: 33 Credit: 67,332 RAC: 0 |
crouse 1491 _ -bash crouse 1828 _ ./boinc -return_results_immediately crouse 1829 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrelax_m crouse 1831 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrel crouse 1832 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -a crouse 1988 _ ps auxf Well.... ./boinc is the main process ...so killing it seems to work for killing rosetta. It appears that if you just try to kill rosetta....boinc starts it up again. I suppose I could look up the pid for boinc and kill it that way too...but killall boinc ALWAYS works without the hassle of looking up the pid number. ;) Now... if your trying to stop just the rosetta client, and start a different project for boinc from the command line....... i haven't figured that out yet. I'm only running rosetta@home on my boinc client..so i haven't had to deal with that. It's easy enough with the ./run_manager thing, but if your running a box with no gui that isn't much help either. http://boinc.berkeley.edu/client_unix.php Shows some boinc commands, but none appear to just "stop" the clients from running..... the detach command would work...but then you would have to "attach" again to start it up. Attaching again, might create a duplicate entry of your computer in the stats (not sure, i haven't actually tried that yet). Vist http://usalug.org https://boinc.bakerlab.org/rosetta/team_display.php?teamid=593 |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
crouse 1491 _ -bash crouse 1828 _ ./boinc -return_results_immediately crouse 1829 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrelax_m crouse 1831 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -abrel crouse 1832 | _ rosetta_4.78_i686-pc-linux-gnu aa 1hz7 A -silent -a crouse 1988 _ ps auxf I understand how killing the whole tree of processes would work, but in the particular case, I'd already killed (and restarted) boinc, but the rosetta process was STILL running, unstoppable. Stopping (killing) boinc _SHOULD_ have killed the rosetta process, but it didn't. I think the reason is, as I explained, a combination of some buggyness in the kernel and something in rosetta causing it to take a signal which then causes the kernel to essentially infinitely loop around... So, in conclusion: there's two problems - Rosetta sometimes gets upset and doesn't actually calculate the correct result and loops forever, and there's a kernel bug that causes a signal-loop which makes the process unkillable. It's possible that upgrading to the 5.2 BOINC would solve the initial process locking up - I'm upgrading at some point in the future. I think I'll close this issue, as it's not really a rosetta problem... -- Mats |
ColdRain~old Send message Joined: 1 Nov 05 Posts: 27 Credit: 33,378 RAC: 0 |
--- snip -- |
Desti Send message Joined: 16 Sep 05 Posts: 50 Credit: 3,018 RAC: 0 |
Sounds like Rosetta wents into a zombie process. |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Sounds like Rosetta wents into a zombie process. Nope, it wasn't marked as a zombie by the OS, and if nothing else was running on the system, it would consume 99.9% of the CPU time, so it was definitely still running. -- Mats |
Questions and Answers :
Unix/Linux :
Unstoppable Rosetta client...
©2024 University of Washington
https://www.bakerlab.org