Questions and Answers : Unix/Linux : Bug: rosetta_4.08_x86_64-pc-linux-gnu uses unsupported CPU features.
Author | Message |
---|---|
ShimmerFairy Send message Joined: 26 Mar 20 Posts: 3 Credit: 2,060 RAC: 0 |
I've been having trouble with running the 64-bit rosetta software, while 32-bit is working fine. First I made sure to enable legacy vsyscall emulation, which I was warned I'd need, but still rosetta_4.08_x86_64-pc-linux-gnu breaks (according to how different the backtraces look in the coredumps from attempts before and after, legacy vsyscall did at least seem to solve 1 problem, it just then revealed another). So, suspecting that the problem might be unsupported CPU features being assumed in the code, I looked at the backtrace for all my segfaulting tasks. In all of the coredumps, the address of the last frame before the first <signal handler called> indicator is 0x00000000013bc5f8. So I go ahead and disassemble that immediate area and find: (gdb) disassemble 0x13bc5f8,0x13bc608 Dump of assembler code from 0x13bc5f8 to 0x13bc608: 0x00000000013bc5f8: pshufb %xmm1,%xmm0 0x00000000013bc5fd: movdqa %xmm0,0x0(%r13) 0x00000000013bc603: jae 0x13bc6cc End of assembler dump. A quick bit of looking shows that the pshufb instruction was introduced in SSSE3. The problem with this is that my poor old AMD Athlon X2 does not have support for the SSSE3 instruction set, so of course it falls apart the instant the program tries to do a pshufb. As far as I can tell, there are four solutions to this problem:
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've been a moderator a long time. I think this is the first time anyone has reported a problem, along with analysis of a disassembly! I am not the developer of the code, so forgive some naive questions, but I do wish to get as much info. as I can. I notice that your machine is AMD. In one case, a work unit that you failed on was completed without incident by an Intel machine running Linux. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1020967004 Their machine reports as: GenuineIntel Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz [Family 6 Model 158 Stepping 9] Am I correct to presume that an i5-7500 has the SSSE support? (a quick search, it looks like "SSSE" is Intel's term, so that's probably true) Sounds like perhaps there is a missing conditional compilation directive somewhere. One question, you ran a disassembly and see the SSSE instruction there... do you have any way to confirm if your machine actually attempted to run that instruction? I believe the compiled code probably has conditional sections that are automatically placed by the compiler options. So seeing the instruction in the code would be expected, but your machine should have branched around it. ...I'm sure you now what I'm trying to say. This may explain an issue I've been seeing with Linux machines. I had presumed it was related to machines with only 1GB per CPU. But typically those would be older machines, and typically older machines would be the ones that might not support the SSSE, so perhaps that is actually the nail that I've been looking for. Thank you for the fresh perspective. I have informed the Project Team with your information. Rosetta Moderator: Mod.Sense |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
Indeed this is an issue and thanks for your detective work. We can address this in the next planned app update which is in the works. We'll have to push out the current apps that are being tested on Ralph@h tomorrow so that we can get started on dependent COVID-19 related tasks sooner than later but we also plan to do another update soon to include a specific COVID-19 protocol. For this update which we are currently working on, we can include specific sse and non-sse apps (your options 2 and 3). We can setup a plan class on Ralph which should address option 3 and test the server functionality on Ralph soon with the existing apps, and after the next update, test the functionality with sse and non-sse apps (option 2). Thanks again! And thank you Mod-Sense for also bringing this to our attention. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So, to try and generalize here a bit: App version showing problem: Rosetta v4.08 x86_64-pc-linux-gnu Symptom: Task ends early with signal 11 Effected systems: CPUs that do not support SSSE, which is often AMD CPUs (especially older ones, "prior to bulldozer"). The good news is it seems to often trip in the first 2 minutes of execution of a task. Rosetta Moderator: Mod.Sense |
ShimmerFairy Send message Joined: 26 Mar 20 Posts: 3 Credit: 2,060 RAC: 0 |
First, I want to stress that the issue is with SSSE3, not with any of the other very similarly-named extensions to the x86 line of CPUs. I only want to stress it because, in particular, there's an insidious difference between SSE3 (referred to as "pni" in the CPU flags in /proc/cpuinfo, and which my CPU does support) and SSSE3 (the instruction set my CPU doesn't support). It's also worth noting that my CPU unsurprisingly doesn't support any of SSE4 either. Here's the set of flags directly from my /proc/cpuinfo, for reference: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall lbrv So, to sum up, my CPU in particular can handle SSE, SSE2, and SSE3 ("pni" in the above), but not SSSE3 or any of SSE4. A very quick look seems to tell me that SSE and SSE2 were in the very first x86-64 CPUs, so those don't realistically need to be conditioned in 64-bit code, and SSE3 showed up not at first but fairly early on. (With SSSE3 taking some more time to show up, at least on the AMD side.) About "did the instruction actually run?", I got the location in question from looking at the backtraces in each of the coredumps generated from the segfault, with the aid of GDB. For example: (gdb) bt #0 0x0000000008389940 in ?? () #1 0x0000000005f2e604 in ?? () #2 <signal handler called> #3 0x0000000008389940 in ?? () #4 0x0000000005f2e604 in ?? () #5 <signal handler called> #6 0x00000000013bc5f8 in ?? () #7 0x0000000003aec103 in ?? () #8 0x0000000003b01e08 in ?? () #9 0x0000000003a8bf0d in ?? () #10 0x0000000003a32025 in ?? () #11 0x0000000003a37de5 in ?? () #12 0x0000000003a9513c in ?? () #13 0x0000000003a95434 in ?? () #14 0x0000000003deb5ab in ?? () #15 0x00000000027480d3 in ?? () #16 0x0000000002749f3a in ?? () #17 0x0000000002f70884 in ?? () #18 0x000000000371db00 in ?? () #19 0x000000000371eccb in ?? () #20 0x000000000371f44e in ?? () #21 0x00000000037866f8 in ?? () #22 0x0000000003788221 in ?? () #23 0x0000000003826698 in ?? () #24 0x00000000038261a3 in ?? () #25 0x00000000004135e6 in ?? () #26 0x0000000005ff3ccc in ?? () #27 0x00000000006108e7 in ?? () As you can see, address 0x00000000013bc5f8 comes from the innermost call frame before the signal handler first get tripped, and every single failed attempt after I fixed my vsyscall support has that same address just before the signal handler gets called. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
noted, SSSE3. thanks |
ShimmerFairy Send message Joined: 26 Mar 20 Posts: 3 Credit: 2,060 RAC: 0 |
Just a quick addition, I got worried that there were other instruction sets in the program that my computer couldn't run, and if the compatibility version you're making just goes to disable SSSE3 specifically, then I'd keep coming back over and over again to say "now it doesn't work for this reason". So I wrote a quick program to go through rosetta_4.08_x86_64-pc-linux-gnu and check ahead of time. Being a very quick project I didn't do anything super fancy like figuring out if the too-new instructions were put inside a conditional to make sure it only got used on capable computers, but with that in mind I did find other instruction sets I can't use on my CPU (I'll list them at the bottom). The easiest way to handle this would be for the compatibility version to be compiled for a specific kind of CPU, instead of just trying to disable specific features. For example, if you want to support all x86-64 CPUs, then you could compile them for the K8 family of AMD processors (the first ones to implement x86-64). If you're using GCC, for instance, that should be doable by making sure that -march=k8 shows up in the options to gcc without any other options contradicting it. (This is all of course assuming none of the instruction sets in question show up as a result of hand-written assembler or through the use of compiler intrinsics that make for a slightly easier version of hand-written assembler.) I also just want to say that I appreciate you guys trying to support old CPUs like mine. My Athlon X2 is over a decade old now (in fact it's part of that K8 family, though a bit later on in the series), and I wouldn't be surprised if you ultimately decided "if your x86-64 bit CPU is too old, then it can only run the 32-bit stuff". The specific instruction sets I found in the program that I can't handle, in case this is of interest, were SSE4.1, SSE4.2, AVX, and more niche feature sets like Restricted Transactional Memory, xgetbv, and rdrand. |
Germano_0x Send message Joined: 27 Dec 13 Posts: 3 Credit: 2,493,872 RAC: 0 |
Hi ShimmerFairy, can you tell us the whole procedure you followed so we (other users) can do the same in case of similar problems? For example a thing that I don't know, is how to attach GDB to a working unit that is no longer running because it has failed. Thank you |
sspseudoo Send message Joined: 4 Mar 20 Posts: 7 Credit: 23,843 RAC: 0 |
I have problems again: https://boinc.bakerlab.org/rosetta/results.php?userid=2083373 I did "strace" a crashing process and these are the last lines: --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x14c77b8} --- ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- +++ killed by SIGSEGV (core dumped) +++ Maybe again "wrong" instructions? Some packages seem to work, the others crash after a minute or so... @Germano_0x: I did it as follows: Wait for a new task that is about to crash. Using "top" finding out the PID, or look in your boinc manager under properties or similar. Using "strace -pid" to attach to the process. See the crash and the address as above. Then (using Fedora) I just did a "coredumpctl gdb" and the coredump was automatically loaded into gdb. In gdb I used "bt" to backtrace the crash. The last adress before <signal handler called> is the same address as shown in "strace". (So the strace thing I did in the beginning was not necessary at all, but already lead to into the right direction, I suppose.) Following I disassembled the area as written in the first post "(gdb) disassemble 0x14c77b8,0x14c77c8" Dump of assembler code from 0x14c77b8 to 0x14c77c8: 0x00000000014c77b8: pshufb %xmm1,%xmm0 0x00000000014c77bd: movdqa %xmm0,0x0(%r13) 0x00000000014c77c3: jae 0x14c788c End of assembler dump. The area about to look at: I do not really know, but just went with the last value a bit higher. The not supported instruction pshufb seems still to be used in rosetta_4.20_x86_64-pc-linux-gnu on this "older" computer. |
Questions and Answers :
Unix/Linux :
Bug: rosetta_4.08_x86_64-pc-linux-gnu uses unsupported CPU features.
©2024 University of Washington
https://www.bakerlab.org