Simple uuid_generate() call in statically linked program segfaults (or stucks) if running on XEN kernel on i686 arch (x86_64 works, also dynamically linked libuuid is ok.) I am not sure if this is kernel-xen problem, but because it happens in libuuid, assigning reporting this to e2fsprogs. See this simple program: int main (int argc, char *argv[]) { char str[64]; uuid_t uu; uuid_generate(uu); uuid_unparse(uu, str); printf("%s\n", str); return 0; } compiled with cc -o uuidgen uuidgen.c -g -O0 -static -luuid # gdb ./uuidgen (gdb) list main 1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <uuid/uuid.h> 4 5 int main (int argc, char *argv[]) 6 { 7 char str[64]; 8 uuid_t uu; 9 10 uuid_generate(uu); (gdb) r Starting program: /root/uuid/uuidgen Program received signal SIGSEGV, Segmentation fault. get_random_bytes (buf=0xbfab7268, nbytes=16) at gen_uuid.c:161 161 memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed)); (gdb) bt #0 get_random_bytes (buf=0xbfab7268, nbytes=16) at gen_uuid.c:161 #1 0x08048505 in uuid__generate_random (out=0xbfab72c4 "�r������", num=0xbfab72a4) at gen_uuid.c:540 #2 0x0804858f in uuid_generate_random (out=0xbfab72c4 "�r������") at gen_uuid.c:556 #3 0x08048244 in main () at uuidgen.c:10 (gdb) list 156 * randomness if /dev/random/urandom is out to lunch. 157 */ 158 for (cp = buf, i = 0; i < nbytes; i++) 159 *cp++ ^= (rand() >> 7) & 0xFF; 160 #ifdef DO_JRAND_MIX 161 memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed)); 162 jrand_seed[2] = jrand_seed[2] ^ syscall(__NR_gettid); 163 for (cp = buf, i = 0; i < nbytes; i++) 164 *cp++ ^= (jrand48(tmp_seed) >> 7) & 0xFF; 165 memcpy(jrand_seed, tmp_seed, (gdb) p jrand_seed Cannot find thread-local variables on this target # rpm -q e2fsprogs e2fsprogs-devel kernel-xen e2fsprogs-1.39-15.el5 e2fsprogs-devel-1.39-15.el5 kernel-xen-2.6.18-92.1.18.el5 # uname -a Linux proliant06 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 5 09:30:07 EST 2008 i686 i686 i386 GNU/Linux
dmesg also contains these xen specific entries: 4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c 4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c 4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c 4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c 4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c
Those messages kernel should be harmless. When using shared libraries on i386 xen, the dynamic linker magically loads a Xen aware libc that avoids -ve segment address accesses. These are functionally fine, but do incurr a performance hit under Xen, hence the kernel warns if any app does this. Since you are statically linking, its not got the option of using the Xen optimized libc, so you see these messages. You should be able to safely ignore them.
Marking for consideration as a 5.3 blocker, since this is the root cause for 467193
Is this a regression?
I think that it doesn't work on RHEL5.2 released version either.
Hm, strange, if I build an x86 executable in a 64-bit xen guest on a 64-bit host, it seems fine: [root@localhost ~]# gcc -m32 -o test32 test.c -g -O0 -static -luuid [root@localhost ~]# ./test32 f8f2847d-1113-44c6-8aee-6b3be9ae992c [root@localhost ~]# file test32 test32: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, statically linked, for GNU/Linux 2.6.9, not stripped Guess I'll build an x86 guest - or was your host on x86 as well? Thanks, -Eric
FYI, I have tried a 32-bit executable in 64-bit guest on 64-bit host, and 32-bit executable in 32-bit guest, on 64-bit host. Both worked fine. The only time I could reproduce it is in a 32-bit Xen Dom0 host. The scenario I've not had time to test is whether a 32-bit guest on a 32-bit host is affected. For my tests the demo code will always hang, i don't see a crash.
Daniel, thanks. I guess I'll grab an x86 box out of RHTS and try it there. -Eric
Yes, this is only reproducible - on x86 32bit Dom0 - on 32bit (paravirtualized) guest running on 32bit host *only*. (The same machine with the same binary but non-xen kernel works ok.)
FYI, if you don't have a 32-bit machine, but do have a 64-bit Fedora box able to run KVM fully virt, you can also install a RHEL-5 Xen 32-bit Dom0 inside the 64-bit KVM guest. That's how I reproduced it :-)
(In reply to comment #8) > For my tests the demo code will always hang, i don't see a crash. I saw this too - it hangs in 16 byte memcpy() call! Other machine just segfaults with the same binary here.
Hm, I was going to give this a whirl on updated e2fsprogs, but during the make check phase: make[1]: Entering directory `/usr/src/redhat/BUILD/e2fsprogs-1.41.3/lib/uuid' LD_LIBRARY_PATH=../../lib DYLD_LIBRARY_PATH=../../lib ./tst_uuid make[1]: *** [check] Segmentation fault so I guess it persists. One thing about this is that the uuid gen stuff uses thread local storage... not sure how that might play into this with the Xen kernel.
Well, this seems odd. With this patch: Index: e2fsprogs-1.39/lib/uuid/gen_uuid.c =================================================================== --- e2fsprogs-1.39.orig/lib/uuid/gen_uuid.c 2008-11-17 15:30:00.000000000 -0600 +++ e2fsprogs-1.39/lib/uuid/gen_uuid.c 2008-11-17 16:13:50.904588722 -0600 @@ -136,6 +136,7 @@ static void get_random_bytes(void *buf, int lose_counter = 0; unsigned char *cp = (unsigned char *) buf; unsigned short tmp_seed[3]; + int tid; if (fd >= 0) { while (n > 0) { @@ -159,7 +160,8 @@ static void get_random_bytes(void *buf, *cp++ ^= (rand() >> 7) & 0xFF; #ifdef DO_JRAND_MIX memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed)); - jrand_seed[2] = jrand_seed[2] ^ syscall(__NR_gettid); + tid = syscall(__NR_gettid); + jrand_seed[2] = jrand_seed[2] ^ tid; for (cp = buf, i = 0; i < nbytes; i++) *cp++ ^= (jrand48(tmp_seed) >> 7) & 0xFF; memcpy(jrand_seed, tmp_seed, it works ok. The syscall is not failing.... I'm not sure what's going wrong here.
This sure looks like a toolchain problem to me, unless someone can point out what's wrong with the code above? (and why it would only manifest itself on a static link?)
I'm going to give a wild guess and suspect thread local storage is somehow busted on static links in a Xen guest. Try applying this pseudo-patch to lib/uuid/Makefile.in and see if the problem goes away: .c.o: @echo " CC $<" - @$(CC) $(ALL_CFLAGS) -c $< -o $@ + @$(CC) $(ALL_CFLAGS) -c $< -o $@ -UTLS @PROFILE_CMT@ @$(CC) $(ALL_CFLAGS) -g -pg -o profiled/$*.o -c $< @CHECKER_CMT@ @$(CC) $(ALL_CFLAGS) -checker -g -o checker/$*.o -c $< @ELF_CMT@ @$(CC) $(ALL_CFLAGS) -fPIC -o elfshared/$*.o -c $< @BSDLIB_CMT@ @$(CC) $(ALL_CFLAGS) $(BSDLIB_PIC_FLAG) -o pic/$*.o -c $< If so, this could be considered a toolchain/library bug, but on some operating systems, such as Solaris, they have stopped supporting static linking because things like TLS are heard to keep working with static linking. Sure seems wierd that it only is blowing up in a Xen environment, though!
Per IRC conversation, Eric pointed out that tst_uuid is dynamically compiled. So this problem of what's going on with thread-local storage and Xen may not be limited to static libraries. Also, the generated assembly with and without the patch mentioned in comment #14 is quite a bit different. Which really seems weird, and it would be good to get a compiler expert to weigh in here. Maybe we should get a simplified test case and consider this a potential toolchain bug?
Created attachment 323969 [details] disassembly of function w/o patch
Created attachment 323970 [details] disassembly of function with patch
I'm slightly hesitant to just commit the patch until we can get to the real underlying problem, to be sure we're not just papering over the problem. I'll see if I can get a simplified testcase for the toolchain folks to look at.
This simplified code I tried and it segfaults on xen kernel... Am I overlooking something completely obvious? compiled with (not static now!) cc -o uuidgen uuidgen.c -g -O0 -pedantic -Wall #include <stdio.h> static __thread unsigned short tst_thread = 0xbabe; int main (int argc, char *argv[]) { unsigned short tst = 0xdead; printf("tst is %0x, tst_thread is %0x\n", tst, tst_thread); return 0; }
Slightly reduced reproducer from the Comment 21: volatile __thread short shortvar; int main (void) { /* movzwl %gs:0xfffffffe,%eax */ return shortvar; } IMO it is a xen hypervisor problem (reassign it) as I guess it wrongly emulates (traps) `movzwl'. One can build the userland program using `gcc -mno-tls-direct-seg-refs' to workaround this problem (I do not know any possible pitfalls of -mno-tls-direct-seg-refs otherwise, though).
Punting to Xen ...
Further punting to kernel-xen, since its a HV problem
Bill, Just to be clear on why this is a priority for 5.3: This BZ is the root cause for bug #467193, and that bug prevents the installer from using encryption. It is not likely we can ship with that not working. Tom
I'll further note that as Thread Local Storage gets used by more and more programs as time goes on, and given that there's probably not as much testing for various packages running under Xen, it's likely that other programs may also end up seg faulting running under the Xen kernel. So I would hazard a guess that e2fsprogs won't be only program that will trip up against this bug.
Tom, its not clear from bug #467193 whether this is a regression from 5.2, or whether installing to encrypted disks has been broken ever since 5.0 GA on i386 Xen. I suspect its probably the latter case, but good to have a confirmation about whether this is a regression or not.
Daniel, although cryptsetup-luks has been available since 5.0 GA, RHEL 5.3 is the first to support it in the installer. Based on that I'll not consider the installer behavior as regression. Please refer to Milan if you need to know if the root cause (i.e. this bug) is a regression or not. Thanks.
For cryptsetup technically it is not regression, statically linked cryptsetup in LUKS mode probably never worked on kernel-xen (At least it doesn't work in RHEL5.2). But we are are going to support encrypted system install in 5.3 installer (new functionality) and it doesn't work for Dom0 and Xen guests on i686. Many people will use xen guests for prototyping, testing etc. Also as Theodore noted, this bug can probably hit more programs using TLS.
do we know roughly when this started happening? the initial report was made on 11/16. Would be key to know what kernel build you were using at the time and if it had been working prior to that.
Bill, it would appear as if this has always been a problem - it was simply never noticed prior to now, because 5.3 is the first release with encryption supported in the installer. As mentioned it could also impact other apps using thread locals
I agree with Jan in Comment #22. The insn looks like: 65 0f b7... 65 is decoded properly as gs override. But the twobyte insn is not properly decoded. We see 0f, and use the twobyte opcode table, but b7 is listed as unknown. Upstream has fixed this in: http://xenbits.xensource.com/xen-unstable.hg?rev/81aa410fa662 Important bit is here: /* 0xB0 - 0xBF */ - X, X, X, X, X, X, X, X, - X, X, X, X, X, X, X, X, + X, X, X, O|M, X, X, O|M, O|M,
Jan and Chris are right on the money here. This has to do with 32-bit segment fixups that the hypervisor does. It goes like: In normal, bare-metal operation, glibc uses negative segment offsets (for performance reasons) when doing thread-local variables. However, in a Xen environment, these cause problems, because the hypervisor is mapped into the upper region of the memory address space. For this reason, under Xen, these instructions are trapped, examined, and emulated. Obviously, if this was done for every program, this would cause quite a performance penalty. So what we do is for dynamically linked programs, we choose a different glibc that doesn't use the negative segment offsets, so things work pretty well. However, for statically linked programs, the only options are to either build with -mno-tls-direct-seg-refs (which is undesirable for bare-metal), or to have Xen correctly emulate the instruction (which it is not doing in this case). To confirm, I rebooted the 5.3 HV with "loglvl=all", and ran the above test; I got: [root@amd1 ~]# ./a.out (XEN) seg_fixup.c:418: Unsupported two byte opcode b7 Segmentation fault Which again, confirms that the HV is not emulating properly. I applied the changeset mentioned in comment #32 above to the 5.3 HV, and re-ran the test, and it doesn't crash anymore; instead, I got: 4gb seg fixup, process a.out (pid 5866), cs:ip 73:08048382 4gb seg fixup, process a.out (pid 5866), cs:ip 73:007ebcc6 and the program doesn't crash. That seems to be the right behavior, so the attached patch seems to be what we need. Thanks, Chris and Jan. Chris Lalancette
Created attachment 324850 [details] Backport of upstream Xen c/s 16407, which seems to fix this issue
Great, do we have test build with the patch anywhere?
I'm sending a scratch build through brew at the moment; I'll give you a pointer to it once it's done. Chris Lalancette
Here's one I started last nigh, all cooked and ready to try https://brewweb.devel.redhat.com/taskinfo?taskID=1590131
With the 2.6.18-124.el5bbuuidxen kernel cryptsetup works there, thanks!
Created attachment 324882 [details] More faithful backport of Xen c/s 16407 This version of the backport is from Bill Burns, and is a bit more faithful than mine. It's also the exact version that Milan tested, so we should go with this version.
*** Bug 467193 has been marked as a duplicate of this bug. ***
*** Bug 474148 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Bug successfuly reproduced on 2.6.18-92.el5xen. Reproducer program exists with segfault in gdb session. Bug is fixed on 2.6.18-126.el5xen. Reproducer program exists normally in gdb session.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html