Bug 471801 - statically linked uuid segfaults in uuid_generate() on Xen kernel
Summary: statically linked uuid segfaults in uuid_generate() on Xen kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.2
Hardware: i686
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Chris Lalancette
QA Contact: Martin Jenner
URL:
Whiteboard:
: 467193 474148 (view as bug list)
Depends On:
Blocks: 467193
TreeView+ depends on / blocked
 
Reported: 2008-11-16 16:47 UTC by Milan Broz
Modified: 2018-10-20 02:19 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 19:48:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
disassembly of function w/o patch (4.76 KB, text/plain)
2008-11-18 22:06 UTC, Eric Sandeen
no flags Details
disassembly of function with patch (4.34 KB, text/plain)
2008-11-18 22:06 UTC, Eric Sandeen
no flags Details
Backport of upstream Xen c/s 16407, which seems to fix this issue (11.27 KB, patch)
2008-11-27 10:05 UTC, Chris Lalancette
no flags Details | Diff
More faithful backport of Xen c/s 16407 (11.61 KB, patch)
2008-11-27 13:24 UTC, Chris Lalancette
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Milan Broz 2008-11-16 16:47:06 UTC
Simple uuid_generate() call in statically linked program segfaults (or stucks)
if running on XEN kernel on i686 arch (x86_64 works, also dynamically linked libuuid is ok.)
 
I am not sure if this is kernel-xen problem, but because it happens in libuuid, assigning reporting this to e2fsprogs.

See this simple program:

int main (int argc, char *argv[])
{
        char str[64];
        uuid_t uu;

        uuid_generate(uu);
        uuid_unparse(uu, str);
        printf("%s\n", str);

        return 0;
}

compiled with
cc -o uuidgen uuidgen.c -g -O0 -static -luuid

# gdb ./uuidgen

(gdb) list main
1       #include <stdio.h>
2       #include <stdlib.h>
3       #include <uuid/uuid.h>
4
5       int main (int argc, char *argv[])
6       {
7               char str[64];
8               uuid_t uu;
9
10              uuid_generate(uu);
(gdb) r
Starting program: /root/uuid/uuidgen

Program received signal SIGSEGV, Segmentation fault.
get_random_bytes (buf=0xbfab7268, nbytes=16) at gen_uuid.c:161
161             memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed));
(gdb) bt
#0  get_random_bytes (buf=0xbfab7268, nbytes=16) at gen_uuid.c:161
#1  0x08048505 in uuid__generate_random (out=0xbfab72c4 "�r������", num=0xbfab72a4) at gen_uuid.c:540
#2  0x0804858f in uuid_generate_random (out=0xbfab72c4 "�r������") at gen_uuid.c:556
#3  0x08048244 in main () at uuidgen.c:10
(gdb) list
156              * randomness if /dev/random/urandom is out to lunch.
157              */
158             for (cp = buf, i = 0; i < nbytes; i++)
159                     *cp++ ^= (rand() >> 7) & 0xFF;
160     #ifdef DO_JRAND_MIX
161             memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed));
162             jrand_seed[2] = jrand_seed[2] ^ syscall(__NR_gettid);
163             for (cp = buf, i = 0; i < nbytes; i++)
164                     *cp++ ^= (jrand48(tmp_seed) >> 7) & 0xFF;
165             memcpy(jrand_seed, tmp_seed,
(gdb) p jrand_seed
Cannot find thread-local variables on this target

# rpm -q e2fsprogs e2fsprogs-devel kernel-xen
e2fsprogs-1.39-15.el5
e2fsprogs-devel-1.39-15.el5
kernel-xen-2.6.18-92.1.18.el5

# uname -a
Linux proliant06 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 5 09:30:07 EST 2008 i686 i686 i386 GNU/Linux

Comment 2 Milan Broz 2008-11-16 17:03:33 UTC
dmesg also contains these xen specific entries:

4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c
4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c
4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c
4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c
4gb seg fixup, process uuidgen (pid 2736), cs:ip 73:0804834c

Comment 3 Daniel Berrangé 2008-11-17 11:10:19 UTC
Those messages kernel should be harmless. When using shared libraries on i386 xen, the dynamic linker magically loads a Xen aware libc that avoids -ve segment address accesses. These are functionally fine, but do incurr a performance hit under Xen, hence the kernel warns if any app does this. Since you are statically linking, its not got the option of using the Xen optimized libc, so you see these messages. You should be able to safely ignore them.

Comment 4 Denise Dumas 2008-11-17 16:49:34 UTC
Marking for consideration as a 5.3 blocker, since this is the root cause for 467193

Comment 5 Eric Sandeen 2008-11-17 17:38:32 UTC
Is this a regression?

Comment 6 Milan Broz 2008-11-17 17:57:03 UTC
I think that it doesn't work on RHEL5.2 released version either.

Comment 7 Eric Sandeen 2008-11-17 20:23:49 UTC
Hm, strange, if I build an x86 executable in a 64-bit xen guest on a 64-bit host, it seems fine:

[root@localhost ~]# gcc -m32 -o test32 test.c -g -O0 -static -luuid
[root@localhost ~]# ./test32 
f8f2847d-1113-44c6-8aee-6b3be9ae992c
[root@localhost ~]# file test32
test32: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, statically linked, for GNU/Linux 2.6.9, not stripped

Guess I'll build an x86 guest - or was your host on x86 as well?

Thanks,
-Eric

Comment 8 Daniel Berrangé 2008-11-17 20:32:35 UTC
FYI, I have tried a 32-bit executable in 64-bit guest on 64-bit host, and 32-bit executable in 32-bit guest, on 64-bit host. Both worked fine.

The only time I could reproduce it is in a 32-bit Xen Dom0 host. The scenario I've not had time to test is whether a 32-bit guest on a 32-bit host is affected.

For my tests the demo code will always hang, i don't see a crash.

Comment 9 Eric Sandeen 2008-11-17 20:42:45 UTC
Daniel, thanks.  I guess I'll grab an x86 box out of RHTS and try it there.

-Eric

Comment 10 Milan Broz 2008-11-17 20:51:22 UTC
Yes, this is only reproducible

- on x86 32bit Dom0
- on 32bit (paravirtualized) guest running on 32bit host *only*.

(The same machine with the same binary but non-xen kernel works ok.)

Comment 11 Daniel Berrangé 2008-11-17 20:54:17 UTC
FYI, if you don't have a 32-bit machine, but do have a 64-bit Fedora box able to run KVM fully virt, you can also install a RHEL-5 Xen 32-bit Dom0 inside the 64-bit KVM guest. That's how I reproduced it :-)

Comment 12 Milan Broz 2008-11-17 20:57:03 UTC
(In reply to comment #8)
> For my tests the demo code will always hang, i don't see a crash.

I saw this too - it hangs in 16 byte memcpy() call!
Other machine just segfaults with the same binary here.

Comment 13 Eric Sandeen 2008-11-17 21:37:42 UTC
Hm, I was going to give this a whirl on updated e2fsprogs, but during the make check phase:

make[1]: Entering directory `/usr/src/redhat/BUILD/e2fsprogs-1.41.3/lib/uuid'
LD_LIBRARY_PATH=../../lib DYLD_LIBRARY_PATH=../../lib ./tst_uuid
make[1]: *** [check] Segmentation fault

so I guess it persists.

One thing about this is that the uuid gen stuff uses thread local storage... not sure how that might play into this with the Xen kernel.

Comment 14 Eric Sandeen 2008-11-17 22:21:51 UTC
Well, this seems odd.  With this patch:

Index: e2fsprogs-1.39/lib/uuid/gen_uuid.c
===================================================================
--- e2fsprogs-1.39.orig/lib/uuid/gen_uuid.c	2008-11-17 15:30:00.000000000 -0600
+++ e2fsprogs-1.39/lib/uuid/gen_uuid.c	2008-11-17 16:13:50.904588722 -0600
@@ -136,6 +136,7 @@ static void get_random_bytes(void *buf, 
 	int lose_counter = 0;
 	unsigned char *cp = (unsigned char *) buf;
 	unsigned short tmp_seed[3];
+	int tid;
 
 	if (fd >= 0) {
 		while (n > 0) {
@@ -159,7 +160,8 @@ static void get_random_bytes(void *buf, 
 		*cp++ ^= (rand() >> 7) & 0xFF;
 #ifdef DO_JRAND_MIX
 	memcpy(tmp_seed, jrand_seed, sizeof(tmp_seed));
-	jrand_seed[2] = jrand_seed[2] ^ syscall(__NR_gettid);
+	tid = syscall(__NR_gettid);
+	jrand_seed[2] = jrand_seed[2] ^ tid;
 	for (cp = buf, i = 0; i < nbytes; i++)
 		*cp++ ^= (jrand48(tmp_seed) >> 7) & 0xFF;
 	memcpy(jrand_seed, tmp_seed, 


it works ok.  The syscall is not failing.... I'm not sure what's going wrong here.

Comment 15 Eric Sandeen 2008-11-17 22:50:52 UTC
This sure looks like a toolchain problem to me, unless someone can point out what's wrong with the code above? (and why it would only manifest itself on a static link?)

Comment 16 Theodore Tso 2008-11-18 21:32:19 UTC
I'm going to give a wild guess and suspect thread local storage is somehow busted on static links in a Xen guest.  Try applying this pseudo-patch to lib/uuid/Makefile.in and see if the problem goes away:

 .c.o:
 	@echo "	CC $<"
- 	@$(CC) $(ALL_CFLAGS) -c $< -o $@
+ 	@$(CC) $(ALL_CFLAGS) -c $< -o $@ -UTLS
 @PROFILE_CMT@	@$(CC) $(ALL_CFLAGS) -g -pg -o profiled/$*.o -c $<
 @CHECKER_CMT@	@$(CC) $(ALL_CFLAGS) -checker -g -o checker/$*.o -c $<
 @ELF_CMT@	@$(CC) $(ALL_CFLAGS) -fPIC -o elfshared/$*.o -c $<
 @BSDLIB_CMT@	@$(CC) $(ALL_CFLAGS) $(BSDLIB_PIC_FLAG) -o pic/$*.o -c $<

If so, this could be considered a toolchain/library bug, but on some operating systems, such as Solaris, they have stopped supporting static linking because things like TLS are heard to keep working with static linking.  Sure seems wierd that it only is blowing up in a Xen environment, though!

Comment 17 Theodore Tso 2008-11-18 21:38:04 UTC
Per IRC conversation, Eric pointed out that tst_uuid is dynamically compiled.  So this problem of what's going on with thread-local storage and Xen may not be limited to static libraries.

Also, the generated assembly with and without the patch mentioned in comment #14 is quite a bit different.  Which really seems weird, and it would be good to get a compiler expert to weigh in here.

Maybe we should get a simplified test case and consider this a potential toolchain bug?

Comment 18 Eric Sandeen 2008-11-18 22:06:22 UTC
Created attachment 323969 [details]
disassembly of function w/o patch

Comment 19 Eric Sandeen 2008-11-18 22:06:43 UTC
Created attachment 323970 [details]
disassembly of function with patch

Comment 20 Eric Sandeen 2008-11-18 22:09:38 UTC
I'm slightly hesitant to just commit the patch until we can get to the real underlying problem, to be sure we're not just papering over the problem.

I'll see if I can get a simplified testcase for the toolchain folks to look at.

Comment 21 Milan Broz 2008-11-25 20:00:15 UTC
This simplified code I tried and it segfaults on xen kernel...

Am I overlooking something completely obvious?

compiled with (not static now!)
cc -o uuidgen uuidgen.c -g -O0 -pedantic -Wall

#include <stdio.h>

static __thread unsigned short tst_thread = 0xbabe;

int main (int argc, char *argv[])
{
        unsigned short tst = 0xdead;

        printf("tst is %0x, tst_thread is %0x\n", tst, tst_thread);
        return 0;
}

Comment 22 Jan Kratochvil 2008-11-25 21:25:24 UTC
Slightly reduced reproducer from the Comment 21:
volatile __thread short shortvar;
int
main (void)
{
  /* movzwl %gs:0xfffffffe,%eax */
  return shortvar;
}

IMO it is a xen hypervisor problem (reassign it) as I guess it wrongly emulates (traps) `movzwl'.  One can build the userland program using
`gcc -mno-tls-direct-seg-refs' to workaround this problem (I do not know any possible pitfalls of -mno-tls-direct-seg-refs otherwise, though).

Comment 23 Eric Sandeen 2008-11-25 21:30:58 UTC
Punting to Xen ...

Comment 24 Daniel Berrangé 2008-11-25 21:49:17 UTC
Further punting to kernel-xen, since its a HV problem

Comment 25 Tom Coughlan 2008-11-25 22:48:12 UTC
Bill,

Just to be clear on why this is a priority for 5.3: This BZ is the root cause for bug #467193, and that bug prevents the installer from using encryption. It is not likely we can ship with that not working. 

Tom

Comment 26 Theodore Tso 2008-11-26 02:46:22 UTC
I'll further note that as Thread Local Storage gets used by more and more
programs as time goes on, and given that there's probably not as much testing
for various packages running under Xen, it's likely that other programs may
also end up seg faulting running under the Xen kernel.  So I would hazard a
guess that e2fsprogs won't be only program that will trip up against this bug.

Comment 27 Daniel Berrangé 2008-11-26 10:34:06 UTC
Tom, its not clear from bug #467193 whether this is a regression from 5.2, or whether installing to encrypted disks has been broken ever since 5.0 GA on i386 Xen.  I suspect its probably the latter case, but good to have a confirmation about whether this is a regression or not.

Comment 28 Alexander Todorov 2008-11-26 10:40:52 UTC
Daniel,
although cryptsetup-luks has been available since 5.0 GA, RHEL 5.3 is the first to support it in the installer. Based on that I'll not consider the installer behavior as regression. Please refer to Milan if you need to know if the root cause (i.e. this bug) is a regression or not.

Thanks.

Comment 29 Milan Broz 2008-11-26 10:54:19 UTC
For cryptsetup technically it is not regression, statically linked cryptsetup in LUKS mode probably never worked on kernel-xen (At least it doesn't work in RHEL5.2).

But we are are going to support encrypted system install in 5.3 installer (new functionality) and it doesn't work for Dom0 and Xen guests on i686. Many people will use xen guests for prototyping, testing etc.

Also as Theodore noted, this bug can probably hit more programs using TLS.

Comment 30 Bill Burns 2008-11-26 18:59:59 UTC
do we know roughly when this started happening? the initial report was made on 11/16. Would be key to know what kernel build you were using at the time and if it had been working prior to that.

Comment 31 Daniel Berrangé 2008-11-26 20:36:45 UTC
Bill, it would appear as if this has always been a problem - it was simply never noticed prior to now, because 5.3 is the first release with encryption supported in the installer. As mentioned it could also impact other apps using thread locals

Comment 32 Chris Wright 2008-11-26 22:21:54 UTC
I agree with Jan in Comment #22.  The insn looks like:  65 0f b7...
65 is decoded properly as gs override.  But the twobyte insn is not properly decoded.  We see 0f, and use the twobyte opcode table, but b7 is listed as unknown.  Upstream has fixed this in:

http://xenbits.xensource.com/xen-unstable.hg?rev/81aa410fa662

Important bit is here:

    /* 0xB0 - 0xBF */
-    X, X, X, X, X, X, X, X,
-    X, X, X, X, X, X, X, X,
+    X, X, X, O|M, X, X, O|M, O|M,

Comment 33 Chris Lalancette 2008-11-27 10:04:04 UTC
Jan and Chris are right on the money here.  This has to do with 32-bit segment fixups that the hypervisor does.  It goes like:

In normal, bare-metal operation, glibc uses negative segment offsets (for performance reasons) when doing thread-local variables.  However, in a Xen environment, these cause problems, because the hypervisor is mapped into the upper region of the memory address space.  For this reason, under Xen, these instructions are trapped, examined, and emulated.  Obviously, if this was done for every program, this would cause quite a performance penalty.  So what we do is for dynamically linked programs, we choose a different glibc that doesn't use the negative segment offsets, so things work pretty well.  However, for statically linked programs, the only options are to either build with -mno-tls-direct-seg-refs (which is undesirable for bare-metal), or to have Xen correctly emulate the instruction (which it is not doing in this case).  To confirm, I rebooted the 5.3 HV with "loglvl=all", and ran the above test; I got:

[root@amd1 ~]# ./a.out
(XEN) seg_fixup.c:418: Unsupported two byte opcode b7
Segmentation fault

Which again, confirms that the HV is not emulating properly.  I applied the changeset mentioned in comment #32 above to the 5.3 HV, and re-ran the test, and it doesn't crash anymore; instead, I got:

4gb seg fixup, process a.out (pid 5866), cs:ip 73:08048382
4gb seg fixup, process a.out (pid 5866), cs:ip 73:007ebcc6

and the program doesn't crash.  That seems to be the right behavior, so the attached patch seems to be what we need.  Thanks, Chris and Jan.

Chris Lalancette

Comment 34 Chris Lalancette 2008-11-27 10:05:20 UTC
Created attachment 324850 [details]
Backport of upstream Xen c/s 16407, which seems to fix this issue

Comment 35 Milan Broz 2008-11-27 10:30:45 UTC
Great, do we have test build with the patch anywhere?

Comment 36 Chris Lalancette 2008-11-27 10:34:09 UTC
I'm sending a scratch build through brew at the moment; I'll give you a pointer to it once it's done.

Chris Lalancette

Comment 37 Bill Burns 2008-11-27 12:35:33 UTC
Here's one I started last nigh, all cooked and ready to try
https://brewweb.devel.redhat.com/taskinfo?taskID=1590131

Comment 38 Milan Broz 2008-11-27 13:06:03 UTC
With the 2.6.18-124.el5bbuuidxen kernel cryptsetup works there, thanks!

Comment 41 Chris Lalancette 2008-11-27 13:24:22 UTC
Created attachment 324882 [details]
More faithful backport of Xen c/s 16407

This version of the backport is from Bill Burns, and is a bit more faithful than mine.  It's also the exact version that Milan tested, so we should go with this version.

Comment 42 Brock Organ 2008-12-02 14:50:58 UTC
*** Bug 467193 has been marked as a duplicate of this bug. ***

Comment 44 Michal Nowak 2008-12-09 10:31:44 UTC
*** Bug 474148 has been marked as a duplicate of this bug. ***

Comment 45 Don Zickus 2008-12-09 21:05:06 UTC
in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 49 Jan Tluka 2008-12-18 18:10:03 UTC
Bug successfuly reproduced on 2.6.18-92.el5xen. Reproducer program exists with segfault in gdb session. 
Bug is fixed on 2.6.18-126.el5xen. Reproducer program exists normally in gdb session.

Comment 51 errata-xmlrpc 2009-01-20 19:48:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 53 Michal Nowak 2009-07-08 12:32:56 UTC
*** Bug 474148 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.