Bug 106744
Summary: | xemacs segfaults on ia64 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Jens Petersen <petersen> |
Component: | xemacs | Assignee: | Jens Petersen <petersen> |
Status: | CLOSED ERRATA | QA Contact: | David Lawrence <dkl> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.0 | CC: | borgan, ckloiber, drepper, hfuchi, jakub, karlos, roland, tao, tee, woodard, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ia64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-05-12 05:07:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 106472, 107562 |
Description
Jens Petersen
2003-10-10 02:43:49 UTC
QA for 21.4.13-3.ent errata packages requested. *** Bug 109355 has been marked as a duplicate of this bug. *** Why didn't this bugfix make it into QU1? I don't know. :-( *** Bug 115262 has been marked as a duplicate of this bug. *** Errata packages updated to 21.4.13-4.ent, which looks good on ia64. QA request updated. After a day of gdb, AFAICT the problem seems to be occurring in: canonicalize_device_connection (device.c), likely: if (HAS_CONTYPE_METH_P (meths, canonicalize_device_connection)) So a recompile fixes it, but some customers use a network shared copy of xemacs. So the an xemacs compiled on an Update 1 box will segfault when run on an Update 2 (beta) box, and vice-versa. This is a serious problem for users with such an environment. J Is it possible that glibc is involved, since installing /lib/tls/* from an Update 1 box on an Update 2 box results in xemacs working. J *** Bug 122044 has been marked as a duplicate of this bug. *** The relation with glibc noted by MSDW is very interesting: adding jakub and foo to the CC list. Still it is surprising that installing U1 glibc on a U2 machine makes xemacs work, considering that xemacs segfaults on U1 anyway. I just tried upgrading a U1 test box to U2 glibc. This didn't stop U2 xemacs segfaulting, however after upgrading to U2 glibc the xemacs I built locally on the machine starts to segfault too. But maybe this is just a symptom of the stack corruption? And if I build xemacs with U2 glibc on the test box and then downgrade to U1 glibc, that xemacs also starts to segfault. This suggests that the U2 xemacs was built with U1 glibc fwiw. With U2 glibc, I see the following: [root@squidward root]# rpm -q glibc xemacs glibc-2.3.2-95.20 xemacs-21.4.13-5.ent [root@squidward root]# gdb xemacs : This GDB was configured as "ia64-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run Starting program: /usr/bin/xemacs [Thread debugging using libthread_db enabled] [New Thread 2305843009234634640 (LWP 1417)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 2305843009234634640 (LWP 1417)] 0x0000000000000000 in ?? () (gdb) break device.c:446 Breakpoint 1 at 0x40000000000a8eb1: file device.c, line 445. (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/xemacs [Thread debugging using libthread_db enabled] [New Thread 2305843009234634640 (LWP 1626)] [Switching to Thread 2305843009234634640 (LWP 1626)] Breakpoint 1, Ffind_device (connection=2305843009235953496, type=2305843009235938760) at device.c:446 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) s 2 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) print conmeths $9 = (struct console_methods *) 0x20000000017b91e8 (gdb) print conmeths->canonicalize_device_connection_method $10 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001486180 (gdb) s canonicalize_device_connection (meths=0x2000000001486180, name=2305843009235953496, errb=ERROR_ME_NOT) at device.c:400 400 { (gdb) print meths $11 = (struct console_methods *) 0x2000000001486180 (gdb) print meths->canonicalize_device_connection_method $12 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0 (gdb) s 401 if (HAS_CONTYPE_METH_P (meths, canonicalize_device_connection)) (gdb) s 400 { (gdb) s Warning: Cannot insert breakpoint 0. Error accessing memory address 0x0: Input/output error. NB how conmeths->canonicalize_device_connection_method is set right before the call to canonicalize_device_connection, but passed to it meths->canonicalize_device_connection_method is null apparently??! Here is the same thing for a local xemacs built against the same glibc: [root@squidward root]# rpm -q glibc xemacs glibc-2.3.2-95.20 xemacs-21.4.13-5.1.squidward3 [root@squidward root]# gdb xemacs GNU gdb Red Hat Linux (6.0post-0.20031117.6rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ia64-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) break device.c:446 Breakpoint 1 at 0x40000000000a9170: file device.c, line 446. (gdb) run Starting program: /usr/bin/xemacs [Thread debugging using libthread_db enabled] [New Thread 2305843009234634640 (LWP 1650)] [Switching to Thread 2305843009234634640 (LWP 1650)] Breakpoint 1, Ffind_device (connection=2305843009235953496, type=2305843009235947928) at device.c:446 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) s 2 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) print conmeths $1 = (struct console_methods *) 0x20000000017b91e8 (gdb) print conmeths->canonicalize_device_connection_method $2 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001496170 (gdb) s canonicalize_device_connection (meths=0x2000000001496170, name=2305843009235953496, errb=ERROR_ME_NOT) at device.c:400 400 { (gdb) print meths $3 = (struct console_methods *) 0x2000000001496170 (gdb) print meths->canonicalize_device_connection_method $4 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x600000000002b9f8 (gdb) s 401 if (((meths)->canonicalize_device_connection_method)) (gdb) s 400 { (gdb) s stream_canonicalize_device_connection (connection=2305843009235953496, errb=ERROR_ME_NOT) at console-stream.c:155 155 return stream_canonicalize_console_connection (connection, errb); (gdb) s stream_canonicalize_console_connection (connection=2305843009235953496, errb=ERROR_ME_NOT) at console-stream.c:130 130 if (NILP (connection) || internal_equal (connection, Vstdio_str, 0)) (gdb) c Continuing. Breakpoint 1, Ffind_device (connection=2305843009235953496, type=2305843009235906120) at device.c:446 446 canon = canonicalize_device_connection (conmeths, connection, And then with the previously released glibc: [root@squidward root]# rpm -q glibc xemacs glibc-2.3.2-95.6 xemacs-21.4.13-5.1.squidward3 [root@squidward root]# gdb xemacs GNU gdb Red Hat Linux (6.0post-0.20031117.6rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ia64-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run Starting program: /usr/bin/xemacs [Thread debugging using libthread_db enabled] [New Thread 2305843009234585488 (LWP 1700)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 2305843009234585488 (LWP 1700)] 0x0000000000000000 in ?? () (gdb) break device.c:446 Breakpoint 1 at 0x40000000000a9170: file device.c, line 446. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/xemacs [Thread debugging using libthread_db enabled] [New Thread 2305843009234585488 (LWP 1709)] [Switching to Thread 2305843009234585488 (LWP 1709)] Breakpoint 1, Ffind_device (connection=2305843009235904344, type=2305843009235898776) at device.c:446 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) s 2 446 canon = canonicalize_device_connection (conmeths, connection, (gdb) print conmeths $1 = (struct console_methods *) 0x20000000017ad1e8 (gdb) print conmeths->canonicalize_device_connection_method $2 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001496170 (gdb) s canonicalize_device_connection (meths=0x2000000001496170, name=2305843009235904344, errb=ERROR_ME_NOT) at device.c:400 400 { (gdb) print meths $3 = (struct console_methods *) 0x2000000001496170 (gdb) print meths->canonicalize_device_connection_method $4 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0 (gdb) s 401 if (((meths)->canonicalize_device_connection_method)) (gdb) s 400 { (gdb) s Warning: Cannot insert breakpoint 0. Error accessing memory address 0x0: Input/output error. I note for the record here too that building with "-O0" does not help, it still segfaults with different glibc's. I note that during the xemacs build there are asm warnings, like this: gcc -c -O2 -g -pipe -Demacs -I. -DHAVE_CONFIG_H -I/usr/X11R6/include doc.c {standard input}: Assembler messages: {standard input}:2297: Warning: Use of 'mov' may violate WAW dependency 'GR%, %\ in 1 - 127' (impliedf), specific resource number is 14 {standard input}:2296: Warning: This is the location of the conflicting usage in several places. One temporary solution at least is to run i386 xemacs on ia64. I just confirmed with help from Uli that for example fc2 xemacs and xemacs-nox run without any problem on a U2 box and a U1 box. This does require however installing some i386 library deps not currently shipped with RHEL3: Canna-libs, Xaw3d, libjpeg, libpng, libtiff, etc, and adding "/emul/ia32-linux/usr/X11R6/lib" to "/etc/ld.so.conf". The temporary build xemacs-21.4.13-6.ent which runs fine in the buildsystem buildroots, segfaults on a fresh U2 install box. Diffing the installed packages didn't turn up any relevant differences. The only vague relevant different as far as build environment goes is maybe elfutils. The buildserver has elfutils-0.94-1 vs elfutils-0.91-3 in U2. Of course on the buildserver the U2 buildroot is in a chroot and it is running kernel-2.4.21-4.EL vs 2.4.21-15.EL in U2. They are both smp boxes. About comment 31 (which was actually coming from dired.c in the smp build): (p21) cmp.eq p6, p7 = r8, r14 nop 0 ;; .mii (p6) addl r14 = 1, r0 (p7) mov r14 = r0 is the relevant code, and Uli told me it is just a harmless warning. http://lists.debian.org/debian-ia64/2003/10/msg00050.html seems highly relevant. One fix we talked about on IRC is: in the dumper, read the whole .rela.dyn section of the binary (_DYNAMIC -> find DT_RELA+DT_RELASZ), find R_IA64_FPTR64LSB relocations. For each such relocation call dlsym on its symbol. For symbols not defined in the binary we IMHO don't have to do anything, as if xemacs relies on library addresses being the same in the dumped code, it doesn't work not only on IA-64/HPPA, but on any other architecture as well (there are no guarantees say libc.so or libX11.so are mmapped at the same addresses in each run). For symbols defined in the binary (but exported in .dynsym), either record fixups in the dumped data, so that on startup xemacs will dlsym on the particular symbol and fixup all places in the dumped data where the particular function pointer has been used, or if xemacs doesn't care about function pointer equality between addresses stored in data structures restored from the dump and addresses in the code not restored by dump, simply create a new official function descriptor and replace addresses in dumped data to point to this new ofd. In the meantime, workaround appears to be to build without linking with -dynamic-export. The only short-coming of this is that dlopen module support will no longer work, but AFIAK this is a little used feature of XEmacs anyway: well I have not aware of any serious use of dl modules. 20:50:21 <bukaj> jens: without -export-dynamic, the linker can assume the binary is the sole user of the functions defined in it 20:50:43 <bukaj> jens: thus it can create official function descriptors when addresses of those functions are taken 20:51:06 <bukaj> jens: in the .opd section (and those addresses cannot change between runs obviously) 20:51:25 <jens> I see. And with it? 20:51:41 <bukaj> jens: but when it cannot assume the binary is the sole user, it needs to leave official function descriptor creation to the dynamic linker 20:51:58 <jens> ah 20:52:18 <bukaj> jens: and the latter are clearly volatile between different runs, while xemacs relies on function addresses being identical An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2003-326.html |