Bug 106744

Summary:	xemacs segfaults on ia64
Product:	Red Hat Enterprise Linux 3	Reporter:	Jens Petersen <petersen>
Component:	xemacs	Assignee:	Jens Petersen <petersen>
Status:	CLOSED ERRATA	QA Contact:	David Lawrence <dkl>
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.0	CC:	borgan, ckloiber, drepper, hfuchi, jakub, karlos, roland, tao, tee, woodard, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	ia64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-05-12 05:07:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	106472, 107562

Description Jens Petersen 2003-10-10 02:43:49 UTC

Description of problem:
xemacs segfaults at startup on ia64.

Version-Release number of selected component (if applicable):
xemacs-21.4.13-2.ent

How reproducible:
everytime

Steps to Reproduce:
1. run xemacs
    
Actual results:
Fatal error (11).
:
Segmentation fault

Expected results:
Run normally

Additional info:
Rebuilding fixes the problem.

Comment 1 Jens Petersen 2003-11-06 06:00:56 UTC

QA for 21.4.13-3.ent errata packages requested.

Comment 2 Jens Petersen 2003-11-07 08:18:45 UTC

*** Bug 109355 has been marked as a duplicate of this bug. ***

Comment 3 Ben Woodard 2004-01-24 04:52:27 UTC

Why didn't this bugfix make it into QU1?

Comment 4 Jens Petersen 2004-01-26 03:31:14 UTC

I don't know. :-(

Comment 6 Jeff Needle 2004-02-13 14:51:55 UTC

*** Bug 115262 has been marked as a duplicate of this bug. ***

Comment 9 Jens Petersen 2004-02-20 18:20:01 UTC

Errata packages updated to 21.4.13-4.ent, which looks good
on ia64.  QA request updated.

Comment 14 Jens Petersen 2004-04-21 14:28:44 UTC

After a day of gdb, AFAICT the problem seems to be occurring in:

canonicalize_device_connection (device.c), likely:

  if (HAS_CONTYPE_METH_P (meths, canonicalize_device_connection))

Comment 17 Johnray Fuller 2004-04-29 18:19:35 UTC

So a recompile fixes it, but some customers use a network shared copy
of xemacs. So the an xemacs compiled on an Update 1 box will segfault
when run on an Update 2 (beta) box, and vice-versa.

This is a serious problem for users with such an environment.

J

Comment 18 Johnray Fuller 2004-04-29 21:42:07 UTC

Is it possible that glibc is involved, since installing /lib/tls/*
from an Update 1 box on an Update 2 box results in xemacs working.

J

Comment 19 Jens Petersen 2004-04-30 04:19:54 UTC

*** Bug 122044 has been marked as a duplicate of this bug. ***

Comment 20 Jens Petersen 2004-04-30 04:26:51 UTC

The relation with glibc noted by MSDW is very interesting:
adding jakub and foo to the CC list.

Comment 21 Jens Petersen 2004-04-30 07:23:34 UTC

Still it is surprising that installing U1 glibc on a U2 machine makes
xemacs work, considering that xemacs segfaults on U1 anyway.

I just tried upgrading a U1 test box to U2 glibc. This didn't
stop U2 xemacs segfaulting, however after upgrading to U2 glibc
the xemacs I built locally on the machine starts to segfault too.
But maybe this is just a symptom of the stack corruption?

Comment 22 Jens Petersen 2004-04-30 07:44:59 UTC

And if I build xemacs with U2 glibc on the test box and then downgrade
to U1 glibc, that xemacs also starts to segfault.

Comment 23 Jens Petersen 2004-04-30 07:46:23 UTC

This suggests that the U2 xemacs was built with U1 glibc fwiw.

Comment 24 Jens Petersen 2004-05-04 12:17:31 UTC

With U2 glibc, I see the following:

[root@squidward root]# rpm -q glibc xemacs
glibc-2.3.2-95.20
xemacs-21.4.13-5.ent
[root@squidward root]# gdb xemacs
:
This GDB was configured as "ia64-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".
 
(gdb) run
Starting program: /usr/bin/xemacs
[Thread debugging using libthread_db enabled]
[New Thread 2305843009234634640 (LWP 1417)]
 
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2305843009234634640 (LWP 1417)]
0x0000000000000000 in ?? ()
(gdb) break device.c:446
Breakpoint 1 at 0x40000000000a8eb1: file device.c, line 445.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/xemacs
[Thread debugging using libthread_db enabled]
[New Thread 2305843009234634640 (LWP 1626)]
[Switching to Thread 2305843009234634640 (LWP 1626)]
                                                                     
          
Breakpoint 1, Ffind_device (connection=2305843009235953496,
    type=2305843009235938760) at device.c:446
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) s 2
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) print conmeths
$9 = (struct console_methods *) 0x20000000017b91e8
(gdb) print conmeths->canonicalize_device_connection_method
$10 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001486180
(gdb) s
canonicalize_device_connection (meths=0x2000000001486180,
    name=2305843009235953496, errb=ERROR_ME_NOT) at device.c:400
400     {
(gdb) print meths
$11 = (struct console_methods *) 0x2000000001486180
(gdb) print meths->canonicalize_device_connection_method
$12 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0
(gdb) s
401       if (HAS_CONTYPE_METH_P (meths, canonicalize_device_connection))
(gdb) s
400     {
(gdb) s
Warning:
Cannot insert breakpoint 0.
Error accessing memory address 0x0: Input/output error.


NB how conmeths->canonicalize_device_connection_method is set right
before the call to canonicalize_device_connection, but passed to it
meths->canonicalize_device_connection_method is null apparently??!

Comment 25 Jens Petersen 2004-05-04 12:25:36 UTC

Here is the same thing for a local xemacs built against the same glibc:

[root@squidward root]# rpm -q glibc xemacs
glibc-2.3.2-95.20
xemacs-21.4.13-5.1.squidward3
[root@squidward root]# gdb xemacs
GNU gdb Red Hat Linux (6.0post-0.20031117.6rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ia64-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".
                                                                     
          
(gdb) break device.c:446
Breakpoint 1 at 0x40000000000a9170: file device.c, line 446.
(gdb) run
Starting program: /usr/bin/xemacs
[Thread debugging using libthread_db enabled]
[New Thread 2305843009234634640 (LWP 1650)]
[Switching to Thread 2305843009234634640 (LWP 1650)]
                                                                     
          
Breakpoint 1, Ffind_device (connection=2305843009235953496,
    type=2305843009235947928) at device.c:446
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) s 2
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) print conmeths
$1 = (struct console_methods *) 0x20000000017b91e8
(gdb) print conmeths->canonicalize_device_connection_method
$2 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001496170
(gdb) s
canonicalize_device_connection (meths=0x2000000001496170,
    name=2305843009235953496, errb=ERROR_ME_NOT) at device.c:400
400     {
(gdb) print meths
$3 = (struct console_methods *) 0x2000000001496170
(gdb) print meths->canonicalize_device_connection_method
$4 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x600000000002b9f8
(gdb) s
401       if (((meths)->canonicalize_device_connection_method))
(gdb) s
400     {
(gdb) s
stream_canonicalize_device_connection (connection=2305843009235953496,
    errb=ERROR_ME_NOT) at console-stream.c:155
155       return stream_canonicalize_console_connection (connection,
errb);
(gdb) s
stream_canonicalize_console_connection (connection=2305843009235953496,
    errb=ERROR_ME_NOT) at console-stream.c:130
130       if (NILP (connection) || internal_equal (connection,
Vstdio_str, 0))
(gdb) c
Continuing.
 
Breakpoint 1, Ffind_device (connection=2305843009235953496,
    type=2305843009235906120) at device.c:446
446           canon = canonicalize_device_connection (conmeths,
connection,

Comment 26 Jens Petersen 2004-05-04 12:34:17 UTC

And then with the previously released glibc:


[root@squidward root]# rpm -q glibc xemacs
glibc-2.3.2-95.6
xemacs-21.4.13-5.1.squidward3
[root@squidward root]# gdb xemacs
GNU gdb Red Hat Linux (6.0post-0.20031117.6rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ia64-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".
       
(gdb) run
Starting program: /usr/bin/xemacs
[Thread debugging using libthread_db enabled]
[New Thread 2305843009234585488 (LWP 1700)]
       
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2305843009234585488 (LWP 1700)]
0x0000000000000000 in ?? ()
(gdb) break device.c:446
Breakpoint 1 at 0x40000000000a9170: file device.c, line 446.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/xemacs
[Thread debugging using libthread_db enabled]
[New Thread 2305843009234585488 (LWP 1709)]
[Switching to Thread 2305843009234585488 (LWP 1709)]
 
Breakpoint 1, Ffind_device (connection=2305843009235904344,
    type=2305843009235898776) at device.c:446
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) s 2
446           canon = canonicalize_device_connection (conmeths,
connection,
(gdb) print conmeths
$1 = (struct console_methods *) 0x20000000017ad1e8
(gdb) print conmeths->canonicalize_device_connection_method
$2 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0x2000000001496170
(gdb) s
canonicalize_device_connection (meths=0x2000000001496170,
    name=2305843009235904344, errb=ERROR_ME_NOT) at device.c:400
400     {
(gdb) print meths
$3 = (struct console_methods *) 0x2000000001496170
(gdb) print meths->canonicalize_device_connection_method
$4 = (Lisp_Object (*)(Lisp_Object, Error_behavior)) 0
(gdb) s
401       if (((meths)->canonicalize_device_connection_method))
(gdb) s
400     {
(gdb) s
Warning:
Cannot insert breakpoint 0.
Error accessing memory address 0x0: Input/output error.

Comment 28 Jens Petersen 2004-05-05 01:52:14 UTC

I note for the record here too that building with "-O0" does not help,
it still segfaults with different glibc's.

Comment 31 Jens Petersen 2004-05-07 02:11:16 UTC

I note that during the xemacs build there are asm warnings, like this:

gcc -c -O2 -g -pipe  -Demacs -I. -DHAVE_CONFIG_H -I/usr/X11R6/include
doc.c
{standard input}: Assembler messages:
{standard input}:2297: Warning: Use of 'mov' may violate WAW
dependency 'GR%, %\ in 1 - 127' (impliedf), specific resource number is 14
{standard input}:2296: Warning: This is the location of the
conflicting usage

in several places.

Comment 34 Jens Petersen 2004-05-07 04:28:54 UTC

One temporary solution at least is to run i386 xemacs on ia64.
I just confirmed with help from Uli that for example fc2 xemacs
and xemacs-nox run without any problem on a U2 box and a U1 box.

This does require however installing some i386 library deps not
currently shipped with RHEL3: Canna-libs, Xaw3d, libjpeg, libpng,
libtiff, etc, and adding "/emul/ia32-linux/usr/X11R6/lib" to
"/etc/ld.so.conf".

Comment 35 Jens Petersen 2004-05-07 05:27:47 UTC

The temporary build xemacs-21.4.13-6.ent which runs fine in the
buildsystem buildroots, segfaults on a fresh U2 install box.

Diffing the installed packages didn't turn up any relevant differences.
The only vague relevant different as far as build environment goes is
maybe elfutils.  The buildserver has elfutils-0.94-1 vs elfutils-0.91-3
in U2.

Of course on the buildserver the U2 buildroot is in a chroot and it
is running kernel-2.4.21-4.EL vs 2.4.21-15.EL in U2. They are both
smp boxes.

Comment 36 Jens Petersen 2004-05-07 06:53:35 UTC

About comment 31 (which was actually coming from dired.c in the smp
build):

        (p21) cmp.eq p6, p7 = r8, r14
        nop 0
        ;;
        .mii
        (p6) addl r14 = 1, r0
        (p7) mov r14 = r0

is the relevant code, and Uli told me it is just a harmless warning.

Comment 37 Jens Petersen 2004-05-07 09:06:57 UTC

http://lists.debian.org/debian-ia64/2003/10/msg00050.html
seems highly relevant.

Comment 38 Jakub Jelinek 2004-05-07 11:56:04 UTC

One fix we talked about on IRC is:
in the dumper, read the whole .rela.dyn section of the binary
(_DYNAMIC -> find DT_RELA+DT_RELASZ), find R_IA64_FPTR64LSB relocations.
For each such relocation call dlsym on its symbol.
For symbols not defined in the binary we IMHO don't have to do anything,
as if xemacs relies on library addresses being the same in the dumped
code, it doesn't work not only on IA-64/HPPA, but on any other architecture
as well (there are no guarantees say libc.so or libX11.so are mmapped
at the same addresses in each run).
For symbols defined in the binary (but exported in .dynsym),
either record fixups in the dumped data, so that on startup
xemacs will dlsym on the particular symbol and fixup all places in
the dumped data where the particular function pointer has been used,
or if xemacs doesn't care about function pointer equality between addresses stored in
data structures restored from the dump and addresses in the code
not restored by dump, simply create a new official function descriptor
and replace addresses in dumped data to point to this new ofd.

Comment 39 Jens Petersen 2004-05-07 12:33:12 UTC

In the meantime, workaround appears to be to build without linking
with -dynamic-export.  The only short-coming of this is that
dlopen module support will no longer work, but AFIAK this is a little
used feature of XEmacs anyway: well I have not aware of any serious
use of dl modules.

Comment 40 Jens Petersen 2004-05-07 13:22:03 UTC

20:50:21 <bukaj> jens: without -export-dynamic, the linker can assume
the binary is the sole user of the functions defined in it
20:50:43 <bukaj> jens: thus it can create official function
descriptors when addresses of those functions are taken
20:51:06 <bukaj> jens: in the .opd section (and those addresses cannot
change between runs obviously)
20:51:25 <jens> I see. And with it?
20:51:41 <bukaj> jens: but when it cannot assume the binary is the
sole user, it needs to leave official function descriptor creation to
the dynamic linker
20:51:58 <jens> ah
20:52:18 <bukaj> jens: and the latter are clearly volatile between
different runs, while xemacs relies on function addresses being identical

Comment 46 John Flanagan 2004-05-12 05:07:16 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2003-326.html