Bug 208091 - automount dumping core
automount dumping core
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: autofs (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Ian Kent
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-26 08:57 EDT by Terje Rosten
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version: beta2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-12-22 19:25:33 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/etc/auto.master file (568 bytes, text/plain)
2006-11-17 12:25 EST, Erik Jacobson
no flags Details
auto_master_linux NIS map from ypcat -k (802 bytes, text/plain)
2006-11-17 12:27 EST, Erik Jacobson
no flags Details
ypcat -k of auto_home (14.77 KB, text/plain)
2006-11-17 12:29 EST, Erik Jacobson
no flags Details
/etc/sysconfig/autofs (1.28 KB, text/plain)
2006-11-17 12:30 EST, Erik Jacobson
no flags Details
Example core file (compressed (611.68 KB, application/octet-stream)
2006-11-17 12:33 EST, Erik Jacobson
no flags Details
Debuginfo collection instructions (3.52 KB, text/plain)
2006-11-17 12:51 EST, Ian Kent
no flags Details
nsswitch.conf file per request (1.65 KB, text/plain)
2006-11-17 14:41 EST, Erik Jacobson
no flags Details
automount dumped core, and thne I sent this file... (260.89 KB, text/plain)
2006-11-17 15:55 EST, Erik Jacobson
no flags Details
Fix illegal memory access in lookup_yp.c (2.01 KB, patch)
2006-11-20 00:20 EST, Ian Kent
no flags Details | Diff
backtraces from another core dump (14.70 KB, text/plain)
2006-11-20 11:24 EST, Erik Jacobson
no flags Details
debug log (49.22 KB, text/plain)
2006-11-20 11:25 EST, Erik Jacobson
no flags Details
gdb output (that probably isn't useful) (12.66 KB, text/plain)
2006-11-21 16:22 EST, Erik Jacobson
no flags Details
daemon.debug log output (54.12 KB, text/plain)
2006-11-21 16:26 EST, Erik Jacobson
no flags Details
strace output (53.58 KB, text/plain)
2006-11-21 17:46 EST, Erik Jacobson
no flags Details
Patch to remove need to call pthread_kill when checking for task done. (1.90 KB, patch)
2006-11-22 01:50 EST, Ian Kent
no flags Details | Diff
Patch to remove need to call pthread_kill when checking for task done (with correction). (1.90 KB, patch)
2006-11-22 20:46 EST, Ian Kent
no flags Details | Diff
gzipped core from comment #47 (117.20 KB, application/octet-stream)
2006-11-23 08:33 EST, Terje Rosten
no flags Details
automount log from crash in comment #47 (2.16 KB, application/octet-stream)
2006-11-23 08:42 EST, Terje Rosten
no flags Details
Patch to fix nsswitch parser locking (1.47 KB, patch)
2006-11-23 13:00 EST, Ian Kent
no flags Details | Diff
Patch to fix macro table locking (5.91 KB, patch)
2006-11-23 13:01 EST, Ian Kent
no flags Details | Diff
Interim patch to fix null map handling semantics (22.96 KB, patch)
2006-11-23 14:05 EST, Ian Kent
no flags Details | Diff
automount log from crash in comment #59 (888 bytes, application/octet-stream)
2006-11-24 06:28 EST, Terje Rosten
no flags Details
Interim patch to fix null map handling semantics - fix (864 bytes, patch)
2006-11-24 12:27 EST, Ian Kent
no flags Details | Diff
logs as requested in comment #64 (696 bytes, application/octet-stream)
2006-11-27 09:34 EST, Terje Rosten
no flags Details

  None (edit)
Description Terje Rosten 2006-09-26 08:57:37 EDT
Description of problem:

automount dumps core on a daily basis:

$ file /core.690 
/core.690: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'automount'

$ file /core.*
/core.1815: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style,
from 'automount'
/core.2671: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style,
from 'automount'


After 
$ service autofs restart
everything is fine

$ cat /etc/auto.master| grep -v '^#'
/misc   /etc/auto.misc
/net    /etc/auto.net
/home yp:auto.home.nis -tcp
+auto.master

auto.master from yp has 10 entries, some direct and some indirect maps.

Version-Release number of selected component (if applicable):
autofs-5.0.1-0.rc1.6
kernel 2.6.17-1.2519.4.21.el5 

Seen on i686 and x86_64.
Comment 1 Ian Kent 2006-09-26 12:15:25 EDT
(In reply to comment #0)
> Description of problem:
> 
> automount dumps core on a daily basis:
> 
> $ file /core.690 
> /core.690: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
> from 'automount'
> 
> $ file /core.*
> /core.1815: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style,
> from 'automount'
> /core.2671: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style,
> from 'automount'
> 
> 
> After 
> $ service autofs restart
> everything is fine
> 
> $ cat /etc/auto.master| grep -v '^#'
> /misc   /etc/auto.misc
> /net    /etc/auto.net
> /home yp:auto.home.nis -tcp
> +auto.master
> 
> auto.master from yp has 10 entries, some direct and some indirect maps.

Can you include your nis master map and output from syslog.

> 
> Version-Release number of selected component (if applicable):
> autofs-5.0.1-0.rc1.6
> kernel 2.6.17-1.2519.4.21.el5 
> 
> Seen on i686 and x86_64.

A debug log would be usefull for me to try and locate where
this is happening. Add "--debug" to OPTIONS in
/etc/sysconfig/autofs and ensure that syslog is sending
daemon.* is being send to a log file.

Also I would appreciate it if you could try the latest available
version in Rawhide.

Ian
Comment 2 Terje Rosten 2006-10-04 07:03:12 EDT
> Also I would appreciate it if you could try the latest available
> version in Rawhide.

Seems like this version 

$ rpm -q autofs
autofs-5.0.1-0.rc2.1

is better, no more core dumps from automount, however now I see core dumps from
umount.nfs:

$ ls -l /core.*
-rw------- 1 root root   110592 Sep 30 07:42 /core.12821
-rw------- 1 root root   110592 Oct  3 02:58 /core.5092
-rw------- 1 root root 36139008 Sep 24 20:32 /core.690
-rw------- 1 root root   110592 Oct  1 16:58 /core.7371

$ file  /core.*
/core.12821: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'mount.nfs'
/core.5092:  ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'umount.nfs'
/core.690:   ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'automount'
/core.7371:  ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'umount.nfs
Comment 3 Ian Kent 2006-10-04 09:41:11 EDT
(In reply to comment #2)
> Seems like this version 
> 
> $ rpm -q autofs
> autofs-5.0.1-0.rc2.1

We have autofs-5.0.1-0.rc2.4 in beta2.

> 
> is better, no more core dumps from automount, however now I see core dumps from
> umount.nfs:
> 
> $ ls -l /core.*
> -rw------- 1 root root   110592 Sep 30 07:42 /core.12821
> -rw------- 1 root root   110592 Oct  3 02:58 /core.5092
> -rw------- 1 root root 36139008 Sep 24 20:32 /core.690
> -rw------- 1 root root   110592 Oct  1 16:58 /core.7371
> 
> $ file  /core.*
> /core.12821: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
> from 'mount.nfs'
> /core.5092:  ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
> from 'umount.nfs'
> /core.690:   ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
> from 'automount'
> /core.7371:  ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
> from 'umount.nfs

There have been a couple of fixes with the latest revision of
nfs-utils. What version are you using? It would be good to
know if the latest is still a problem.

Ian
Comment 4 Ian Kent 2006-10-04 09:53:19 EDT
(In reply to comment #3)
> > 
> > is better, no more core dumps from automount, however now I see core dumps from
> > umount.nfs:
> > 

I just checked a recent RHEL5 test install I have and I don't see
any cores. I've been running tests that mount/umount several
hundred mounts over the last couple of days.

util-linux-2.13-0.42.el5
nfs-utils-1.0.9-8.fc6
nfs-utils-lib-1.0.8-7.2

The later nfs-utils version was needed to resolve a problem
with incorect return status from mount.

Ian
Comment 5 Erik Jacobson 2006-11-17 12:04:56 EST
Still observed in RHEL5 Beta2 :(

This is actually impacting our testing since we use NFS heavily in one of
our offices to get at files in our home directories and various data
repositories.
Comment 6 Erik Jacobson 2006-11-17 12:09:37 EST
In my case, I can test like this... (rhel5 beta2, ia64 in this case)

 - Reboot machine :)
 - ls ~erikj (which mounts my home directory over NFS)
 - Wait about 30 minutes
 - Then find I can't get to ~erikj any more
 - Then find a core in /.
Comment 7 Erik Jacobson 2006-11-17 12:20:05 EST
I'm going to attach some files about how we have automount set up here.

Also... Not that this is a huge help, but here is a paste from gdb:

[root@minime1 /]# gdb --core=core.1889
GNU gdb Red Hat Linux (6.5-12.el5rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-redhat-linux-gnu".
(no debugging symbols found)
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `automount'.
Program terminated with signal 11, Segmentation fault.
#0  0x20000008000dec60 in ?? ()
Comment 8 Erik Jacobson 2006-11-17 12:25:32 EST
Created attachment 141502 [details]
/etc/auto.master file

This mainly just shows we get auto_master_linux from NIS.
Comment 9 Erik Jacobson 2006-11-17 12:27:07 EST
Created attachment 141504 [details]
auto_master_linux NIS map from ypcat -k

This shows ypcat -k output from the map referenced in /etc/auto.master.
Comment 10 Erik Jacobson 2006-11-17 12:29:18 EST
Created attachment 141505 [details]
ypcat -k of auto_home

This shows the auto_home NIS map.  I choose this one because I referred to my
home directory earlier.
Comment 11 Erik Jacobson 2006-11-17 12:30:45 EST
Created attachment 141506 [details]
/etc/sysconfig/autofs
Comment 12 Erik Jacobson 2006-11-17 12:33:37 EST
Created attachment 141508 [details]
Example core file (compressed

Note that this file will expand to be like 500 megabytes or so.

This is an ia64 core dump file.
Comment 13 Ian Kent 2006-11-17 12:45:18 EST
(In reply to comment #12)
> Created an attachment (id=141508) [edit]
> Example core file (compressed
> 
> Note that this file will expand to be like 500 megabytes or so.
> 
> This is an ia64 core dump file.
> 

Doesn't sound good.
I see there's no backtrace and it looks like you don't have 
the autofs-debuginfo package installed.

Getting some backtrace info would be good.
And a debug log would also be good.

I'm going to attach some instructions for what I'd like
since downloading the core is likely to take a while.

Ian
Comment 14 Ian Kent 2006-11-17 12:51:17 EST
Created attachment 141510 [details]
Debuginfo collection instructions

I'll need to install the autofs version you are using to
do anything with the core so if you have time it would be
helpfull if you could post the backtrace information.

What was that version again?
And kernel version?

Ian
Comment 15 Ian Kent 2006-11-17 12:57:08 EST
(In reply to comment #9)
> Created an attachment (id=141504) [edit]
> auto_master_linux NIS map from ypcat -k
> 
> This shows ypcat -k output from the map referenced in /etc/auto.master.

Why do you like to use "soft" mounts?
Comment 16 Ian Kent 2006-11-17 13:00:59 EST
(In reply to comment #14)
> Created an attachment (id=141510) [edit]
> Debuginfo collection instructions
> 
> I'll need to install the autofs version you are using to
> do anything with the core so if you have time it would be
> helpfull if you could post the backtrace information.

But I don't have ia64 system.
Afraid I'll need that debug info.
Comment 17 Erik Jacobson 2006-11-17 14:16:30 EST
> Why do you like to use "soft" mounts?

I'm not the IT department, I just use what's in those maps :)
I'm a victum :-)
Comment 18 Erik Jacobson 2006-11-17 14:19:22 EST
> But I don't have ia64 system

You can reserve one of the SGI (or non-SGI ones) out of the Westford lab if
needed.  I'll try to collect what you asked for here.  -Erik
Comment 19 Erik Jacobson 2006-11-17 14:41:06 EST
Created attachment 141523 [details]
nsswitch.conf file per request

Some other requested info so far...

[root@minime1 sysconfig]# rpm -q autofs -q kernel
autofs-5.0.1-0.rc2.15
kernel-2.6.18-1.2747.el5
[root@minime1 sysconfig]# uname -a
Linux minime1 2.6.18-1.2747.el5 #1 SMP Thu Nov 9 18:56:16 EST 2006 ia64 ia64
ia64 GNU/Linux
Comment 20 Jeff Moyer 2006-11-17 14:54:02 EST
Core was generated by `automount'.
Program terminated with signal 11, Segmentation fault.
#0  0x20000008000dec60 in pthread_barrier_init () from /lib/libpthread.so.0
(gdb) bt
#0  0x20000008000dec60 in pthread_barrier_init () from /lib/libpthread.so.0
#1  0x200000080002c520 in st_queue_handler (arg=0x2000000800411800)
    at state.c:944
#2  0x20000008000d3190 in pthread_create@@GLIBC_2.2 ()
   from /lib/libpthread.so.0
#3  0xc000000000000610 in ?? ()
#4  0x200000080002c520 in st_queue_handler (arg=0x200000000017f240)
    at state.c:944
Previous frame inner to this frame (corrupt stack?)
(gdb) thr a a bt

Thread 22 (process 1889):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008000e5810 in open64 () from /lib/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)

Thread 21 (process 1890):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008000dce80 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)

Thread 20 (process 1894):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 19 (process 1897):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 18 (process 1898):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 17 (process 1899):
---Type <return> to continue, or q <return> to quit---
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 16 (process 1900):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 15 (process 1901):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 14 (process 1904):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 13 (process 1905):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 12 (process 1906):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

---Type <return> to continue, or q <return> to quit---
Thread 11 (process 1907):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 10 (process 1908):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 9 (process 1909):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 8 (process 1910):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 7 (process 1911):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 6 (process 1912):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)
---Type <return> to continue, or q <return> to quit---

Thread 5 (process 1913):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 4 (process 1914):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 3 (process 1915):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 2 (process 1916):
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000008002e9730 in fts_build () from /lib/libc.so.6.1
Previous frame inner to this frame (corrupt stack?)

Thread 1 (process 1891):
#0  0x20000008000dec60 in pthread_barrier_init () from /lib/libpthread.so.0
#1  0x200000080002c520 in st_queue_handler (arg=0x2000000800411800)
    at state.c:944
#2  0x20000008000d3190 in pthread_create@@GLIBC_2.2 ()
   from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0xc000000000000610 in ?? ()
#4  0x200000080002c520 in st_queue_handler (arg=0x200000000017f240)
    at state.c:944
Previous frame inner to this frame (corrupt stack?)
Comment 21 Erik Jacobson 2006-11-17 15:04:59 EST
Hello.  I'm hoping the last add is what you needed.

I went searching for debuginfo packages for RHEL5 Beta2 on the public
server.  I only see selected debuginfo packages as being available and
autofs isn't one of them.

I think I recall, on the internal RH network, that I can get to these - but
I would have to get permission to export the package from RH to SGI since
we're seeing the problem on the SGI side.  Feel free to attach the appropriate
debuginfo package and I'll install it.

It seems very repeatable over here, so I can help test a fixed pacakge too.
Comment 22 Erik Jacobson 2006-11-17 15:55:34 EST
Created attachment 141530 [details]
automount dumped core, and thne I sent this file...

It just dumped core again, but I had the logging going this time.  Here is the
log.
Comment 23 Ian Kent 2006-11-18 02:20:08 EST
(In reply to comment #21)
> Hello.  I'm hoping the last add is what you needed.

Afraid not.
That stack corruption (real or apparent) can't be relied upon.
I'll need to work out how to manually decode the stack call trace
for ia64.

> 
> I went searching for debuginfo packages for RHEL5 Beta2 on the public
> server.  I only see selected debuginfo packages as being available and
> autofs isn't one of them.

Is it accetable for you to build from a source rpm?
Either I could provide a patch and you could add it and build
it or I could provide an rpm with a patch included.

This would be purely for possibility elimination testing and
would only need to be run long enough to establish if there
is any change.

Although I doubt very much the stack of the task dispatcher
has overflowed I'd like to eliminate this easily checked 
obvious possiblily first.

> 
> It seems very repeatable over here, so I can help test a fixed pacakge too.

That is the puzzle, maybe it is simply a stack overflow.
The task dispatcher uses a small stack (64k).
I'd like to check what happens if I increase that to 256k.
The dispatcher doesn't actually do much and threads it launches
have a much bigger stack (which probably should be smaller).

Ian
Comment 24 Erik Jacobson 2006-11-19 11:15:42 EST
I'm familiar with how SRPMS work and could add a patch and build if that
would be easier for you.  Or you could provide a binary RPM; whatever is best.

Just be sure we're starting from the base SRPM :)  I'm using RHEL5 Beta2 now,
autofs-5.0.1-0.rc2.15.  -Erik
Comment 25 Ian Kent 2006-11-20 00:15:39 EST
(In reply to comment #24)
> I'm familiar with how SRPMS work and could add a patch and build if that
> would be easier for you.  Or you could provide a binary RPM; whatever is best.

That's great.

> 
> Just be sure we're starting from the base SRPM :)  I'm using RHEL5 Beta2 now,
> autofs-5.0.1-0.rc2.15.  -Erik

I did some testing over the weekend and located a couple of
memory access violations. I don't think that this will resolve
the problem but one was a used after free type error with could
be triggering a problem on your hardware.

Ian
Comment 26 Ian Kent 2006-11-20 00:20:18 EST
Created attachment 141618 [details]
Fix illegal memory access in lookup_yp.c

This patch applies cleanly against autofs-5.0.1-0.rc2.15.
Please try it if you can get time.
In the mean time I will continue with organizing access
to ia64 hardware to try and duplicate the problem.

Ian
Comment 27 Erik Jacobson 2006-11-20 10:40:41 EST
Hello.  I expanded the autofs SRPM and put your patch in place in the spec file.
I confirmed the new was insteed applied by the build.

I installed the base RPM and the debuginfo RPM.

And now we wait :)
Comment 28 Erik Jacobson 2006-11-20 11:24:19 EST
Created attachment 141661 [details]
backtraces from another core dump

It went boom a while ago.

Here are backtraces on all threads from the core file.	Not sure if it's very
useful but here it is.	I'll attach the daemon log again too.
Comment 29 Erik Jacobson 2006-11-20 11:25:38 EST
Created attachment 141662 [details]
debug log
Comment 31 Ian Kent 2006-11-21 13:53:08 EST
(In reply to comment #28)
> Created an attachment (id=141661) [edit]
> backtraces from another core dump
> 
> It went boom a while ago.

Can you note the time of the core file and verify whether
this happens before the attempted mount or as a result of
it.

So far I've not been able to duplicate this on ia64.

Ian
Comment 32 Erik Jacobson 2006-11-21 13:59:56 EST
The test machine I was using got erased; I'm going to build the RPM again with
your patch (need to build it anyway to get debuginfo installed)... I'll then
get the timing for the core dump and such.  More in a bit.
Comment 33 Erik Jacobson 2006-11-21 16:22:45 EST
Created attachment 141830 [details]
gdb output (that probably isn't useful)

Ok, back how things were...  rhel5 beta2

The autofs rpm has the patch from comment 26.

[root@minime1 /]# rpm -q autofs autofs-debuginfo kernel
autofs-5.0.1-0.rc2.15erikj
autofs-debuginfo-5.0.1-0.rc2.15erikj
kernel-2.6.18-1.2747.el5

Turned debug mode on, enabled collection of the daemon.debug syslog stuffs...

I restarted autofs.

A while later, boom.

[root@minime1 /]# ls -l /core*
-rw------- 1 root root 507248640 Nov 21 13:31 /core.1904

I'll attach the daemon log in a moment.

Attached here is the gdb output again - from the latest event.
Comment 34 Erik Jacobson 2006-11-21 16:26:18 EST
Created attachment 141833 [details]
daemon.debug log output

If I understand the question right...

Here is the time info for the core file - 13:31:22

[root@minime1 ia64]# stat /core.1904
  File: `/core.1904'
  Size: 507248640	Blocks: 4824	   IO Block: 16384  regular file
Device: 815h/2069d	Inode: 156052	   Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2006-11-21 15:20:34.000000000 -0600
Modify: 2006-11-21 13:31:22.000000000 -0600
Change: 2006-11-21 13:31:22.000000000 -0600

But I'm not sure the granularity was enough to figure out exactly what it
was paired with in the daemon debug file:

Nov 21 13:31:21 minime1 automount[1904]: expire_cleanup: got thid
2305843009746104896 path /data/lwork stat 2
Nov 21 13:31:21 minime1 automount[1904]: expire_cleanup: sigchld: exp
2305843009746104896 finished, switching from 2 to 1
Nov 21 13:31:21 minime1 automount[1904]: st_ready: st_ready(): state = 2 path
/data/lwork
Nov 21 13:31:22 minime1 automount[1904]: st_expire: state 1 path /data/eagan
Nov 21 13:31:22 minime1 automount[1904]: expire_proc: exp_proc =
2305843009746104896 path /data/eagan
Nov 21 13:31:22 minime1 automount[1904]: expire_cleanup: got thid
2305843009746104896 path /data/eagan stat 0
Nov 21 13:31:22 minime1 automount[1904]: expire_cleanup: sigchld: exp
2305843009746104896 finished, switching from 2 to 1
Nov 21 13:31:22 minime1 automount[1904]: st_ready: st_ready(): state = 2 path
/data/eagan
Nov 21 13:31:22 minime1 automount[1904]: expire_proc_indirect: 1 remaining in
/home
Nov 21 13:31:22 minime1 automount[1904]: mount still busy /home
Nov 21 13:31:22 minime1 automount[1904]: expire_cleanup: got thid
2305843009720398400 path /home stat 2
Nov 21 13:31:22 minime1 automount[1904]: expire_cleanup: sigchld: exp
2305843009720398400 finished, switching from 2 to 1
Nov 21 13:31:22 minime1 automount[1904]: st_ready: st_ready(): state = 2 path
/home
Nov 21 13:31:34 minime1 dhclient: DHCPREQUEST on eth0 to 128.162.243.246 port
67
Comment 35 Erik Jacobson 2006-11-21 17:46:16 EST
Created attachment 141845 [details]
strace output

I booted with selinux=0 to disable selinux.  That's because selinux prevented
me from attaching to the automount process.

Shortly after automount started, I started strace like this:

strace -o /tmp/automount-strace-out -f -p 2531

After it dumped core, I made this attachment of automount-strace-out in case
it's helpful some how.

Note that I had to kill -9 the strace process that was attached to what
had become a defunct process.

I'm not sure if it's helpful :(
Comment 36 Erik Jacobson 2006-11-21 18:01:40 EST
I *think* this is unrelated.  I filed it in an SGI bug report for us to look
in to and possibly file a rhel5 bug.  I panicked the system trying to use gdb
against the automount process (and threads).  Here is the text of the sgi bug
report.  Hopefully unrelated as I said but I wanted to put it here in the 
interest of full disclosure.

This was observed with RHEL5 Beta2.

What was happening?  I was trying to debug the automount core dump problem.
So I used gdb to attacah to the automount process.

It went along fine for some time.  Then the system went boom.

I wonder to myself if the system went boom at the same point the automount
task would have seg faulted if I didn't have gdb attached.


[root@minime1 ~]# kernel BUG at kernel/exit.c:76!
automount[2926]: bugcheck! 0 [1]
Modules linked in: nfs lockd fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth
sunrpc ipv6 vfat fat dm_mirror dm_mod button parport_pc lp parport mca_recovery
ide_cd cdrom tg3 sg mptsas scsi_transport_sas mptscsih mptbase sata_vsc libata
qla1280 sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd

Pid: 2926, CPU 1, comm:            automount
psr : 00001010085a2010 ifs : 800000000000038a ip  : [<a00000010007df60>]    Not
tainted
ip is at release_task+0x140/0x7e0
unat: 0000000000000000 pfs : 000000000000038a rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 6516606a9965a565
ldrs: 0000000000000000 ccv : 0000000000000010 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a00000010007df60 b6  : a0000001004d7840 b7  : a0000001003c88e0
f6  : 1003e00000000000000a0 f7  : 1003e20c49ba5e353f7cf
f8  : 1003e00000000000004e2 f9  : 1003e000000000fa00000
f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
r1  : a000000100bef1f0 r2  : a000000100a06750 r3  : a000000100938860
r8  : 0000000000000023 r9  : 0000000000000026 r10 : a000000100a06780
r11 : a000000100a06780 r12 : e00000301c2efe00 r13 : e00000301c2e8000
r14 : a000000100a06750 r15 : 0000000000000000 r16 : a000000100938868
r17 : e000003071ae7e18 r18 : 0000000000000000 r19 : e0000030031171e3
r20 : 0000000000000000 r21 : a0000001009ef820 r22 : e000003003120000
r23 : a000000100845200 r24 : a0000001009ef820 r25 : a000000100a06758
r26 : a000000100a06758 r27 : 0000000000000000 r28 : 0000000000000026
r29 : 80000001fdc00000 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<a000000100014140>] show_stack+0x40/0xa0
                                sp=e00000301c2ef990 bsp=e00000301c2e9358
 [<a000000100014a40>] show_regs+0x840/0x880
                                sp=e00000301c2efb60 bsp=e00000301c2e9300
 [<a000000100037c60>] die+0x1c0/0x2c0
                                sp=e00000301c2efb60 bsp=e00000301c2e92b8
 [<a000000100037db0>] die_if_kernel+0x50/0x80
                                sp=e00000301c2efb80 bsp=e00000301c2e9288
 [<a0000001006147f0>] ia64_bad_break+0x270/0x4a0
                                sp=e00000301c2efb80 bsp=e00000301c2e9260
 [<a00000010000c700>] __ia64_leave_kernel+0x0/0x280
                                sp=e00000301c2efc30 bsp=e00000301c2e9260
 [<a00000010007df60>] release_task+0x140/0x7e0
                                sp=e00000301c2efe00 bsp=e00000301c2e9210
 [<a0000001000ef020>] check_noreap+0xa0/0x160
                                sp=e00000301c2efe00 bsp=e00000301c2e91d0
 [<a0000001000f15d0>] utrace_report_death+0x650/0x680
                                sp=e00000301c2efe00 bsp=e00000301c2e9178
 [<a000000100081b50>] do_exit+0x1330/0x14a0
                                sp=e00000301c2efe10 bsp=e00000301c2e9120
 [<a000000100081e80>] sys_exit+0x20/0x40
                                sp=e00000301c2efe30 bsp=e00000301c2e90c8
 [<a00000010000c490>] __ia64_trace_syscall+0xd0/0x110
                                sp=e00000301c2efe30 bsp=e00000301c2e90c8
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000301c2f0000 bsp=e00000301c2e90c8
 <0>Kernel panic - not syncing: Fatal exception
Comment 37 Erik Jacobson 2006-11-21 18:08:36 EST
It just happened again (kernel panic tyring to use gdb to debug the
running automount process).  

Similar kernel backtrace.
Comment 38 Ian Kent 2006-11-21 20:43:45 EST
(In reply to comment #33)
> Created an attachment (id=141830) [edit]
> gdb output (that probably isn't useful)

No, this is much better.
gdb has been able to decode the stack trace so I'm able
to make sence of it and have some confidence in it.

It's consistent with the previous cores in that it says
that autofs crashed while checking for the existence of
a thread when trying to identify completed tasks (they
are typically expires).

I checked this part of the code following your first core
but it looked ok. I have seen this type of problem before
with the dispatcher but I'm puzzled as to why it appears
to be happening when a seemingly unrelated action, a
mount request, comes in (although we can't yet be sure
that it is coincident with the mount request).

Ian
Comment 39 Ian Kent 2006-11-22 01:50:29 EST
Created attachment 141877 [details]
Patch to remove need to call pthread_kill when checking for task done. 

This assumes there is a problem with detached thread
id reuse, possibly related to order of execution imposed
by scheduling and the pthread library is unable to handle
a call to ptherad_kill during thread setup.

Again, as I can't reproduce the problem, this is just a
guess as to what might be happening so please give it a
try (also please continue using the first patch).

Ian
Comment 40 Erik Jacobson 2006-11-22 09:54:23 EST
Sorry to report that on ia64 I'm having trouble building with that patch
applied.

Here is some proof that the patch applied (21 and 22 are from you)

Patch #20 (autofs-5.0.1-rc2-numeric-ldap-host-name.patch):
+ patch -p1 -s
+ echo 'Patch #21 (autofs-fixup-from-rh):'
Patch #21 (autofs-fixup-from-rh):
+ patch -p1 -s
+ echo 'Patch #22 (no-pthread-kill-from-rh):'
Patch #22 (no-pthread-kill-from-rh):
+ patch -p1 -s
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.87964


Here are lines leading up to the build failure.

Would it be helpful if I reserved one of the SGI ia64 systems in Westford and
tried to get RHEL5 Beta2 installed on it?  I can't promise I can duplicate
the problem there, but I can try that too.  That might let you fly less
blind :)

gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -D_REENTRANT -D_REENTRANT -rdynamic -fPIE
-D_GNU_SOURCE -I../include -DAUTOFS_LIB_DIR=\"/usr/lib/autofs\" 
-DAUTOFS_MAP_DIR=\"/etc\" -DAUTOFS_CONF_DIR=\"/etc/sysconfig\"
-DVERSION_STRING=\"5.0.1-0.rc2.15erikj2\" -c state.c
state.c: In function 'st_set_done':
state.c:57: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
state.c:83: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
state.c:97: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
state.c:201: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:217: warning: empty declaration
state.c:229: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:253: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:317: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:340: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:347: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:429: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:451: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:511: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:541: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:571: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:596: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:621: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:640: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:739: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:796: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:841: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:856: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:877: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:997: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{'
token
state.c:1018: error: old-style parameter declarations in prototyped function
definition
state.c:1018: error: expected '{' at end of input
make[1]: *** [state.o] Error 1
make[1]: Leaving directory `/usr/src/redhat/BUILD/autofs-5.0.1/daemon'
make: *** [daemon] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.87964 (%build)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.87964 (%build)
Comment 41 Erik Jacobson 2006-11-22 11:03:59 EST
altix3.lab.boston.redhat.com is available.  I have caught it failing once
already in my simple test case set up there.

/etc/auto.master has a map reference to /etc/auto.test.
/etc/auto.test just mounts a couple nfs directories under
/test/t1, /test/t2, and /test/t3.

/usr/src/redhat has autofs in it with your first patch but not your second
applied.  I have just restarted this version of autofs with the patch
in place.

There are a couple example cores in /, but some of them were with the original
rhel5 beta2 automount process.

The new version of autofs that has your patch applied was installed (along
with debuginfo) Wed Nov 22 11:01:56 EST 2006 so any core in / older
than that should associate with this install.

With my small non-NIS test case on altix3, it doesn't seem to trigger
as often but still does.  I wish we could find an exact trigger.

Please feel free to log in to this machine as root to test and debug.

I hope this helps.
Comment 42 Ian Kent 2006-11-22 11:10:21 EST
(In reply to comment #40)
> Sorry to report that on ia64 I'm having trouble building with that patch
> applied.
> 

Arrgh .. I was sure I fixed that in the patch I posted.
It's a missing ";"

static void st_set_thid(struct autofs_point *, pthread_t);
+static void st_set_done(struct autofs_point *ap)

as you can see.
Comment 43 Ian Kent 2006-11-22 11:14:38 EST
(In reply to comment #41)
> altix3.lab.boston.redhat.com is available.  I have caught it failing once
> already in my simple test case set up there.

Excellent.
Good work.

> 
> /etc/auto.master has a map reference to /etc/auto.test.
> /etc/auto.test just mounts a couple nfs directories under
> /test/t1, /test/t2, and /test/t3.
> 
> /usr/src/redhat has autofs in it with your first patch but not your second
> applied.  I have just restarted this version of autofs with the patch
> in place.
> 
> There are a couple example cores in /, but some of them were with the original
> rhel5 beta2 automount process.
> 
> The new version of autofs that has your patch applied was installed (along
> with debuginfo) Wed Nov 22 11:01:56 EST 2006 so any core in / older
> than that should associate with this install.
> 
> With my small non-NIS test case on altix3, it doesn't seem to trigger
> as often but still does.  I wish we could find an exact trigger.
> 
> Please feel free to log in to this machine as root to test and debug.

Certainly.

> 
> I hope this helps.

Yep. This helps a lot.
Given that we know it fails I'm going to add the second patch
and do a quick test with it.
We can always take it out to get more info later.

Ian
Comment 44 Erik Jacobson 2006-11-22 16:10:50 EST
Hi Ian.  The version of the RPMs you created on altix3 have run on an SGI
machine for quite some time now and I'm fairly confident the new rpm fixes
the issue.

This being near a holiday, I'm not sure what exposure we'll get from the people
doing the MPI regression testing but I've asked them to try the patched RPMs
too.

Erik
Comment 45 Ian Kent 2006-11-22 20:46:45 EST
Created attachment 141964 [details]
Patch to remove need to call pthread_kill when checking for task done (with correction).
Comment 46 Ian Kent 2006-11-22 20:58:22 EST
(In reply to comment #44)
> Hi Ian.  The version of the RPMs you created on altix3 have run on an SGI
> machine for quite some time now and I'm fairly confident the new rpm fixes
> the issue.

That sound promising.

> 
> This being near a holiday, I'm not sure what exposure we'll get from the people
> doing the MPI regression testing but I've asked them to try the patched RPMs
> too.

It needs some more work anyway.
I'll do some verification (I'd hate to slowly growing task list)
while we wait.

Ian
Comment 47 Terje Rosten 2006-11-23 08:30:07 EST
Got a new core file on RHEL 5 Beta 2 x86_64 , autofs-5.0.1-0.rc2.15.

Some gdb output (info threads) (I did not find any debuginfo package):

#0  0x000055555556caec in add_source () from /usr/sbin/automount
(gdb) list threads
No symbol table is loaded.  Use the "file" command.
(gdb) info threads
  18 process 7134  0x00002aaaaacd3b18 in do_sigwait ()
   from /lib64/libpthread.so.0
  17 process 7135  0x00002aaaaacd0607 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  16 process 7136  0x00002aaaaacd0607 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  15 process 7139  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  14 process 7142  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  13 process 7143  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  12 process 7144  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  11 process 7145  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  10 process 7146  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  9 process 7147  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  8 process 7148  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  7 process 7149  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  6 process 7150  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  5 process 7151  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  4 process 7395  0x00002aaaaacd0416 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  3 process 7398  0x00002aaaab3b1487 in mkdir () from /lib64/libc.so.6
  2 process 7405  0x00002aaaab3cb3f8 in __lll_mutex_lock_wait ()
   from /lib64/libc.so.6
* 1 process 7407  0x000055555556caec in add_source () from /usr/sbin/automount

Some another bugs too:

o /net don't work with -hosts, however /net with /etc/auto.net works.

o direct mounts is here and then I really miss -null (/opt is direct mount
in nis maps and I want that to be a local file system on Linux clients...)

o man pages really need a refresher e.g. autofs(8): this is not correct:

 /etc/init.d/autofs  status
 will display the current configuration and a list of currently running
 automount daemons.



Comment 48 Terje Rosten 2006-11-23 08:33:37 EST
Created attachment 141990 [details]
gzipped core from comment #47
Comment 49 Terje Rosten 2006-11-23 08:42:03 EST
Created attachment 141992 [details]
automount log from crash in comment #47

Lots of  
failed to get group info from getgrgid_r
then:
lookup_nss_read_map: can't to read name service switch config.
segfault at 0000000000000008 rip 000055555556caec rsp 000000004202a940 error 4

BTW:
$ cat /etc/nsswitch.conf | grep -v '#' | sed '/^$/d'
passwd:     files nis
shadow:     files nis
group:	    files nis
hosts:	    files nis dns
bootparams: nisplus [NOTFOUND=return] files
ethers:     files
netmasks:   files
networks:   files
protocols:  files nis
rpc:	    files
services:   files nis
netgroup:   files nis
publickey:  nisplus
automount:  files nis
aliases:    files nisplus
Comment 50 Ian Kent 2006-11-23 09:54:16 EST
(In reply to comment #47)
> Got a new core file on RHEL 5 Beta 2 x86_64 , autofs-5.0.1-0.rc2.15.
> 
> Some gdb output (info threads) (I did not find any debuginfo package):
> 
> #0  0x000055555556caec in add_source () from /usr/sbin/automount

That's new, maybe.
How about a backtrace?

> (gdb) list threads
> No symbol table is loaded.  Use the "file" command.
> (gdb) info threads

snip ...

>   2 process 7405  0x00002aaaab3cb3f8 in __lll_mutex_lock_wait ()
>    from /lib64/libc.so.6
> * 1 process 7407  0x000055555556caec in add_source () from /usr/sbin/automount

And no line number.
It's probably the nsswitch parser locking bug.
I'll prepare a patch.
In fact it's probably better for me to update the RHEL5 package
with the patch developed previously in this bug and include the
nsswitch and macro table locking fixes. I was planning on doing
that this week anyway.

I'll sort that out tomorrow and post to the bug.

> 
> Some another bugs too:
> 
> o /net don't work with -hosts, however /net with /etc/auto.net works.

No information. We'll have to work on this later.

> 
> o direct mounts is here and then I really miss -null (/opt is direct mount
> in nis maps and I want that to be a local file system on Linux clients...)

Yep. I'm aware that the implementation in 0.rc2.15 is incorrect.
I've fixed this but it's not quite finished yet, however, it does
make autofs null entries work as as they do in Solaris.
Once I've done the updates above I can prepare a temporary patch
for you to use until the package is updated. Hopefully I'll be
able to complete that tomorrow as well.

> 
> o man pages really need a refresher e.g. autofs(8): this is not correct:
> 
>  /etc/init.d/autofs  status
>  will display the current configuration and a list of currently running
>  automount daemons.

Ooops!

Ian
Comment 51 Ian Kent 2006-11-23 10:05:57 EST
(In reply to comment #49)
> Created an attachment (id=141992) [edit]
> automount log from crash in comment #47
> 
> Lots of  
> failed to get group info from getgrgid_r

These messages are a bit puzling.
We'll try with the nsswitch parser locking patch first and
see how that goes.

How long had autofs been running before the trouble?

Ian
Comment 52 Ian Kent 2006-11-23 10:10:03 EST
(In reply to comment #49)
> Created an attachment (id=141992) [edit]
> automount log from crash in comment #47
> 
> Lots of  
> failed to get group info from getgrgid_r

Oh .. another thing.
Tell me your not running a 32 bit package on a 64 bit arch!
That won't work at this stage.

Ian
Comment 53 Terje Rosten 2006-11-23 12:32:38 EST
> It's probably the nsswitch parser locking bug.
> I'll prepare a patch.
> In fact it's probably better for me to update the RHEL5 package
> with the patch developed previously in this bug and include the
> nsswitch and macro table locking fixes. I was planning on doing
> that this week anyway.

Sounds great, I have no problems testing patches.

>> Lots of  
>> failed to get group info from getgrgid_r

>These messages are a bit puzling.
>We'll try with the nsswitch parser locking patch first and
>see how that goes.

>How long had autofs been running before the trouble?

Not long, 24 hours or something. However I believe the  getgrgid_rm
warnings are present right after startup.

>Oh .. another thing.
>Tell me your not running a 32 bit package on a 64 bit arch!
>That won't work at this stage.

Don't think so, as I have not done anything fancy, will check later.




Comment 54 Ian Kent 2006-11-23 13:00:20 EST
Created attachment 142002 [details]
Patch to fix nsswitch parser locking
Comment 55 Ian Kent 2006-11-23 13:01:42 EST
Created attachment 142003 [details]
Patch to fix macro table locking
Comment 56 Ian Kent 2006-11-23 13:04:24 EST
Here are the two patches for nsswitch parser and macro
table locking I mentioned.

I recommend using the previous patches for the illegal
memory access and the one two avoid the use of pthread_kill
as well.

Please give them a try.
Comment 57 Ian Kent 2006-11-23 14:05:00 EST
Created attachment 142005 [details]
Interim patch to fix null map handling semantics

I haven't re-tested this patch.
I had to make a few changes due to dependencies on other
updates not in 0.rc2.15.

Hopefully it will be OK.
I'll check it tomorrow.

Ian
Comment 58 Terje Rosten 2006-11-23 17:34:08 EST
> Interim patch to fix null map handling semantics

Thanks, will test tomorrow.

BTW: srpms with all 5 patches added available here:
 http://www.pvv.ntnu.no/~terjeros/rpms/autofs/
Comment 59 Terje Rosten 2006-11-24 06:20:58 EST
Some more testing with the 5 pathces add:

good: 
 o -null works. Thanks

bad:
 o -hosts don't work (not important here)
 o more core dumps:

$ gdb -c /core.6997 /usr/sbin/automount 

[snip]

Reading symbols from /lib64/libnss_nis.so.2...done.
Loaded symbols for /lib64/libnss_nis.so.2
Core was generated by `automount'.
Program terminated with signal 11, Segmentation fault.
#0  tree_free_mnt_tree (tree=0x5555557acec0) at mounts.c:495
495                     p = p->next;
(gdb) info threads
  14 process 6997  0x00002aaaaacd3b18 in do_sigwait () from /lib64/libpthread.so.0
  13 process 6998  0x00002aaaaacd0607 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  12 process 6999  0x00002aaaaacd0416 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  11 process 7002  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  10 process 7005  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  9 process 7006  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  8 process 7007  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  7 process 7008  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  6 process 7009  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  5 process 7010  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  4 process 7011  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  3 process 7012  0x00002aaaab3b6b96 in poll () from /lib64/libc.so.6
  2 process 7013  0x00002aaaaacd2909 in __lll_mutex_unlock_wake () from
/lib64/libpthread.so.0
* 1 process 7395  tree_free_mnt_tree (tree=0x5555557acec0) at mounts.c:495
(gdb) thread 1
[Switching to thread 1 (process 7395)]#0  tree_free_mnt_tree
(tree=0x5555557acec0) at mounts.c:495
495                     p = p->next;
(gdb) bt
#0  tree_free_mnt_tree (tree=0x5555557acec0) at mounts.c:495
#1  0x0000555555561e95 in expire_proc_direct (arg=<value optimized out>)
    at /usr/include/pthread.h:579
#2  0x00002aaaaaccc305 in start_thread () from /lib64/libpthread.so.0
#3  0x00002aaaab3bf66d in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (process 7013)]#0  0x00002aaaaacd2909 in
__lll_mutex_unlock_wake ()
   from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002aaaaacd2909 in __lll_mutex_unlock_wake () from /lib64/libpthread.so.0
#1  0x00002aaaaaccf9d9 in _L_mutex_unlock_59 () from /lib64/libpthread.so.0
#2  0x00002aaaaaccf69b in __pthread_mutex_unlock_usercnt () from
/lib64/libpthread.so.0
#3  0x0000555555569624 in st_add_task (ap=0x5555557a9370, state=ST_EXPIRE) at
state.c:758
#4  0x000055555555d8e6 in handle_mounts (arg=0x5555557a9370) at automount.c:626
#5  0x00002aaaaaccc305 in start_thread () from /lib64/libpthread.so.0
#6  0x00002aaaab3bf66d in clone () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()
Comment 60 Terje Rosten 2006-11-24 06:23:52 EST
More good things:
  o the error msgs 
    do_mount_indirect: failed to get group info from getgrgid_r
    is not here any more.
Comment 61 Terje Rosten 2006-11-24 06:28:14 EST
Created attachment 142056 [details]
automount log from crash in comment #59

Logs from automount with debug option leading to crash.
Comment 62 Ian Kent 2006-11-24 08:49:54 EST
(In reply to comment #59)
> Some more testing with the 5 pathces add:
> 
> good: 
>  o -null works. Thanks

Cool. Hopefully I'll be able to sort out the error check
I need and add it to the package soon.

> 
> bad:
>  o -hosts don't work (not important here)

Yep. We'll get to it.

>  o more core dumps:
> 
> $ gdb -c /core.6997 /usr/sbin/automount 
> 
> [snip]
> 
> Reading symbols from /lib64/libnss_nis.so.2...done.
> Loaded symbols for /lib64/libnss_nis.so.2
> Core was generated by `automount'.
> Program terminated with signal 11, Segmentation fault.
> #0  tree_free_mnt_tree (tree=0x5555557acec0) at mounts.c:495
> 495                     p = p->next;

Ha .. this has to be an error from altering the null map
patch. I had to make some changes in this area.
I'll check it out but I'll probably recommend using an
updated package. Still haven't managed to do that sorry.

Ian
Comment 63 Ian Kent 2006-11-24 12:27:43 EST
Created attachment 142080 [details]
Interim patch to fix null map handling semantics - fix

This patch seems to fix the error I made with the interim
patch above.

Hopefully this will allow you to continue your testing
and give me time to complete the null map patch and
consolidate the current changes into the RHEL package
and perform proper testing.

Ian
Comment 64 Ian Kent 2006-11-24 12:35:28 EST
(In reply to comment #62)
> > bad:
> >  o -hosts don't work (not important here)
> 
> Yep. We'll get to it.

This is a bit of a second priority but can you give some
more information on this please.

A debug log would be good including startup of autofs
and an attempt to access a server.

Output of "showmount -e <server>" for a server you are
having trouble with and its /etc/exports.

Ian
Comment 65 Terje Rosten 2006-11-27 09:34:58 EST
Created attachment 142171 [details]
logs as requested in comment #64

About automount log:
first startup, 
then /net (with -hosts map) (failure)
and last /net.program (with /etc/auto.net as map ) to the same host (works).
Comment 66 Erik Jacobson 2006-11-27 09:55:12 EST
Hi.  I'm now running with 6 patches to the base rhel5 beta2 version of
autofs.  

Hopefully that's the right number :)

I started with the SRPM with the rp22 version from altix3.lab.boston.redhat.com.
That included autofs-fixup-from-rh and autofs-5.0.1-rc2-use-task-done.patch.

From this bug, I added:
autofs-nsswitch-parser-locking (comment 54)
autofs-macro-table-locking-patch (comment 55)
autofs-null-map-handling-try1-patch (comment 57)
autofs-null-map-handling-try2-patch (comment 63)

The problem I hit was already fixed but I'm now running with all these to
be sure things keep chugging along over here.  So far so good.
Comment 67 Ian Kent 2006-11-27 10:18:18 EST
(In reply to comment #66)
> Hi.  I'm now running with 6 patches to the base rhel5 beta2 version of
> autofs.  
> 
> Hopefully that's the right number :)

Looks ok.

> 
> I started with the SRPM with the rp22 version from altix3.lab.boston.redhat.com.
> That included autofs-fixup-from-rh and autofs-5.0.1-rc2-use-task-done.patch.
> 
> From this bug, I added:
> autofs-nsswitch-parser-locking (comment 54)
> autofs-macro-table-locking-patch (comment 55)

These two would be the next segv you would see, good.

> autofs-null-map-handling-try1-patch (comment 57)
> autofs-null-map-handling-try2-patch (comment 63)

And these of course if you need the "-null" option.
The thing that remains with this patch is an error
that I'm having a little trouble working out. Other
than that it should provide the correct "-null" 
functionality.

I have applied all the above patches, except the null
map, to RHEL 5 cvs (revision 24). If we see further
problems we'll deal with them as they arise.

Thanks
Ian
Comment 68 Ian Kent 2006-11-27 10:19:51 EST
(In reply to comment #67)
> The thing that remains with this patch is an error
> that I'm having a little trouble working out. Other
> than that it should provide the correct "-null" 
> functionality.

That's error check, not error, sorry.

Ian
Comment 69 Ian Kent 2006-11-27 10:42:58 EST
(In reply to comment #65)
> Created an attachment (id=142171) [edit]
> logs as requested in comment #64
> 
> About automount log:
> first startup, 
> then /net (with -hosts map) (failure)
> and last /net.program (with /etc/auto.net as map ) to the same host (works).

I think this may be a problem that I know about with
the matching of a simple host name against FQDN in
the export list. It may also be (as well) that autofs
doesn't understand some of the Sun style export access
control syntax. I was alerted to the problem recently
and haven't yet started work on fixing it.

When I start this work it may be best for us to open
another bz specifically for it. We'll see.

So I have to recommend using the old script, as you are
doing, in the mean time.

Ian

Comment 70 Erik Jacobson 2006-11-27 22:32:20 EST
So far so good on our systems.  I haven't been testing the null stuff, just
making sure the changes are working together ok.

I'll note that on 3 machines we upgraded autofs on, the /etc/auto.master
was renamed to .rpmsave and _no_ auto.master file resulted - making it
so autofs would fail to restart after upgrading the autofs rpm.

Probably a separate bug.  We just copied valid auto.master's back in to place
and were back in business.
Comment 71 Ian Kent 2006-11-27 22:45:44 EST
(In reply to comment #70)
> So far so good on our systems.  I haven't been testing the null stuff, just
> making sure the changes are working together ok.
> 
> I'll note that on 3 machines we upgraded autofs on, the /etc/auto.master
> was renamed to .rpmsave and _no_ auto.master file resulted - making it
> so autofs would fail to restart after upgrading the autofs rpm.
> 

Wow .. that's no good.
I've never seen that happen before and I update my autofs
package a lot on serveral different installs.

> Probably a separate bug.  We just copied valid auto.master's back in to place
> and were back in business.

That would be best for tracking purposes.

Ian


Comment 72 Erik Jacobson 2006-11-27 22:59:22 EST
Yeah, it didn't always happen.  I wonder to myself if it's the missingok in
the spec file.

At SGI, we have a management step in opening red hat bugs so I proposed it on
the sgi side.  I should have a RH bug for you tomorrow.  Thanks!
Comment 73 Terje Røsten 2006-11-28 05:25:14 EST
(In reply to comment #71)
> > I'll note that on 3 machines we upgraded autofs on, the /etc/auto.master
> > was renamed to .rpmsave and _no_ auto.master file resulted - making it
> > so autofs would fail to restart after upgrading the autofs rpm.
> > 
> 
> Wow .. that's no good.
> I've never seen that happen before and I update my autofs
> package a lot on serveral different installs.

I have also seen this, I believe it was on some FC5 systems
in the autofs-4.1.4-32 or -33 update.

It's of course very nasty.

 

Comment 74 Terje Røsten 2006-11-28 06:32:54 EST
(In reply to comment #69)
> When I start this work it may be best for us to open
> another bz specifically for it. We'll see.

Ok.
 
> So I have to recommend using the old script, as you are
> doing, in the mean time.

Yeah, running with patch from comment #63 and fix from bz #208244
and things seems very happy, (I even have DEFAULT_BROWSE_MODE="yes" now).




Comment 75 Jay Turner 2006-12-01 15:32:13 EST
QE ack for RHEL5.  Quite a mess, but we'll do some focused automount testing.
Comment 76 Terje Rosten 2006-12-01 20:55:29 EST
(In reply to comment #75)
> QE ack for RHEL5.  Quite a mess, but we'll do some focused automount testing.

Of course, it's serious rewrite going from multi process to
multi thread, and implementing lots of new features at the same time.

However I am happy user now.

Please test yp, ldap, files + different nfs servers: linux, solaris (9+10),
hp-ux, aix, fredbsd, upd,tcp, rsize+wsize (large values values with tcp
seems to be flakey/slow)  and -hosts and -null options, this is not trivial stuff.

Thanks to Ian and Jeffrey, things are getting much better!

Comment 77 Ian Kent 2006-12-07 01:51:58 EST
(In reply to comment #76)
> (In reply to comment #75)
> > QE ack for RHEL5.  Quite a mess, but we'll do some focused automount testing.
> 
> Of course, it's serious rewrite going from multi process to
> multi thread, and implementing lots of new features at the same time.

Sure is, but when I look at the new features list it seems
quite short considering the amount of change.

Never the less our compatibility is much better now.

> 
> However I am happy user now.

That's great to hear.

> 
> Please test yp, ldap, files + different nfs servers: linux, solaris (9+10),
> hp-ux, aix, fredbsd, upd,tcp, rsize+wsize (large values values with tcp
> seems to be flakey/slow)  and -hosts and -null options, this is not trivial stuff.

Our focus has been on compatibility with Solaris, the assumption
bieng that if we achieve that then everything should work with
other servers since the Solaris implementation is the expected
standard behavior.

The -null map semantics should now be correct in all cases
as of autofs-5.0.1-0.rc2.27 in RHEL5.

The -hosts export list access validation still needs more
work for some of the Solaris specific options.

Certainly the sources yp and files should be fine. They've
had lots of exersise. LDAP should be fine as well but at
this time it needs a change to the configuration to tell
autofs to the rfc2307bis schema for Solaris servers.

Ian
Comment 78 Erik Jacobson 2006-12-12 11:00:57 EST
FYI - SGI has had good results with autofs in rhel5 rc snapshot2.  thanks.
Comment 79 Ian Kent 2006-12-15 02:08:39 EST
So far the updates that have been applied to autofs
as a result of this investigation appear to have
resolved the issues (including the corrections
for "-null" map semantics).

The issue raised in comment #72 has been raised as a
separate bug in bz 217575.

So I'm setting this bug to modified.

Ian
Comment 80 RHEL Product and Program Management 2006-12-22 19:25:33 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Note You need to log in before you can comment on or make changes to this bug.