432313 – Backgrounded NFS mounts dies soon after "service netfs start" command is issued

Bug 432313 - Backgrounded NFS mounts dies soon after "service netfs start" command is issued

Summary: Backgrounded NFS mounts dies soon after "service netfs start" command is issued

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	util-linux
Sub Component:
Version:	4.5
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	445037 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-11 07:14 UTC by Sunil Shinde
Modified:	2018-10-20 00:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2008-0751
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 20:01:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed patch (878 bytes, patch) 2008-03-20 17:56 UTC, Jeff Layton	no flags	Details \| Diff
proposed patch 2 (822 bytes, patch) 2008-03-27 15:00 UTC, Jeff Layton	no flags	Details \| Diff
patch 3 (913 bytes, patch) 2008-03-27 16:49 UTC, Jeff Layton	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0751	0	normal	SHIPPED_LIVE	util-linux bug fix update	2008-07-23 16:50:51 UTC

Description Sunil Shinde 2008-02-11 07:14:20 UTC

Description of problem:

I have a setup with the following configuration:
A configuration where the directories are mounted via NFS as
background mounts through /etc/fstab.
And trying to mount the file system by issuing service netfs start command.

Here are the entries in /etc/fstab file
<ip>:/work1     /work1          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
<ip>:/work2     /work2          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0

Version-Release number of selected component (if applicable):
RHEL4U5

How reproducible:
Always

Steps to Reproduce:
1. make the entries in /etc/fstab file to mount two NFS directories as;
<ip>:/work1     /work1          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
<ip>:/work2     /work2          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
2. make the NFS server not reachable, either plug out the network
cable or shutdown the NFS server.
3. start the /etc/rc.d/init.d/netfs service.
4. ps -ef
  
Actual results:
The background mount for "/work2" ( second line in /etc/fstab)
disappears shortly after it is backgrounded.
We see from the "/var/log/messages" file that the mount for "/work1"
issues the following message shortly after it is backgrounded.

mount: mount to NFS server 'XXX.XXX.XXX.XXX' failed

However, we never see that message coming from the background mount
for "/work2".

But, if I manually run "mount -a" from the client, both mounts issue
the message and the background mounts remain running until the server
that is exporting the "/work1" and "/work2" file systems is up and
running.

Expected results:
When the "service netfs start" command is issued.
The mounts for "/work1" and "/work2" should get backgrounded and wait
for the NFS server to export the two file systems.

Additional info:
mount command submitted through initlog command has this behavior.
strace on the mount command shows.

Process 8119 attached - interrupt to quit

connect(3, {sa_family=AF_INET, sin_port=htons(111),
sin_addr=inet_addr("XXX.XXX.XXX.XXX")}, 16) = -1 EHOSTUNREACH (No route to host)

close(3)                                = 0

close(3)                                = -1 EBADF (Bad file descriptor)

gettimeofday({1201283232, 281438}, NULL) = 0

write(2, "mount: mount to NFS server \'172."..., 49) = -1 EPIPE (Broken pipe)

--- SIGPIPE (Broken pipe) @ a000000000010641 (1fb7) ---

Process 8119 detached

Comment 1 Bill Nottingham 2008-02-11 16:47:09 UTC

If you background yourself, you must be prepared for the fact that you are no
longer attached to the terminal. (This could also fail due to SELinux, for example.)

Comment 2 Karel Zak 2008-02-12 13:45:03 UTC

> <ip>:/work1    /work1          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
> <ip>:/work2    /work2          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
> We see from the "/var/log/messages" file that the mount for "/work1"
> issues the following message shortly after it is backgrounded.
> 
> mount: mount to NFS server 'XXX.XXX.XXX.XXX' failed
> 
> However, we never see that message coming from the background mount
> for "/work2".

It seems correct, you shouldn't see any initial message from /work2. See nfs(5)
man page:

   After a mount operation is backgrounded, all subsequent mounts on the same
   NFS server will be backgrounded immediately, without first attempting the
   mount.

Are you really sure that the second mount process (for /work2) really died?

Comment 3 Sunil Shinde 2008-02-13 06:55:57 UTC

(In reply to comment #2)
> > <ip>:/work1    /work1          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
> > <ip>:/work2    /work2          nfs rw,bg,intr,hard,wsize=32768,rsize=32768 0 0
> > We see from the "/var/log/messages" file that the mount for "/work1"
> > issues the following message shortly after it is backgrounded.
> > 
> > mount: mount to NFS server 'XXX.XXX.XXX.XXX' failed
> > 
> > However, we never see that message coming from the background mount
> > for "/work2".
> 
> It seems correct, you shouldn't see any initial message from /work2. See nfs(5)
> man page:
> 
>    After a mount operation is backgrounded, all subsequent mounts on the same
>    NFS server will be backgrounded immediately, without first attempting the
>    mount.
> 
> Are you really sure that the second mount process (for /work2) really died?
> 

Yes i am very sure if I submit the mount command through initlog command the
second mount process dies.restarting/starting the netfs also has this behavior.

Comment 4 Jeff Layton 2008-03-20 12:34:42 UTC

Going through the different cases here. Set up 2 mounts in /etc/fstab:

server:/vol/vol0/home/jlayton/rhel4/1 /mnt/netapp1 nfs bg 0 0
server:/vol/vol0/home/jlayton/rhel4/2 /mnt/netapp2 nfs bg 0 0

Running mount -a seems to result in both daemons running after the initial mount
backgrounds them:

root      3816     1  0 08:26 pts/0    00:00:00 mount --
server:/vol/vol0/home/jlayton/rhel4/1
root      3817     1  0 08:26 pts/0    00:00:00 mount --
server:/vol/vol0/home/jlayton/rhel4/2

...both are blocked trying to open a socket to the portmapper on the server:

connect(3, {sa_family=AF_INET, sin_port=htons(111),
sin_addr=inet_addr("10.100.56.139")}, 16 <unfinished ...>

...this might be a factor. std* are still connected to the controlling terminal:

mount   3816 root    0u   CHR  136,0               2 /dev/pts/0
mount   3816 root    1u   CHR  136,0               2 /dev/pts/0
mount   3816 root    2u   CHR  136,0               2 /dev/pts/0

I'll do an attempt next by running "service netfs start"...

Comment 5 Jeff Layton 2008-03-20 13:08:34 UTC

Doing service nfs start shows a couple of processes just after it starts:

root      4074  4061  0 08:35 pts/1    00:00:00 initlog -q -c mount -a -t nfs,nfs4
root      4075  4074  0 08:35 pts/1    00:00:00 mount -a -t nfs,nfs4

...after running for a bit, I see the log messages come across saying that the
mount has been backgrounded, but both mount processes then disappear.

My suspicion is that initlog is exiting and that's causing the background mount
processes to die...

Comment 6 Jeff Layton 2008-03-20 14:08:52 UTC

yep...using gdb in follow-fork-mode:

(gdb) set follow-fork-mode child
(gdb) c
Continuing.
Attaching after fork to child process 2517.

Program received signal SIGPIPE, Broken pipe.
[Switching to process 2517]
0x000000337a2b9f12 in __write_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x000000337a2b9f12 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x000000337a265678 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x000000337a265867 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
#3  0x000000337a23f6ff in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x000000337a23f889 in vfprintf () from /lib64/tls/libc.so.6
#5  0x000000337a248576 in fprintf () from /lib64/tls/libc.so.6
#6  0x000000000040b2c8 in mount_errors (server=0x46 <Address 0x46 out of
bounds>, will_retry=1, bg=1) at nfsmount.c:139
#7  0x000000000040cf0c in nfsmount (spec=Variable "spec" is not available.
) at nfsmount.c:1263
#8  0x0000000000403fec in try_mount_one (spec0=0x51f7e0
"netapp-f87.gsslab.rdu.redhat.com:/vol/vol0/home/jlayton/rhel4/1", 
    node0=0x51ce20 "/mnt/netapp1", types0=0x51e6a0 "nfs", opts0=0x51f830 "bg",
freq=0, pass=0, bg=1, ro=0) at mount.c:859
#9  0x000000000040533a in mount_one (spec=0x51f7e0
"netapp-f87.gsslab.rdu.redhat.com:/vol/vol0/home/jlayton/rhel4/1", 
    node=0x51ce20 "/mnt/netapp1", types=0x51e6a0 "nfs", opts=0x51f830 "bg",
cmdlineopts=0x0, freq=0, pass=0) at mount.c:1301
#10 0x0000000000405b5f in main (argc=Variable "argc" is not available.
) at mount.c:1445
#11 0x000000337a21c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#12 0x00000000004030ea in _start ()
#13 0x0000007fbffffde8 in ?? ()
#14 0x000000000000001c in ?? ()
#15 0x0000000000000004 in ?? ()
#16 0x0000007fbfffff6c in ?? ()
#17 0x0000000000000000 in ?? ()

...mount_errors does a fprintf(stderr, ...) so it does look like the stderr pipe
closed here.

I suppose the question is -- is initlog expected to stick around for children of
the process that it spawns?

Comment 7 Jeff Layton 2008-03-20 14:57:03 UTC

No, it isn't. It repeatedly calls waitpid() with WNOHANG to see if it has exited.

Once the main process has exited, the parent exits and the children's stderr
fd's are closed. They then get a SIGPIPE when trying to write to stderr and
exit, but the parent process is now init and they get auto-reaped.

There are a couple of options I suppose:

1) change initlog so that it can wait for all children to exit (allow it to
follow forks in a sense). This could be problematic though since the shell
script probably won't exit if initlog doesn't exit...

2) change mount to run bg children in a silent mode (redirecting all std* fd's
to /dev/null) or to log to syslog directly. The former would probably be fairly
easy to implement. The latter may be tricky.

Comment 8 Jeff Layton 2008-03-20 15:33:09 UTC

This seems to be fixed in rawhide -- the backgrounded children dissociate from
the controlling terminal. I'm going to presume that that's ok here...

Comment 10 Jeff Layton 2008-03-20 17:56:08 UTC

Created attachment 298722 [details]
proposed patch

Proposed patch -- make backgrounded mount processes detach from controlling
terminal.

This patch seems to fix it. It just has the forked mount processes detach from
the controlling terminal. This prevents the SIGPIPE that occurs when the
process tries to write to stderr.

This does mean that there are no log messages produced from the backgrounded
processes, but this seems to be consistent with upstream behavior in this
regard.

Comment 12 RHEL Program Management 2008-03-27 11:58:59 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Jeff Layton 2008-03-27 15:00:24 UTC

Created attachment 299336 [details]
proposed patch 2

Updated patch. Make the daemonize() call a void function. There's not much we
can do differently if it fails, so just make it a best-effort thing.

Comment 14 Jeff Layton 2008-03-27 16:47:32 UTC

Committed in util-linux-2.12a-19.el4

Comment 15 Jeff Layton 2008-03-27 16:49:24 UTC

Created attachment 299358 [details]
patch 3

Patch committed to CVS. Same code, but rename function to be more clear. Also
add comment.

Comment 19 errata-xmlrpc 2008-07-24 20:01:47 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0751.html

Comment 20 Karel Zak 2008-12-23 22:46:58 UTC

*** Bug 445037 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.