From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901
Description of problem:
On the (nfs) server, there is a script (cycle_nfs.sh) which does the following
(actually does only step 5, steps 1-4 are to be done before running the script):
1. assign ip-address to eth0:1
2. start nfs services (i.e. /etc/init.d/nfs start)
3. mount /dev/sdc1..6 onto /extdisk1..6
4. nfs export disk partitions
exportfs -o rw, no_root_squash client-ip:/extdisk1..6
5. for i = 1 to 50 do
5.1 bring down eth0:1
5.2 unexport nfs partitions (i.e. exportfs -u -a)
5.3 umount /extdisk1..6
5.4 sleep 5 (seconds)
5.5 mount /extdisk1..6
5.6 nfs export all partitions /extdisk1..6
5.7 bringup eth0:1
5.8 sleep 25 (seconds)
The client runs a simple program which does random r/w's on the NFS
Along with this on the server (where the above script is cycling NFS server),
another program (called loopopen) opens and closes (open(), close()) all device
files (i.e. /dev/sdcX) repeateadly in a loop. (attached file: loopopen.c).
After couple of iterations, following things happen:
1. the script hangs while in the loop, because
1. umount program in the script hangs
2. the loopopen program on the server hangs while doing open()
on a device.
[the output of 'ps -Alef' is attached (ps.out)]
But the server is still telnet-accessible and sort of running.
Note: above script (cycle_nfs.sh) is very similar to the one in bug report
61546, although here the filesystem is EXT2.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. perform steps 1-4 decribed in 'Description' section
2. have a client access the exported partitions (we wrote a simple
program to cp,rm,mv 1-MB of files to/from the server, on the client)
3. run the cycle_nfs.sh script on the server (which does step 5 mentioned in
the 'Description' section).
4. run the loopopen program on the server
5. wait for umount and loopopen program to hang!!
Actual Results: umount program hangs while umounting one of the device file.
and a concurrent open (being done by the loopopen program) hangs on the same
the script hangs, the loopopen program hangs :(
Expected Results: Neither umount nor the loopopen program should hang. It does
not in normal operations when the device is not being exported as NFS share.
Staring NFS and having client access it, causes this weird and crazy
Looks to be a [ext2 + kernel + nfsd] bug.
System setup details
- RedHat 7.2 + errata
- Kernel: 2.4.9-13 (UP machine)
- uname -a ==>
Linux vcslinux20 2.4.9-13 #1 Tue Oct 30 20:11:04 EST 2001 i686 unknown
Hardware setup details
- Dell PowerEdge 1300 server and client.
- Adapter cards with HP and Quantum SCSI disks
- using aic7xxx scsi driver (ver 5.2.4), the default installed by RH
following text files contain addition information, might be useful for
debugging. output was taken after the script and program (loopopen) hung.
1. ps.out - full output of 'ps -Aelf' command.
2. sysrq-M.out - memory stat (output of SysRq-M, taken from /var/log/message)
3. sysrq-T.out - task list (output of SysRq-T, taken from /var/log/message)
4. scsi.out - output of /proc/scsi/scsi
5. aic_0/1.out - output of /proc/scsi/aic7xxx/0,1
6. cycle_nfs.sh - sample script. [change device and mount points]
7. loopopen.c - the loopy open/close c-code [change device and mount points]
Created attachment 49444 [details]
output of ps -Aelf command. shows umount and loopopen hung.
Created attachment 49445 [details]
output of magic-SysRq-'T' key (captured from dmesg)
Created attachment 49446 [details]
output of magic-SysRq-'M' key
Created attachment 49447 [details]
Created attachment 49448 [details]
cat /proc/scsi/aic7xxx/1 (aic7xxx driver info)
Created attachment 49449 [details]
script to cycle NFS server (step 5 in description section)
Created attachment 49450 [details]
'loopopen' code which does open()/close() in a loop, on set of device files.
There indeed is a deadlock in the VFS layer of 2.4.9-31 it seems; but only when
opening devices nodes of a mounted fs directly. We'll consider modifying the VFS
to fix this however since it only triggers with actions that aren't done in any
normal use (direct device access while having the same device mounted in linux
gives undefined results) it might be that we won't fix this in the 2.4.9 kernel
series (it's root only so not a security issue). Later kernels (2.4.18+) have