Bug 180029

Summary: deadlocks on ext2,sync mounted fs
Product: Red Hat Enterprise Linux 4 Reporter: Jure Pečar <pegasus>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-07 16:59:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jure Pečar 2006-02-04 21:52:22 UTC
From Bugzilla Helper:
User-Agent: Opera/8.5 (X11; Linux i686; U; en)

Description of problem:
I'm used to have my /boot sitting on md raid1, formatted as ext2 and moutned 
sync. It works well over many different servers, but I'm noticing some problems 
with latest RHEL4.

This problem hit me twice so far, both times at rpm kernel instalation and 
removal (of course, nothing else ever does io to /boot). Once while rpm was 
running the postinstall grub update script, second while it was removing initrd 
of the previous kernel. What happens is that the process doing the io to /boot 
gets stuck in D and never recovers. After hard reset the file it locked on is 
*gone*. At least it could be picked up by fsck and dropped into lost+found ...

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-11.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. have /boot on md raid1, formatted as ext2 and mounted sync
2. do some rpm kernel install / remove
3. eventually it will deadlock
  

Actual Results:  I belive ps ax output shows the problem best:

3829 pts/0    S+     0:00 rpm -e kernel-smp-2.6.9-11.EL
3832 pts/0    S+     0:00 /bin/sh /var/tmp/rpm-tmp.43481 4
3907 pts/0    S+     0:00 /bin/bash /sbin/new-kernel-pkg --rminitrd --rmmoddep -
-remove 2.6.9-11.ELsmp
3931 pts/0    D+     0:00 rm -f /boot/initrd-2.6.9-11.ELsmp.img


Expected Results:  any ext2,sync mount has to work in the way it is meant to work :)
deadlocks are not wanted on any filesystem.

Additional info:

I did some digging in /proc/3931. I belive these are the relevant data:

# cat /proc/3931/maps 
00400000-00409000 r-xp 00000000 09:02 7536693                            /bin/rm
00508000-00509000 rw-p 00008000 09:02 7536693                            /bin/rm
00509000-0052a000 rwxp 00509000 00:00 0 
2a95556000-2a95557000 rw-p 2a95556000 00:00 0 
2a95563000-2a95565000 rw-p 2a95563000 00:00 0 
2a95565000-2a97b20000 r--p 00000000 09:02 3701737                        /usr/
lib/locale/locale-archive
369f700000-369f715000 r-xp 00000000 09:02 7290882                        /lib64/
ld-2.3.4.so
369f814000-369f816000 rw-p 00014000 09:02 7290882                        /lib64/
ld-2.3.4.so
369f900000-369fa2a000 r-xp 00000000 09:02 7291090                        /lib64/
tls/libc-2.3.4.so
369fa2a000-369fb29000 ---p 0012a000 09:02 7291090                        /lib64/
tls/libc-2.3.4.so
369fb29000-369fb2c000 r--p 00129000 09:02 7291090                        /lib64/
tls/libc-2.3.4.so
369fb2c000-369fb2f000 rw-p 0012c000 09:02 7291090                        /lib64/
tls/libc-2.3.4.so
369fb2f000-369fb33000 rw-p 369fb2f000 00:00 0 
7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0 
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 

# cat /proc/3931/stat
3931 (rm) D 3907 3829 3398 34816 3829 4194560 134 0 0 0 0 0 0 0 18 0 1 0 25921 
42160128 108 18446744073709551615 4194304 4227748 548682070624 
18446744073709551615 234606008713 0 0 0 0 18446744071563606489 0 0 17 0 0 0

# cat /proc/3931/status 
Name:	rm
State:	D (disk sleep)
SleepAVG:	78%
Tgid:	3931
Pid:	3931
PPid:	3907
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	256
Groups:	0 1 2 3 4 6 10 
VmSize:	   41172 kB
VmLck:	       0 kB
VmRSS:	     432 kB
VmData:	     160 kB
VmStk:	       8 kB
VmExe:	      32 kB
VmLib:	    1280 kB
StaBrk:	00509000 kB
Brk:	0052a000 kB
StaStk:	7fbffffa60 kB
Threads:	1
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	0000000000000000
CapInh:	0000000000000000
CapPrm:	00000000fffffeff
CapEff:	00000000fffffeff

# cat /proc/3931/wchan 
__lock_buffer


As I have to put this machine in production early next week, I'm afraid I wont 
be able to do any more tests on it. But as it's easy to recreate the situation, 
I don't belive this is much of a problem.

Btw, it's a dual opteron ... if smp has a factor here at all.

Comment 1 Jason Baron 2006-02-07 16:59:04 UTC

*** This bug has been marked as a duplicate of 180028 ***