From Bugzilla Helper: User-Agent: Opera/8.5 (X11; Linux i686; U; en) Description of problem: I'm used to have my /boot sitting on md raid1, formatted as ext2 and moutned sync. It works well over many different servers, but I'm noticing some problems with latest RHEL4. This problem hit me twice so far, both times at rpm kernel instalation and removal (of course, nothing else ever does io to /boot). Once while rpm was running the postinstall grub update script, second while it was removing initrd of the previous kernel. What happens is that the process doing the io to /boot gets stuck in D and never recovers. After hard reset the file it locked on is *gone*. At least it could be picked up by fsck and dropped into lost+found ... Version-Release number of selected component (if applicable): kernel-smp-2.6.9-11.EL How reproducible: Sometimes Steps to Reproduce: 1. have /boot on md raid1, formatted as ext2 and mounted sync 2. do some rpm kernel install / remove 3. eventually it will deadlock Actual Results: I belive ps ax output shows the problem best: 3829 pts/0 S+ 0:00 rpm -e kernel-smp-2.6.9-11.EL 3832 pts/0 S+ 0:00 /bin/sh /var/tmp/rpm-tmp.43481 4 3907 pts/0 S+ 0:00 /bin/bash /sbin/new-kernel-pkg --rminitrd --rmmoddep - -remove 2.6.9-11.ELsmp 3931 pts/0 D+ 0:00 rm -f /boot/initrd-2.6.9-11.ELsmp.img Expected Results: any ext2,sync mount has to work in the way it is meant to work :) deadlocks are not wanted on any filesystem. Additional info: I did some digging in /proc/3931. I belive these are the relevant data: # cat /proc/3931/maps 00400000-00409000 r-xp 00000000 09:02 7536693 /bin/rm 00508000-00509000 rw-p 00008000 09:02 7536693 /bin/rm 00509000-0052a000 rwxp 00509000 00:00 0 2a95556000-2a95557000 rw-p 2a95556000 00:00 0 2a95563000-2a95565000 rw-p 2a95563000 00:00 0 2a95565000-2a97b20000 r--p 00000000 09:02 3701737 /usr/ lib/locale/locale-archive 369f700000-369f715000 r-xp 00000000 09:02 7290882 /lib64/ ld-2.3.4.so 369f814000-369f816000 rw-p 00014000 09:02 7290882 /lib64/ ld-2.3.4.so 369f900000-369fa2a000 r-xp 00000000 09:02 7291090 /lib64/ tls/libc-2.3.4.so 369fa2a000-369fb29000 ---p 0012a000 09:02 7291090 /lib64/ tls/libc-2.3.4.so 369fb29000-369fb2c000 r--p 00129000 09:02 7291090 /lib64/ tls/libc-2.3.4.so 369fb2c000-369fb2f000 rw-p 0012c000 09:02 7291090 /lib64/ tls/libc-2.3.4.so 369fb2f000-369fb33000 rw-p 369fb2f000 00:00 0 7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0 ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 # cat /proc/3931/stat 3931 (rm) D 3907 3829 3398 34816 3829 4194560 134 0 0 0 0 0 0 0 18 0 1 0 25921 42160128 108 18446744073709551615 4194304 4227748 548682070624 18446744073709551615 234606008713 0 0 0 0 18446744071563606489 0 0 17 0 0 0 # cat /proc/3931/status Name: rm State: D (disk sleep) SleepAVG: 78% Tgid: 3931 Pid: 3931 PPid: 3907 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: 0 1 2 3 4 6 10 VmSize: 41172 kB VmLck: 0 kB VmRSS: 432 kB VmData: 160 kB VmStk: 8 kB VmExe: 32 kB VmLib: 1280 kB StaBrk: 00509000 kB Brk: 0052a000 kB StaStk: 7fbffffa60 kB Threads: 1 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff # cat /proc/3931/wchan __lock_buffer As I have to put this machine in production early next week, I'm afraid I wont be able to do any more tests on it. But as it's easy to recreate the situation, I don't belive this is much of a problem. Btw, it's a dual opteron ... if smp has a factor here at all.
*** This bug has been marked as a duplicate of 180028 ***