Bug 433267

Summary: [Stratus 4.6.z bug] iounmap may sleep while holding vmlist_lock, causing a deadlock.
Product: Red Hat Enterprise Linux 4 Reporter: RHEL Program Management <pm-rhel>
Component: kernelAssignee: Vitaly Mayatskikh <vmayatsk>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6.zCC: andriusb, chas.horvath, damin, dmair, jbaron, jskrabal, lwoodman, smcgrath
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0167 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-14 10:31:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 361931    
Bug Blocks: 240187    

Description RHEL Program Management 2008-02-18 08:48:29 UTC
This bug has been copied from bug #361931 and has been proposed
to be backported to 4.6 z-stream (EUS).

Comment 2 Andrius Benokraitis 2008-02-18 13:48:13 UTC
Chas, this is slated to be delivered on 13-Mar-08 in the 4.6.z stream.

Comment 3 Andrius Benokraitis 2008-02-26 19:05:40 UTC
Jiri, what exactly do you need (and when) from Stratus? Bug 361931 has the exact
patch needed... Are we still on schedule for this being released 13-Mar-08?

Comment 4 Jiri Skrabal 2008-02-27 07:49:29 UTC
Hi Adrius,

I'm little bit confused here. From the bug activity list I see that you have set
the NEEDINFO flag on and now you are asking me what information I need.

It looks like you did it by mistake or I missed something or I did some mistake
myself. I'm still quite new here so it may be my fault.

From the original bug history its obvious that the patch has been tested on 4.6
release and it is working. Also the devel_ack is on in the EUS bug. It looks
like the only blocking issue here is the bug status (NEEDINFO). 

So I'm changing the status to ASSIGNED. Still, the bug shall be delivered as
planed originally.



Comment 5 Andrius Benokraitis 2008-02-27 18:15:32 UTC
Thanks Jiri - I see Vitaly just spun kernel-2.6.9-67.0.7.EL, and I'm assuming
there will be another internal include/spin prior to the 13-Mar-08 GA date...

Comment 6 Vitaly Mayatskikh 2008-02-27 18:45:07 UTC
Patch included in kernel 2.6.9-67.0.7.EL

Comment 10 errata-xmlrpc 2008-03-14 10:31:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0167.html


Comment 11 damin 2008-03-24 19:26:22 UTC
Gentlemen, while this bug has been closed, I was wondering if there is any 
possible way that this issue could affect the mpt-fusion drivers and cause a 
potential Journal Abort error on an EXT3 filesystem.

[root@dmx-node5 ~]# 
Message from syslogd@dmx-node5 at Sat Mar 22 03:42:13 2008 ...
dmx-node5 kernel: journal commit I/O error

We are running under Vmware ESX 3.5 w/ an iSCSI SAN, and in most cases, I would 
attribute this to very high load on the SAN. In fact, I've not had issues w/ 
this until more recent kernels. At the time that this issue is happening (it is 
not isolated to this VM, but to all machines running 2.6.9-67.0.4) there is 
nominal load, and no indication of SAN timeout issues or SCSI mid-layer issues 
in the VM or the logs.

Am I chasing a red-herring here? If so, any suggestions on debugging procedures 
that I should use to diagnose the specific issue?

This does not seem to affect anything running RHEL5 w/ latest kernels.

Comment 12 R.H. 2008-05-02 19:04:45 UTC
We're running kernel-hugemem-2.6.9-67.EL  we have a RAID-10 setup and just
recently we noticed issues with "find" spinning in "D" state (uninterruptible 
sleep) can this be related? 

time strace -c find . -type d >../find.out


q% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 49.44  114.793832          63   1821161           getdents64
 18.73   43.484328          16   2731259           lstat64
 11.44   26.549265          15   1820837           chdir
  5.78   13.427308          15    910426           close
  5.31   12.329499          14    910427           open
  4.78   11.103291          12    910426           fstat64
  4.42   10.251790          11    910419           fcntl64
  0.10    0.230356          31      7375           write
  0.00    0.000206          69         3           mremap
  0.00    0.000203         203         1           execve
  0.00    0.000127          21         6           read
  0.00    0.000124          31         4           munmap
  0.00    0.000103          13         8           old_mmap
  0.00    0.000061          12         5           mmap2
  0.00    0.000032          11         3           brk
  0.00    0.000029          15         2         1 access
  0.00    0.000028          14         2           mprotect
  0.00    0.000023          12         2           fchdir
  0.00    0.000012          12         1           time
  0.00    0.000012          12         1           uname
  0.00    0.000012          12         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00  232.170641              10022369         1 total

real    73m26.471s
user    1m27.163s
sys     5m54.949s

And further on two systems at this kernel level we see "First orphan inode"
showing up in output of tune2fs -l

Like so:

First orphan inode:       9519201

On one of these hosts gconf problems appeared and df -hl said /tmp/ was full
du -sh did not agree with df and isn't surprising since df keeps track of things
differently.

Please advise.