Bug 836095 - for kernel-2.6.18-308.8.2.el5; The user-mode processes are waiting_uninterruptible in kernel-mode. can not reboot machine.
for kernel-2.6.18-308.8.2.el5; The user-mode processes are waiting_uninterrup...
Status: CLOSED DUPLICATE of bug 848706
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.10
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-28 02:23 EDT by Mitz Amano
Modified: 2013-04-24 00:27 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-04-24 00:27:39 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
first find the waiting_uninterruptible processes under Red Hat Pure Version (90.60 KB, image/jpeg)
2012-06-28 02:23 EDT, Mitz Amano
no flags Details
after 1 hour, the task still waiting_uninterruptible (84.04 KB, image/jpeg)
2012-06-28 02:25 EDT, Mitz Amano
no flags Details
It is seems a dead lock between nfsd (PID 3486) and genload (PID 3517) (24.93 KB, application/x-gzip)
2012-06-28 02:46 EDT, Mitz Amano
no flags Details
It is another coredump analyzing for merged version (also for nfs commit features ) (57.00 KB, application/msword)
2012-06-28 03:41 EDT, Mitz Amano
no flags Details
This is the details data information for support comment #4 (80.50 KB, application/octet-stream)
2012-06-28 03:43 EDT, Mitz Amano
no flags Details

  None (edit)
Description Mitz Amano 2012-06-28 02:23:40 EDT
Created attachment 594938 [details]
first find the waiting_uninterruptible processes under Red Hat Pure Version

Description of problem:

The user-mode processes are waiting_uninterruptible in kernel-mode. can not reboot machine.

the processes are genloads and cp


Version-Release number of selected component (if applicable):

kernel-2.6.18-308.8.2.el5

How reproducible:

use ltp-full-20100331.gz  using /opt/ltp/testscripts/ltpstress.sh  

Steps to Reproduce:
1. install ltp-full-20100331.gz
2. cd /optltp/testscripts/
3. ./ltpstress.sh -d /root/gchen/redhat_output/datafile.out -l /root/gchen/redhat_output/logfile.log -t 48 -S
  
Actual results:

4. after 1 day;
5. find genload processes and another processes are waiting_uninterruptible
6. can not kill them with "kill -9" and can not reboot machine.

Expected results:

It should pass the stress testing for 2 days, and can be boot up.

Additional info:

This issue occurs under the Red Hat Pure Version (which I download from website)

The history is:
   1st: I merged Asianux patches into this Red Hat kernel version (my job);

   2nd: Find the issue;

   3rd: download Red Hat Pure Version from website;

   4th: Testing again, the issue repeated;

   5th: I generated the coredump with merged version;

   6th: I use crash to analyze it.

   7th: I think it is nfs issue which actually relative with

* Thu Mar 22 2012 Alexander Gordeev <agordeev@redhat.com> [2.6.18-308.3.1.el5]
- [fs] nfs: allow high priority COMMITs to bypass inode commit lock (Jeff Layton) [799941 773777] [Bug7]
- [fs] nfs: don't skip COMMITs if system under is mem pressure (Jeff Layton) [799941 773777] [Bug7]

   8th: I will supply the attachment for proof the issue;

   9th: I will supply the attachment for my coredump analyzation result;


Last: please help to check (better for solving this issue), thanks.

: )
Comment 1 Mitz Amano 2012-06-28 02:25:37 EDT
Created attachment 594942 [details]
after 1 hour, the task still waiting_uninterruptible

and can not kill them with -9 option;

when type reboot command, it will waiting for it.
Comment 2 Mitz Amano 2012-06-28 02:46:00 EDT
Created attachment 594948 [details]
It is seems a dead lock between nfsd (PID 3486) and genload (PID 3517)


It is my current analyzing work, not the final result.

Currently I will report the current status, and still continue to analyze it.


the current status is (but maybe not correct):

   1) genload call nfs_commit_inode to lock the NFS_INO_COMMIT;

      A) in nfs_commit_inode, call nfs_commit_list

      B) nfs_commit_list send rpc command to remote (it is a ASYNC task)

      C) genload waiting for the task finished; and then release the lock;

   2) nfsd (the nfs mount to local machine) received the task;

      A) call svc_process -> nfsd_dispatch ...

      B) at last when it call __alloc_pages which is a kernel common function

          I) it call .... -> nfs_commit_inode -> nfs_commit_set_lock

          II) nfsd is now waiting for NFS_INO_COMMIT;

    3) after data analyzation:

       A) nfsd and genload are work on the same inode;

       B) the RPC task truly not released, it is ACTIVE

       C) after kill -9, genload went out from rpc module but still waiting for task commit;

    4) Current conclusion:

       A) It is a dead lock with nfsd and genload;

    5) What next to do

       A) get all data flow between genload and nfsd;

       B) if all the data truly proof the current conclusion; then we prove it is root cause.
Comment 3 Mitz Amano 2012-06-28 02:55:33 EDT
for comment 2, the coredump is generated by merged version (not the pure version), but the result is the same (it is genload and nfsd waiting_uninterruptable instead of cp and genload)


and please tell me how to find *debuginfo* rpm for this relative kernel, so I can download and repeat the issue, then analyze the coredump from pure version from Red Hat.
Comment 4 Mitz Amano 2012-06-28 03:41:42 EDT
Created attachment 594958 [details]
It is another coredump analyzing for merged version (also for nfs commit features )


Root cause: (all happened in fs/nfs sub system, wirte.c file)
    a.	when nfs_sync_inode_wait is called by a process (such as fsx-linux)
    b.	At the same time, nfs_commit_inode is called by another process (such as genload) 
    c.	Deadlock occurs:
        i.	fsx-linux is waiting for the request finishing (which genload will do next);
        ii.	genload (and for the same to all another process which will sync commit all requests) is waiting for fsx-linux to release commit lock.


The more details to see the attachment.
Comment 5 Mitz Amano 2012-06-28 03:43:32 EDT
Created attachment 594959 [details]
This is the details data information for support comment #4


This is the details data information for support comment #4
Comment 6 Mitz Amano 2012-09-10 22:00:34 EDT
It can be closed, just the same as Bug: 848706
Comment 7 Linda Wang 2013-04-24 00:27:39 EDT

*** This bug has been marked as a duplicate of bug 848706 ***

Note You need to log in before you can comment on or make changes to this bug.