Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 751785

Summary:

ps aux hangs

Product:

Red Hat Enterprise Linux 6

Reporter:

Raghavendra Bhat <rabhat>

Component:

kernel

Assignee:

nfs-maint

kernel sub component:

NFS

QA Contact:

Filesystem QE <fs-qe>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

unspecified

Priority:

medium

CC:

albert, bcodding, myllynen, ohudlick, ovasik, shwetha.h.panduranga, swhiteho, torel, xzhou

Version:

6.1

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-12-06 11:07:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Source code for binary mentioned in bug description.	none

Description Raghavendra Bhat 2011-11-07 15:02:16 UTC

Created attachment 532068 [details]
Source code for binary mentioned in bug description.

Description of problem:

While doing ps aux, the command hangs. strace lists that it is unable to read from /proc/11270/cmdline. 11270 happens to be rhn_network.

Version-Release number of selected component (if applicable):
procps-3.2.8-17.el6.x86_64

How reproducible:
Had a nfs mount on RHEL 6.1 which talked to a GlusterFS 3.2 NFS server running on a different RHEL 6.1 instance. On the mount, fs-perf-test was being run. (Src for fs-perf-test, new-fs-perf.c is attached). fs-perf-test was unresponsive and while trying to debug that problem, observed that ps aux hangs on the machine having the nfs mount.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

strace output:

stat("/proc/11263", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/11263/stat", O_RDONLY)      = 5
read(5, "11263 (awk) S 11098 11018 11018 "..., 1023) = 228
close(5)                                = 0
open("/proc/11263/status", O_RDONLY)    = 5
read(5, "Name:\tawk\nState:\tS (sleeping)\nTg"..., 1023) = 889
close(5)                                = 0
open("/proc/11263/cmdline", O_RDONLY)   = 5
read(5, "awk\0-v\0progname\0/etc/cron.daily/"..., 2047) = 137
close(5)                                = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
write(1, "root     11263  0.0  0.0   9068 "..., 143root     11263  0.0  0.0   9068   768 ?        SN   Nov06   0:00 awk -v progname /etc/cron.daily/mlocate.cron progname {?????   print progname
) = 143
stat("/proc/11268", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/11268/stat", O_RDONLY)      = 5
read(5, "11268 (updatedb) D 11262 11018 1"..., 1023) = 235
close(5)                                = 0
open("/proc/11268/status", O_RDONLY)    = 5
read(5, "Name:\tupdatedb\nState:\tD (disk sl"..., 1023) = 898
close(5)                                = 0
open("/proc/11268/cmdline", O_RDONLY)   = 5
read(5, "/usr/bin/updatedb\0-f\0sysfs\nrootf"..., 2047) = 214
close(5)                                = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
write(1, "root     11268  0.0  0.0   4076 "..., 143root     11268  0.0  0.0   4076   760 ?        DN   Nov06   0:00 /usr/bin/updatedb -f sysfs?rootfs?bdev?proc?cgroup?cpuset?tmpfs?devtmpfs?binf
) = 143
stat("/proc/11270", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/11270/stat", O_RDONLY)      = 5
read(5, "11270 (rhn_check) D 1726 1726 17"..., 1023) = 249
close(5)                                = 0
open("/proc/11270/status", O_RDONLY)    = 5
read(5, "Name:\trhn_check\nState:\tD (disk s"..., 1023) = 883
close(5)                                = 0
open("/proc/11270/cmdline", O_RDONLY)   = 5
read(5, 

^^^^^^^^^^^^^^

stalls here forever. /var/log/messages has the following information:

Nov  6 04:44:17 RHEL6 kernel: INFO: task rhn_check:11270 blocked for more than 120 seconds.
Nov  6 04:44:17 RHEL6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 04:44:17 RHEL6 kernel: rhn_check     D 0000000000000000     0 11270   1726 0x00000080
Nov  6 04:44:17 RHEL6 kernel: ffff880046713e08 0000000000000082 0000000000000000 ffffffff8120e3d2
Nov  6 04:44:17 RHEL6 kernel: ffff88003737a6c0 ffffffff812064af ffff880046713d98 00000001610e6889
Nov  6 04:44:17 RHEL6 kernel: ffff8800792ebb38 ffff880046713fd8 000000000000f598 ffff8800792ebb38
Nov  6 04:44:17 RHEL6 kernel: Call Trace:
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8120e3d2>] ? selinux_inode_permission+0x72/0xb0
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff812064af>] ? security_inode_permission+0x1f/0x30
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff814dd3c5>] rwsem_down_failed_common+0x95/0x1d0
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff814dd523>] rwsem_down_write_failed+0x23/0x30
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8126e773>] call_rwsem_down_write_failed+0x13/0x20
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff814dca22>] ? down_write+0x32/0x40
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8112debc>] sys_mmap_pgoff+0x5c/0x2d0
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff810104e9>] sys_mmap+0x29/0x30
Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Nov  7 02:11:22 RHEL6 abrt[15069]: saved core dump of pid 15068 (/usr/libexec/fprintd) to /var/spool/abrt/ccpp-1320649882-15068.new/coredump (757760 bytes)
Nov  7 02:11:22 RHEL6 abrtd: Directory 'ccpp-1320649882-15068' creation detected
Nov  7 02:11:22 RHEL6 abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1318151537-1579)
Nov  7 02:11:22 RHEL6 abrtd: Deleting crash ccpp-1320649882-15068 (dup of ccpp-1318151537-1579), sending dbus signal
Nov  7 02:11:27 RHEL6 abrt[15076]: not dumping repeating crash in '/usr/libexec/fprintd'
Nov  7 02:11:30 RHEL6 abrt[15081]: not dumping repeating crash in '/usr/libexec/fprintd'

Comment 2 Jaromír Cápík 2012-01-10 12:07:03 UTC

Hello Raghavendra.

This seem to be a stalled read from the cmdline file. Please, try to repeat the test and once it gets frozen, try to type the /proc/<PID>/cmdline with the cat command from a different terminal.

Expected results:
1.) The cat command gets frozen too ... 
-OR-
2.) The cat command returns a correct output and successfully terminates ...
-OR-
3.) something else (+describe what happens) ...

Thanks in advance.

Regards,
Jaromir.

Comment 4 Suzanne Logcher 2012-02-14 23:19:43 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 5 Raghavendra Bhat 2012-03-05 07:37:03 UTC

I have been able to reproduce the issue. Again the machine mount the nfs client from the gluster-nfs server running on some other machine (glusterfs version: glusterfs-3.3.0qa25) hangs when ps aux , w commands are executed.

ran 'w' with strace:

read(4, "2014 (rpc.statd) S 1 2014 2014 0"..., 1023) = 250
close(4)                                = 0
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=1339, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9b33f94000
read(4, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1339
close(4)                                = 0
munmap(0x7f9b33f94000, 4096)            = 0
open("/proc/2014/cmdline", O_RDONLY)    = 4
read(4, "rpc.statd\0", 2047)            = 10
close(4)                                = 0
stat("/proc/6686", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/6686/stat", O_RDONLY)       = 4
read(4, "6686 (run.sh) S 32277 6686 32277"..., 1023) = 246
close(4)                                = 0
open("/proc/6686/cmdline", O_RDONLY)    = 4
read(4, "/bin/bash\0/opt/qa/tools/system_l"..., 2047) = 67
close(4)                                = 0
stat("/proc/6804", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/6804/stat", O_RDONLY)       = 4
read(4, "6804 (bonnie.sh) S 6686 6686 322"..., 1023) = 245
close(4)                                = 0
open("/proc/6804/cmdline", O_RDONLY)    = 4
read(4, "/bin/bash\0/opt/qa/tools/system_l"..., 2047) = 62
close(4)                                = 0
stat("/proc/6806", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/6806/stat", O_RDONLY)       = 4
read(4, "6806 (bonnie++) S 6804 6686 3227"..., 1023) = 236
close(4)                                = 0
open("/proc/6806/cmdline", O_RDONLY)    = 4
read(4, "bonnie++\0-u\0root\0-d\0/nfs/run6686"..., 2047) = 34
close(4)                                = 0
stat("/proc/8466", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/8466/stat", O_RDONLY)       = 4
read(4, "8466 (rpc.statd) S 1 8466 8466 0"..., 1023) = 252
close(4)                                = 0
open("/proc/8466/cmdline", O_RDONLY)    = 4
read(4, "rpc.statd\0", 2047)            = 10
close(4)                                = 0
stat("/proc/10697", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/10697/stat", O_RDONLY)      = 4
read(4, "10697 (rhn_check) D 1692 1692 16"..., 1023) = 243
close(4)                                = 0
open("/proc/10697/cmdline", O_RDONLY)   = 4
read(4, 


On the nfs client iozone was runnimng (iozone -a), which also was hung immedietly after the command was executed. 

cat /proc/PID/cmdline also hangs.

This is the outout of 'cat /proc/10697/cmdline':

 cat /proc/10697/cmdline
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C

Did strace cat /proc/10697/cmdline. This is the output:

strace cat /proc/10697/cmdline 
execve("/bin/cat", ["cat", "/proc/10697/cmdline"], [/* 28 vars */]) = 0
brk(0)                                  = 0x14d4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8fb000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=40617, ...}) = 0
mmap(NULL, 40617, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4adc8f1000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\355!\2750\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1904456, ...}) = 0
mmap(0x30bd200000, 3729576, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30bd200000
mprotect(0x30bd386000, 2093056, PROT_NONE) = 0
mmap(0x30bd585000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x185000) = 0x30bd585000
mmap(0x30bd58a000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x30bd58a000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8f0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8ef000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8ee000
arch_prctl(ARCH_SET_FS, 0x7f4adc8ef700) = 0
mprotect(0x30bd585000, 16384, PROT_READ) = 0
mprotect(0x30bd01f000, 4096, PROT_READ) = 0
munmap(0x7f4adc8f1000, 40617)           = 0
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=99158752, ...}) = 0
mmap(NULL, 99158752, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4ad6a5d000
close(3)                                = 0
brk(0)                                  = 0x14d4000
brk(0x14f5000)                          = 0x14f5000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
open("/proc/10697/cmdline", O_RDONLY)   = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3,

Comment 6 Raghavendra Bhat 2012-03-19 12:12:35 UTC

I am able to reproduce this issue again on RHEL-6.2. I have an internal system which has the bug reproduced in it. Can you please take a look?

Comment 7 Shwetha Panduranga 2012-04-16 11:02:04 UTC

Even I have hit the same issue. 

ps has hung when it got the pid of glusterfs client process.

This is the output of /proc/<glusterfs pid>/task/stack

1610  1611  1612  1613  1616  1617  1618  1619  1620  3434  3435
[04/16/12 - 21:46:21 root@APP-CLIENT1 1610]# cat ../1619/stack 
[<ffffffffa033ad54>] rpc_wait_bit_killable+0x24/0x40 [sunrpc]
[<ffffffffa033ad2d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
[<ffffffffa03a7c0a>] nfs_initiate_commit+0xfa/0x120 [nfs]
[<ffffffffa03a9039>] nfs_commit_inode+0x199/0x250 [nfs]
[<ffffffffa03960c6>] nfs_release_page+0x86/0xa0 [nfs]
[<ffffffff8110fe60>] try_to_release_page+0x30/0x60
[<ffffffff8112a2a1>] shrink_
page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740
[<ffffffff8112b37f>] shrink_zone+0x38f/0x520
[<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520
[<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130
[<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940
[<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150
[<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370
[<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0
[<ffffffff8113c6aa>] __get_user_pages+0x12a/0x430
[<ffffffff8113ca49>] get_user_pages+0x49/0x50
[<ffffffffa015444a>] fuse_copy_fill+0xca/0x1f0 [fuse]
[<ffffffffa0154796>] fuse_copy_args+0x1b6/0x350 [fuse]
[<ffffffffa0155524>] fuse_dev_read+0x284/0x310 [fuse]
[<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140
[<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0
[<ffffffff81177563>] vfs_readv+0x43/0x60
[<ffffffff81177691>] sys_readv+0x51/0xb0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[04/16/12 - 21:46:24 root@APP-CLIENT1 1610]# 

These are the gluster nfs server logs.

[2012-04-16 21:35:50.119791] I [client.c:136:client_register_grace_timer] 0-dstore-client-8: Registering a grace timer
[2012-04-16 21:35:50.119838] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-2: changing port to 24009 (from 0)
[2012-04-16 21:35:50.119902] I [client.c:136:client_register_grace_timer] 0-dstore-client-2: Registering a grace timer
[2012-04-16 21:35:50.187812] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 18
[2012-04-16 21:35:50.187901] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.187947] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 21
[2012-04-16 21:35:50.187970] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.188001] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 22
[2012-04-16 21:35:50.188032] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.188200] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 24
[2012-04-16 21:35:50.188241] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.293762] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-1: changing port to 24009 (from 0)
[2012-04-16 21:35:50.293883] I [client.c:136:client_register_grace_timer] 0-dstore-client-1: Registering a grace timer
[2012-04-16 21:35:50.293938] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-4: changing port to 24010 (from 0)
[2012-04-16 21:35:50.294053] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-7: changing port to 24011 (from 0)
[2012-04-16 21:35:50.294163] I [client.c:136:client_register_grace_timer] 0-dstore-client-4: Registering a grace timer
[2012-04-16 21:35:50.294228] I [client.c:136:client_register_grace_timer] 0-dstore-client-7: Registering a grace timer
[2012-04-16 21:35:50.737371] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.737445] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply[2012-04-16 21:35:50.119791] I [client.c:136:client_register_grace_timer] 0-dstore-client-8: Registering a grace timer
[2012-04-16 21:35:50.119838] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-2: changing port to 24009 (from 0)
[2012-04-16 21:35:50.119902] I [client.c:136:client_register_grace_timer] 0-dstore-client-2: Registering a grace timer
[2012-04-16 21:35:50.187812] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 18
[2012-04-16 21:35:50.187901] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.187947] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 21
[2012-04-16 21:35:50.187970] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.188001] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 22
[2012-04-16 21:35:50.188032] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.188200] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 24
[2012-04-16 21:35:50.188241] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2012-04-16 21:35:50.293762] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-1: changing port to 24009 (from 0)
[2012-04-16 21:35:50.293883] I [client.c:136:client_register_grace_timer] 0-dstore-client-1: Registering a grace timer
[2012-04-16 21:35:50.293938] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-4: changing port to 24010 (from 0)
[2012-04-16 21:35:50.294053] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-7: changing port to 24011 (from 0)
[2012-04-16 21:35:50.294163] I [client.c:136:client_register_grace_timer] 0-dstore-client-4: Registering a grace timer
[2012-04-16 21:35:50.294228] I [client.c:136:client_register_grace_timer] 0-dstore-client-7: Registering a grace timer
[2012-04-16 21:35:50.737371] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.737445] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.738060] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738113] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.738576] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738613] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.738804] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738850] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.743696] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.743765] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.743911] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.743972] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.744238] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.744297] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.744769] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.744830] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.745023] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.745084] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.745264] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore

[2012-04-16 21:35:50.738060] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738113] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.738576] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738613] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.738804] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.738850] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.743696] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.743765] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.743911] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.743972] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.744238] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.744297] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.744769] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.744830] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.745023] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
[2012-04-16 21:35:50.745084] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-16 21:35:50.745264] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore

Comment 8 Raghavendra Bhat 2012-04-25 10:37:11 UTC

Hit the issue again. This time gluster nfs server log level was set to DEEBUG.

This is the nfs server logs when the nfs client was hung.

2012-04-25 15:52:01.750472] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-0: changing port to 24009 (from 0)
[2012-04-25 15:52:01.763225] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-1: changing port to 24010 (from 0)
[2012-04-25 15:52:01.770323] D [client-handshake.c:1665:server_has_portmap] 0-mirror-client-2: detected portmapper on server
[2012-04-25 15:52:01.772293] D [client-handshake.c:193:client_start_ping] 0-mirror-client-2: returning as transport is already disconnected OR
 there are no frames (1 || 1)
[2012-04-25 15:52:01.775942] D [socket.c:193:__socket_rwv] 0-mirror-client-0: EOF from peer 127.0.0.1:24007
[2012-04-25 15:52:01.782242] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-0: reading from socket failed. Error (Transport en
dpoint is not connected), peer (127.0.0.1:24007)
[2012-04-25 15:52:01.785826] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now
[2012-04-25 15:52:01.827678] D [client.c:2117:client_rpc_notify] 0-mirror-client-0: disconnected (skipped notify)
[2012-04-25 15:52:01.833224] D [socket.c:193:__socket_rwv] 0-mirror-client-1: EOF from peer 127.0.0.1:24007
[2012-04-25 15:52:01.834515] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-1: reading from socket failed. Error (Transport en
dpoint is not connected), peer (127.0.0.1:24007)
[2012-04-25 15:52:01.835679] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now
[2012-04-25 15:52:01.836723] D [client.c:2117:client_rpc_notify] 0-mirror-client-1: disconnected (skipped notify)
[2012-04-25 15:52:01.840632] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-2: changing port to 24013 (from 0)
[2012-04-25 15:52:01.842144] D [socket.c:193:__socket_rwv] 0-mirror-client-2: EOF from peer 127.0.0.1:24007
[2012-04-25 15:52:01.843125] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24007)
[2012-04-25 15:52:01.844010] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now
[2012-04-25 15:52:01.845846] D [client.c:2117:client_rpc_notify] 0-mirror-client-2: disconnected (skipped notify)
[2012-04-25 15:52:04.190866] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4617fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343515136,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.222722] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
[2012-04-25 15:52:04.227420] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-25 15:52:04.232569] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4117fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 149557248,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.234233] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
[2012-04-25 15:52:04.235567] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-25 15:52:04.240296] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4017fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 149491712,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.241962] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
[2012-04-25 15:52:04.243251] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-25 15:52:04.247706] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 5117fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 344236032,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.249278] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
[2012-04-25 15:52:04.250451] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-25 15:52:04.254747] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4717fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343580672,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.256408] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
[2012-04-25 15:52:04.257550] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-04-25 15:52:04.261964] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4b17fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343842816,  count: 65536, UNSTABLE
[2012-04-25 15:52:04.263107] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror
:

This is the stacktrace of the gluseterfs pid which is hung.

 cd /proc/32262/
root@hyperspace:/proc/32262# unalias ls
root@hyperspace:/proc/32262# ls
attr	   clear_refs	    cpuset   fd       limits	mountinfo   numa_maps	   pagemap	schedstat  stat     task
autogroup  cmdline	    cwd      fdinfo   loginuid	mounts	    oom_adj	   personality	sessionid  statm    wchan
auxv	   comm		    environ  io       maps	mountstats  oom_score	   root		smaps	   status
cgroup	   coredump_filter  exe      latency  mem	net	    oom_score_adj  sched	stack	   syscall
root@hyperspace:/proc/32262# cd task/
root@hyperspace:/proc/32262/task# ls
32262  32263  32264  32265  32269  32289  32290  32291	32361  32593  32644
root@hyperspace:/proc/32262/task# cat 32262/stack 
[<ffffffffa05f4294>] nfs_wait_bit_killable+0x24/0x40 [nfs]
[<ffffffffa060342a>] nfs_commit_inode+0x9a/0x280 [nfs]
[<ffffffffa05f1824>] nfs_release_page+0x84/0xa0 [nfs]
[<ffffffff8110b702>] try_to_release_page+0x32/0x50
[<ffffffff8111da77>] shrink_page_list+0x537/0x5c0
[<ffffffff8111deea>] shrink_inactive_list+0x11a/0x450
[<ffffffff8111e772>] shrink_zone+0x1f2/0x2b0
[<ffffffff8111eb44>] shrink_zones+0x74/0xe0
[<ffffffff8111ec3e>] do_try_to_free_pages+0x8e/0x2d0
[<ffffffff8111f0d7>] try_to_free_pages+0x87/0x110
[<ffffffff81114654>] __alloc_pages_nodemask+0x494/0x840
[<ffffffff81149f45>] alloc_pages_current+0xa5/0x110
[<ffffffff8104428b>] pte_alloc_one+0x1b/0x50
[<ffffffff8112dd95>] __pte_alloc+0x35/0x100
[<ffffffff81131d09>] handle_mm_fault+0x129/0x250
[<ffffffff815c6a47>] do_page_fault+0x1a7/0x540
[<ffffffff815c34d5>] page_fault+0x25/0x30
[<ffffffff814c8f38>] skb_copy_datagram_iovec+0x148/0x2a0
[<ffffffff81515aea>] tcp_recvmsg+0xa3a/0xbb0
[<ffffffff81538c8b>] inet_recvmsg+0x6b/0x80
[<ffffffff814bacc4>] sock_aio_read+0x164/0x170
[<ffffffff81165383>] do_sync_readv_writev+0xd3/0x110
[<ffffffff81165654>] do_readv_writev+0xd4/0x1e0
[<ffffffff81165805>] vfs_readv+0x45/0x60
[<ffffffff81165871>] sys_readv+0x51/0xc0
[<ffffffff8100c002>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Comment 9 Jaromír Cápík 2012-06-12 14:02:35 UTC

Hello guys.

Sorry for the delay. As the cat command got frozen too, this is really a stalled read. I believe the problem needs to be fixed somewhere in the lower levels. I would suggest to change the component to glibc or kernel (or selinux?). Could you try to reproduce that with SELinux disabled?
Anyway ... I'm changing the component to glibc and we'll see what the glibc maintainer says.

Regards,
Jaromir.

Comment 10 Jeff Law 2012-06-12 14:59:32 UTC

Everything here looks like a kernel bug to me; reassigning to the kernel team.

Comment 11 Tore H. Larsen 2012-12-20 01:49:25 UTC

cc

Comment 12 Oleg Nesterov 2015-12-10 15:09:04 UTC

(In reply to Raghavendra Bhat from comment #0)
>
>
> open("/proc/11270/cmdline", O_RDONLY)   = 5
> read(5, 
>
> stalls here forever.

Apparently proc_pid_cmdline_read() sleeps in down_read(&mm->mmap_sem).
 
> /var/log/messages has the following information:
> 
> Nov  6 04:44:17 RHEL6 kernel: INFO: task rhn_check:11270 blocked for more
> than 120 seconds.
...
> Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8126e773>]
> call_rwsem_down_write_failed+0x13/0x20
> Nov  6 04:44:17 RHEL6 kernel: [<ffffffff814dca22>] ? down_write+0x32/0x40
> Nov  6 04:44:17 RHEL6 kernel: [<ffffffff8112debc>] sys_mmap_pgoff+0x5c/0x2d0
> Nov  6 04:44:17 RHEL6 kernel: [<ffffffff810104e9>] sys_mmap+0x29/0x30

and rhn_check too can't tale the same semaphore.

Comment 13 Tore H. Larsen 2015-12-10 15:10:54 UTC

Nagios nrpe plugin triggers this readily on fully loaded nodes.

Comment 14 Oleg Nesterov 2015-12-10 15:16:19 UTC

(In reply to Shwetha Panduranga from comment #7)
>
> Even I have hit the same issue. 
> 
> ps has hung when it got the pid of glusterfs client process.
> 
> This is the output of /proc/<glusterfs pid>/task/stack
> 
> 1610  1611  1612  1613  1616  1617  1618  1619  1620  3434  3435
> [04/16/12 - 21:46:21 root@APP-CLIENT1 1610]# cat ../1619/stack 
> [<ffffffffa033ad54>] rpc_wait_bit_killable+0x24/0x40 [sunrpc]
> [<ffffffffa033ad2d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
> [<ffffffffa03a7c0a>] nfs_initiate_commit+0xfa/0x120 [nfs]
> [<ffffffffa03a9039>] nfs_commit_inode+0x199/0x250 [nfs]
> [<ffffffffa03960c6>] nfs_release_page+0x86/0xa0 [nfs]
> [<ffffffff8110fe60>] try_to_release_page+0x30/0x60
> [<ffffffff8112a2a1>] shrink_
> page_list.clone.0+0x4f1/0x5c0
> [<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740
> [<ffffffff8112b37f>] shrink_zone+0x38f/0x520
> [<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520
> [<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130
> [<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940
> [<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150
> [<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370
> [<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0
> [<ffffffff8113c6aa>] __get_user_pages+0x12a/0x430
> [<ffffffff8113ca49>] get_user_pages+0x49/0x50

And this sleeps with ->mmap_sem held. For reading, but if another process
tries to take it for writing (sys_mmap() in #0), then down_read() in
proc_pid_cmdline_read() will block.

Comment 15 Oleg Nesterov 2015-12-10 15:21:12 UTC

(In reply to Tore H. Larsen from comment #13)
>
> Nagios nrpe plugin triggers this readily on fully loaded nodes.

sorry, I don't know what nrpe is ;)

I think that gluster/nfs people should look at this. Once again, ->mmap_sem
is held by handle_mm_fault() callers, and it seems that nfs_release_page()
hangs.

Could you re-assign this bug to someone who knows this area?

Comment 16 Benjamin Coddington 2015-12-21 19:36:38 UTC

Should (In reply to Tore H. Larsen from comment #13)
> Nagios nrpe plugin triggers this readily on fully loaded nodes.

Tore, are you hitting this without glusterfs or fuse?

Looks like a problem on a fuse since shared mmap_sem..

There's several upstream fixes to skip waiting in nfs_release_pages, but I'd only thought they were necessary for loopback mounts.

Comment 17 Tore H. Larsen 2017-02-22 11:04:47 UTC

Is this related to solutions 470653?   Please add bug # to solution.

Yes, I'm hitting it without glusterfs.  Nodes do have FhGFS client (BeeGFS). Went away from Nagios as nrpe got stuck when accessing /proc. Using ganglia and PCP (pmcd) to monitor nodes.

Comment 18 Jan Kurik 2017-12-06 11:07:02 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/