Hide Forgot
Created attachment 532068 [details] Source code for binary mentioned in bug description. Description of problem: While doing ps aux, the command hangs. strace lists that it is unable to read from /proc/11270/cmdline. 11270 happens to be rhn_network. Version-Release number of selected component (if applicable): procps-3.2.8-17.el6.x86_64 How reproducible: Had a nfs mount on RHEL 6.1 which talked to a GlusterFS 3.2 NFS server running on a different RHEL 6.1 instance. On the mount, fs-perf-test was being run. (Src for fs-perf-test, new-fs-perf.c is attached). fs-perf-test was unresponsive and while trying to debug that problem, observed that ps aux hangs on the machine having the nfs mount. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: strace output: stat("/proc/11263", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/11263/stat", O_RDONLY) = 5 read(5, "11263 (awk) S 11098 11018 11018 "..., 1023) = 228 close(5) = 0 open("/proc/11263/status", O_RDONLY) = 5 read(5, "Name:\tawk\nState:\tS (sleeping)\nTg"..., 1023) = 889 close(5) = 0 open("/proc/11263/cmdline", O_RDONLY) = 5 read(5, "awk\0-v\0progname\0/etc/cron.daily/"..., 2047) = 137 close(5) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 write(1, "root 11263 0.0 0.0 9068 "..., 143root 11263 0.0 0.0 9068 768 ? SN Nov06 0:00 awk -v progname /etc/cron.daily/mlocate.cron progname {????? print progname ) = 143 stat("/proc/11268", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/11268/stat", O_RDONLY) = 5 read(5, "11268 (updatedb) D 11262 11018 1"..., 1023) = 235 close(5) = 0 open("/proc/11268/status", O_RDONLY) = 5 read(5, "Name:\tupdatedb\nState:\tD (disk sl"..., 1023) = 898 close(5) = 0 open("/proc/11268/cmdline", O_RDONLY) = 5 read(5, "/usr/bin/updatedb\0-f\0sysfs\nrootf"..., 2047) = 214 close(5) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 write(1, "root 11268 0.0 0.0 4076 "..., 143root 11268 0.0 0.0 4076 760 ? DN Nov06 0:00 /usr/bin/updatedb -f sysfs?rootfs?bdev?proc?cgroup?cpuset?tmpfs?devtmpfs?binf ) = 143 stat("/proc/11270", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/11270/stat", O_RDONLY) = 5 read(5, "11270 (rhn_check) D 1726 1726 17"..., 1023) = 249 close(5) = 0 open("/proc/11270/status", O_RDONLY) = 5 read(5, "Name:\trhn_check\nState:\tD (disk s"..., 1023) = 883 close(5) = 0 open("/proc/11270/cmdline", O_RDONLY) = 5 read(5, ^^^^^^^^^^^^^^ stalls here forever. /var/log/messages has the following information: Nov 6 04:44:17 RHEL6 kernel: INFO: task rhn_check:11270 blocked for more than 120 seconds. Nov 6 04:44:17 RHEL6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 6 04:44:17 RHEL6 kernel: rhn_check D 0000000000000000 0 11270 1726 0x00000080 Nov 6 04:44:17 RHEL6 kernel: ffff880046713e08 0000000000000082 0000000000000000 ffffffff8120e3d2 Nov 6 04:44:17 RHEL6 kernel: ffff88003737a6c0 ffffffff812064af ffff880046713d98 00000001610e6889 Nov 6 04:44:17 RHEL6 kernel: ffff8800792ebb38 ffff880046713fd8 000000000000f598 ffff8800792ebb38 Nov 6 04:44:17 RHEL6 kernel: Call Trace: Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8120e3d2>] ? selinux_inode_permission+0x72/0xb0 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff812064af>] ? security_inode_permission+0x1f/0x30 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff814dd3c5>] rwsem_down_failed_common+0x95/0x1d0 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff814dd523>] rwsem_down_write_failed+0x23/0x30 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8126e773>] call_rwsem_down_write_failed+0x13/0x20 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff814dca22>] ? down_write+0x32/0x40 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8112debc>] sys_mmap_pgoff+0x5c/0x2d0 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff810104e9>] sys_mmap+0x29/0x30 Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Nov 7 02:11:22 RHEL6 abrt[15069]: saved core dump of pid 15068 (/usr/libexec/fprintd) to /var/spool/abrt/ccpp-1320649882-15068.new/coredump (757760 bytes) Nov 7 02:11:22 RHEL6 abrtd: Directory 'ccpp-1320649882-15068' creation detected Nov 7 02:11:22 RHEL6 abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1318151537-1579) Nov 7 02:11:22 RHEL6 abrtd: Deleting crash ccpp-1320649882-15068 (dup of ccpp-1318151537-1579), sending dbus signal Nov 7 02:11:27 RHEL6 abrt[15076]: not dumping repeating crash in '/usr/libexec/fprintd' Nov 7 02:11:30 RHEL6 abrt[15081]: not dumping repeating crash in '/usr/libexec/fprintd'
Hello Raghavendra. This seem to be a stalled read from the cmdline file. Please, try to repeat the test and once it gets frozen, try to type the /proc/<PID>/cmdline with the cat command from a different terminal. Expected results: 1.) The cat command gets frozen too ... -OR- 2.) The cat command returns a correct output and successfully terminates ... -OR- 3.) something else (+describe what happens) ... Thanks in advance. Regards, Jaromir.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative.
I have been able to reproduce the issue. Again the machine mount the nfs client from the gluster-nfs server running on some other machine (glusterfs version: glusterfs-3.3.0qa25) hangs when ps aux , w commands are executed. ran 'w' with strace: read(4, "2014 (rpc.statd) S 1 2014 2014 0"..., 1023) = 250 close(4) = 0 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=1339, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9b33f94000 read(4, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1339 close(4) = 0 munmap(0x7f9b33f94000, 4096) = 0 open("/proc/2014/cmdline", O_RDONLY) = 4 read(4, "rpc.statd\0", 2047) = 10 close(4) = 0 stat("/proc/6686", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/6686/stat", O_RDONLY) = 4 read(4, "6686 (run.sh) S 32277 6686 32277"..., 1023) = 246 close(4) = 0 open("/proc/6686/cmdline", O_RDONLY) = 4 read(4, "/bin/bash\0/opt/qa/tools/system_l"..., 2047) = 67 close(4) = 0 stat("/proc/6804", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/6804/stat", O_RDONLY) = 4 read(4, "6804 (bonnie.sh) S 6686 6686 322"..., 1023) = 245 close(4) = 0 open("/proc/6804/cmdline", O_RDONLY) = 4 read(4, "/bin/bash\0/opt/qa/tools/system_l"..., 2047) = 62 close(4) = 0 stat("/proc/6806", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/6806/stat", O_RDONLY) = 4 read(4, "6806 (bonnie++) S 6804 6686 3227"..., 1023) = 236 close(4) = 0 open("/proc/6806/cmdline", O_RDONLY) = 4 read(4, "bonnie++\0-u\0root\0-d\0/nfs/run6686"..., 2047) = 34 close(4) = 0 stat("/proc/8466", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/8466/stat", O_RDONLY) = 4 read(4, "8466 (rpc.statd) S 1 8466 8466 0"..., 1023) = 252 close(4) = 0 open("/proc/8466/cmdline", O_RDONLY) = 4 read(4, "rpc.statd\0", 2047) = 10 close(4) = 0 stat("/proc/10697", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/10697/stat", O_RDONLY) = 4 read(4, "10697 (rhn_check) D 1692 1692 16"..., 1023) = 243 close(4) = 0 open("/proc/10697/cmdline", O_RDONLY) = 4 read(4, On the nfs client iozone was runnimng (iozone -a), which also was hung immedietly after the command was executed. cat /proc/PID/cmdline also hangs. This is the outout of 'cat /proc/10697/cmdline': cat /proc/10697/cmdline ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C Did strace cat /proc/10697/cmdline. This is the output: strace cat /proc/10697/cmdline execve("/bin/cat", ["cat", "/proc/10697/cmdline"], [/* 28 vars */]) = 0 brk(0) = 0x14d4000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8fb000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=40617, ...}) = 0 mmap(NULL, 40617, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4adc8f1000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\355!\2750\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1904456, ...}) = 0 mmap(0x30bd200000, 3729576, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30bd200000 mprotect(0x30bd386000, 2093056, PROT_NONE) = 0 mmap(0x30bd585000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x185000) = 0x30bd585000 mmap(0x30bd58a000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x30bd58a000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8f0000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8ef000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4adc8ee000 arch_prctl(ARCH_SET_FS, 0x7f4adc8ef700) = 0 mprotect(0x30bd585000, 16384, PROT_READ) = 0 mprotect(0x30bd01f000, 4096, PROT_READ) = 0 munmap(0x7f4adc8f1000, 40617) = 0 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=99158752, ...}) = 0 mmap(NULL, 99158752, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4ad6a5d000 close(3) = 0 brk(0) = 0x14d4000 brk(0x14f5000) = 0x14f5000 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0 open("/proc/10697/cmdline", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 read(3,
I am able to reproduce this issue again on RHEL-6.2. I have an internal system which has the bug reproduced in it. Can you please take a look?
Even I have hit the same issue. ps has hung when it got the pid of glusterfs client process. This is the output of /proc/<glusterfs pid>/task/stack 1610 1611 1612 1613 1616 1617 1618 1619 1620 3434 3435 [04/16/12 - 21:46:21 root@APP-CLIENT1 1610]# cat ../1619/stack [<ffffffffa033ad54>] rpc_wait_bit_killable+0x24/0x40 [sunrpc] [<ffffffffa033ad2d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] [<ffffffffa03a7c0a>] nfs_initiate_commit+0xfa/0x120 [nfs] [<ffffffffa03a9039>] nfs_commit_inode+0x199/0x250 [nfs] [<ffffffffa03960c6>] nfs_release_page+0x86/0xa0 [nfs] [<ffffffff8110fe60>] try_to_release_page+0x30/0x60 [<ffffffff8112a2a1>] shrink_ page_list.clone.0+0x4f1/0x5c0 [<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740 [<ffffffff8112b37f>] shrink_zone+0x38f/0x520 [<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520 [<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130 [<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940 [<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150 [<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370 [<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0 [<ffffffff8113c6aa>] __get_user_pages+0x12a/0x430 [<ffffffff8113ca49>] get_user_pages+0x49/0x50 [<ffffffffa015444a>] fuse_copy_fill+0xca/0x1f0 [fuse] [<ffffffffa0154796>] fuse_copy_args+0x1b6/0x350 [fuse] [<ffffffffa0155524>] fuse_dev_read+0x284/0x310 [fuse] [<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140 [<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0 [<ffffffff81177563>] vfs_readv+0x43/0x60 [<ffffffff81177691>] sys_readv+0x51/0xb0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff [04/16/12 - 21:46:24 root@APP-CLIENT1 1610]# These are the gluster nfs server logs. [2012-04-16 21:35:50.119791] I [client.c:136:client_register_grace_timer] 0-dstore-client-8: Registering a grace timer [2012-04-16 21:35:50.119838] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-2: changing port to 24009 (from 0) [2012-04-16 21:35:50.119902] I [client.c:136:client_register_grace_timer] 0-dstore-client-2: Registering a grace timer [2012-04-16 21:35:50.187812] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 18 [2012-04-16 21:35:50.187901] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.187947] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 21 [2012-04-16 21:35:50.187970] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.188001] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 22 [2012-04-16 21:35:50.188032] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.188200] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 24 [2012-04-16 21:35:50.188241] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.293762] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-1: changing port to 24009 (from 0) [2012-04-16 21:35:50.293883] I [client.c:136:client_register_grace_timer] 0-dstore-client-1: Registering a grace timer [2012-04-16 21:35:50.293938] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-4: changing port to 24010 (from 0) [2012-04-16 21:35:50.294053] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-7: changing port to 24011 (from 0) [2012-04-16 21:35:50.294163] I [client.c:136:client_register_grace_timer] 0-dstore-client-4: Registering a grace timer [2012-04-16 21:35:50.294228] I [client.c:136:client_register_grace_timer] 0-dstore-client-7: Registering a grace timer [2012-04-16 21:35:50.737371] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.737445] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply[2012-04-16 21:35:50.119791] I [client.c:136:client_register_grace_timer] 0-dstore-client-8: Registering a grace timer [2012-04-16 21:35:50.119838] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-2: changing port to 24009 (from 0) [2012-04-16 21:35:50.119902] I [client.c:136:client_register_grace_timer] 0-dstore-client-2: Registering a grace timer [2012-04-16 21:35:50.187812] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 18 [2012-04-16 21:35:50.187901] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.187947] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 21 [2012-04-16 21:35:50.187970] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.188001] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 22 [2012-04-16 21:35:50.188032] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.188200] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 24 [2012-04-16 21:35:50.188241] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-16 21:35:50.293762] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-1: changing port to 24009 (from 0) [2012-04-16 21:35:50.293883] I [client.c:136:client_register_grace_timer] 0-dstore-client-1: Registering a grace timer [2012-04-16 21:35:50.293938] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-4: changing port to 24010 (from 0) [2012-04-16 21:35:50.294053] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-dstore-client-7: changing port to 24011 (from 0) [2012-04-16 21:35:50.294163] I [client.c:136:client_register_grace_timer] 0-dstore-client-4: Registering a grace timer [2012-04-16 21:35:50.294228] I [client.c:136:client_register_grace_timer] 0-dstore-client-7: Registering a grace timer [2012-04-16 21:35:50.737371] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.737445] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.738060] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738113] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.738576] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738613] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.738804] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738850] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.743696] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.743765] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.743911] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.743972] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.744238] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.744297] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.744769] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.744830] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.745023] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.745084] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.745264] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738060] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738113] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.738576] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738613] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.738804] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.738850] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.743696] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.743765] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.743911] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.743972] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.744238] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.744297] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.744769] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.744830] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.745023] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore [2012-04-16 21:35:50.745084] W [rpcsvc.c:530:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-16 21:35:50.745264] E [nfs3.c:2210:nfs3_write] 0-nfs-nfsv3: Volume is disabled: dstore
Hit the issue again. This time gluster nfs server log level was set to DEEBUG. This is the nfs server logs when the nfs client was hung. 2012-04-25 15:52:01.750472] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-0: changing port to 24009 (from 0) [2012-04-25 15:52:01.763225] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-1: changing port to 24010 (from 0) [2012-04-25 15:52:01.770323] D [client-handshake.c:1665:server_has_portmap] 0-mirror-client-2: detected portmapper on server [2012-04-25 15:52:01.772293] D [client-handshake.c:193:client_start_ping] 0-mirror-client-2: returning as transport is already disconnected OR there are no frames (1 || 1) [2012-04-25 15:52:01.775942] D [socket.c:193:__socket_rwv] 0-mirror-client-0: EOF from peer 127.0.0.1:24007 [2012-04-25 15:52:01.782242] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-0: reading from socket failed. Error (Transport en dpoint is not connected), peer (127.0.0.1:24007) [2012-04-25 15:52:01.785826] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now [2012-04-25 15:52:01.827678] D [client.c:2117:client_rpc_notify] 0-mirror-client-0: disconnected (skipped notify) [2012-04-25 15:52:01.833224] D [socket.c:193:__socket_rwv] 0-mirror-client-1: EOF from peer 127.0.0.1:24007 [2012-04-25 15:52:01.834515] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-1: reading from socket failed. Error (Transport en dpoint is not connected), peer (127.0.0.1:24007) [2012-04-25 15:52:01.835679] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now [2012-04-25 15:52:01.836723] D [client.c:2117:client_rpc_notify] 0-mirror-client-1: disconnected (skipped notify) [2012-04-25 15:52:01.840632] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 0-mirror-client-2: changing port to 24013 (from 0) [2012-04-25 15:52:01.842144] D [socket.c:193:__socket_rwv] 0-mirror-client-2: EOF from peer 127.0.0.1:24007 [2012-04-25 15:52:01.843125] D [socket.c:1521:__socket_proto_state_machine] 0-mirror-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24007) [2012-04-25 15:52:01.844010] D [socket.c:1807:socket_event_handler] 0-transport: disconnecting now [2012-04-25 15:52:01.845846] D [client.c:2117:client_rpc_notify] 0-mirror-client-2: disconnected (skipped notify) [2012-04-25 15:52:04.190866] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4617fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343515136, count: 65536, UNSTABLE [2012-04-25 15:52:04.222722] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror [2012-04-25 15:52:04.227420] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-25 15:52:04.232569] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4117fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 149557248, count: 65536, UNSTABLE [2012-04-25 15:52:04.234233] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror [2012-04-25 15:52:04.235567] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-25 15:52:04.240296] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4017fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 149491712, count: 65536, UNSTABLE [2012-04-25 15:52:04.241962] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror [2012-04-25 15:52:04.243251] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-25 15:52:04.247706] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 5117fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 344236032, count: 65536, UNSTABLE [2012-04-25 15:52:04.249278] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror [2012-04-25 15:52:04.250451] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-25 15:52:04.254747] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4717fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343580672, count: 65536, UNSTABLE [2012-04-25 15:52:04.256408] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror [2012-04-25 15:52:04.257550] W [rpcsvc.c:532:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-04-25 15:52:04.261964] D [nfs3-helpers.c:1762:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 4b17fcc6, WRITE: args: FH: hashcount 1, exportid e68ec23f-140e-46fd-9d21-e2662dc175f9, gfid 6af0e0e1-54f8-4a67-811d-57f1beed8cbe, offset: 343842816, count: 65536, UNSTABLE [2012-04-25 15:52:04.263107] E [nfs3.c:2211:nfs3_write] 0-nfs-nfsv3: Volume is disabled: mirror : This is the stacktrace of the gluseterfs pid which is hung. cd /proc/32262/ root@hyperspace:/proc/32262# unalias ls root@hyperspace:/proc/32262# ls attr clear_refs cpuset fd limits mountinfo numa_maps pagemap schedstat stat task autogroup cmdline cwd fdinfo loginuid mounts oom_adj personality sessionid statm wchan auxv comm environ io maps mountstats oom_score root smaps status cgroup coredump_filter exe latency mem net oom_score_adj sched stack syscall root@hyperspace:/proc/32262# cd task/ root@hyperspace:/proc/32262/task# ls 32262 32263 32264 32265 32269 32289 32290 32291 32361 32593 32644 root@hyperspace:/proc/32262/task# cat 32262/stack [<ffffffffa05f4294>] nfs_wait_bit_killable+0x24/0x40 [nfs] [<ffffffffa060342a>] nfs_commit_inode+0x9a/0x280 [nfs] [<ffffffffa05f1824>] nfs_release_page+0x84/0xa0 [nfs] [<ffffffff8110b702>] try_to_release_page+0x32/0x50 [<ffffffff8111da77>] shrink_page_list+0x537/0x5c0 [<ffffffff8111deea>] shrink_inactive_list+0x11a/0x450 [<ffffffff8111e772>] shrink_zone+0x1f2/0x2b0 [<ffffffff8111eb44>] shrink_zones+0x74/0xe0 [<ffffffff8111ec3e>] do_try_to_free_pages+0x8e/0x2d0 [<ffffffff8111f0d7>] try_to_free_pages+0x87/0x110 [<ffffffff81114654>] __alloc_pages_nodemask+0x494/0x840 [<ffffffff81149f45>] alloc_pages_current+0xa5/0x110 [<ffffffff8104428b>] pte_alloc_one+0x1b/0x50 [<ffffffff8112dd95>] __pte_alloc+0x35/0x100 [<ffffffff81131d09>] handle_mm_fault+0x129/0x250 [<ffffffff815c6a47>] do_page_fault+0x1a7/0x540 [<ffffffff815c34d5>] page_fault+0x25/0x30 [<ffffffff814c8f38>] skb_copy_datagram_iovec+0x148/0x2a0 [<ffffffff81515aea>] tcp_recvmsg+0xa3a/0xbb0 [<ffffffff81538c8b>] inet_recvmsg+0x6b/0x80 [<ffffffff814bacc4>] sock_aio_read+0x164/0x170 [<ffffffff81165383>] do_sync_readv_writev+0xd3/0x110 [<ffffffff81165654>] do_readv_writev+0xd4/0x1e0 [<ffffffff81165805>] vfs_readv+0x45/0x60 [<ffffffff81165871>] sys_readv+0x51/0xc0 [<ffffffff8100c002>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
Hello guys. Sorry for the delay. As the cat command got frozen too, this is really a stalled read. I believe the problem needs to be fixed somewhere in the lower levels. I would suggest to change the component to glibc or kernel (or selinux?). Could you try to reproduce that with SELinux disabled? Anyway ... I'm changing the component to glibc and we'll see what the glibc maintainer says. Regards, Jaromir.
Everything here looks like a kernel bug to me; reassigning to the kernel team.
cc
(In reply to Raghavendra Bhat from comment #0) > > > open("/proc/11270/cmdline", O_RDONLY) = 5 > read(5, > > stalls here forever. Apparently proc_pid_cmdline_read() sleeps in down_read(&mm->mmap_sem). > /var/log/messages has the following information: > > Nov 6 04:44:17 RHEL6 kernel: INFO: task rhn_check:11270 blocked for more > than 120 seconds. ... > Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8126e773>] > call_rwsem_down_write_failed+0x13/0x20 > Nov 6 04:44:17 RHEL6 kernel: [<ffffffff814dca22>] ? down_write+0x32/0x40 > Nov 6 04:44:17 RHEL6 kernel: [<ffffffff8112debc>] sys_mmap_pgoff+0x5c/0x2d0 > Nov 6 04:44:17 RHEL6 kernel: [<ffffffff810104e9>] sys_mmap+0x29/0x30 and rhn_check too can't tale the same semaphore.
Nagios nrpe plugin triggers this readily on fully loaded nodes.
(In reply to Shwetha Panduranga from comment #7) > > Even I have hit the same issue. > > ps has hung when it got the pid of glusterfs client process. > > This is the output of /proc/<glusterfs pid>/task/stack > > 1610 1611 1612 1613 1616 1617 1618 1619 1620 3434 3435 > [04/16/12 - 21:46:21 root@APP-CLIENT1 1610]# cat ../1619/stack > [<ffffffffa033ad54>] rpc_wait_bit_killable+0x24/0x40 [sunrpc] > [<ffffffffa033ad2d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] > [<ffffffffa03a7c0a>] nfs_initiate_commit+0xfa/0x120 [nfs] > [<ffffffffa03a9039>] nfs_commit_inode+0x199/0x250 [nfs] > [<ffffffffa03960c6>] nfs_release_page+0x86/0xa0 [nfs] > [<ffffffff8110fe60>] try_to_release_page+0x30/0x60 > [<ffffffff8112a2a1>] shrink_ > page_list.clone.0+0x4f1/0x5c0 > [<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740 > [<ffffffff8112b37f>] shrink_zone+0x38f/0x520 > [<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520 > [<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130 > [<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940 > [<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150 > [<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370 > [<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0 > [<ffffffff8113c6aa>] __get_user_pages+0x12a/0x430 > [<ffffffff8113ca49>] get_user_pages+0x49/0x50 And this sleeps with ->mmap_sem held. For reading, but if another process tries to take it for writing (sys_mmap() in #0), then down_read() in proc_pid_cmdline_read() will block.
(In reply to Tore H. Larsen from comment #13) > > Nagios nrpe plugin triggers this readily on fully loaded nodes. sorry, I don't know what nrpe is ;) I think that gluster/nfs people should look at this. Once again, ->mmap_sem is held by handle_mm_fault() callers, and it seems that nfs_release_page() hangs. Could you re-assign this bug to someone who knows this area?
Should (In reply to Tore H. Larsen from comment #13) > Nagios nrpe plugin triggers this readily on fully loaded nodes. Tore, are you hitting this without glusterfs or fuse? Looks like a problem on a fuse since shared mmap_sem.. There's several upstream fixes to skip waiting in nfs_release_pages, but I'd only thought they were necessary for loopback mounts.
Is this related to solutions 470653? Please add bug # to solution. Yes, I'm hitting it without glusterfs. Nodes do have FhGFS client (BeeGFS). Went away from Nagios as nrpe got stuck when accessing /proc. Using ganglia and PCP (pmcd) to monitor nodes.
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/