Bug 1127778

Summary: DHT :- file/Directory creation fails with 'Input/output error' for all files/Directories hashing to one sub-volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachana Patel <racpatel>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED WORKSFORME QA Contact: Matt Zywusko <mzywusko>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: mzywusko, nbalacha, pkarampu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-31 15:42:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachana Patel 2014-08-07 14:26:29 UTC
Description of problem:
=======================
file/Directory  creation fails with 'Input/output error'  for all files/Directories hashing to one sub-volume.
Brick log has errors:-
[2014-08-06 05:28:45.042158] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=((null
))


Version-Release number of selected component (if applicable):
=============================================================
3.6.0.27-1.el6rhs.x86_64


How reproducible:
=================
only once


Steps to Reproduce:
===================
1. had distributed volume. removed all data from mount point. stopp volume and glusterd and upgraded gluster rpms
2. start volume and mount it again
3. create files and Directories from mount point.
Any creation hashing to one sub-volume fails with error


[root@OVM1 brick1]# touch abc3
touch: cannot touch `abc3': Input/output error

--> verified all bricks were up and log do not has disconnect error.
--> verified on bricks, it has no data on that sub-volume. Verified even .glusterfs Directory and it has not extra entry.


Actual results:
===============
any opration hashing to one particular error fails with I/O error


Expected results:
=================
creation and access of Directory and files should not fail if all sub-volumes are up

Additional info:
================

Work around:-
restarting brick process for that sub-volume solved this problem


brick log snippet:-
[2014-08-06 05:28:45.042158] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=((null
))
[2014-08-06 05:28:45.043287] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=(abc3)
[2014-08-06 05:28:45.043505] E [posix.c:134:posix_lookup] 0-brick1-posix: null gfid for path (null)
[2014-08-06 05:28:45.043543] E [posix.c:151:posix_lookup] 0-brick1-posix: lstat on (null) failed: Success
[2014-08-06 05:28:45.043571] E [server-rpc-fops.c:183:server_lookup_cbk] 0-brick1-server: 3415: LOOKUP (null) (00000000-0000-0000-0000-
000000000001/abc3) ==> (Success)
[2014-08-06 05:28:45.044380] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=(abc3)
[2014-08-06 05:28:45.044623] E [posix.c:134:posix_lookup] 0-brick1-posix: null gfid for path (null)
[2014-08-06 05:28:45.044658] E [posix.c:151:posix_lookup] 0-brick1-posix: lstat on (null) failed: Success
[2014-08-06 05:28:45.044686] E [server-rpc-fops.c:183:server_lookup_cbk] 0-brick1-server: 3416: LOOKUP (null) (00000000-0000-0000-0000-
000000000001/abc3) ==> (Success)
[2014-08-06 05:28:54.934111] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=((null
))
[2014-08-06 05:28:54.935288] C [inode.c:1151:__inode_path] 0-/brick1/inode: possible infinite loop detected, forcing break. name=(abc3)
[2014-08-06 05:28:54.935498] E [posix.c:134:posix_lookup] 0-brick1-posix: null gfid for path (null)
[2014-08-06 05:28:54.935551] E [posix.c:151:posix_lookup] 0-brick1-posix: lstat on (null) failed: Success
[2014-08-06 05:28:54.935580] E [server-rpc-fops.c:183:server_lookup_cbk] 0-brick1-server: 3418: LOOKUP (null) (00000000-0000-0000-0000-
000000000001/abc3) ==> (Success)

Comment 3 Nithya Balachandran 2014-08-12 10:37:28 UTC
Rachana, is this reproducible on a fresh setup? If it is, please let us know so we can gdb into the process and see what is going wrong.

Comment 4 Pranith Kumar K 2014-08-13 08:45:10 UTC
posix_health_check_thread_proc keeps performing stat on the brick root every 30 seconds. I see that stat on brick on that xfs partition is returning Input/Output error. Something must have happened to xfs partition?
I see the following logs in the bricks. What happened to the xfs partition is something we need to debug if we have dmesg or /var/log/messages on the machines. That info is not attached to the sosreports.

14:09:40 :) ⚡ grep-bricks -i "posix_health_check_thread_proc" | grep "2014-08"
172/var/log/glusterfs/bricks/brick0.log:[2014-08-06 08:13:27.557717] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-test1-posix: stat() on /brick0 returned: Input/output error
172/var/log/glusterfs/bricks/brick0.log:[2014-08-06 08:13:27.557800] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-test1-posix: health-check failed, going down
172/var/log/glusterfs/bricks/brick0.log:[2014-08-06 08:13:57.558133] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-test1-posix: still alive! -> SIGTERM
198/var/log/glusterfs/bricks/brick3-n1.log:[2014-08-06 05:54:28.153734] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-new-posix: stat() on /brick3/n1 returned: No such file or directory
198/var/log/glusterfs/bricks/brick3-n1.log:[2014-08-06 05:54:28.153771] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-new-posix: health-check failed, going down
198/var/log/glusterfs/bricks/brick3-n1.log:[2014-08-06 05:54:58.153968] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-new-posix: still alive! -> SIGTERM
198/var/log/glusterfs/bricks/brick3-n2.log:[2014-08-06 05:54:28.153576] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-new-posix: stat() on /brick3/n2 returned: No such file or directory
198/var/log/glusterfs/bricks/brick3-n2.log:[2014-08-06 05:54:28.153614] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-new-posix: health-check failed, going down
198/var/log/glusterfs/bricks/brick3-n2.log:[2014-08-06 05:54:58.153873] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-new-posix: still alive! -> SIGTERM
198/var/log/glusterfs/bricks/brick3-n3.log:[2014-08-06 05:54:28.153504] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-new-posix: stat() on /brick3/n3 returned: No such file or directory
198/var/log/glusterfs/bricks/brick3-n3.log:[2014-08-06 05:54:28.153605] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-new-posix: health-check failed, going down
198/var/log/glusterfs/bricks/brick3-n3.log:[2014-08-06 05:54:58.153875] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-new-posix: still alive! -> SIGTERM
198/var/log/glusterfs/bricks/brick3-n4.log:[2014-08-06 05:54:28.159848] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-new-posix: stat() on /brick3/n4 returned: No such file or directory
198/var/log/glusterfs/bricks/brick3-n4.log:[2014-08-06 05:54:28.159894] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-new-posix: health-check failed, going down
198/var/log/glusterfs/bricks/brick3-n4.log:[2014-08-06 05:54:58.160139] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-new-posix: still alive! -> SIGTERM
240/var/log/glusterfs/bricks/brick0.log:[2014-08-06 09:45:45.661795] W [posix-helpers.c:1427:posix_health_check_thread_proc] 0-test1-posix: stat() on /brick0 returned: Input/output error
240/var/log/glusterfs/bricks/brick0.log:[2014-08-06 09:45:45.661843] M [posix-helpers.c:1447:posix_health_check_thread_proc] 0-test1-posix: health-check failed, going down
240/var/log/glusterfs/bricks/brick0.log:[2014-08-06 09:46:15.662140] M [posix-helpers.c:1452:posix_health_check_thread_proc] 0-test1-posix: still alive! -> SIGTERM

Comment 7 Nithya Balachandran 2015-12-31 15:42:35 UTC
This issue has not been seen again so moving this to WorksForMe. Please reopen if seen again.