Bug 1381831
Summary: | dom_md/ids is always reported in the self-heal info | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> |
Component: | arbiter | Assignee: | Ravishankar N <ravishankar> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.2 | CC: | amukherj, jajeon, knarra, rhinduja, rhs-bugs, sasundar, storage-qa-internal, wenshi |
Target Milestone: | --- | ||
Target Release: | RHGS 3.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.8.4-3 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-23 06:08:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1351528 |
Description
RamaKasturi
2016-10-05 07:32:21 UTC
extended attributes on the files which are reported in heal info: extended attributes for the file on the engine volume from all nodes: ===================================================================== [root@rhsqa-grafton1 ~]# getfattr -d -m . -e hex /rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x080000000000000057f3b7260004776c trusted.gfid=0x496e047d725f4a0b87a131f47be477a9 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 [root@rhsqa-grafton2 ~]# getfattr -d -m . -e hex /rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.engine-client-0=0x000000010000000000000000 trusted.bit-rot.version=0x040000000000000057f3b085000093a5 trusted.gfid=0x496e047d725f4a0b87a131f47be477a9 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 [root@rhsqa-grafton3 ~]# getfattr -d -m . -e hex /rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick1/engine//53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.engine-client-0=0x000000050000000000000000 trusted.bit-rot.version=0x040000000000000057f3b0850000a228 trusted.gfid=0x496e047d725f4a0b87a131f47be477a9 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 extended attributes for the file on vmstore volume: ================================================================= [root@rhsqa-grafton1 ~]# getfattr -d -m . -e hex /rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x070000000000000057f3b7260004756f trusted.gfid=0x16a892f912294233aea514be469b926d trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 [root@rhsqa-grafton2 ~]# getfattr -d -m . -e hex /rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vmstore-client-0=0x000000060000000000000000 trusted.bit-rot.version=0x030000000000000057f3b08a00027c98 trusted.gfid=0x16a892f912294233aea514be469b926d trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 [root@rhsqa-grafton3 ~]# getfattr -d -m . -e hex /rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids getfattr: Removing leading '/' from absolute path names # file: rhgs/brick3/vmstore/4f007a3a-612f-40db-8b07-6666e8259957/dom_md/ids security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vmstore-client-0=0x000000060000000000000000 trusted.bit-rot.version=0x030000000000000057f3b08a00027f8e trusted.gfid=0x16a892f912294233aea514be469b926d trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000 gluster volume info details: ================================== [root@rhsqa-grafton3 ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 03c68517-4be1-45e3-b788-87e10d73f3ee Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.79:/rhgs/brick1/engine Brick2: 10.70.36.80:/rhgs/brick1/engine Brick3: 10.70.36.81:/rhgs/brick1/engine (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 network.ping-timeout: 30 user.cifs: off performance.strict-o-direct: on auth.ssl-allow: 10.70.36.79,10.70.36.80,10.70.36.81 client.ssl: on server.ssl: on [root@rhsqa-grafton3 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 03454b82-d4ea-4cf5-85c3-29bee7afd87f Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.79:/rhgs/brick2/data Brick2: 10.70.36.80:/rhgs/brick2/data Brick3: 10.70.36.81:/rhgs/brick2/data (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 network.ping-timeout: 30 user.cifs: off performance.strict-o-direct: on auth.ssl-allow: 10.70.36.79,10.70.36.80,10.70.36.81 client.ssl: on server.ssl: on [root@rhsqa-grafton3 ~]# gluster volume info vmstore Volume Name: vmstore Type: Replicate Volume ID: 16fb0e38-4a9c-4468-8a51-fa8dc5a8dc06 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.79:/rhgs/brick3/vmstore Brick2: 10.70.36.80:/rhgs/brick3/vmstore Brick3: 10.70.36.81:/rhgs/brick3/vmstore (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 network.ping-timeout: 30 user.cifs: off performance.strict-o-direct: on auth.ssl-allow: 10.70.36.79,10.70.36.80,10.70.36.81 client.ssl: on server.ssl: on sosreports are present in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1381822/ Following errors seen in engine mount log: ================================================ [2016-10-04 13:27:42.529680] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-engine-replicate-0: no subvolumes up [2016-10-04 13:27:42.529730] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-engine-shard: stat failed: 496e047d-725f-4a0b-87a1-31f47be477a9 [Transport endpoint is not connected] [2016-10-04 13:27:42.716866] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-engine-replicate-0: no subvolumes up [2016-10-04 13:27:42.716918] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-engine-shard: stat failed: d27d0848-cf13-4aa6-a012-d2cc8f3b9a6a [Transport endpoint is not connected] [2016-10-04 13:27:43.030224] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-engine-shard: stat failed: 496e047d-725f-4a0b-87a1-31f47be477a9 [Transport endpoint is not connected] [2016-10-04 13:27:47.534122] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 339820: FSTAT() /53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids => -1 (Transport endpoint is not connected) [2016-10-04 13:27:57.229171] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 339862: FSTAT() /53c84f1e-3643-45aa-805e-8c9e92ee3098/images/f0c14312-7e49-464f-9660-3b629 fb8b538/7efedd28-7a43-4142-8bf7-fe468376626f => -1 (Transport endpoint is not connected) [2016-10-04 13:28:07.050637] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 339904: FSTAT() /53c84f1e-3643-45aa-805e-8c9e92ee3098/dom_md/ids => -1 (Transport endpoint is not connected) [2016-10-04 13:28:10.753676] E [socket.c:2309:socket_connect_finish] 0-engine-client-0: connection to 10.70.36.79:24007 failed (Connection refused) [2016-10-04 13:28:11.990676] E [glusterfsd-mgmt.c:1922:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.70.36.80 (No data available) [2016-10-04 13:28:11.990716] I [glusterfsd-mgmt.c:1959:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server 10.70.36.81 [2016-10-04 13:28:14.774318] E [socket.c:2309:socket_connect_finish] 0-engine-client-2: connection to 10.70.36.81:24007 failed (Connection refused) [2016-10-04 13:28:14.778947] E [socket.c:2309:socket_connect_finish] 0-engine-client-1: connection to 10.70.36.80:24007 failed (Connection refused) [2016-10-04 13:28:16.745810] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 339946: FSTAT() /53c84f1e-3643-45aa-805e-8c9e92ee3098/images/f0c14312-7e49-464f-9660-3b629 fb8b538/7efedd28-7a43-4142-8bf7-fe468376626f => -1 (Transport endpoint is not connected) [2016-10-04 13:28:22.814018] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to 10.70.36.81:24007 failed (Connection refused) [2016-10-04 13:28:22.814070] E [glusterfsd-mgmt.c:1922:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.70.36.81 (Transport endpoint is not connect ed) [2016-10-04 13:28:22.814084] I [glusterfsd-mgmt.c:1939:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers Following error messages are seen in vmstore mount log: ====================================================== [2016-10-04 13:28:10.152365] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:10.652808] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:10.652858] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:11.153299] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:11.153350] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:11.228410] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:11.653757] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:11.653771] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:11.990719] E [glusterfsd-mgmt.c:1922:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.70.36.80 (No data available) [2016-10-04 13:28:11.990762] I [glusterfsd-mgmt.c:1959:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server 10.70.36.81 [2016-10-04 13:28:12.154230] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:12.154303] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:12.654727] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:12.654780] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:13.155210] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:13.155264] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:13.610936] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:13.655747] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:13.655762] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:13.918741] E [socket.c:2309:socket_connect_finish] 0-vmstore-client-1: connection to 10.70.36.80:24007 failed (Connection refused) [2016-10-04 13:28:14.156240] I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up [2016-10-04 13:28:14.156317] E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected] [2016-10-04 13:28:16.928912] E [socket.c:2309:socket_connect_finish] 0-vmstore-client-2: connection to 10.70.36.81:24007 failed (Connection refused) [2016-10-04 13:28:22.958364] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to 10.70.36.81:24007 failed (Connection refused) [2016-10-04 13:28:22.958424] E [glusterfsd-mgmt.c:1922:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.70.36.81 (Transport endpoint is not connect ed) [2016-10-04 13:28:22.958438] I [glusterfsd-mgmt.c:1939:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers The message "I [MSGID: 108006] [afr-common.c:4439:afr_local_init] 0-vmstore-replicate-0: no subvolumes up" repeated 24 times between [2016-10-04 13:28:14.156240] and [2016-1 0-04 13:28:24.665181] The message "E [MSGID: 133014] [shard.c:1129:shard_common_stat_cbk] 0-vmstore-shard: stat failed: 16a892f9-1229-4233-aea5-14be469b926d [Transport endpoint is not connected]" repeated 21 times between [2016-10-04 13:28:14.156317] and [2016-10-04 13:28:24.665196] [2016-10-04 13:28:24.784468] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f0a063fcdc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x 7f0a07a92c45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f0a07a92abb] ) 0-: received signum (15), shutting down Hi Kasturi, at a first glance, the afr xattrs for the 'ids' file (comment #2) seem to indicate a pending heal on the first brick, which is why heal info shows these entries. I'm not sure what the issue is. Is the file not getting healed when you run 'gluster vol heal volname' when all bricks and shds are up? Ravi, i have not ran gluster vol heal <vol_name> as i was expecting that files should automatically get healed. gluster volume status shows that all bricks and shd are up. My setup is currently down due to SSL issue. I will get back to you on this once the other issue is resolved. Hi Kasturi, looks like the sos reports in comment #4 are for BZ 1381822. Could you provide the links to the one for this BZ? Hi Ravi, Logs are the same. Directory name i have created with the other BZ. Thanks kasturi In the client logs, there are frequent disconnects to the bricks and in some cases, it is not able to connect to client-0 after a disconnect because of glusterd not serving the client the port number of the brick to connect to because of BZ 1381822 (Too many open files in glusterd). This is likely the reason for the constant healing needed on client-0 for the ids file. Moving the BZ to ON_QA. Verified and works fine with build glusterfs-3.8.4-8.el7rhgs.x86_64. procedure 1: ================ 1) Killed one of the brick in vmstore volume. 2) Created a new vm. 3) Brought back the brick which was down by running 'gluster volume start <vol_name> force" 4) Once the brick is brought up back i see that all the entries get healed and 'gluster volume heal vmstore info' reports nothing after sometime. procedure 2: ================== 1) Created vms on the setup. 2) started running I/O using dd inside the vm. 3) killed one of the data brick. 4) After sometime brought the brick up. 5) I see that heal happens successfully and heal info reports zero entries. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |