Bug 1335378
Summary: | self-heal is not happening and even heal info command is hung | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RajeshReddy <rmekala> |
Component: | replicate | Assignee: | Ashish Pandey <aspandey> |
Status: | CLOSED NOTABUG | QA Contact: | SATHEESARAN <sasundar> |
Severity: | unspecified | Docs Contact: | |
Priority: | medium | ||
Version: | rhgs-3.1 | CC: | amukherj, aspandey, mzywusko, pkarampu, rcyriac, rhs-bugs, sabose, sasundar |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-17 09:35:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1277939 |
Description
RajeshReddy
2016-05-12 06:48:33 UTC
sosreports are available on rhsqe-repo.lab.eng.blr.redhat.com @/home/repo/sosreports/bug.1335378 There seems to be an issue with unlock not being sent to the brick, even when the connection is available from client to server. I also took some statedumps which are important. I am not sure which condition can lead to this bug yet. The bug is in either client/server translators. As per the brick logs: [root@cambridge bricks]# grep yarrow.lab.eng.blr.redhat.com rhgs-vmaddldiskbrick-vmaddl-brick1.log | grep gf_client_unref | grep -v client-0-0-0 [2016-05-11 09:42:00.587879] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-volume2-server: Shutting down connection yarrow.lab.eng.blr.redhat.com-4889-2016/05/11-09:32:06:839412-volume2-client-0-0-1 [2016-05-11 11:56:54.945712] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-volume2-server: Shutting down connection yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-8 [root@cambridge bricks]# grep yarrow.lab.eng.blr.redhat.com rhgs-vmaddldiskbrick-vmaddl-brick1.log | grep setvolume | grep -v client-0-0-0 [2016-05-10 09:03:09.154583] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-1 (version: 3.7.9) [2016-05-10 09:03:09.236874] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-6805-2016/05/10-08:37:19:10587-volume2-client-0-0-1 (version: 3.7.9) [2016-05-11 07:56:53.538451] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-2 (version: 3.7.9) [2016-05-11 09:32:07.295064] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-3 (version: 3.7.9) [2016-05-11 09:42:00.132986] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-4 (version: 3.7.9) [2016-05-11 09:42:00.491152] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-4889-2016/05/11-09:32:06:839412-volume2-client-0-0-1 (version: 3.7.9) [2016-05-11 09:50:26.619117] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-5 (version: 3.7.9) [2016-05-11 11:27:36.296603] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-6 (version: 3.7.9) [2016-05-11 11:35:44.525578] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-7 (version: 3.7.9) [2016-05-11 11:53:19.309196] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-8 (version: 3.7.9) [2016-05-11 11:56:50.256846] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-volume2-server: accepted client from yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-9 (version: 3.7.9) New connection to the brick happened at [2016-05-11 11:56:50.256846] which is still active. But both fxattrop and finodelk failed with transport endpoint not connected: [2016-05-11 11:56:54.944907] W [MSGID: 114031] [client-rpc-fops.c:1917:client3_3_fxattrop_cbk] 0-volume2-client-0: remote operation failed [2016-05-11 11:56:54.945728] W [MSGID: 108001] [afr-transaction.c:729:afr_handle_quorum] 0-volume2-replicate-0: 9805e9cb-8c28-4c1d-aaf1-326e331d23f8: Failing FXATTROP as quorum is not met [2016-05-11 11:56:54.945763] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 0-volume2-client-0: remote operation failed [Transport endpoint is not connected] [2016-05-11 11:56:54.946131] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-volume2-shard: Update to file size xattr failed on 9805e9cb-8c28-4c1d-aaf1-326e331d23f8 [Read-only file system] This is the stale lock: [xlator.features.locks.volume2-locks.inode] path=/8c8b8d20-54d5-41cb-8401-699ee877537b/images/67da0afc-e687-435e-a16d-88c56d876dcc/ed6c954 a-e5f5-4e09-959d-0759c566bb65.lease mandatory=0 inodelk-count=4 lock-dump.domain.domain=volume2-replicate-0:self-heal lock-dump.domain.domain=volume2-replicate-0 lock-dump.domain.domain=volume2-replicate-0:metadata inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=9223372036854775806, len=0, pid = 16355, owner=d8b8d88ed37f0000, client=0x7ffbb81136a0, connection-id=yarrow.lab.eng.blr.redhat.com-7343-2016/05/10-08:37:23:721784-volume2-client-0-0-9, granted at 2016-05-11 11:56:54 <<------ Unlock failed because of transport endpoint not connected. Which is the bug. inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=9223372036854775806, len=0, pid = 18446744073709551610, owner=70a43167907f0000, client=0x7ffbb80f00d0, connection-id=cambridge.lab.eng.blr.redhat.com-21500-2016/05/11-11:53:19:288932-volume2-client-0-0-0, blocked at 2016-05-12 03:28:55 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=9223372036854775806, len=0, pid = 18446744073709551610, owner=80c447afb27f0000, client=0x7ffbb80f2980, connection-id=moonshine.lab.eng.blr.redhat.com-14476-2016/05/11-11:53:21:329730-volume2-client-0-0-0, blocked at 2016-05-12 03:28:55 inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=9223372036854775806, len=0, pid = 15709, owner=5d3d0000, client=0x7ffbb0021200, connection-id=yarrow.lab.eng.blr.redhat.com-15709-2016/05/12-03:29:21:720440-volume2-client-0-0-0, blocked at 2016-05-12 03:29:21 This bug happened once and I don't think we have clear steps to re-create the issue. So far we don't have RC as to why it happened. Can we try to get a consistent reproducer here? Do we have a reproducer or shall we close this? (In reply to Sahina Bose from comment #8) > Do we have a reproducer or shall we close this? I am hitting this issue again, where I see that self-heal info is hung. I have tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-5.el7rhgs ). These are the steps I did : 1. Created replica 3 volume 2. Optimized the volume for VM store usecase 3. Created a VM ( which uses gfapi to access its disk ) 4. Started OS installation 5. While OS installation is happening, killed the brick ( kill <pid> ) of the server1. 6. I could see that heal-info reporting unhealed entries 7. Created one another VM and installed OS in it too. 8. Brought back the volume ( by force starting the volume ) 9. When the brick is up ( confirmed with gluster volume status ), heal-info just didn't respond (In reply to SATHEESARAN from comment #9) > (In reply to Sahina Bose from comment #8) > > Do we have a reproducer or shall we close this? > > I am hitting this issue again, where I see that self-heal info is hung. > I have tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-5.el7rhgs ). > > These are the steps I did : > > 1. Created replica 3 volume > 2. Optimized the volume for VM store usecase > 3. Created a VM ( which uses gfapi to access its disk ) > 4. Started OS installation > 5. While OS installation is happening, killed the brick ( kill <pid> ) of > the server1. > 6. I could see that heal-info reporting unhealed entries > 7. Created one another VM and installed OS in it too. > 8. Brought back the volume ( by force starting the volume ) > 9. When the brick is up ( confirmed with gluster volume status ), heal-info > just didn't respond This issue is seen with compound-fops only which was not the reason this bug was raised initially. I will raise a separate bug for the issue. Closing this bug |