Bug 1609450
Summary: | Bricks are marked as down, after node reboot | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> | ||||||
Component: | glusterd | Assignee: | Sanju <srakonde> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | SATHEESARAN <sasundar> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | rhgs-3.3 | CC: | guillaume.pavese, moagrawa, nchilaka, rhs-bugs, sabose, sasundar, srakonde, storage-qa-internal, vbellur | ||||||
Target Milestone: | --- | Keywords: | Reopened, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1609451 (view as bug list) | Environment: |
RHHI
|
||||||
Last Closed: | 2020-07-08 07:09:20 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1609451 | ||||||||
Attachments: |
|
Description
SATHEESARAN
2018-07-28 01:24:48 UTC
I have a suspicion around the following messages in glusterd logs <snip> [2018-07-27 15:49:12.961132] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 6715c775-6021-4f21-a669-83bee56e55c5 [2018-07-27 15:49:12.967504] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-07-27 15:49:12.972686] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data has disconnected from glusterd. [2018-07-27 15:49:12.980700] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-07-27 15:49:12.986954] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine has disconnected from glusterd. [2018-07-27 15:49:12.993857] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-07-27 15:49:13.000230] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/vmstore/vmstore has disconnected from glusterd. </snip> Workaround exists for this issue: After reboot of the node, just need to restart gluster service on that node and 'gluster volume status' reports correct status Created attachment 1471171 [details]
glusterd log file from the rebooted node
Sanju - did we try to reproduce this with latest RHGS bits? I did the following to reproduce the issue. 1. Formed a 3 node cluster running with RHGS-3.4.0 bits 2. Created 3 replica 3 volumes and started them 3. Enabled server quorum for all volumes gluster volume set <volname> cluster.server-quorum-type server 4. Enabled client quorum for all volumes gluster v set <volname> cluster.quorum-type auto 5. Rebooted one of the node 6. Started glusterd on rebooted node 7. gluster v status shows all bricks online. @Sas, I couldn't reproduce this issue with RHGS-3.4.0 bits. I'm in favour of closing this bug. need your inputs here. I'm closing this. Please feel free to reopen if the issue persists. I have hit the same issue while upgrading from RHHI-V 1.1 to RHHI-V 1.5 RHHI-V 1.1 - glusterfs-3.8.4-15.8.el7rhgs RHHI-V 1.5 - glusterfs-3.12.2-25.el7rhgs Post upgrade, the RHVH node was rebooted, whrn the node came up, I could issue gluster volume status and noticed that the brick was down, but after investigating the brick process, those were up and running. So re-opening the bug Created attachment 1497834 [details]
glusterd.log
Attaching the glusterd.log as the issue is re-surfaced
Sas - To close down the loop, can you please provide us a setup where this can be replicated so what we can take over and start debugging this? We need to prioritize this bug considering the nature of the problem reported. Is it a temporary situation, that resolves itself after a short period of time? (In reply to Yaniv Kaul from comment #29) > Is it a temporary situation, that resolves itself after a short period of > time? No. It doesn't It recovers after glusterd restart on that particular HC node |