| Summary: | [GLUSTERD]Failed to reflect the correct volume status on the rebooted node after doing volume stop while 1 node is down | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Manisha Saini <msaini> | ||||||
| Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Manisha Saini <msaini> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | rhgs-3.2 | CC: | amukherj, asriram, bmohanra, jijoy, msaini, nchilaka, rhinduja, rhs-bugs, sasundar, storage-qa-internal, vbellur | ||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||
| Target Release: | --- | Flags: | amukherj:
needinfo?
(asriram) |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: |
In multi-node NFS-Ganesha configurations with multiple volumes, if a node was rebooted while a volume was stopped, volume status was reported incorrectly. This is resolved as of Red Hat Gluster Storage 3.4.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-10-31 08:43:54 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Manisha Saini
2016-12-12 10:46:26 UTC
I had got a chance to look into the set up and figured out that at the time of friend update and the node which went through the reboot, found a higher version for volume ganeshaVol5 as per [2016-12-12 09:26:13.494987] I [MSGID: 106009] [glusterd-utils.c:2914:glusterd_compare_friend_volume] 0-management: Version of volume ganeshaVol5 differ. local version = 7, remote version = 8 on peer dhcp46-241.lab.eng.blr.redhat.com However the surprising part post that was glusterd didn't update the volume info file with the latest and still continued with the stale volinfo. The log file doesn't indicate any failures for the same however to analyze the issue I have couple of requests: 1. Is it reproducible? 2. If yes, can we try to enable debug log and share across? Now coming to the decision on if its a blocker for rhgs-3.2.0 or not, my answer would be no as the test case looks to be something which will not be often executed at production i.e. rebooting a node and stopping the volume at the same time. Please add your thoughts. Just to confirm the issue exists on non nfs-ganesh setup, tried the below things multiple times, it worked perfectly to me "Steps i did": 1. Created a 4 node cluster 2. Created 4 distribute volume using all 4 node bricks and started all the volumes. 3. Rebooted one of cluster node and at the same time stopped the volumes. 4. Checked the volume status on the rebooted node, it showed correctly ( volumes was in stopped state) @Manisha, The firewall rules on your setup are persistent ? if not persistent, then chances of hitting this issue more. Always make the firewall rules persistent for any node reboot related testing (In reply to Byreddy from comment #4) > Just to confirm the issue exists on non nfs-ganesh setup, tried the below > things multiple times, it worked perfectly to me > > > "Steps i did": > 1. Created a 4 node cluster > 2. Created 4 distribute volume using all 4 node bricks and started all the > volumes. > 3. Rebooted one of cluster node and at the same time stopped the volumes. > 4. Checked the volume status on the rebooted node, it showed correctly ( > volumes was in stopped state) > > @Manisha, The firewall rules on your setup are persistent ? if not > persistent, then chances of hitting this issue more. > Always make the firewall rules persistent for any node reboot related testing Byreddy, I had the same query on firewalld rules. Manisha has configured firewalld rules via gdeploy and glusterfs service was added permanent. <snip> <msaini_>[root@dhcp47-3 ~]# firewall-cmd --list-services <msaini_> dhcpv6-client rpc-bind rquota high-availability mountd glusterfs nfs ssh nlm </snip> (In reply to Atin Mukherjee from comment #3) > I had got a chance to look into the set up and figured out that at the time > of friend update and the node which went through the reboot, found a higher > version for volume ganeshaVol5 as per > > [2016-12-12 09:26:13.494987] I [MSGID: 106009] > [glusterd-utils.c:2914:glusterd_compare_friend_volume] 0-management: Version > of volume ganeshaVol5 differ. local version = 7, remote version = 8 on peer > dhcp46-241.lab.eng.blr.redhat.com > > However the surprising part post that was glusterd didn't update the volume > info file with the latest and still continued with the stale volinfo. The > log file doesn't indicate any failures for the same however to analyze the > issue I have couple of requests: > > 1. Is it reproducible? > 2. If yes, can we try to enable debug log and share across? > > Now coming to the decision on if its a blocker for rhgs-3.2.0 or not, my > answer would be no as the test case looks to be something which will not be > often executed at production i.e. rebooting a node and stopping the volume > at the same time. > > Please add your thoughts. Again tried reproducing the same scenario.The issue is reproducible. With single volume ,the issue is not observed.In my scenario there were 4 volumes.With the same steps(Creating volumes and doing start and stop on those volumes) i am able to hit this issue again. Created attachment 1231084 [details]
Glusterd Logs of rebooted Node
Created attachment 1231086 [details]
Glusterd Logs of the Node from which Volume stop was performed
Based on issue reproducible, it looks issue exists on nfs-ganesha configured setup. This scenario is working well for me on the setup where nfs-ganesha is not configured @Manisha, You can try the same thing on the same setup with out nfs-ganesha config to isolate a problem. Here are few additional data points which Manisha & myself came up with as per the testing and analysis results: 1. This issue doesn't happen on a similar setup where NFS-Ganesha is not configured. 2. This issue doesn't happen for a single volume set up. 3. This issue only happens if the node goes for a reboot, killing all gluster processes and then bringing it back after performing volume stop from another nodes doesn't cause any inconsistency in data. 4. If more than one volume is stopped then this issue doesn't persist. We still don't have enough RCA to have any evidence what's going wrong here however IMO this test doesn't look like a frequent use case in production and can be deferred from rhgs-3.2.0 given there is a workaround available here to correct the state. The doc text is slightly edited for the release notes. Anjana - for your awareness, this needs to be taken out from the known issue chapter. |