Description of problem: Doing volume stop while 1 node is rebooted,unable to reflect the correct status of the volume on rebooted node. When rebooted node came up,its still reflecting that volume in "Started" state Version-Release number of selected component (if applicable): glusterfs-3.8.4-8.el7rhgs.x86_64 nfs-ganesha-2.4.1-2.el7rhgs.x86_64 How reproducible: Steps to Reproduce: 1.Create 4 node ganesha cluster on 7 node gluster and enable ganesha on it 2.Create 4 Distribute Volumes 3.Perform volume start and stop operations on different volumes 4.Before doing node reboot, start all the volumes 5.Now reboot 1 of the node and From 1 of the other node which is up,Do volume stop [root@dhcp46-241 ganesha]# gluster v stop ganeshaVol5 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: ganeshaVol5: success [root@dhcp46-241 ganesha]# showmount -e localhost Export list for localhost: /ganeshaVol4 (everyone) /ganeshaVol3 (everyone) /ganeshaVol1 (everyone) When the rebooted node came up,it still reflects ganeshaVol5 in started state. On rest of the other nodes this volume is in stopped state. Following messages are reflected in gluster v status (on rebooted node) Staging failed on dhcp46-241.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-219.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp47-45.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-232.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp47-33.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-110.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started [root@dhcp47-3 ~]# gluster v status Status of volume: ganeshaVol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data1/3 49162 0 Y 4236 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data1/3 49162 0 Y 30733 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data1/3 49152 0 Y 1820 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data1/3 49161 0 Y 26508 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data2/4 49163 0 Y 4256 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data2/4 49163 0 Y 30753 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data2/4 49153 0 Y 1827 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data2/4 49162 0 Y 26531 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data3/5 49164 0 Y 4276 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data3/5 49164 0 Y 30773 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data3/5 49154 0 Y 1840 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data3/5 49163 0 Y 26551 Task Status of Volume ganeshaVol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ganeshaVol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data1/2 49159 0 Y 4137 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data1/2 49159 0 Y 30634 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data1/2 49155 0 Y 1854 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data1/2 49158 0 Y 26401 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data2/2 49160 0 Y 4157 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data2/2 49160 0 Y 30654 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data2/2 49156 0 Y 1847 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data2/2 49159 0 Y 26421 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data3/2 49161 0 Y 4177 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data3/2 49161 0 Y 30674 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data3/2 49157 0 Y 1872 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data3/2 49160 0 Y 26441 Task Status of Volume ganeshaVol3 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ganeshaVol4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data1/6 49156 0 Y 4060 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data1/6 49156 0 Y 30569 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data1/6 49158 0 Y 1883 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data1/6 49155 0 Y 26320 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data2/6 49157 0 Y 4080 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data2/6 49157 0 Y 30589 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data2/6 49159 0 Y 1889 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data2/6 49156 0 Y 26343 Brick dhcp46-219.lab.eng.blr.redhat.com:/mn t/data3/6 49158 0 Y 4100 Brick dhcp46-241.lab.eng.blr.redhat.com:/mn t/data3/6 49158 0 Y 30609 Brick dhcp47-3.lab.eng.blr.redhat.com:/mnt/ data3/6 49160 0 Y 1906 Brick dhcp47-45.lab.eng.blr.redhat.com:/mnt /data3/6 49157 0 Y 26371 Task Status of Volume ganeshaVol4 ------------------------------------------------------------------------------ There are no active volume tasks Staging failed on dhcp46-241.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-219.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp47-45.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-232.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp47-33.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Staging failed on dhcp46-110.lab.eng.blr.redhat.com. Error: Volume ganeshaVol5 is not started Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp47-3.lab.eng.blr.redhat.com:/var/ lib/glusterd/ss_brick 49164 0 Y 1928 Brick dhcp46-219.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49155 0 Y 1817 Brick dhcp46-241.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49155 0 Y 28366 Self-heal Daemon on localhost N/A N/A Y 6506 Self-heal Daemon on dhcp46-241.lab.eng.blr. redhat.com N/A N/A Y 32053 Self-heal Daemon on dhcp46-219.lab.eng.blr. redhat.com N/A N/A Y 5445 Self-heal Daemon on dhcp47-45.lab.eng.blr.r edhat.com N/A N/A Y 28102 Self-heal Daemon on dhcp46-232.lab.eng.blr. redhat.com N/A N/A Y 12308 Self-heal Daemon on dhcp47-33.lab.eng.blr.r edhat.com N/A N/A Y 10595 Self-heal Daemon on dhcp46-110.lab.eng.blr. redhat.com N/A N/A Y 5481 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp47-3 ~]# gluster v info Volume Name: ganeshaVol1 Type: Distribute Volume ID: d5568168-ec2c-445b-9747-b8ca8fcaba7c Status: Started Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data1/3 Brick2: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data1/3 Brick3: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data1/3 Brick4: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data1/3 Brick5: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data2/4 Brick6: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data2/4 Brick7: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data2/4 Brick8: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data2/4 Brick9: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data3/5 Brick10: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data3/5 Brick11: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data3/5 Brick12: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data3/5 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: off ganesha.enable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: ganeshaVol3 Type: Distribute Volume ID: c208643d-521d-4fcb-8768-0edd81f23ee6 Status: Started Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data1/2 Brick2: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data1/2 Brick3: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data1/2 Brick4: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data1/2 Brick5: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data2/2 Brick6: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data2/2 Brick7: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data2/2 Brick8: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data2/2 Brick9: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data3/2 Brick10: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data3/2 Brick11: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data3/2 Brick12: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data3/2 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: off ganesha.enable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: ganeshaVol4 Type: Distribute Volume ID: e87dff35-b277-45ea-abb7-5a7e8d32f4e6 Status: Started Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data1/6 Brick2: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data1/6 Brick3: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data1/6 Brick4: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data1/6 Brick5: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data2/6 Brick6: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data2/6 Brick7: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data2/6 Brick8: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data2/6 Brick9: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data3/6 Brick10: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data3/6 Brick11: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data3/6 Brick12: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data3/6 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: off ganesha.enable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: ganeshaVol5 Type: Distribute Volume ID: 1a6864e5-64b3-4b45-8a25-939895c630cf Status: Started Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data1/7 Brick2: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data1/7 Brick3: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data1/7 Brick4: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data1/7 Brick5: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data2/7 Brick6: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data2/7 Brick7: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data2/7 Brick8: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data2/7 Brick9: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data3/7 Brick10: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data3/7 Brick11: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data3/7 Brick12: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data3/7 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: off ganesha.enable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: gluster_shared_storage Type: Replicate Volume ID: bcb7239e-1e56-41f1-a2cc-df94bb929fe9 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: dhcp47-3.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick2: dhcp46-219.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick3: dhcp46-241.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet nfs-ganesha: enable cluster.enable-shared-storage: enable [root@dhcp47-3 ~]# firewall-cmd --list-services dhcpv6-client rpc-bind rquota high-availability mountd glusterfs nfs ssh nlm [root@dhcp47-3 ~]# gluster peer status Number of Peers: 6 Hostname: dhcp46-241.lab.eng.blr.redhat.com Uuid: 1fe28c22-4b7c-4dcd-ae69-d572b66d2434 State: Peer in Cluster (Connected) Hostname: dhcp46-219.lab.eng.blr.redhat.com Uuid: 35ecc4c8-84b4-4ad6-a25c-8f411e1a1087 State: Peer in Cluster (Connected) Hostname: dhcp47-45.lab.eng.blr.redhat.com Uuid: ff3ba838-5350-44c2-954a-be74f65b4663 State: Peer in Cluster (Connected) Hostname: dhcp46-232.lab.eng.blr.redhat.com Uuid: 222f7028-81e4-45c6-8b2a-eac9fafef2eb State: Peer in Cluster (Connected) Hostname: dhcp47-33.lab.eng.blr.redhat.com Uuid: e90fa3d9-58db-4d38-abbb-26d6158bc205 State: Peer in Cluster (Connected) Hostname: dhcp46-110.lab.eng.blr.redhat.com Uuid: d7c61834-17a0-430e-b27e-cf1dc4f3f3b0 State: Peer in Cluster (Connected) [root@dhcp47-3 ganeshaVol5]# showmount -e localhost Export list for localhost: /ganeshaVol1 (everyone) /ganeshaVol3 (everyone) /ganeshaVol4 (everyone) /ganeshaVol5 (everyone) On the Node from which volume stopped was performed: Volume Name: ganeshaVol5 Type: Distribute Volume ID: 1a6864e5-64b3-4b45-8a25-939895c630cf Status: Stopped Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data1/7 Brick2: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data1/7 Brick3: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data1/7 Brick4: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data1/7 Brick5: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data2/7 Brick6: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data2/7 Brick7: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data2/7 Brick8: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data2/7 Brick9: dhcp46-219.lab.eng.blr.redhat.com:/mnt/data3/7 Brick10: dhcp46-241.lab.eng.blr.redhat.com:/mnt/data3/7 Brick11: dhcp47-3.lab.eng.blr.redhat.com:/mnt/data3/7 Brick12: dhcp47-45.lab.eng.blr.redhat.com:/mnt/data3/7 Options Reconfigured: ganesha.enable: on features.cache-invalidation: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Actual results: Rebooted node is not fetching the correct status of the volume from other nodes Expected results: Rebooted node should reflect the correct status of the volume Additional info:
I had got a chance to look into the set up and figured out that at the time of friend update and the node which went through the reboot, found a higher version for volume ganeshaVol5 as per [2016-12-12 09:26:13.494987] I [MSGID: 106009] [glusterd-utils.c:2914:glusterd_compare_friend_volume] 0-management: Version of volume ganeshaVol5 differ. local version = 7, remote version = 8 on peer dhcp46-241.lab.eng.blr.redhat.com However the surprising part post that was glusterd didn't update the volume info file with the latest and still continued with the stale volinfo. The log file doesn't indicate any failures for the same however to analyze the issue I have couple of requests: 1. Is it reproducible? 2. If yes, can we try to enable debug log and share across? Now coming to the decision on if its a blocker for rhgs-3.2.0 or not, my answer would be no as the test case looks to be something which will not be often executed at production i.e. rebooting a node and stopping the volume at the same time. Please add your thoughts.
Just to confirm the issue exists on non nfs-ganesh setup, tried the below things multiple times, it worked perfectly to me "Steps i did": 1. Created a 4 node cluster 2. Created 4 distribute volume using all 4 node bricks and started all the volumes. 3. Rebooted one of cluster node and at the same time stopped the volumes. 4. Checked the volume status on the rebooted node, it showed correctly ( volumes was in stopped state) @Manisha, The firewall rules on your setup are persistent ? if not persistent, then chances of hitting this issue more. Always make the firewall rules persistent for any node reboot related testing
(In reply to Byreddy from comment #4) > Just to confirm the issue exists on non nfs-ganesh setup, tried the below > things multiple times, it worked perfectly to me > > > "Steps i did": > 1. Created a 4 node cluster > 2. Created 4 distribute volume using all 4 node bricks and started all the > volumes. > 3. Rebooted one of cluster node and at the same time stopped the volumes. > 4. Checked the volume status on the rebooted node, it showed correctly ( > volumes was in stopped state) > > @Manisha, The firewall rules on your setup are persistent ? if not > persistent, then chances of hitting this issue more. > Always make the firewall rules persistent for any node reboot related testing Byreddy, I had the same query on firewalld rules. Manisha has configured firewalld rules via gdeploy and glusterfs service was added permanent. <snip> <msaini_>[root@dhcp47-3 ~]# firewall-cmd --list-services <msaini_> dhcpv6-client rpc-bind rquota high-availability mountd glusterfs nfs ssh nlm </snip>
(In reply to Atin Mukherjee from comment #3) > I had got a chance to look into the set up and figured out that at the time > of friend update and the node which went through the reboot, found a higher > version for volume ganeshaVol5 as per > > [2016-12-12 09:26:13.494987] I [MSGID: 106009] > [glusterd-utils.c:2914:glusterd_compare_friend_volume] 0-management: Version > of volume ganeshaVol5 differ. local version = 7, remote version = 8 on peer > dhcp46-241.lab.eng.blr.redhat.com > > However the surprising part post that was glusterd didn't update the volume > info file with the latest and still continued with the stale volinfo. The > log file doesn't indicate any failures for the same however to analyze the > issue I have couple of requests: > > 1. Is it reproducible? > 2. If yes, can we try to enable debug log and share across? > > Now coming to the decision on if its a blocker for rhgs-3.2.0 or not, my > answer would be no as the test case looks to be something which will not be > often executed at production i.e. rebooting a node and stopping the volume > at the same time. > > Please add your thoughts. Again tried reproducing the same scenario.The issue is reproducible. With single volume ,the issue is not observed.In my scenario there were 4 volumes.With the same steps(Creating volumes and doing start and stop on those volumes) i am able to hit this issue again.
Created attachment 1231084 [details] Glusterd Logs of rebooted Node
Created attachment 1231086 [details] Glusterd Logs of the Node from which Volume stop was performed
Based on issue reproducible, it looks issue exists on nfs-ganesha configured setup. This scenario is working well for me on the setup where nfs-ganesha is not configured @Manisha, You can try the same thing on the same setup with out nfs-ganesha config to isolate a problem.
Here are few additional data points which Manisha & myself came up with as per the testing and analysis results: 1. This issue doesn't happen on a similar setup where NFS-Ganesha is not configured. 2. This issue doesn't happen for a single volume set up. 3. This issue only happens if the node goes for a reboot, killing all gluster processes and then bringing it back after performing volume stop from another nodes doesn't cause any inconsistency in data. 4. If more than one volume is stopped then this issue doesn't persist. We still don't have enough RCA to have any evidence what's going wrong here however IMO this test doesn't look like a frequent use case in production and can be deferred from rhgs-3.2.0 given there is a workaround available here to correct the state.
The doc text is slightly edited for the release notes.
Anjana - for your awareness, this needs to be taken out from the known issue chapter.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days