Bug 1536511

Summary: Gluster pod with 850 volumes fails to come up after node reboot
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachael <rgeorge>
Component: rhgs-server-containerAssignee: Saravanakumar <sarumuga>
Status: CLOSED ERRATA QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: bmohanra, hchiramm, kramdoss, madam, ndevos, pprakash, rcyriac, rgeorge, rhs-bugs, rtalur, sankarshan, wmeng
Target Milestone: ---Keywords: Patch
Target Release: OCS 3.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-24 05:57:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1627104, 1629575    
Attachments:
Description Flags
Output of oc get pods none

Description Rachael 2018-01-19 14:51:27 UTC
Created attachment 1383446 [details]
Output of oc get pods

Description of problem:
On a 3 node CNS setup, 850 volumes were created and one of the nodes from the storage pool was rebooted. After reboot the gluster pods failed to come up. This behaviour was seen across two different setups. The same test was successful for 700 volumes.


Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-57.el7rhgs.x86_64
heketi-client-5.0.0-19.el7rhgs.x86_64
glusterfs-3.8.4-54.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
1. Create 850 volumes
2. Restart one of the nodes from the storage pool


Actual results:
Gluster pod fails to come up


Expected results:
Gluster pod should be up and running


Additional info:

Comment 13 Raghavendra Talur 2018-01-24 06:19:36 UTC
Karthick has found some change in behavior. When lv commands are executed on host nodes they don't list pvs/vgs/lvs that were created inside gluster pod. However, on refreshing the cache, they are listed. For example:

[root@dhcp46-91 ~]# vgs                                    
  VG                      #PV #LV #SN Attr   VSize   VFree   
  docker-vg                 1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104           1   3   0 wz--n- <99.00g    4.00m
  vg_rhel_dhcp47--104-var   1   1   0 wz--n- <60.00g 1020.00m <----- missing entities that were created inside container
[root@dhcp46-91 ~]# vgscan --cache
  Reading volume groups from cache.
  Found volume group "vg_94345e8bdc012ce2fb6d02a12287d635" using metadata type lvm2
  Found volume group "vg_1bda4fff7754ee439915247d44b2c460" using metadata type lvm2
  Found volume group "vg_b4e1ea155b348fad42a9299b1821abdc" using metadata type lvm2
  Found volume group "docker-vg" using metadata type lvm2
  Found volume group "vg_2ad2fbc4d4d335f67b9420c8b6f09f94" using metadata type lvm2
  Found volume group "rhel_dhcp47-104" using metadata type lvm2
  Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2
  Found volume group "vg_b942df6a4d890312a07848d3ae574db1" using metadata type lvm2
[root@dhcp46-91 ~]# vgs
  VG                                  #PV #LV #SN Attr   VSize   VFree   
  docker-vg                             1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104                       1   3   0 wz--n- <99.00g    4.00m
  vg_1bda4fff7754ee439915247d44b2c460   1 166   0 wz--n-  99.87g    4.11g
  vg_2ad2fbc4d4d335f67b9420c8b6f09f94   1 564   0 wz--n- 299.87g   15.66g
  vg_94345e8bdc012ce2fb6d02a12287d635   1 188   0 wz--n- 599.87g   92.00m
  vg_b4e1ea155b348fad42a9299b1821abdc   1 352   0 wz--n- 299.87g <122.49g
  vg_b942df6a4d890312a07848d3ae574db1   1 192   0 wz--n- 299.87g  203.11g
  vg_rhel_dhcp47--104-var               1   1   0 wz--n- <60.00g 1020.00m <---- now they are seen


RCA:
lvm commands in container have been told clearly to not rely on cache(a consequence is that they don't update the cache either).
lvm commands on the node still refer to the cache. Hence any change inside container related to lvm won't be available to host lvm commands unless "--cache" is used.

I don't see any functional impact because of this. When debugging setups or performing recovery on host in case of errors this might be a little confusing to admins/engineers. Probably need to document this.

Comment 23 Michael Adam 2018-09-20 20:45:11 UTC
POST since there's a proposed patch already.

Comment 25 Michael Adam 2018-09-20 20:52:24 UTC
Proposing this for 3.11.0

Comment 31 errata-xmlrpc 2018-10-24 05:57:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2990