Created attachment 1383446 [details] Output of oc get pods Description of problem: On a 3 node CNS setup, 850 volumes were created and one of the nodes from the storage pool was rebooted. After reboot the gluster pods failed to come up. This behaviour was seen across two different setups. The same test was successful for 700 volumes. Version-Release number of selected component (if applicable): cns-deploy-5.0.0-57.el7rhgs.x86_64 heketi-client-5.0.0-19.el7rhgs.x86_64 glusterfs-3.8.4-54.el7rhgs.x86_64 How reproducible: Steps to Reproduce: 1. Create 850 volumes 2. Restart one of the nodes from the storage pool Actual results: Gluster pod fails to come up Expected results: Gluster pod should be up and running Additional info:
Karthick has found some change in behavior. When lv commands are executed on host nodes they don't list pvs/vgs/lvs that were created inside gluster pod. However, on refreshing the cache, they are listed. For example: [root@dhcp46-91 ~]# vgs VG #PV #LV #SN Attr VSize VFree docker-vg 1 1 0 wz--n- <50.00g 30.00g rhel_dhcp47-104 1 3 0 wz--n- <99.00g 4.00m vg_rhel_dhcp47--104-var 1 1 0 wz--n- <60.00g 1020.00m <----- missing entities that were created inside container [root@dhcp46-91 ~]# vgscan --cache Reading volume groups from cache. Found volume group "vg_94345e8bdc012ce2fb6d02a12287d635" using metadata type lvm2 Found volume group "vg_1bda4fff7754ee439915247d44b2c460" using metadata type lvm2 Found volume group "vg_b4e1ea155b348fad42a9299b1821abdc" using metadata type lvm2 Found volume group "docker-vg" using metadata type lvm2 Found volume group "vg_2ad2fbc4d4d335f67b9420c8b6f09f94" using metadata type lvm2 Found volume group "rhel_dhcp47-104" using metadata type lvm2 Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2 Found volume group "vg_b942df6a4d890312a07848d3ae574db1" using metadata type lvm2 [root@dhcp46-91 ~]# vgs VG #PV #LV #SN Attr VSize VFree docker-vg 1 1 0 wz--n- <50.00g 30.00g rhel_dhcp47-104 1 3 0 wz--n- <99.00g 4.00m vg_1bda4fff7754ee439915247d44b2c460 1 166 0 wz--n- 99.87g 4.11g vg_2ad2fbc4d4d335f67b9420c8b6f09f94 1 564 0 wz--n- 299.87g 15.66g vg_94345e8bdc012ce2fb6d02a12287d635 1 188 0 wz--n- 599.87g 92.00m vg_b4e1ea155b348fad42a9299b1821abdc 1 352 0 wz--n- 299.87g <122.49g vg_b942df6a4d890312a07848d3ae574db1 1 192 0 wz--n- 299.87g 203.11g vg_rhel_dhcp47--104-var 1 1 0 wz--n- <60.00g 1020.00m <---- now they are seen RCA: lvm commands in container have been told clearly to not rely on cache(a consequence is that they don't update the cache either). lvm commands on the node still refer to the cache. Hence any change inside container related to lvm won't be available to host lvm commands unless "--cache" is used. I don't see any functional impact because of this. When debugging setups or performing recovery on host in case of errors this might be a little confusing to admins/engineers. Probably need to document this.
POST since there's a proposed patch already.
Proposing this for 3.11.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2990