Bug 1536511 - Gluster pod with 850 volumes fails to come up after node reboot
Summary: Gluster pod with 850 volumes fails to come up after node reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhgs-server-container
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 3.11
Assignee: Saravanakumar
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks: 1627104 1629575
TreeView+ depends on / blocked
 
Reported: 2018-01-19 14:51 UTC by Rachael
Modified: 2018-10-24 05:59 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-24 05:57:39 UTC
Embargoed:


Attachments (Terms of Use)
Output of oc get pods (68.43 KB, image/png)
2018-01-19 14:51 UTC, Rachael
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github gluster gluster-containers pull 104 0 None None None 2018-09-26 08:48:27 UTC
Red Hat Bugzilla 1589277 1 None None None 2021-09-09 14:29:39 UTC
Red Hat Bugzilla 1623433 0 unspecified CLOSED Brick fails to come online after shutting down and restarting a node 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1627104 0 urgent CLOSED cant deploy gluster with crio because LVM commands fail 2022-03-13 15:32:14 UTC
Red Hat Product Errata RHBA-2018:2990 0 None None None 2018-10-24 05:59:07 UTC

Internal Links: 1589277 1623433 1627104

Description Rachael 2018-01-19 14:51:27 UTC
Created attachment 1383446 [details]
Output of oc get pods

Description of problem:
On a 3 node CNS setup, 850 volumes were created and one of the nodes from the storage pool was rebooted. After reboot the gluster pods failed to come up. This behaviour was seen across two different setups. The same test was successful for 700 volumes.


Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-57.el7rhgs.x86_64
heketi-client-5.0.0-19.el7rhgs.x86_64
glusterfs-3.8.4-54.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
1. Create 850 volumes
2. Restart one of the nodes from the storage pool


Actual results:
Gluster pod fails to come up


Expected results:
Gluster pod should be up and running


Additional info:

Comment 13 Raghavendra Talur 2018-01-24 06:19:36 UTC
Karthick has found some change in behavior. When lv commands are executed on host nodes they don't list pvs/vgs/lvs that were created inside gluster pod. However, on refreshing the cache, they are listed. For example:

[root@dhcp46-91 ~]# vgs                                    
  VG                      #PV #LV #SN Attr   VSize   VFree   
  docker-vg                 1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104           1   3   0 wz--n- <99.00g    4.00m
  vg_rhel_dhcp47--104-var   1   1   0 wz--n- <60.00g 1020.00m <----- missing entities that were created inside container
[root@dhcp46-91 ~]# vgscan --cache
  Reading volume groups from cache.
  Found volume group "vg_94345e8bdc012ce2fb6d02a12287d635" using metadata type lvm2
  Found volume group "vg_1bda4fff7754ee439915247d44b2c460" using metadata type lvm2
  Found volume group "vg_b4e1ea155b348fad42a9299b1821abdc" using metadata type lvm2
  Found volume group "docker-vg" using metadata type lvm2
  Found volume group "vg_2ad2fbc4d4d335f67b9420c8b6f09f94" using metadata type lvm2
  Found volume group "rhel_dhcp47-104" using metadata type lvm2
  Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2
  Found volume group "vg_b942df6a4d890312a07848d3ae574db1" using metadata type lvm2
[root@dhcp46-91 ~]# vgs
  VG                                  #PV #LV #SN Attr   VSize   VFree   
  docker-vg                             1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104                       1   3   0 wz--n- <99.00g    4.00m
  vg_1bda4fff7754ee439915247d44b2c460   1 166   0 wz--n-  99.87g    4.11g
  vg_2ad2fbc4d4d335f67b9420c8b6f09f94   1 564   0 wz--n- 299.87g   15.66g
  vg_94345e8bdc012ce2fb6d02a12287d635   1 188   0 wz--n- 599.87g   92.00m
  vg_b4e1ea155b348fad42a9299b1821abdc   1 352   0 wz--n- 299.87g <122.49g
  vg_b942df6a4d890312a07848d3ae574db1   1 192   0 wz--n- 299.87g  203.11g
  vg_rhel_dhcp47--104-var               1   1   0 wz--n- <60.00g 1020.00m <---- now they are seen


RCA:
lvm commands in container have been told clearly to not rely on cache(a consequence is that they don't update the cache either).
lvm commands on the node still refer to the cache. Hence any change inside container related to lvm won't be available to host lvm commands unless "--cache" is used.

I don't see any functional impact because of this. When debugging setups or performing recovery on host in case of errors this might be a little confusing to admins/engineers. Probably need to document this.

Comment 23 Michael Adam 2018-09-20 20:45:11 UTC
POST since there's a proposed patch already.

Comment 25 Michael Adam 2018-09-20 20:52:24 UTC
Proposing this for 3.11.0

Comment 31 errata-xmlrpc 2018-10-24 05:57:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2990


Note You need to log in before you can comment on or make changes to this bug.