1536511 – Gluster pod with 850 volumes fails to come up after node reboot

Bug 1536511 - Gluster pod with 850 volumes fails to come up after node reboot

Summary: Gluster pod with 850 volumes fails to come up after node reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	cns-3.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 3.11
Assignee:	Saravanakumar
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1627104 1629575
TreeView+	depends on / blocked

Reported:	2018-01-19 14:51 UTC by Rachael
Modified:	2018-10-24 05:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-24 05:57:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output of oc get pods (68.43 KB, image/png) 2018-01-19 14:51 UTC, Rachael	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	gluster gluster-containers pull 104	0	None	None	None	2018-09-26 08:48:27 UTC
Red Hat Bugzilla	1589277	1	None	None	None	2024-09-18 00:48:01 UTC
Red Hat Bugzilla	1623433	0	unspecified	CLOSED	Brick fails to come online after shutting down and restarting a node	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1627104	0	urgent	CLOSED	cant deploy gluster with crio because LVM commands fail	2022-03-13 15:32:14 UTC
Red Hat Product Errata	RHBA-2018:2990	0	None	None	None	2018-10-24 05:59:07 UTC

Internal Links: 1589277 1623433 1627104

Description Rachael 2018-01-19 14:51:27 UTC

Created attachment 1383446 [details]
Output of oc get pods

Description of problem:
On a 3 node CNS setup, 850 volumes were created and one of the nodes from the storage pool was rebooted. After reboot the gluster pods failed to come up. This behaviour was seen across two different setups. The same test was successful for 700 volumes.


Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-57.el7rhgs.x86_64
heketi-client-5.0.0-19.el7rhgs.x86_64
glusterfs-3.8.4-54.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
1. Create 850 volumes
2. Restart one of the nodes from the storage pool


Actual results:
Gluster pod fails to come up


Expected results:
Gluster pod should be up and running


Additional info:

Comment 13 Raghavendra Talur 2018-01-24 06:19:36 UTC

Karthick has found some change in behavior. When lv commands are executed on host nodes they don't list pvs/vgs/lvs that were created inside gluster pod. However, on refreshing the cache, they are listed. For example:

[root@dhcp46-91 ~]# vgs                                    
  VG                      #PV #LV #SN Attr   VSize   VFree   
  docker-vg                 1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104           1   3   0 wz--n- <99.00g    4.00m
  vg_rhel_dhcp47--104-var   1   1   0 wz--n- <60.00g 1020.00m <----- missing entities that were created inside container
[root@dhcp46-91 ~]# vgscan --cache
  Reading volume groups from cache.
  Found volume group "vg_94345e8bdc012ce2fb6d02a12287d635" using metadata type lvm2
  Found volume group "vg_1bda4fff7754ee439915247d44b2c460" using metadata type lvm2
  Found volume group "vg_b4e1ea155b348fad42a9299b1821abdc" using metadata type lvm2
  Found volume group "docker-vg" using metadata type lvm2
  Found volume group "vg_2ad2fbc4d4d335f67b9420c8b6f09f94" using metadata type lvm2
  Found volume group "rhel_dhcp47-104" using metadata type lvm2
  Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2
  Found volume group "vg_b942df6a4d890312a07848d3ae574db1" using metadata type lvm2
[root@dhcp46-91 ~]# vgs
  VG                                  #PV #LV #SN Attr   VSize   VFree   
  docker-vg                             1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104                       1   3   0 wz--n- <99.00g    4.00m
  vg_1bda4fff7754ee439915247d44b2c460   1 166   0 wz--n-  99.87g    4.11g
  vg_2ad2fbc4d4d335f67b9420c8b6f09f94   1 564   0 wz--n- 299.87g   15.66g
  vg_94345e8bdc012ce2fb6d02a12287d635   1 188   0 wz--n- 599.87g   92.00m
  vg_b4e1ea155b348fad42a9299b1821abdc   1 352   0 wz--n- 299.87g <122.49g
  vg_b942df6a4d890312a07848d3ae574db1   1 192   0 wz--n- 299.87g  203.11g
  vg_rhel_dhcp47--104-var               1   1   0 wz--n- <60.00g 1020.00m <---- now they are seen


RCA:
lvm commands in container have been told clearly to not rely on cache(a consequence is that they don't update the cache either).
lvm commands on the node still refer to the cache. Hence any change inside container related to lvm won't be available to host lvm commands unless "--cache" is used.

I don't see any functional impact because of this. When debugging setups or performing recovery on host in case of errors this might be a little confusing to admins/engineers. Probably need to document this.

Comment 23 Michael Adam 2018-09-20 20:45:11 UTC

POST since there's a proposed patch already.

Comment 25 Michael Adam 2018-09-20 20:52:24 UTC

Proposing this for 3.11.0

Comment 31 errata-xmlrpc 2018-10-24 05:57:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2990

Note You need to log in before you can comment on or make changes to this bug.