1678446 – gluster pod stuck in 0/1 state for more than 4 hours during an upgrade (on large scale - > 850 volumes) - possibly LVM scale issue

Bug 1678446 - gluster pod stuck in 0/1 state for more than 4 hours during an upgrade (on large scale - > 850 volumes) - possibly LVM scale issue

Summary: gluster pod stuck in 0/1 state for more than 4 hours during an upgrade (on la...

Keywords:
Status:	CLOSED DUPLICATE of bug 1676466
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Talur
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-18 19:13 UTC by RamaKasturi
Modified:	2019-05-08 11:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-07 11:45:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description RamaKasturi 2019-02-18 19:13:15 UTC

Description of problem:
When trying to upgrade a setup from OCP3.11+OCS 3.11 to OCP3.11 + OCS3.11.1 where the no.of volumes are more than 850 , gluster pod has been stuck in 0/1 state for more than 4 hours and when digging deeper found that pvscan has been stuck since the number of lvs here are greater than 1000.

Version-Release number of selected component (if applicable):
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.refarch311.ocsqeblr.com:443
openshift v3.11.69
kubernetes v1.11.0+d4cacc0
sh-4.2# rpm -qa | grep glusterfs
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64
glusterfs-server-3.12.2-32.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have a setup of OCP3.11+OCS3.11 in aws environment
2. create 850 file and 50 block volumes.
3. create 100 cirrros pod attached to file and block volumes.
4. Try upgrading the setup to latest version which is OCS3.11.1

Actual results:
I see that gluster pod is in 0/1 state for more than four hours and pvscan has been stuck since the number of lvs present were more than 1000.

Expected results:
gluster pod should not be in 0/1 state and pvscan should be run successfully.

Additional info:

Below is the workaround we followed to get the pod up and running.

1. removing glusterfs=storage-host label from node
2. rebooting the node / stop and start the aws instance
3. and relabeling the node back to glusterfs=storage-host.

Comment 8 Yaniv Kaul 2019-04-14 14:20:42 UTC

Status?

Note You need to log in before you can comment on or make changes to this bug.