Bug 1678446

Summary: gluster pod stuck in 0/1 state for more than 4 hours during an upgrade (on large scale - > 850 volumes) - possibly LVM scale issue
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: rhgs-server-containerAssignee: Raghavendra Talur <rtalur>
Status: CLOSED DUPLICATE QA Contact: Prasanth <pprakash>
Severity: high Docs Contact:
Priority: unspecified    
Version: ocs-3.11CC: kramdoss, madam, pasik, puebele, rhs-bugs, rtalur, sankarshan, sarumuga
Target Milestone: ---Keywords: Performance, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-07 11:45:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description RamaKasturi 2019-02-18 19:13:15 UTC
Description of problem:
When trying to upgrade a setup from OCP3.11+OCS 3.11 to OCP3.11 + OCS3.11.1 where the no.of volumes are more than 850 , gluster pod has been stuck in 0/1 state for more than 4 hours and when digging deeper found that pvscan has been stuck since the number of lvs here are greater than 1000.


Version-Release number of selected component (if applicable):
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.refarch311.ocsqeblr.com:443
openshift v3.11.69
kubernetes v1.11.0+d4cacc0
sh-4.2# rpm -qa | grep glusterfs
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64
glusterfs-server-3.12.2-32.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Have a setup of OCP3.11+OCS3.11 in aws environment
2. create 850 file and 50 block volumes.
3. create 100 cirrros pod attached to file and block volumes.
4. Try upgrading the setup to latest version which is OCS3.11.1

Actual results:
I see that gluster pod is in 0/1 state for more than four hours and pvscan has been stuck since the number of lvs present were more than 1000.

Expected results:
gluster pod should not be in 0/1 state and pvscan should be run successfully.

Additional info:

Below is the workaround we followed to get the pod up and running.

1. removing glusterfs=storage-host label from node
2. rebooting the node / stop and start the aws instance
3. and relabeling the node back to glusterfs=storage-host.

Comment 8 Yaniv Kaul 2019-04-14 14:20:42 UTC
Status?