Bug 1636912

Summary:	PVC still in pending state after node shutdown and start - heketi kube exec layer got stuck
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rachael <rgeorge>
Component:	heketi	Assignee:	John Mulligan <jmulligan>
Status:	CLOSED ERRATA	QA Contact:	Nitin Goyal <nigoyal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	asriram, hchiramm, jcall, jmulligan, knarra, kramdoss, madam, mmariyan, murali.kottakota, nigoyal, nravinas, rgeorge, rhs-bugs, rtalur, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	OCS 3.11.z Batch Update 4	Flags:	knarra: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	heketi-9.0.0-3.el7rhgs	Doc Type:	Bug Fix
Doc Text:	Previously, when Heketi executed commands within OpenShift/Kubernetes pods, the commands were executed without a timeout specified. Hence, some commands never returned which differed from the SSH executor which always executes commands with a timeout. With this update, the commands that are executed in the gluster containers have a timeout specified. The timeout values are the same regardless of what connection type is used.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-30 12:34:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1707226

Description Rachael 2018-10-08 09:08:03 UTC

Description of problem:

On a 4 node OCS setup, a loop was run for creation of file PVCs. While the PVCs requests were being serviced, one of the gluster nodes was shutdown such that 3 out of four gluster nodes were still up. However, the file PVCs did not get bound.

After approx 10 mins, the node that was shutdown was powered on. It was observed that the PVCs were still in pending state with the following:

3m          8h           2211      claim46.155b6c7c2e658249   PersistentVolumeClaim               Warning   ProvisioningFailed   persistentvolume-controller   Failed to provision volume with StorageClass "glusterfs-storage": failed to create volume: failed to create volume: Server busy. Retry operation later.

----Snip from heketi log------
[heketi] WARNING 2018/10/08 04:39:22 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 411.036µs
[negroni] Started POST /volumes
[heketi] WARNING 2018/10/08 04:39:22 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 517.159µs
[negroni] Started POST /volumes
[heketi] WARNING 2018/10/08 04:39:22 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 205.319µs
[negroni] Started POST /volumes
[heketi] WARNING 2018/10/08 04:39:22 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 156µs
[negroni] Started POST /volumes
[heketi] WARNING 2018/10/08 04:39:22 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 184.337µs
 

Version-Release number of selected component (if applicable):
heketi-client-7.0.0-14.el7rhgs.x86_64
heketi-7.0.0-14.el7rhgs.x86_64
glusterfs-libs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-api-3.12.2-18.1.el7rhgs.x86_64
python2-gluster-3.12.2-18.1.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.1.el7rhgs.x86_64
glusterfs-server-3.12.2-18.1.el7rhgs.x86_64
gluster-block-0.2.1-27.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-18.1.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.1.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-18.1.el7rhgs.x86_64
oc v3.11.15
kubernetes v1.11.0+d4cacc0


How reproducible: 1/1


Steps to Reproduce:
1. Run a for loop to create file PVCs
2. While PVs are being created, shutdown one of the gluster nodes (3 out of 4 gluster nodes are up). The remaining PVCs were stuck in pending state
3. After 10 mins, power ON the node


Actual results:
The PVCs are stuck in pending state for over 12 hours

Expected results:
The PVC should be bound since 3 gluster nodes are up and satisfy the quorum

Comment 14 murali.kottakota 2019-01-29 14:51:26 UTC

Hi,

We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods.  (Please provide workaround to move further (pod restart doesn't workout) 
https://bugzilla.redhat.com/show_bug.cgi?id=1630117
https://bugzilla.redhat.com/show_bug.cgi?id=1636912

Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume
Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 
at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8.
Now not a single volume is able to create and around 10 volumes are already created in this setup.

Please find heketidb dump and log 

[negroni] Completed 200 OK in 98.699µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 106.654µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 185.406µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 102.664µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 192.658µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 198.611µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 124.254µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 101.491µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 116.997µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 100.171µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 109.238µs
[negroni] Started POST /volumes
[heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8)
[negroni] Completed 429 Too Many Requests in 191.118µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 188.791µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 94.436µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 110.893µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 112.132µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 96.15µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 112.682µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 140.543µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 182.066µs
[negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6
[negroni] Completed 200 OK in 151.572µs


Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. 

[root@app2 ~]# uname -a 
Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux 

Hardware: HP GEN8 

OS; NAME="CentOS Linux" 
VERSION="7 (Core)" 
ID="centos" 
ID_LIKE="rhel fedora" 
VERSION_ID="7" 
PRETTY_NAME="CentOS Linux 7 (Core)" 
ANSI_COLOR="0;31" 
CPE_NAME="cpe:/o:centos:centos:7" 
HOME_URL="https://www.centos.org/" 
BUG_REPORT_URL="https://bugs.centos.org/" 

CENTOS_MANTISBT_PROJECT="CentOS-7" 
CENTOS_MANTISBT_PROJECT_VERSION="7" 
REDHAT_SUPPORT_PRODUCT="centos" 
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Comment 35 John Mulligan 2019-08-13 17:48:25 UTC

*** Bug 1658250 has been marked as a duplicate of this bug. ***

Comment 36 John Mulligan 2019-08-13 18:41:20 UTC

*** Bug 1656910 has been marked as a duplicate of this bug. ***

Comment 40 errata-xmlrpc 2019-10-30 12:34:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3255