Bug 1652538

Summary:	upgrade of gluster pods from ocs 3.9 to ocs 3.11.1 fails
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	rhgs-server-container	Assignee:	Saravanakumar <sarumuga>
Status:	CLOSED NOTABUG	QA Contact:	Prasanth <pprakash>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	kramdoss, madam, rhs-bugs, sankarshan
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-23 06:17:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description RamaKasturi 2018-11-22 10:38:08 UTC

Description of problem:
I see that when trying to upgrade my gluster pods from 3.9 to 3.11.1 it fails with the error below and the pod is stuck at terminating state.


LAST SEEN                       FIRST SEEN                      COUNT     NAME                                       KIND      SUBOBJECT                    TYPE      REASON      SOURCE                                      MESSAGE
2018-11-22 14:20:32 +0530 IST   2018-11-22 14:20:32 +0530 IST   1         glusterfs-storage-7mg42.1569661f82236cca   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:20:42 +0530 IST   2018-11-22 14:20:42 +0530 IST   1         glusterfs-storage-7mg42.15696621a751a461   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:20:57 +0530 IST   2018-11-22 14:20:32 +0530 IST   2         glusterfs-storage-7mg42.1569661f82236cca   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:21:07 +0530 IST   2018-11-22 14:20:42 +0530 IST   2         glusterfs-storage-7mg42.15696621a751a461   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:21:22 +0530 IST   2018-11-22 14:20:32 +0530 IST   3         glusterfs-storage-7mg42.1569661f82236cca   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:21:32 +0530 IST   2018-11-22 14:20:42 +0530 IST   3         glusterfs-storage-7mg42.15696621a751a461   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:21:47 +0530 IST   2018-11-22 14:20:32 +0530 IST   4         glusterfs-storage-7mg42.1569661f82236cca   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:21:57 +0530 IST   2018-11-22 14:20:42 +0530 IST   4         glusterfs-storage-7mg42.15696621a751a461   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


2018-11-22 14:22:00 +0530 IST   2018-11-22 13:05:08 +0530 IST   3         glusterfs-storage-7mg42.156962022ca9bb1c   Pod       spec.containers{glusterfs}   Normal    Killing   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Killing container with id docker://glusterfs:Need to kill Pod
2018-11-22 14:22:00 +0530 IST   2018-11-22 14:22:00 +0530 IST   1         glusterfs-storage-7mg42.15696633e9d1f91b   Pod                 Warning   FailedKillPod   kubelet, dhcp46-23.lab.eng.blr.redhat.com   error killing pod: failed to "KillContainer" for "glusterfs" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

Version-Release number of selected component (if applicable):
[root@dhcp46-160 ~]# oc version

oc v3.9.51
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp46-160.lab.eng.blr.redhat.com:8443
openshift v3.9.51
kubernetes v1.9.1+a0ce1bc657



How reproducible:
Hit it once

Steps to Reproduce:
1. Install ocp3.9 + cns 3.9
2. Delete the gluster deamonset
3. Edit the gluster template to have the following image name and version and create the ds again.
displayName: Daemonset Node Labels
  name: NODE_LABELS
  value: '{ "glusterfs": "storage-host" }'
- displayName: GlusterFS container image name
  name: IMAGE_NAME
  required: true
  value: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-server-rhel7
- displayName: GlusterFS container image version
  name: IMAGE_VERSION
  required: true
  value: 3.11.1
- description: A unique name to identify which heketi service manages this cluster,
    useful for running multiple heketi instances
  displayName: GlusterFS cluster name
  name: CLUSTER_NAME
  value: storage

4. Run the command to create ds again, oc process glusterfs | oc create -f -
5. Verify that ds has been created.
[root@dhcp46-160 ~]# oc get ds
NAME                DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
glusterfs-storage   4         4         3         0            3           glusterfs=storage-host   1h
6. Now delete the gluster pod by running the command "oc delete pod <glusterfs_pod>"

Actual results:
gluster pod stuck in terminating state and the below errors are seen in the events.

[root@dhcp46-160 ~]# oc get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                       KIND      SUBOBJECT                    TYPE      REASON      SOURCE                                       MESSAGE
4m          2h           29        cirros-block-1-87d9l.1569625ea154505a      Pod       spec.containers{cirros}      Warning   Unhealthy   kubelet, dhcp46-160.lab.eng.blr.redhat.com   Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
3m          3h           53        glusterfs-storage-7mg42.156962022ca9bb1c   Pod       spec.containers{glusterfs}   Normal    Killing     kubelet, dhcp46-23.lab.eng.blr.redhat.com    Killing container with id docker://glusterfs:Need to kill Pod
1h          1h           4         glusterfs-storage-7mg42.1569661f82236cca   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com    Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


1h        1h        4         glusterfs-storage-7mg42.15696621a751a461   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, dhcp46-23.lab.eng.blr.redhat.com   Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""


1h        1h        6         glusterfs-storage-7mg42.15696633e9d1f91b   Pod                 Warning   FailedKillPod   kubelet, dhcp46-23.lab.eng.blr.redhat.com   error killing pod: failed to "KillContainer" for "glusterfs" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"


Expected results:
Should be successfully able to upgrade the pods to 3.11.1

Additional info:

[root@dhcp46-160 ~]# oc get pods -l glusterfs
NAME                                          READY     STATUS        RESTARTS   AGE
glusterblock-storage-provisioner-dc-1-h26pt   1/1       Running       1          22h
glusterfs-storage-7mg42                       0/1       Terminating   1          4h
glusterfs-storage-878qw                       1/1       Running       1          3h
glusterfs-storage-bxv4h                       1/1       Running       1          4h
glusterfs-storage-glbfg                       1/1       Running       1          4h
heketi-storage-1-qwsxk                        1/1       Running       0          2h

[root@dhcp46-160 ~]# oc get nodes
NAME                                STATUS    ROLES     AGE       VERSION
dhcp46-107.lab.eng.blr.redhat.com   Ready     <none>    22h       v1.9.1+a0ce1bc657
dhcp46-160.lab.eng.blr.redhat.com   Ready     master    22h       v1.9.1+a0ce1bc657
dhcp46-222.lab.eng.blr.redhat.com   Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp46-23.lab.eng.blr.redhat.com    Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp46-236.lab.eng.blr.redhat.com   Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp47-134.lab.eng.blr.redhat.com   Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp47-147.lab.eng.blr.redhat.com   Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp47-37.lab.eng.blr.redhat.com    Ready     compute   22h       v1.9.1+a0ce1bc657
dhcp47-80.lab.eng.blr.redhat.com    Ready     compute   22h       v1.9.1+a0ce1bc657


[root@dhcp46-160 ~]# docker images
REPOSITORY                                                                       TAG                 IMAGE ID            CREATED             SIZE
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-volmanager-rhel7   3.11.1              cba22f33b513        8 days ago          424 MB

Comment 2 RamaKasturi 2018-11-23 06:17:50 UTC

Closing this bug as it works for me. The problem was image version did not had a v in the value.

- displayName: GlusterFS container image version
  name: IMAGE_VERSION
  required: true
  value: 3.11.1

changed it to v3.11.1

- displayName: GlusterFS container image version
  name: IMAGE_VERSION
  required: true
  value: v3.11.1