Bug 1653572

Summary:	On node shutdown,app pods running on that node goes into "known"/"ContainerCreating" state and remains same till the node is brought up
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Manisha Saini <msaini>
Component:	kubernetes	Assignee:	Humble Chirammal <hchiramm>
Status:	CLOSED WONTFIX	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	aos-bugs, aos-storage-staff, hchiramm, jarrpa, jmulligan, jokerman, knarra, kramdoss, madam, mmccomas, pprakash, rhs-bugs, rtalur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1653606 (view as bug list)		Environment:
Last Closed:	2020-02-27 18:52:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1653606, 1707226

Description Manisha Saini 2018-11-27 07:22:12 UTC

Description of problem:

On performing node shutdown,the app pods which were running on that node earlier,goes into "unknown"/"ContainerCreating" state.This state is same till the time the node is in power off state.But when the same node is brought up,pods start running as expected.

Till the time the node is shutdown,user is not able to access these pods.Ideally,performing node shutdown should move the pods to some other working node.

Snippet of one of the app-pod not coming up---->

Following message was observed on the pod went in unknown state-

cirros003-1-nnhm2

===========

  Type     Reason                  Age              From                                        Message
  ----     ------                  ----             ----                                        -------
  Normal   Scheduled               1h               default-scheduler                           Successfully assigned app-storage/cirros003-1-nnhm2 to dhcp47-141.lab.eng.blr.redhat.com
  Normal   SuccessfulAttachVolume  1h               attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-cef582c8-f205-11e8-8abe-005056a5bb93"
  Normal   Pulled                  1h               kubelet, dhcp47-141.lab.eng.blr.redhat.com  Container image "cirros" already present on machine
  Normal   Created                 1h               kubelet, dhcp47-141.lab.eng.blr.redhat.com  Created container
  Normal   Started                 1h               kubelet, dhcp47-141.lab.eng.blr.redhat.com  Started container
  Warning  Unhealthy               1h (x2 over 1h)  kubelet, dhcp47-141.lab.eng.blr.redhat.com  Liveness probe failed: rpc error: code = 14 desc = grpc: the connection is unavailable

================

The pod which is trying to come up again shows following message in oc describe-

cirros003-1-64bwx

================
Events:
  Type     Reason              Age               From                                       Message
  ----     ------              ----              ----                                       -------
  Normal   Scheduled           14m               default-scheduler                          Successfully assigned app-storage/cirros003-1-64bwx to dhcp46-38.lab.eng.blr.redhat.com
  Warning  FailedAttachVolume  14m               attachdetach-controller                    Multi-Attach error for volume "pvc-cef582c8-f205-11e8-8abe-005056a5bb93" Volume is already used by pod(s) cirros003-1-nnhm2
  Warning  FailedMount         1m (x6 over 12m)  kubelet, dhcp46-38.lab.eng.blr.redhat.com  Unable to mount volumes for pod "cirros003-1-64bwx_app-storage(66887542-f211-11e8-8abe-005056a5bb93)": timeout expired waiting for volumes to attach or mount for pod "app-storage"/"cirros003-1-64bwx". list of unmounted volumes=[cirros-vol]. list of unattached volumes=[cirros-vol default-token-bdqdk]
=================

Version-Release number of selected component (if applicable):


OCS version : 3.11.1 and OCP version :3.11.1

# oc version
oc v3.11.43
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp47-89.lab.eng.blr.redhat.com:8443
openshift v3.11.43
kubernetes v1.11.0+d4cacc0


OCS images used for testing-

openshift_storage_glusterfs_heketi_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-volmanager-rhel7:3.11.1-1'
   
openshift_storage_glusterfs_block_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-gluster-block-prov-rhel7:3.11.1-1'
    
openshift_storage_glusterfs_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-server-rhel7:3.11.1-1'

2# rpm -qa | grep heketi
heketi-client-8.0.0-1.el7rhgs.x86_64
heketi-8.0.0-1.el7rhgs.x86_64


How reproducible:
2/2

Steps to Reproduce:
1.Create OCP 3.11.1 and OCS 3.11.1 setup using ansible deployment scripts
2.Create 20 block pvc's giving HA count=4
3.Create 20 app-pods (cirros pods) using the pvc's created in step 2
4.Login to 1 gluster pod and check the node hostname which host 1-1 node of both heketi and block-hosting volume.Power off that node.
5.Wait for the app pods hosting on that node to spin up on other node 

Actual results:

The pods never spin up to other node but instead goes into "unknown"/"container" creating state leading to those pods inaccessible 

Expected results:

Pods should spin up on someother node when the node is powered off

Additional info:

Detailed logs and sosreport will be providing in next step.

Comment 4 Yaniv Kaul 2019-04-01 06:50:39 UTC

Latest status?

Comment 6 RamaKasturi 2019-05-03 05:41:53 UTC

Hello Humble,

    Placing the needinfo on me and clearing it on manihsa & i will be working on this bug and update the status here.

Thanks
kasturi

Comment 8 Humble Chirammal 2019-05-06 08:20:43 UTC

Have you configured 'mulitpath' for these block PVCs? 

If yes, whats the status of mulitpath when one node goes down ?

Comment 10 Humble Chirammal 2019-07-09 09:31:25 UTC

(In reply to Humble Chirammal from comment #8)
> Have you configured 'mulitpath' for these block PVCs? 
> 
> If yes, whats the status of mulitpath when one node goes down ?

Kasturi, apart from above requested information, can we try this against latest OCS builds?

Comment 11 RamaKasturi 2019-07-09 09:41:59 UTC

Hello Humble,

   Couple of questions before we actually try this out. 

1) Is this fixed with any of the builds ?

2) If yes, can you please put the FIV of the bug and move it ON_QA so that QE can verify this ?

If not the above, do we know that this has been fixed and if yes, you would want QE to give this a try ?


Thanks
kasturi