Description of problem:
On performing node shutdown,the app pods which were running on that node earlier,goes into "unknown"/"ContainerCreating" state.This state is same till the time the node is in power off state.But when the same node is brought up,pods start running as expected.
Till the time the node is shutdown,user is not able to access these pods.Ideally,performing node shutdown should move the pods to some other working node.
Snippet of one of the app-pod not coming up---->
Following message was observed on the pod went in unknown state-
cirros003-1-nnhm2
===========
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1h default-scheduler Successfully assigned app-storage/cirros003-1-nnhm2 to dhcp47-141.lab.eng.blr.redhat.com
Normal SuccessfulAttachVolume 1h attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-cef582c8-f205-11e8-8abe-005056a5bb93"
Normal Pulled 1h kubelet, dhcp47-141.lab.eng.blr.redhat.com Container image "cirros" already present on machine
Normal Created 1h kubelet, dhcp47-141.lab.eng.blr.redhat.com Created container
Normal Started 1h kubelet, dhcp47-141.lab.eng.blr.redhat.com Started container
Warning Unhealthy 1h (x2 over 1h) kubelet, dhcp47-141.lab.eng.blr.redhat.com Liveness probe failed: rpc error: code = 14 desc = grpc: the connection is unavailable
================
The pod which is trying to come up again shows following message in oc describe-
cirros003-1-64bwx
================
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned app-storage/cirros003-1-64bwx to dhcp46-38.lab.eng.blr.redhat.com
Warning FailedAttachVolume 14m attachdetach-controller Multi-Attach error for volume "pvc-cef582c8-f205-11e8-8abe-005056a5bb93" Volume is already used by pod(s) cirros003-1-nnhm2
Warning FailedMount 1m (x6 over 12m) kubelet, dhcp46-38.lab.eng.blr.redhat.com Unable to mount volumes for pod "cirros003-1-64bwx_app-storage(66887542-f211-11e8-8abe-005056a5bb93)": timeout expired waiting for volumes to attach or mount for pod "app-storage"/"cirros003-1-64bwx". list of unmounted volumes=[cirros-vol]. list of unattached volumes=[cirros-vol default-token-bdqdk]
=================
Version-Release number of selected component (if applicable):
OCS version : 3.11.1 and OCP version :3.11.1
# oc version
oc v3.11.43
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://dhcp47-89.lab.eng.blr.redhat.com:8443
openshift v3.11.43
kubernetes v1.11.0+d4cacc0
OCS images used for testing-
openshift_storage_glusterfs_heketi_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-volmanager-rhel7:3.11.1-1'
openshift_storage_glusterfs_block_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-gluster-block-prov-rhel7:3.11.1-1'
openshift_storage_glusterfs_image='brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/ocs/rhgs-server-rhel7:3.11.1-1'
2# rpm -qa | grep heketi
heketi-client-8.0.0-1.el7rhgs.x86_64
heketi-8.0.0-1.el7rhgs.x86_64
How reproducible:
2/2
Steps to Reproduce:
1.Create OCP 3.11.1 and OCS 3.11.1 setup using ansible deployment scripts
2.Create 20 block pvc's giving HA count=4
3.Create 20 app-pods (cirros pods) using the pvc's created in step 2
4.Login to 1 gluster pod and check the node hostname which host 1-1 node of both heketi and block-hosting volume.Power off that node.
5.Wait for the app pods hosting on that node to spin up on other node
Actual results:
The pods never spin up to other node but instead goes into "unknown"/"container" creating state leading to those pods inaccessible
Expected results:
Pods should spin up on someother node when the node is powered off
Additional info:
Detailed logs and sosreport will be providing in next step.
Have you configured 'mulitpath' for these block PVCs?
If yes, whats the status of mulitpath when one node goes down ?
Comment 10Humble Chirammal
2019-07-09 09:31:25 UTC
(In reply to Humble Chirammal from comment #8)
> Have you configured 'mulitpath' for these block PVCs?
>
> If yes, whats the status of mulitpath when one node goes down ?
Kasturi, apart from above requested information, can we try this against latest OCS builds?
Hello Humble,
Couple of questions before we actually try this out.
1) Is this fixed with any of the builds ?
2) If yes, can you please put the FIV of the bug and move it ON_QA so that QE can verify this ?
If not the above, do we know that this has been fixed and if yes, you would want QE to give this a try ?
Thanks
kasturi