1841611 – Fedora DC based app pods on the stopped worker node failed to reach running state on the new node

Bug 1841611 - Fedora DC based app pods on the stopped worker node failed to reach running state on the new node

Summary: Fedora DC based app pods on the stopped worker node failed to reach running s...

Keywords:
Status:	CLOSED DUPLICATE of bug 1845666
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Niels de Vos
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-29 14:28 UTC by Prasad Desala
Modified:	2020-07-12 13:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-12 13:09:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Prasad Desala 2020-05-29 14:28:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):
=============================================================================
When a worker node is poweredoff/shut-downed the DC app pods on the failed node respinned on another healthy node but the pod is stuck at "ContainerCreating" status due to multi-attach error.


Version of all relevant components (if applicable):
v4.4.0-428

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
We have a WA at the moment

Is there any workaround available to the best of your knowledge?
Yes.

When the old pods are forcefully deleted using below commands, fedora based DC app pod reached running state.

oc delete pod pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt  --force --grace-period=0

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
I don't think this is a regression

Steps to Reproduce:
===================
This issue can be reproduced by running below ocs-ci test,
https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py

The test script does below steps,
1) Create two fedora based DC app pods using node_selector.
2) Identify both DC app pod and OSD running node and increase the machineset.
3) Wait till the new node comes up and label the node with ocs storage label.
4) Power off the identified node in step-2 from AWS console.
5) Wait till the OCS pods on the failed node failover to other node in the same AZ
6) Fedora based DC app pod should automatically spin on another nodes and reach running state.
7) Do sanity check and health check.

Actual results:
===============
Fedora based DC app pod stuck at ContainerCreating state due to multi-attach error.

pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-jcwhf    0/1     ContainerCreating   0          78m
pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt    1/1     Terminating         0          89m

Events:
  Type     Reason              Age                From                                                 Message
  ----     ------              ----               ----                                                 -------
  Normal   Scheduled           <unknown>          default-scheduler                                    Successfully assigned namespace-test-75a6b474bf9c4e3980dcbd5c43c11813/pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-mstsw to ip-10-0-144-141.us-east-2.compute.internal
  Warning  FailedAttachVolume  20m                attachdetach-controller                              Multi-Attach error for volume "pvc-d64c1f43-e14b-46b7-8c40-ef0316ba8639" Volume is already used by pod(s) pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-2swbx
  Warning  FailedMount         37s (x9 over 18m)  kubelet, ip-10-0-144-141.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition


Expected results:
=================
The DC app pod should reach running status on the new node.

Comment 3 Humble Chirammal 2020-06-01 04:46:37 UTC

  Warning  FailedAttachVolume  20m                attachdetach-controller                              Multi-Attach error for volume "pvc-d64c1f43-e14b-46b7-8c40-ef0316ba8639" Volume is already used by pod(s) pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-2swbx

Indeed this is working as per current design. Unless there is an acknowledgment in the API object for removal of pod/node ..etc, the switch will be difficult/risky and this is an extra measure to make sure we don't land on data corruption..etc.

However, we are also looking into the improvements we could think of. The fencing of the node is one solution as "Cloud Providers" does. 

Niels, I am assigning this bug to you based on the experiments you are planning in this area. Please feel free to reassign if required.

Comment 4 Mudit Agarwal 2020-06-04 08:45:00 UTC

Discussed with Niels, this is not soemthing which can be done in 4.5 (or near future).
Can be moved out.

Comment 5 Mudit Agarwal 2020-06-25 03:21:57 UTC

This is not much different from https://bugzilla.redhat.com/show_bug.cgi?id=1845666

Comment 6 Yaniv Kaul 2020-07-12 10:14:27 UTC

(In reply to Mudit Agarwal from comment #5)
> This is not much different from
> https://bugzilla.redhat.com/show_bug.cgi?id=1845666

So why not close as dup?

Comment 7 Mudit Agarwal 2020-07-12 13:09:56 UTC

As Humble mentioned that this issue can be resolved with node fencing, I am duping this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1845666.

Though this is an older bug and the other one should have been duped here but https://bugzilla.redhat.com/show_bug.cgi?id=1845666 was specifically opened to address node fencing in ceph-csi and carries a lot more details about the problem hence duping this one there.

*** This bug has been marked as a duplicate of bug 1845666 ***

Note You need to log in before you can comment on or make changes to this bug.