Bug 1979561 - In OCS with Multus enabled cluster, fio command is hung on app pod after deleting the plugin pod
Summary: In OCS with Multus enabled cluster, fio command is hung on app pod after dele...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Sébastien Han
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On: 2142617
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-06 12:16 UTC by Sidhant Agrawal
Modified: 2023-08-09 17:03 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Previously, when a plugin pod is deleted, the data became inaccessible until a node restart took place. The issue is caused because `netns` of the mount gets destroyed when the `csi-cephfsplugin` pod is restarted which results in `csi-cephfsplugin` locking up all mounted volumes. This issue is seen only in clusters enabled with multus. With this update, the issue is resolved when you restart the node on which `csi-cephfsplugin` was restarted after the deletion.
Clone Of:
Environment:
Last Closed: 2023-01-31 00:19:18 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 374 0 None Merged Resync from upstream release-1.9 to downstream 4.11 2022-04-27 12:57:33 UTC
Github rook rook pull 8686 0 None open ceph: run CSI pods in multus-connected host network namespace 2021-10-05 14:33:15 UTC
Github rook rook pull 9903 0 None open csi: update multus design to mitigate csi-plugin pod restart 2022-03-21 16:50:48 UTC
Github rook rook pull 9925 0 None Draft core: fix csi-cephfsplugin pod restart on non-hostnetworking env 2022-03-21 16:50:48 UTC
Red Hat Product Errata RHBA-2023:0551 0 None None None 2023-01-31 00:19:42 UTC

Description Sidhant Agrawal 2021-07-06 12:16:33 UTC
Description of problem (please be detailed as possible and provide log
snippets):

Previously this issue was observed over non-multus clusters and fixed in Bug 1970352
Raising this new bug for Multus enabled cluster, as the issue is reproduced.

Testcase failed:
tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin]
tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin]

Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-exp/sagrawal-exp_20210701T110421/logs/failed_testcase_ocs_logs_1625556609/

Version of all relevant components (if applicable):
OCP: 4.8.0-0.nightly-2021-07-01-043852
OCS: ocs-operator.v4.8.0-436.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, application pod is not usable

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Raising on first failure occurrence

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:

In a Multus enabled OCS Internal mode cluster
Run tier4 tests in tests/manage/pv_services/test_delete_plugin_pod.py
https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_delete_plugin_pod.py

OR 

Manual steps
1. Create a NAD for public network
2. Install OCS operator & Create Storage Cluster with Multus (using the network interface from step 1)
3. Create a CephFS PVC
4. Attach the PVC to a pod running on the node 'node1'
5. Delete the csi-cephfsplugin pod running on the node 'node1'  and wait for the new csi-cephfsplugin to create.
6. Run I/O on the app pod

Actual results:
Unable to get IO results, fio used for step 4 became hung forever

Expected results:
IOs should be successful, Read/write operation should succeed 

Additional info:

>> Describe output of Public NAD:

Name:         ocs-public
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  k8s.cni.cncf.io/v1
Kind:         NetworkAttachmentDefinition
Metadata:
  Creation Timestamp:  2021-07-02T08:01:55Z
  Generation:          1
  Managed Fields:
    API Version:  k8s.cni.cncf.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:config:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-07-02T08:01:55Z
  Resource Version:  460932
  UID:               4693ec86-3e4a-4b2a-bb45-400b6bdbcffc
Spec:
  Config:  { "cniVersion": "0.3.0", "type": "macvlan", "master": "ens192", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.1.0/24" } }
Events:    <none>


>> CSI_ENABLE_HOST_NETWORK is "true" in rook-ceph-operator pod

$ oc -n openshift-storage logs rook-ceph-operator-746c4c69b6-8gqng  | egrep CSI_ENABLE_HOST_NETWORK
2021-07-05 15:11:53.651969 I | op-k8sutil: CSI_ENABLE_HOST_NETWORK="true" (default)

Comment 3 Sébastien Han 2021-07-06 13:14:44 UTC
Rohan PATL.

Comment 7 Rohan Gupta 2021-07-07 07:17:21 UTC
we are hitting this because of https://github.com/rook/rook/issues/8085#issuecomment-862017608

Comment 8 Michael Adam 2021-07-07 15:44:55 UTC
(In reply to Sidhant Agrawal from comment #0)
> Previously this issue was observed over non-multus clusters and fixed in Bug 1970352
> Raising this new bug for Multus enabled cluster, as the issue is reproduced.

Just to make it clear: This bug was not really fixed for non-multus installs, but mostly avoided by forcing csi to host-network.

Debugging why this is happening was still an open AI: https://bugzilla.redhat.com/show_bug.cgi?id=1970352#c6

So unfortunately, we have no real fix to learn from at this point.

Comment 9 Sébastien Han 2021-07-08 09:35:32 UTC
Answering Neha and removing Travis's needinfo.

You have a good point and I know it can be confusing, but CSI_ENABLE_HOST_NETWORK is an operator setting to control the ceph-csi deployment.
When a CephCluster is injected, it goes on and deploys ceph-csi too. At this point, the multus configuration comes from the CephCluster and the Rook operator will change the ceph-csi settings accordingly.
For instance, it will disable hostnetworking and apply the multus annotations. However, CSI_ENABLE_HOST_NETWORK will remain true but the correct config is applied.
Hope that clarifies.

To summarize where we are:

* after running a few test using hostnetworking does not solve the issue
* the only workaround is to restart the **node**, unfortunately, the plugin is a daemonset so we cannot really evict it

Actions items:

* summarizing the problem and send this to the OCP networking team for inputs so we can hopefully fix this in 4.9 or 4.10

As a general thought, a pod restarting is not a recurring event, even though the severity is high.
So we can ship with this as a known issue and the proposed workaround.

Comment 10 Chris Blum 2021-07-08 11:09:32 UTC
instead of restarting the node - could we also 'oc delete' the CSI pods and have them re-created by the DeamonSet?

Comment 11 Sébastien Han 2021-07-08 12:30:33 UTC
(In reply to Chris Blum from comment #10)
> instead of restarting the node - could we also 'oc delete' the CSI pods and
> have them re-created by the DeamonSet?

No, the result will be the same then deleting the pod.

Comment 12 Mudit Agarwal 2021-07-14 08:18:57 UTC
Moving this out of 4.8

Rohan, please fill the doc text

Comment 15 Chris Blum 2021-07-14 10:41:36 UTC
I don't agree with the workaround, because it does not tell me WHICH node to reboot. 

The CSI Pods are on all nodes in the cluster and it is vital that we hand out a way to identify which nodes need a reboot. If we tell customers that they will need to blindly reboot all their nodes they will not be happy and IMO we need to raise the Quality Impact of this BZ.

Please add steps to identify which nodes need to be rebooted to the Workaround.

Comment 16 Rohan Gupta 2021-07-14 10:55:22 UTC
Steps to identify where the csi-cephfsplugin or csi-rbdplugin restarted
1. "oc get pods -o wide" (this will display the nodes on which the pod are )
2. check for the node where csi-cephfsplugin or csi-rbdplugin pod was recently re-created 
or 
   
"oc describe pod <application pod>" which has a locked mount and check its node. (as per this bug, identify the node where the fio command is hung in the pod)

Comment 17 Mudit Agarwal 2021-07-14 11:47:26 UTC
Rohan can keep me honest here but it basically means restart the only node on which csi-cephfsplugin is currently running (after a restart).

Comment 18 Sébastien Han 2021-07-20 12:00:19 UTC
I wonder why this issue is still on Rook where the pod responsible for mounting/mapping is ceph-csi.
It's not something Rook can fix. I believe we thought we could just use hostnetworking instead and change that in Rook, but this happens not to be possible.

A question for the ceph-csi people, the problem arises when the pod restarts, as we know restart simply means "terminating" and "creating" a new pod.
During the termination sequence, the ceph-csi binary could catch the termination signal (assuming it's graceful) and before returning would unmount/unmap filesystem and rbd devices so that when we create the new pod when can cleanly remount/remap what's needed.
Assuming this is a viable solution, I still have one question: can we force detach while a client is accessing the mount?

Ilya, thoughts on this workaround? Thanks!

Comment 19 Niels de Vos 2021-07-20 13:42:22 UTC
(In reply to Sébastien Han from comment #18)
> A question for the ceph-csi people, the problem arises when the pod
> restarts, as we know restart simply means "terminating" and "creating" a new
> pod.
> During the termination sequence, the ceph-csi binary could catch the
> termination signal (assuming it's graceful) and before returning would
> unmount/unmap filesystem and rbd devices so that when we create the new pod
> when can cleanly remount/remap what's needed.
> Assuming this is a viable solution, I still have one question: can we force
> detach while a client is accessing the mount?

unmounting/unmapping is not something Ceph-CSI can do when application pods are using the volume. An application with an open file on the volume will prevent unmounting.

In order to cleanly execute this process, Ceph-CSI would need to drain all Pods that use Ceph volumes, (which automatically unmounts the volumes), make sure all volumes are unmounted, and only then the Ceph-CSI Pod may continue to get terminated.

Maybe this can be automated in a higher level, I think it is unclean for Ceph-CSI to trigger the draining (I'm also not sure that Ceph-CSI can do it fast enough at all during a termination request).

Comment 20 Sébastien Han 2021-07-20 13:47:11 UTC
(In reply to Niels de Vos from comment #19)
> (In reply to Sébastien Han from comment #18)
> > A question for the ceph-csi people, the problem arises when the pod
> > restarts, as we know restart simply means "terminating" and "creating" a new
> > pod.
> > During the termination sequence, the ceph-csi binary could catch the
> > termination signal (assuming it's graceful) and before returning would
> > unmount/unmap filesystem and rbd devices so that when we create the new pod
> > when can cleanly remount/remap what's needed.
> > Assuming this is a viable solution, I still have one question: can we force
> > detach while a client is accessing the mount?
> 
> unmounting/unmapping is not something Ceph-CSI can do when application pods
> are using the volume. An application with an open file on the volume will
> prevent unmounting.


Hence my question about "force" detach :)

> In order to cleanly execute this process, Ceph-CSI would need to drain all
> Pods that use Ceph volumes, (which automatically unmounts the volumes), make
> sure all volumes are unmounted, and only then the Ceph-CSI Pod may continue
> to get terminated.
> 
> Maybe this can be automated in a higher level, I think it is unclean for
> Ceph-CSI to trigger the draining (I'm also not sure that Ceph-CSI can do it
> fast enough at all during a termination request).

The termination request should hold a bit until the pod is done doing the unmount/unmap, there is a configurable timeout IIRC.

Comment 21 Humble Chirammal 2021-07-20 15:35:55 UTC
Just to summarize few things here:

Till ODF 4.8.0, we were always on 'host networking' and this issue was not observed.

With the 4.8.0 changes of multus, the pod networking came into picture and could see this issue hit while inflight I/Os are in place from app pods and when nodeplugin restarted.

https://bugzilla.redhat.com/show_bug.cgi?id=1970352#c12 confirms that, if we use 'host networking since start' of the deployment we are good.

>> bz c#9

To summarize where we are:

* after running a few test using hostnetworking does not solve the issue
>>

However the above scenario (bz comment) points to a situation where we started with 'multus' and then tried to get rid of it (removing annotation...etc) and tried to go back to host networking , but we are facing some issues on this rollback path. This need to be analyzed further on, where exactly we fail: Is it because we cannot get  complete (network) stack migration of odf components  back to host networking? or some limitation to go back to different network model after we enabled one other model in ocp...etc ?. iic, this failure is also getting discussed with the OCP networking team. 

Regardless, if multus ( one way to trigger pod networking ON)  is enabled and nodeplugin restart happened, there is an issue due to the veth detachment or iow,  'pod networking readiness'  against odf component ( csi.etc) has to be explored further.  This also analogues to a situation of communication break of I/O path while nodeplugin host user space mounts and the plugin pod got restarted situation.

IMO, Having a proper solution for 4.8.0 release for pod networking readiness looks difficult or not viable. But we can continue exploring the possible solutions.

Comment 25 Sébastien Han 2021-08-02 17:00:22 UTC
I feel like we are saying the same thing but don't understand each other or maybe I left a typo somewhere :).
This sentence which seems to bug people "after running a few test using hostnetworking does not solve the issue" means the following:

* multus is enabled for the Ceph network
* network annotations are not propagated to ceph-csi pods so they run on hostnetworking

Still, with that, the csi-rbd-plugin is not capable of contacting the OSD network to successfully map an rbd device.
Multus CIDR is on a different subnet than the host network.

Having the host network stack is not sufficient.
So right now, we are looking to see if we can bridge the multus network onto the host, then with hostnetworking the plugging pod will get access to the multus network.

Hope that clarifies and that I'm making sense.

Ilya, when you have a moment I'd like your opinion on https://bugzilla.redhat.com/show_bug.cgi?id=1979561#c18.
Thanks!

Comment 31 Renan Campos 2021-08-19 19:15:02 UTC
Rohan and I have been investigating this issue and found that this can be fixed by running the csi-cephfsplugin/csi-rbdplugin pods on the host network namespace with the addition of a macvlan network device connected to the multus network. 

This macvlan device is connected to the same master interface as the other macvlan interfaces configured to use the multus network, with its ip address set to be in the network defined in the NetworkAttachmentDefinition. 
The CSI plug-in pods will use this interface to send traffic through the multus network.

This solution was manually verified: 
The described macvlan interfaces were created on the worker nodes of an OCP cluster. 
Rook was deployed with the CSI pods configured to use the host network. 
While a test pod ran I/O to a mounted CephCluster PVC, the CSI pod running on that worker node was terminated. 
When the CSI pod came back up, I/O to the volume was possible again!    

We are currently looking into how to add this functionality into Rook. 
There are two problems to solve:
1. Ensuring the multus-connected macvlan network interface is present on the host network namespace.
2. How to determine a free IP address on the to provide to the macvlan network interface. It must not be one that is used by another interface on the multus network.

Comment 32 Renan Campos 2021-08-19 19:19:46 UTC
Fixing typo:

2. How to determine a free IP address on the multus network to provide to the macvlan network interface. It must not be one that is used by another interface on the multus network.

Comment 33 Humble Chirammal 2021-08-23 05:13:29 UTC
Hi Renan/Rohan, Thanks for experimenting and for the update on possible solutions. Isnt it the the same method or solution described by the OCP team to get rid of the issue where we connect or attach host network namespace to multus network. Just to understand this better,  Are we limited to macvlan network in this case?  I was in an impression that, any network device which is part of the multus can be attached to the host network namespace and ideally it should help us to get rid of i/o disconnect during this scenario.  Please correct me if I am wrong.

Can we also capture the exact steps in a document so that we can add further discussions or thoughts on the same and make progress?

Comment 34 Rohan Gupta 2021-08-24 15:15:09 UTC
(In reply to Humble Chirammal from comment #33)
> Hi Renan/Rohan, Thanks for experimenting and for the update on possible
> solutions. Isnt it the the same method or solution described by the OCP team
> to get rid of the issue where we connect or attach host network namespace to
> multus network.

Yes, they suggested adding a bride to connect multus network namespace to the host network namespace.

>Just to understand this better,  Are we limited to macvlan
> network in this case?  I was in an impression that, any network device which
> is part of the multus can be attached to the host network namespace and
> ideally it should help us to get rid of i/o disconnect during this scenario.

This is also one of the approaches which Renan tested later. We are creating a pod on the multus network and moving its network interface from its pod network to the host network. This is explained in detail in the following doc.

> Please correct me if I am wrong.
> 
> Can we also capture the exact steps in a document so that we can add further
> discussions or thoughts on the same and make progress?

This doc captured the possible solutions, steps and the issues with the solution https://docs.google.com/document/d/1mc44IktWF_wqn6lDBlVwu9bnewDGu4O9xgKYu-TwdsQ/edit?usp=sharing

Comment 35 Rohan Gupta 2021-08-24 15:17:18 UTC
Fixing typo:
The OCP network team suggested adding a bridge to connect multus network namespace to the host network namespace.

Comment 36 Humble Chirammal 2021-08-25 11:31:41 UTC
Thanks Rohan for c#34, will revisit the doc.

Comment 41 Renan Campos 2021-10-05 12:53:14 UTC
The changes for this fix are in a Rook PR, currently under review:
https://github.com/rook/rook/pull/8686

Once this is merged, I will make the need changes to OCS.

Comment 42 Mudit Agarwal 2021-10-05 14:33:15 UTC
(In reply to Renan Campos from comment #41)
> The changes for this fix are in a Rook PR, currently under review:
> https://github.com/rook/rook/pull/8686
> 
> Once this is merged, I will make the need changes to OCS.

Thanks Renan, please create a clone once you raise a PR for ocs-operator changes.

Comment 43 Sébastien Han 2021-10-06 13:00:32 UTC
It feels a bit short to include this in the 4.9 time frame. Moving to 4.10 and to Rook.

Comment 46 Sébastien Han 2021-11-29 16:28:13 UTC
We don't need the blocker flag for a 4.10 bug at this point.

Comment 49 Travis Nielsen 2022-01-24 16:33:06 UTC
Moving back to assigned while it's still in development and finalizing the design.

Comment 50 Sébastien Han 2022-02-22 15:05:57 UTC
Unfortunately, due to major design changes, engineering won't be able to deliver the fix for this issue in 4.10. Hence, moving to 4.11.
Thanks for your understanding.

Comment 54 Sébastien Han 2022-04-27 12:57:33 UTC
Present in the 4.11 branch after this resync https://github.com/red-hat-storage/rook/pull/374

Comment 55 Mudit Agarwal 2022-06-07 09:08:01 UTC
No QE cycles in 4.11 for this feature.

Comment 61 Yaniv Kaul 2022-09-04 11:14:52 UTC
How does it look for 4.12?

Comment 62 Mudit Agarwal 2022-09-05 04:56:21 UTC
Already fixed by engineering in 4.11 and is available in the builds since then. 
But I don't think this is part of QE planning for 4.12, we will still continue to give support exceptions for this.

Comment 85 errata-xmlrpc 2023-01-31 00:19:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551


Note You need to log in before you can comment on or make changes to this bug.