Description of problem (please be detailed as possible and provide log snippets): Previously this issue was observed over non-multus clusters and fixed in Bug 1970352 Raising this new bug for Multus enabled cluster, as the issue is reproduced. Testcase failed: tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin] Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-exp/sagrawal-exp_20210701T110421/logs/failed_testcase_ocs_logs_1625556609/ Version of all relevant components (if applicable): OCP: 4.8.0-0.nightly-2021-07-01-043852 OCS: ocs-operator.v4.8.0-436.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, application pod is not usable Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Raising on first failure occurrence Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: NA Steps to Reproduce: In a Multus enabled OCS Internal mode cluster Run tier4 tests in tests/manage/pv_services/test_delete_plugin_pod.py https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_delete_plugin_pod.py OR Manual steps 1. Create a NAD for public network 2. Install OCS operator & Create Storage Cluster with Multus (using the network interface from step 1) 3. Create a CephFS PVC 4. Attach the PVC to a pod running on the node 'node1' 5. Delete the csi-cephfsplugin pod running on the node 'node1' and wait for the new csi-cephfsplugin to create. 6. Run I/O on the app pod Actual results: Unable to get IO results, fio used for step 4 became hung forever Expected results: IOs should be successful, Read/write operation should succeed Additional info: >> Describe output of Public NAD: Name: ocs-public Namespace: default Labels: <none> Annotations: <none> API Version: k8s.cni.cncf.io/v1 Kind: NetworkAttachmentDefinition Metadata: Creation Timestamp: 2021-07-02T08:01:55Z Generation: 1 Managed Fields: API Version: k8s.cni.cncf.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:config: Manager: kubectl-create Operation: Update Time: 2021-07-02T08:01:55Z Resource Version: 460932 UID: 4693ec86-3e4a-4b2a-bb45-400b6bdbcffc Spec: Config: { "cniVersion": "0.3.0", "type": "macvlan", "master": "ens192", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.1.0/24" } } Events: <none> >> CSI_ENABLE_HOST_NETWORK is "true" in rook-ceph-operator pod $ oc -n openshift-storage logs rook-ceph-operator-746c4c69b6-8gqng | egrep CSI_ENABLE_HOST_NETWORK 2021-07-05 15:11:53.651969 I | op-k8sutil: CSI_ENABLE_HOST_NETWORK="true" (default)
Rohan PATL.
we are hitting this because of https://github.com/rook/rook/issues/8085#issuecomment-862017608
(In reply to Sidhant Agrawal from comment #0) > Previously this issue was observed over non-multus clusters and fixed in Bug 1970352 > Raising this new bug for Multus enabled cluster, as the issue is reproduced. Just to make it clear: This bug was not really fixed for non-multus installs, but mostly avoided by forcing csi to host-network. Debugging why this is happening was still an open AI: https://bugzilla.redhat.com/show_bug.cgi?id=1970352#c6 So unfortunately, we have no real fix to learn from at this point.
Answering Neha and removing Travis's needinfo. You have a good point and I know it can be confusing, but CSI_ENABLE_HOST_NETWORK is an operator setting to control the ceph-csi deployment. When a CephCluster is injected, it goes on and deploys ceph-csi too. At this point, the multus configuration comes from the CephCluster and the Rook operator will change the ceph-csi settings accordingly. For instance, it will disable hostnetworking and apply the multus annotations. However, CSI_ENABLE_HOST_NETWORK will remain true but the correct config is applied. Hope that clarifies. To summarize where we are: * after running a few test using hostnetworking does not solve the issue * the only workaround is to restart the **node**, unfortunately, the plugin is a daemonset so we cannot really evict it Actions items: * summarizing the problem and send this to the OCP networking team for inputs so we can hopefully fix this in 4.9 or 4.10 As a general thought, a pod restarting is not a recurring event, even though the severity is high. So we can ship with this as a known issue and the proposed workaround.
instead of restarting the node - could we also 'oc delete' the CSI pods and have them re-created by the DeamonSet?
(In reply to Chris Blum from comment #10) > instead of restarting the node - could we also 'oc delete' the CSI pods and > have them re-created by the DeamonSet? No, the result will be the same then deleting the pod.
Moving this out of 4.8 Rohan, please fill the doc text
I don't agree with the workaround, because it does not tell me WHICH node to reboot. The CSI Pods are on all nodes in the cluster and it is vital that we hand out a way to identify which nodes need a reboot. If we tell customers that they will need to blindly reboot all their nodes they will not be happy and IMO we need to raise the Quality Impact of this BZ. Please add steps to identify which nodes need to be rebooted to the Workaround.
Steps to identify where the csi-cephfsplugin or csi-rbdplugin restarted 1. "oc get pods -o wide" (this will display the nodes on which the pod are ) 2. check for the node where csi-cephfsplugin or csi-rbdplugin pod was recently re-created or "oc describe pod <application pod>" which has a locked mount and check its node. (as per this bug, identify the node where the fio command is hung in the pod)
Rohan can keep me honest here but it basically means restart the only node on which csi-cephfsplugin is currently running (after a restart).
I wonder why this issue is still on Rook where the pod responsible for mounting/mapping is ceph-csi. It's not something Rook can fix. I believe we thought we could just use hostnetworking instead and change that in Rook, but this happens not to be possible. A question for the ceph-csi people, the problem arises when the pod restarts, as we know restart simply means "terminating" and "creating" a new pod. During the termination sequence, the ceph-csi binary could catch the termination signal (assuming it's graceful) and before returning would unmount/unmap filesystem and rbd devices so that when we create the new pod when can cleanly remount/remap what's needed. Assuming this is a viable solution, I still have one question: can we force detach while a client is accessing the mount? Ilya, thoughts on this workaround? Thanks!
(In reply to Sébastien Han from comment #18) > A question for the ceph-csi people, the problem arises when the pod > restarts, as we know restart simply means "terminating" and "creating" a new > pod. > During the termination sequence, the ceph-csi binary could catch the > termination signal (assuming it's graceful) and before returning would > unmount/unmap filesystem and rbd devices so that when we create the new pod > when can cleanly remount/remap what's needed. > Assuming this is a viable solution, I still have one question: can we force > detach while a client is accessing the mount? unmounting/unmapping is not something Ceph-CSI can do when application pods are using the volume. An application with an open file on the volume will prevent unmounting. In order to cleanly execute this process, Ceph-CSI would need to drain all Pods that use Ceph volumes, (which automatically unmounts the volumes), make sure all volumes are unmounted, and only then the Ceph-CSI Pod may continue to get terminated. Maybe this can be automated in a higher level, I think it is unclean for Ceph-CSI to trigger the draining (I'm also not sure that Ceph-CSI can do it fast enough at all during a termination request).
(In reply to Niels de Vos from comment #19) > (In reply to Sébastien Han from comment #18) > > A question for the ceph-csi people, the problem arises when the pod > > restarts, as we know restart simply means "terminating" and "creating" a new > > pod. > > During the termination sequence, the ceph-csi binary could catch the > > termination signal (assuming it's graceful) and before returning would > > unmount/unmap filesystem and rbd devices so that when we create the new pod > > when can cleanly remount/remap what's needed. > > Assuming this is a viable solution, I still have one question: can we force > > detach while a client is accessing the mount? > > unmounting/unmapping is not something Ceph-CSI can do when application pods > are using the volume. An application with an open file on the volume will > prevent unmounting. Hence my question about "force" detach :) > In order to cleanly execute this process, Ceph-CSI would need to drain all > Pods that use Ceph volumes, (which automatically unmounts the volumes), make > sure all volumes are unmounted, and only then the Ceph-CSI Pod may continue > to get terminated. > > Maybe this can be automated in a higher level, I think it is unclean for > Ceph-CSI to trigger the draining (I'm also not sure that Ceph-CSI can do it > fast enough at all during a termination request). The termination request should hold a bit until the pod is done doing the unmount/unmap, there is a configurable timeout IIRC.
Just to summarize few things here: Till ODF 4.8.0, we were always on 'host networking' and this issue was not observed. With the 4.8.0 changes of multus, the pod networking came into picture and could see this issue hit while inflight I/Os are in place from app pods and when nodeplugin restarted. https://bugzilla.redhat.com/show_bug.cgi?id=1970352#c12 confirms that, if we use 'host networking since start' of the deployment we are good. >> bz c#9 To summarize where we are: * after running a few test using hostnetworking does not solve the issue >> However the above scenario (bz comment) points to a situation where we started with 'multus' and then tried to get rid of it (removing annotation...etc) and tried to go back to host networking , but we are facing some issues on this rollback path. This need to be analyzed further on, where exactly we fail: Is it because we cannot get complete (network) stack migration of odf components back to host networking? or some limitation to go back to different network model after we enabled one other model in ocp...etc ?. iic, this failure is also getting discussed with the OCP networking team. Regardless, if multus ( one way to trigger pod networking ON) is enabled and nodeplugin restart happened, there is an issue due to the veth detachment or iow, 'pod networking readiness' against odf component ( csi.etc) has to be explored further. This also analogues to a situation of communication break of I/O path while nodeplugin host user space mounts and the plugin pod got restarted situation. IMO, Having a proper solution for 4.8.0 release for pod networking readiness looks difficult or not viable. But we can continue exploring the possible solutions.
I feel like we are saying the same thing but don't understand each other or maybe I left a typo somewhere :). This sentence which seems to bug people "after running a few test using hostnetworking does not solve the issue" means the following: * multus is enabled for the Ceph network * network annotations are not propagated to ceph-csi pods so they run on hostnetworking Still, with that, the csi-rbd-plugin is not capable of contacting the OSD network to successfully map an rbd device. Multus CIDR is on a different subnet than the host network. Having the host network stack is not sufficient. So right now, we are looking to see if we can bridge the multus network onto the host, then with hostnetworking the plugging pod will get access to the multus network. Hope that clarifies and that I'm making sense. Ilya, when you have a moment I'd like your opinion on https://bugzilla.redhat.com/show_bug.cgi?id=1979561#c18. Thanks!
Rohan and I have been investigating this issue and found that this can be fixed by running the csi-cephfsplugin/csi-rbdplugin pods on the host network namespace with the addition of a macvlan network device connected to the multus network. This macvlan device is connected to the same master interface as the other macvlan interfaces configured to use the multus network, with its ip address set to be in the network defined in the NetworkAttachmentDefinition. The CSI plug-in pods will use this interface to send traffic through the multus network. This solution was manually verified: The described macvlan interfaces were created on the worker nodes of an OCP cluster. Rook was deployed with the CSI pods configured to use the host network. While a test pod ran I/O to a mounted CephCluster PVC, the CSI pod running on that worker node was terminated. When the CSI pod came back up, I/O to the volume was possible again! We are currently looking into how to add this functionality into Rook. There are two problems to solve: 1. Ensuring the multus-connected macvlan network interface is present on the host network namespace. 2. How to determine a free IP address on the to provide to the macvlan network interface. It must not be one that is used by another interface on the multus network.
Fixing typo: 2. How to determine a free IP address on the multus network to provide to the macvlan network interface. It must not be one that is used by another interface on the multus network.
Hi Renan/Rohan, Thanks for experimenting and for the update on possible solutions. Isnt it the the same method or solution described by the OCP team to get rid of the issue where we connect or attach host network namespace to multus network. Just to understand this better, Are we limited to macvlan network in this case? I was in an impression that, any network device which is part of the multus can be attached to the host network namespace and ideally it should help us to get rid of i/o disconnect during this scenario. Please correct me if I am wrong. Can we also capture the exact steps in a document so that we can add further discussions or thoughts on the same and make progress?
(In reply to Humble Chirammal from comment #33) > Hi Renan/Rohan, Thanks for experimenting and for the update on possible > solutions. Isnt it the the same method or solution described by the OCP team > to get rid of the issue where we connect or attach host network namespace to > multus network. Yes, they suggested adding a bride to connect multus network namespace to the host network namespace. >Just to understand this better, Are we limited to macvlan > network in this case? I was in an impression that, any network device which > is part of the multus can be attached to the host network namespace and > ideally it should help us to get rid of i/o disconnect during this scenario. This is also one of the approaches which Renan tested later. We are creating a pod on the multus network and moving its network interface from its pod network to the host network. This is explained in detail in the following doc. > Please correct me if I am wrong. > > Can we also capture the exact steps in a document so that we can add further > discussions or thoughts on the same and make progress? This doc captured the possible solutions, steps and the issues with the solution https://docs.google.com/document/d/1mc44IktWF_wqn6lDBlVwu9bnewDGu4O9xgKYu-TwdsQ/edit?usp=sharing
Fixing typo: The OCP network team suggested adding a bridge to connect multus network namespace to the host network namespace.
Thanks Rohan for c#34, will revisit the doc.
The changes for this fix are in a Rook PR, currently under review: https://github.com/rook/rook/pull/8686 Once this is merged, I will make the need changes to OCS.
(In reply to Renan Campos from comment #41) > The changes for this fix are in a Rook PR, currently under review: > https://github.com/rook/rook/pull/8686 > > Once this is merged, I will make the need changes to OCS. Thanks Renan, please create a clone once you raise a PR for ocs-operator changes.
It feels a bit short to include this in the 4.9 time frame. Moving to 4.10 and to Rook.
We don't need the blocker flag for a 4.10 bug at this point.
Moving back to assigned while it's still in development and finalizing the design.
Unfortunately, due to major design changes, engineering won't be able to deliver the fix for this issue in 4.10. Hence, moving to 4.11. Thanks for your understanding.
Present in the 4.11 branch after this resync https://github.com/red-hat-storage/rook/pull/374
No QE cycles in 4.11 for this feature.
How does it look for 4.12?
Already fixed by engineering in 4.11 and is available in the builds since then. But I don't think this is part of QE planning for 4.12, we will still continue to give support exceptions for this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551