I cannot find must-gather 0010-must-gather.local.7015162253973320014.tar.gz in the customer case. I think I need a must-gather from when the issue is happening to debug further. Please attach it to this BZ or make it more clear what file I need to download on the customer case. --- The only thing I could find from the "happy" must-gathers that seems off to me is that there are 5 copies of the ReplicaSet for each `rook-ceph-osd-*` deployment. This suggests that the OSDs may have been updated 4 other times and previous versions not cleaned up properly. This may be a manifestation of an issue I see reported in upstream Kubernetes here: https://github.com/kubernetes/kubernetes/issues/34052. I don't believe this could cause the issue, however. It might be worth checking with the OpenShift Platform team about why those weren't deleted with their deployments (or garbage collected).
Ceph appears as though it has upgraded successfully, including the OSDs. In this instance, noobaa is failing somehow. If I look at the events log and search "noobaa" in the RelatedObject column, I see failures at the bottom. > error killing pod: [failed to "KillContainer" for "db" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "1b4a429a-1699-4a2c-80e3-0d3cce2f8d6c" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox a8f69ae5ad86cbb5c543e29233531f20a24a749c709537b2c3ef8e1a9ec56c1b: failed to stop container k8s_db_noobaa-db-pg-0_openshift-storage_1b4a429a-1699-4a2c-80e3-0d3cce2f8d6c_0: context deadline exceeded"] > error killing pod: [failed to "KillContainer" for "db" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container cf4897c53ea7a19d6bc69faa6e763e0f518a6574c0805e1b4413a3805029a1e9: context deadline exceeded", failed to "KillPodSandbox" for "1b4a429a-1699-4a2c-80e3-0d3cce2f8d6c" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"] And a lot of these: > failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API --- These could still be failures from the API server, especially given the CPU utilization error. I think it's likely there are no ODF bugs. Let's keep waiting on the OCP BZ.
Small update: OpenShift team has finally begun looking at API server related issues in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=2101397
I still believe there is something wrong with OpenShift or the customer's environment configuration, as the error messages on pods aren't coming from anything in ODF. In the first iteration from this bug, there were problems deleting the multus network: > rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_rook-ceph-osd-0-597cc9488-8kqts_openshift-storage_1d241513-5aea-4e1a-9cc6-5999e8e08cb7_0(f545dc81649f96ada4b1121b2b0fcfc4d34723e41b0506e58a8e3b60f1130b20): error removing pod openshift-storage_rook-ceph-osd-0-597cc9488-8kqts from CNI network \"multus-cni-network\": netplugin failed:\" In the second iteration from this bug, OpenShift failed to kill pod sandboxes: > error killing pod: [failed to "KillContainer" for "db" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container cf4897c53ea7a19d6bc69faa6e763e0f518a6574c0805e1b4413a3805029a1e9: context deadline exceeded", failed to "KillPodSandbox" for "1b4a429a-1699-4a2c-80e3-0d3cce2f8d6c" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"] When looking into issues related to failing to kill pod sandboxes, a common suggestion is that the nodes may be overloaded making kubelet unresponsive. If nodes are overloaded at the same time Ceph daemons are being updated (adding even more load), that certainly could cause system instability, but I'm not sure how to prove if that is what's going on for these systems. ODF is deployed with default resource requests and limits that help reduce issues issues with resource overuse, so there is likely to be something else going on. Are there any processes on the node that are using a lot of resources? This could be a process on the host itself or a pod running that doesn't have resource requests/limits set. The customer can get all pods on a node with this command. If they do this for an overloaded node, we could see if there are any pods that don't have resource requests/limits set that could be an issue: > kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node> And a `top` on the node itself might help show if there are host applications consuming more resources than expected. I believe the OpenShift team should have better experience getting to the bottom of what's going on, but they have reported they're having issues with staffing. Knowing that, I will do what I can to try to narrow down possibilities. Since the user's environment probably isn't showing the issue right now, I can take a look at the `ssh <node> top -n 1 -b` and `kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>` output for each user node to try to find out if there is a heavy application running.