Description of problem: Prometheus component in Pending state on Rosa4.11 on RHODF addon deployer version 2.0.9 provider cluster Version-Release number of selected component (if applicable): How reproducible: 2/2 Steps to Reproduce: 1.Install provider cluster with ROSA4.11 with deployer version v2.0.9 2.Terminate the node where alermanager pod is running 3. Actual results: RHODF Deployer showing installing state with alertmanager pod in ContainerCreating state Expected results: Deployer in successfully installed state with all pods ready Additional info: Must Gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-p10pr/sgatfane-p10pr_20221110T062544/logs/ocs-must-gather http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-p10pr/sgatfane-p10pr_20221110T062544/logs/must-gather.local.7739378137757196546 From oc describe alertmanager pod: ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 16s (x459 over 17h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded $ oc get managedocs managedocs -oyaml apiVersion: ocs.openshift.io/v1alpha1 kind: ManagedOCS metadata: creationTimestamp: "2022-11-10T07:16:30Z" finalizers: - managedocs.ocs.openshift.io generation: 1 name: managedocs namespace: openshift-storage resourceVersion: "1423586" uid: a9ac1395-9e15-4981-9f2f-bf6643c36512 spec: {} status: components: alertmanager: state: Pending prometheus: state: Ready storageCluster: state: Ready reconcileStrategy: strict $ oc get pods NAME READY STATUS RESTARTS AGE addon-ocs-provider-qe-catalog-2cw6w 1/1 Running 0 16h alertmanager-managed-ocs-alertmanager-0 0/2 ContainerCreating 0 16h csi-addons-controller-manager-699689f4bb-jgcnx 2/2 Running 0 16h must-gather-b76dv-helper 1/1 Running 0 11s ocs-metrics-exporter-74948d7ff9-ldm4q 1/1 Running 0 16h ocs-operator-67c7958cfc-dssbv 1/1 Running 0 16h ocs-osd-aws-data-gather-6f5fbcc998-ksbfw 1/1 Running 0 16h ocs-osd-controller-manager-5f48d88445-g47q4 2/3 Running 0 16h ocs-provider-server-7df6f5d569-4h4rw 1/1 Running 0 16h odf-console-759fff6766-hcwp9 1/1 Running 0 16h odf-operator-controller-manager-d98d8f7b6-vcvh6 2/2 Running 0 16h prometheus-managed-ocs-prometheus-0 0/3 Init:0/1 0 16h prometheus-operator-c74f5f6c9-8ww4j 1/1 Running 0 16h rook-ceph-crashcollector-ip-10-0-143-31.ec2.internal-5c7d4vprgs 1/1 Running 0 16h rook-ceph-crashcollector-ip-10-0-153-47.ec2.internal-5bcf6xsqqv 1/1 Running 0 17h rook-ceph-crashcollector-ip-10-0-165-251.ec2.internal-59978s4dc 1/1 Running 0 16h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66f5dd4bsqg22 2/2 Running 0 16h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76bcff8fksdhz 2/2 Running 0 17h rook-ceph-mgr-a-65864cdbd4-9mjt8 2/2 Running 0 17h rook-ceph-mon-a-64b468d5b9-nzcmt 2/2 Running 0 17h rook-ceph-mon-d-59958bdbf-r5kpn 2/2 Running 0 16h rook-ceph-mon-e-75f87b8c79-k8ttt 2/2 Running 0 14h rook-ceph-operator-5b68c8775-4jvql 1/1 Running 18 (14h ago) 16h rook-ceph-osd-0-5777c6f849-hpkwq 2/2 Running 0 16h rook-ceph-osd-1-575777cd6f-dp8pz 2/2 Running 0 17h rook-ceph-osd-2-68869b774b-ksdnq 2/2 Running 0 17h rook-ceph-tools-c5846444b-srm7m 1/1 Running 0 16h $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 29h Ready 2022-11-10T07:23:00Z $ oc get managedocs -A NAMESPACE NAME AGE openshift-storage managedocs 29h $ oc get storageconsumers NAME AGE storageconsumer-93207550-4f4a-4e2a-a454-e9cf23f25286 25h storageconsumer-fd9cae87-7395-4563-8b8a-450bdab052d1 25h $ ocm list cluster | grep pr10 1vt4rb5pimbsdte0ummlggm61riiac5e sgatfane-pr10 https://api.sgatfane-pr10.z0ah.s1.devshift.org:6443 4.11.12 rosa aws us-east-1 ready $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.3 NooBaa Operator 4.11.3 mcg-operator.v4.11.2 Succeeded observability-operator.v0.0.15 Observability Operator 0.0.15 observability-operator.v0.0.15-rc Succeeded ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded ocs-osd-deployer.v2.0.9 OCS OSD Deployer 2.0.9 ocs-osd-deployer.v2.0.8 Installing odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.450-6e98c37 Route Monitor Operator 0.1.450-6e98c37 route-monitor-operator.v0.1.448-b25b8ee Succeeded
*** Bug 2142513 has been marked as a duplicate of this bug. ***
Found a similar issue https://bugzilla.redhat.com/show_bug.cgi?id=2073452#c23, where pod gets stuck in Container Creating state with OVN 4.11 clusters.
Copying the contain from deplicate marked closed bug https://bugzilla.redhat.com/show_bug.cgi?id=2142513 ------------------------------------------------------------------------------------------------------------ Description of problem: After terminating a worker node on the provider, the pod "alertmanager-managed-ocs-alertmanager-0" is stuck in a "ContainerCreating" state, and the pod "prometheus-managed-ocs-prometheus-0" is stuck in an "Init:0/1" state Version-Release number of selected component (if applicable): ROSA cluster OCP4.11, ODF4.10 How reproducible: Yes, in node termination, the pods "alertmanager-managed-ocs-alertmanager-0" and "prometheus-managed-ocs-prometheus-0" are not recovered. Is there any workaround available to the best of your knowledge? Yes, after restarting the pods "alertmanager-managed-ocs-alertmanager-0" and "prometheus-managed-ocs-prometheus-0", they went back to a "Running" state. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?1 Can this issue reproducible? yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Yes, I didn't see this issue in the previous versions Steps to Reproduce: Terminate one of the worker nodes on the provider. Actual results: the pod "alertmanager-managed-ocs-alertmanager-0" is stuck in a "ContainerCreating" state, and/or the pod "prometheus-managed-ocs-prometheus-0" is stuck in an "Init:0/1" state Expected results: All the pods should be in a Completed or Running state. Additional info: Jenkins job link to the provider cluster: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17960/ Versions: OC version: Client Version: 4.10.24 Server Version: 4.11.12 Kubernetes Version: v1.24.6+5157800 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.12 True False 5h36m Error while reconciling 4.11.12: the cluster operator monitoring has not yet successfully rolled out Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) ----------------------------------------------------------------------------------------
I was able to reproduce the bug by terminating the node on which alertmanager pod is running. We can mark it as a tracking of https://issues.redhat.com/browse/OCPBUGS-681
The workaround would be to restart the alertmanager pod.
The ODF Managed Service Project has sunset and is now consider obsolete