2142461 – [MS] Rosa4.11 with RHODF addon deployer version 2.0.9 provider cluster result into prometheus component in Pending state and alertmanager pod stuck in ContainerCreating state

Bug 2142461 - [MS] Rosa4.11 with RHODF addon deployer version 2.0.9 provider cluster result into prometheus component in Pending state and alertmanager pod stuck in ContainerCreating state

Summary: [MS] Rosa4.11 with RHODF addon deployer version 2.0.9 provider cluster result...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Rewant
QA Contact:	suchita
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2142513 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-11-14 05:14 UTC by suchita
Modified:	2024-07-11 10:26 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-11 10:26:45 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCPBUGS-3739	0	None	None	None	2022-11-16 07:03:57 UTC

Description suchita 2022-11-14 05:14:25 UTC

Description of problem:
Prometheus component in Pending state on Rosa4.11 on RHODF addon deployer version 2.0.9 provider cluster

Version-Release number of selected component (if applicable):


How reproducible:
2/2

Steps to Reproduce:
1.Install provider cluster with ROSA4.11 with deployer version v2.0.9
2.Terminate the node where alermanager pod is running
3. 

Actual results:
RHODF Deployer showing installing state with alertmanager pod in ContainerCreating state

Expected results:
Deployer in successfully installed state with all pods ready

Additional info:
Must Gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-p10pr/sgatfane-p10pr_20221110T062544/logs/ocs-must-gather

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-p10pr/sgatfane-p10pr_20221110T062544/logs/must-gather.local.7739378137757196546


From oc describe alertmanager pod:
...
Events:
  Type     Reason                  Age                  From     Message
  ----     ------                  ----                 ----     -------
  Warning  FailedCreatePodSandBox  16s (x459 over 17h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded



$ oc get managedocs managedocs -oyaml
apiVersion: ocs.openshift.io/v1alpha1
kind: ManagedOCS
metadata:
  creationTimestamp: "2022-11-10T07:16:30Z"
  finalizers:
  - managedocs.ocs.openshift.io
  generation: 1
  name: managedocs
  namespace: openshift-storage
  resourceVersion: "1423586"
  uid: a9ac1395-9e15-4981-9f2f-bf6643c36512
spec: {}
status:
  components:
    alertmanager:
      state: Pending
    prometheus:
      state: Ready
    storageCluster:
      state: Ready
  reconcileStrategy: strict

$ oc get pods
NAME                                                              READY   STATUS              RESTARTS       AGE
addon-ocs-provider-qe-catalog-2cw6w                               1/1     Running             0              16h
alertmanager-managed-ocs-alertmanager-0                           0/2     ContainerCreating   0              16h
csi-addons-controller-manager-699689f4bb-jgcnx                    2/2     Running             0              16h
must-gather-b76dv-helper                                          1/1     Running             0              11s
ocs-metrics-exporter-74948d7ff9-ldm4q                             1/1     Running             0              16h
ocs-operator-67c7958cfc-dssbv                                     1/1     Running             0              16h
ocs-osd-aws-data-gather-6f5fbcc998-ksbfw                          1/1     Running             0              16h
ocs-osd-controller-manager-5f48d88445-g47q4                       2/3     Running             0              16h
ocs-provider-server-7df6f5d569-4h4rw                              1/1     Running             0              16h
odf-console-759fff6766-hcwp9                                      1/1     Running             0              16h
odf-operator-controller-manager-d98d8f7b6-vcvh6                   2/2     Running             0              16h
prometheus-managed-ocs-prometheus-0                               0/3     Init:0/1            0              16h
prometheus-operator-c74f5f6c9-8ww4j                               1/1     Running             0              16h
rook-ceph-crashcollector-ip-10-0-143-31.ec2.internal-5c7d4vprgs   1/1     Running             0              16h
rook-ceph-crashcollector-ip-10-0-153-47.ec2.internal-5bcf6xsqqv   1/1     Running             0              17h
rook-ceph-crashcollector-ip-10-0-165-251.ec2.internal-59978s4dc   1/1     Running             0              16h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66f5dd4bsqg22   2/2     Running             0              16h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76bcff8fksdhz   2/2     Running             0              17h
rook-ceph-mgr-a-65864cdbd4-9mjt8                                  2/2     Running             0              17h
rook-ceph-mon-a-64b468d5b9-nzcmt                                  2/2     Running             0              17h
rook-ceph-mon-d-59958bdbf-r5kpn                                   2/2     Running             0              16h
rook-ceph-mon-e-75f87b8c79-k8ttt                                  2/2     Running             0              14h
rook-ceph-operator-5b68c8775-4jvql                                1/1     Running             18 (14h ago)   16h
rook-ceph-osd-0-5777c6f849-hpkwq                                  2/2     Running             0              16h
rook-ceph-osd-1-575777cd6f-dp8pz                                  2/2     Running             0              17h
rook-ceph-osd-2-68869b774b-ksdnq                                  2/2     Running             0              17h
rook-ceph-tools-c5846444b-srm7m                                   1/1     Running             0              16h


$ oc get storagecluster
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   29h   Ready              2022-11-10T07:23:00Z   

$ oc get managedocs -A
NAMESPACE           NAME         AGE
openshift-storage   managedocs   29h

$ oc get storageconsumers
NAME                                                   AGE
storageconsumer-93207550-4f4a-4e2a-a454-e9cf23f25286   25h
storageconsumer-fd9cae87-7395-4563-8b8a-450bdab052d1   25h

$ ocm list cluster | grep pr10
1vt4rb5pimbsdte0ummlggm61riiac5e  sgatfane-pr10                 https://api.sgatfane-pr10.z0ah.s1.devshift.org:6443         4.11.12             rosa            aws             us-east-1       ready

$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.3                      NooBaa Operator               4.11.3            mcg-operator.v4.11.2                      Succeeded
observability-operator.v0.0.15            Observability Operator        0.0.15            observability-operator.v0.0.15-rc         Succeeded
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.9                   OCS OSD Deployer              2.0.9             ocs-osd-deployer.v2.0.8                   Installing
odf-csi-addons-operator.v4.10.5           CSI Addons                    4.10.5            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.10.5                      OpenShift Data Foundation     4.10.5            odf-operator.v4.10.4                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.450-6e98c37   Route Monitor Operator        0.1.450-6e98c37   route-monitor-operator.v0.1.448-b25b8ee   Succeeded

Comment 1 suchita 2022-11-15 07:58:53 UTC

*** Bug 2142513 has been marked as a duplicate of this bug. ***

Comment 2 Rewant 2022-11-15 08:01:38 UTC

Found a similar issue https://bugzilla.redhat.com/show_bug.cgi?id=2073452#c23, where pod gets stuck in Container Creating state with OVN 4.11 clusters.

Comment 3 suchita 2022-11-15 08:05:38 UTC

Copying the contain from deplicate marked closed bug https://bugzilla.redhat.com/show_bug.cgi?id=2142513
------------------------------------------------------------------------------------------------------------
Description of problem:
After terminating a worker node on the provider, the pod "alertmanager-managed-ocs-alertmanager-0" is stuck in a "ContainerCreating" state, and the pod "prometheus-managed-ocs-prometheus-0" is stuck in an "Init:0/1" state

Version-Release number of selected component (if applicable):
ROSA cluster OCP4.11, ODF4.10

How reproducible:
Yes, in node termination, the pods "alertmanager-managed-ocs-alertmanager-0" and "prometheus-managed-ocs-prometheus-0" are not recovered.

Is there any workaround available to the best of your knowledge?
Yes, after restarting the pods "alertmanager-managed-ocs-alertmanager-0" and "prometheus-managed-ocs-prometheus-0", they went back to a "Running" state.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Yes, I didn't see this issue in the previous versions

Steps to Reproduce:
Terminate one of the worker nodes on the provider. 

Actual results:
the pod "alertmanager-managed-ocs-alertmanager-0" is stuck in a "ContainerCreating" state, and/or the pod "prometheus-managed-ocs-prometheus-0" is stuck in an "Init:0/1" state


Expected results:
All the pods should be in a Completed or Running state.

Additional info:

Jenkins job link to the provider cluster: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17960/

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.11.12
Kubernetes Version: v1.24.6+5157800

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.12   True        False         5h36m   Error while reconciling 4.11.12: the cluster operator monitoring has not yet successfully rolled out

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) 
----------------------------------------------------------------------------------------

Comment 4 Rewant 2022-11-15 08:20:46 UTC

I was able to reproduce the bug by terminating the node on which alertmanager pod is running. We can mark it as a tracking of https://issues.redhat.com/browse/OCPBUGS-681

Comment 5 Rewant 2022-11-15 12:26:51 UTC

The workaround would be to restart the alertmanager pod.

Comment 15 Ohad 2024-07-11 10:26:45 UTC

The ODF Managed Service Project has sunset and is now consider obsolete

Note You need to log in before you can comment on or make changes to this bug.