gathered data URL: shorturl.at/clCQS Hello, Customer did an upgrade to 4.7.6 but with a degraded monitoring operator. [mchebbi@fedora namespaces]$ omg get pod -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE alertmanager-main-0 0/5 Pending 0 10m openshift-dev-hw62p-worker-xbrn9 <=== no ip address alertmanager-main-1 5/5 Running 0 11m 10.131.0.17 openshift-dev-hw62p-worker-cch82 alertmanager-main-2 0/5 Pending 0 10m openshift-dev-hw62p-worker-lc5wk <=== no ip address cluster-monitoring-operator-68bd597889-k9ftp 2/2 Running 0 1h16m 10.128.0.18 openshift-dev-hw62p-master-1 grafana-7874c696f6-j9z6p 2/2 Running 0 1h13m 10.131.0.14 openshift-dev-hw62p-worker-cch82 kube-state-metrics-67c786b7f6-n9vrh 3/3 Running 0 1h13m 10.129.2.18 openshift-dev-hw62p-worker-lc5wk node-exporter-48z57 2/2 Running 0 1h53m 172.21.8.53 openshift-dev-hw62p-master-2 node-exporter-5kms7 2/2 Running 0 1h52m 172.21.8.55 openshift-dev-hw62p-master-1 node-exporter-7ljnl 2/2 Running 0 1h53m 172.21.8.59 openshift-dev-hw62p-master-0 node-exporter-k5647 2/2 Running 0 1h52m 172.21.8.72 openshift-dev-hw62p-worker-cch82 node-exporter-svjf4 2/2 Running 0 1h52m 172.21.8.73 openshift-dev-hw62p-worker-lc5wk node-exporter-x7vf4 2/2 Running 0 1h51m 172.21.8.80 openshift-dev-hw62p-worker-xbrn9 openshift-state-metrics-56b7d9ff8-8tjnj 3/3 Running 0 1h13m 10.129.2.12 openshift-dev-hw62p-worker-lc5wk prometheus-adapter-77db96648b-jz6dt 1/1 Running 0 1h13m 10.129.2.19 openshift-dev-hw62p-worker-lc5wk prometheus-adapter-77db96648b-xdbcr 1/1 Running 0 1h13m 10.129.2.9 openshift-dev-hw62p-worker-lc5wk prometheus-k8s-0 0/7 Pending 0 33m openshift-dev-hw62p-worker-xbrn9 <=== no ip address prometheus-k8s-1 7/7 Running 1 1h22m 10.131.0.10 openshift-dev-hw62p-worker-cch82 prometheus-operator-67bff998f9-5lhtq 2/2 Running 0 1h16m 10.128.0.16 openshift-dev-hw62p-master-1 telemeter-client-78fdcf7976-tn5f4 3/3 Running 0 1h13m 10.129.2.13 openshift-dev-hw62p-worker-lc5wk thanos-querier-856ffddfcf-9vdjc 5/5 Running 0 1h13m 10.129.2.15 openshift-dev-hw62p-worker-lc5wk thanos-querier-856ffddfcf-wzzr8 5/5 Running 0 1h13m 10.131.0.11 openshift-dev-hw62p-worker-cch82 $ oc get po -n openshift-monitoring NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring alertmanager-main-0 0/5 ContainerCreating 0 16h openshift-monitoring alertmanager-main-1 0/5 ContainerCreating 0 16h openshift-monitoring alertmanager-main-2 5/5 Running 0 16h openshift-monitoring cluster-monitoring-operator-68bd597889-6h6bh 2/2 Running 0 16h openshift-monitoring grafana-7874c696f6-4gk9v 2/2 Running 0 16h openshift-monitoring kube-state-metrics-67c786b7f6-6x9lp 3/3 Running 0 16h openshift-monitoring node-exporter-48z57 2/2 Running 0 21h openshift-monitoring node-exporter-5kms7 2/2 Running 0 21h openshift-monitoring node-exporter-7ljnl 2/2 Running 0 21h openshift-monitoring node-exporter-k5647 2/2 Running 0 21h openshift-monitoring node-exporter-svjf4 2/2 Running 0 21h openshift-monitoring node-exporter-x7vf4 2/2 Running 0 21h openshift-monitoring openshift-state-metrics-56b7d9ff8-mcmdn 3/3 Running 0 16h openshift-monitoring prometheus-adapter-6f9548f67f-fbn2v 1/1 Running 0 3h59m openshift-monitoring prometheus-adapter-6f9548f67f-twg7c 1/1 Running 0 3h59m openshift-monitoring prometheus-k8s-0 0/7 ContainerCreating 0 16h openshift-monitoring prometheus-k8s-1 7/7 Running 0 16h openshift-monitoring prometheus-operator-67bff998f9-9d7bx 2/2 Running 0 16h openshift-monitoring telemeter-client-78fdcf7976-zzkmm 3/3 Running 0 16h openshift-monitoring thanos-querier-856ffddfcf-c2pwv 5/5 Running 0 16h openshift-monitoring thanos-querier-856ffddfcf-pnmbp 5/5 Running 0 16h ======================================================================================================================================================================================= [root@ocp-dev-console ocp]# oc get events -n openshift-monitoring NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE openshift-monitoring 20s Warning FailedCreatePodSandBox pod/alertmanager-main-0 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-0_openshift-monitoring_28ace80c-c675-43f9-a5b5-7cc6b8589eef_0(dd714fe33a86d802c9b16523a6fca8f574f59a9e31bfe70ed6bdbf147dc0a84a): [openshift-monitoring/alertmanager-main-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-0 dd714fe33a86d802c9b16523a6fca8f574f59a9e31bfe70ed6bdbf147dc0a84a] [openshift-monitoring/alertmanager-main-0 dd714fe33a86d802c9b16523a6fca8f574f59a9e31bfe70ed6bdbf147dc0a84a] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ' openshift-monitoring 35s Warning FailedCreatePodSandBox pod/alertmanager-main-1 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_6fe22f18-6d30-449b-b6fc-949992ed0688_0(97449af2ab65893a4925852a00b3beac360626dc220cf44789489fc59620b85f): [openshift-monitoring/alertmanager-main-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1 97449af2ab65893a4925852a00b3beac360626dc220cf44789489fc59620b85f] [openshift-monitoring/alertmanager-main-1 97449af2ab65893a4925852a00b3beac360626dc220cf44789489fc59620b85f] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ' openshift-monitoring 2m24s Warning FailedCreatePodSandBox pod/prometheus-k8s-0 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-0_openshift-monitoring_b305a9e5-0b53-4c72-9d2e-e040bfe6629d_0(e0321fe6a997fc5129072f5c5c59a271b87a80cdc7b091180da0f6f20ecb1609): [openshift-monitoring/prometheus-k8s-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-0 e0321fe6a997fc5129072f5c5c59a271b87a80cdc7b091180da0f6f20ecb1609] [openshift-monitoring/prometheus-k8s-0 e0321fe6a997fc5129072f5c5c59a271b87a80cdc7b091180da0f6f20ecb1609] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ======================================================================================================================================================================================= [root@ocp-dev-console ocp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.6 True False 20h Error while reconciling 4.7.6: the cluster operator monitoring has not yet successfully rolled out ======================================================================================================================================================================================= [root@ocp-dev-console ocp]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE monitoring 4.7.6 False True True 16h ======================================================================================================================================================================================= [mchebbi@fedora logs]$ omg get pods -n openshift-monitoring|grep -i pending alertmanager-main-0 0/5 Pending 0 10m alertmanager-main-2 0/5 Pending 0 10m prometheus-k8s-0 0/7 Pending 0 33m 2021-04-19T12:42:45.003983515Z I0419 12:42:45.003948 1 tasks.go:46] ran task 15 of 15: Updating User Workload Thanos Ruler 2021-04-19T12:46:48.337049930Z I0419 12:46:48.336997 1 operator.go:237] ConfigMap to configure stack does not exist. Reconciling with default config every 15m0s. 2021-04-19T12:47:35.051205028Z I0419 12:47:35.051173 1 tasks.go:46] ran task 7 of 15: Updating Alertmanager 2021-04-19T12:47:37.478192708Z I0419 12:47:37.478159 1 tasks.go:46] ran task 5 of 15: Updating Prometheus-k8s 2021-04-19T12:47:37.478228208Z I0419 12:47:37.478188 1 operator.go:374] Updating ClusterOperator status to failed. Err: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 3 replicas, got 1 updated replicas 2021-04-19T12:47:37.486270474Z E0419 12:47:37.486240 1 operator.go:306] Syncing "openshift-monitoring/cluster-monitoring-config" failed 2021-04-19T12:47:37.486297985Z E0419 12:47:37.486266 1 operator.go:307] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 3 replicas, got 1 updated replicas 2021-04-19T12:47:37.486297985Z W0419 12:47:37.486292 1 operator.go:427] No Cluster Monitoring ConfigMap was found. Using defaults. 2021-04-19T12:47:37.503450533Z I0419 12:47:37.503400 1 operator.go:366] Updating ClusterOperator status to in progress. 2021-04-19T12:47:48.581634518Z I0419 12:47:48.581600 1 tasks.go:46] ran task 15 of 15: Updating User Workload Thanos Ruler 2021-04-19T12:51:48.146898447Z I0419 12:51:48.146852 1 operator.go:279] Triggering an update due to ConfigMap or Secret: kube-system/extension-apiserver-authentication 2021-04-19T12:51:48.224846771Z I0419 12:51:48.224815 1 operator.go:279] Triggering an update due to ConfigMap or Secret: openshift-monitoring/prometheus-adapter-tls 2021-04-19T12:51:48.224936825Z I0419 12:51:48.224911 1 operator.go:279] Triggering an update due to ConfigMap or Secret: openshift-monitoring/grpc-tls 2021-04-19T12:51:48.226328174Z I0419 12:51:48.226304 1 operator.go:279] Triggering an update due to ConfigMap or Secret: openshift-config-managed/kubelet-serving-ca 2021-04-19T12:51:48.291335021Z I0419 12:51:48.291300 1 operator.go:279] Triggering an update due to ConfigMap or Secret: openshift-monitoring/alertmanager-trusted-ca-bundle 2021-04-19T12:51:48.291335021Z I0419 12:51:48.291325 1 operator.go:279] Triggering an update due to ConfigMap or Secret: openshift-monitoring/telemeter-trusted-ca-bundle 2021-04-19T12:51:48.337058104Z I0419 12:51:48.336998 1 operator.go:237] ConfigMap to configure stack does not exist. Reconciling with default config every 15m0s. - lastTransitionTime: "2021-04-19T09:15:31Z" message: Done applying 4.7.6 status: "True" type: Available - lastTransitionTime: "2021-04-19T11:54:47Z" message: Cluster operator monitoring is not available reason: ClusterOperatorNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-04-19T11:42:23Z" message: 'Error while reconciling 4.7.6: the cluster operator monitoring has not yet successfully rolled out' reason: ClusterOperatorNotAvailable status: "False" type: Progressing =========================================== alertmanager-main-0 - lastProbeTime: null lastTransitionTime: '2021-04-19T12:41:42Z' message: 'containers with unready status: [alertmanager config-reloader alertmanager-proxy kube-rbac-proxy prom-label-proxy]' reason: ContainersNotReady status: 'False' type: Ready =========================================== alertmanager-main-2 - lastProbeTime: null lastTransitionTime: '2021-04-19T12:42:02Z' message: 'containers with unready status: [alertmanager config-reloader alertmanager-proxy kube-rbac-proxy prom-label-proxy]' reason: ContainersNotReady status: 'False' ------------------------- Thanks for your help.
Looking at the logs: ovnkube-master-7dl6w/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:35:24.233947454Z I0419 11:35:24.233935 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 ovnkube-master-7dl6w/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:35:24.327140074Z I0419 11:35:24.327114 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 93.1917ms ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:01.923797846Z I0419 11:40:01.923775 1 pods.go:60] Stale logical port found: openshift-monitoring_prometheus-k8s-0. This logical port will be deleted. ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:01.925549283Z 2021-04-19T11:40:01.925Z|00290|nbctl|INFO|Running command run --if-exists -- lsp-del openshift-monitoring_prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156372889Z I0419 11:40:29.156364 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.2.17/23"],"mac_address":"0a:58:0a:81:02:11","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.17/23","gateway_ip":"10.129.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.208380617Z I0419 11:40:29.208352 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 51.992078ms ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.208508428Z E0419 11:40:29.208494 1 ovn.go:627] error while creating logical port openshift-monitoring_prometheus-k8s-0 error: connection is shut down One issue I see here is that we sync the pod via a nbctl command, but then try to add the port again using go-ovn bindings. It's possible we could run into a cache out of sync issue where the go-ovn cache thinks the port is still there, but its really been deleted via nbctl. We should fix the sync method to use go-ovn as well. I'm not saying that's the root cause of this bug, just noticing it while I'm looking at the logs.
What looks odd from these logs is the timing between binding and annotations: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.2.17/23"],"mac_address":"0a:58:0a:81:02:11","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.17/23","gateway_ip":"10.129.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 [trozet@trozet ci.]$ grep -r prometheus-k8s-0 * | grep ovn-controller namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-whdjz/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:22.817976433Z 2021-04-19T11:26:22Z|00038|binding|INFO|Releasing lport openshift-monitoring_prometheus-k8s-0 from this chassis. namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:36.082279873Z 2021-04-19T11:26:36Z|00028|binding|INFO|Claiming lport openshift-monitoring_prometheus-k8s-0 for this chassis. namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:36.082279873Z 2021-04-19T11:26:36Z|00029|binding|INFO|openshift-monitoring_prometheus-k8s-0: Claiming 0a:58:0a:80:02:09 10.128.2.9 namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:38:58.508607415Z 2021-04-19T11:38:58Z|00049|binding|INFO|Releasing lport openshift-monitoring_prometheus-k8s-0 from this chassis. We can see at 11:26:36 ovn-controller bound to 10.128.2.9, but we dont see that new IP annotated until: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T12:19:31.964007339Z I0419 12:19:31.963998 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.9/23"],"mac_address":"0a:58:0a:80:02:09","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.2.3"}],"ip_address":"10.128.2.9/23","gateway_ip":"10.128.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0
We think we found what happened here. When ovnkube-master started, it found the prometheus pod not running and deleted its port. 2021-04-19T11:40:01.923797846Z I0419 11:40:01.923775 1 pods.go:60] Stale logical port found: openshift-monitoring_prometheus-k8s-0. This logical port will be deleted. 2021-04-19T11:40:01.925549283Z 2021-04-19T11:40:01.925Z|00290|nbctl|INFO|Running command run --if-exists -- lsp-del openshift-monitoring_prometheus-k8s-0 There was then a disconnection between kubemaster and the nbdb: 2021-04-19T11:40:26.193749278Z 2021/04/19 11:40:26 rpc2: client protocol error: read tcp 172.21.8.55:52212->172.21.8.53:9642: read: connection reset by peer 2021-04-19T11:40:26.193812317Z 2021/04/19 11:40:26 ssl:172.21.8.53:9642,ssl:172.21.8.55:9642,ssl:172.21.8.59:9642 disconnected. Reconnecting ... 2021-04-19T11:40:26.701238456Z 2021/04/19 11:40:26 ssl:172.21.8.53:9642,ssl:172.21.8.55:9642,ssl:172.21.8.59:9642 reconnected. 2021-04-19T11:40:28.754648075Z 2021/04/19 11:40:28 rpc2: client protocol error: read tcp 172.21.8.55:45308->172.21.8.53:9641: read: connection reset by peer 2021-04-19T11:40:28.754903588Z 2021/04/19 11:40:28 ssl:172.21.8.53:9641,ssl:172.21.8.55:9641,ssl:172.21.8.59:9641 disconnected. Reconnecting ... which recovered here 2021-04-19T11:40:29.286480541Z 2021/04/19 11:40:29 ssl:172.21.8.53:9641,ssl:172.21.8.55:9641,ssl:172.21.8.59:9641 reconnected. In the meanwhile, the prometheus pod came back (with fixed name, which is associated to the switch logical port): 2021-04-19T11:40:29.156372889Z I0419 11:40:29.156364 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":[ 2021-04-19T11:40:29.208380617Z I0419 11:40:29.208352 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 51.992078ms 2021-04-19T11:40:29.208508428Z E0419 11:40:29.208494 1 ovn.go:627] error while creating logical port openshift-monitoring_prometheus-k8s-0 error: connection is shut down What's odd is the fact that the master thinks the LSP is there, but the connection is not up. This is because the go-ovn binding library we use holds a cache of the content of the nbdb. The cache gets refreshed when it receives updates (or deletion) events, and gets aligned when the master starts OR when it reconnects. We found an issue where deletion events handled while the connection was down are not processed when reconnecting, which causes ovnkube master to think that the port is already there while it's not. There was a disconnection between ovn-kubemaster and north bound database:
*** Bug 1983109 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438