1801089 – [OVN] Installation failed and monitoring pod not created due to some network error.

Bug 1801089 - [OVN] Installation failed and monitoring pod not created due to some network error.

Summary: [OVN] Installation failed and monitoring pod not created due to some network ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:	SDN-CI-IMPACT
Duplicates (1):	1836376 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-10 08:51 UTC by huirwang
Modified:	2023-09-15 00:29 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:10:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovn-master logs (1.08 MB, application/gzip) 2020-05-21 02:22 UTC, zhaozhanqi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:11:46 UTC

Description huirwang 2020-02-10 08:51:10 UTC

Description of problem:
Install an OSP cluster with ovn network, installation failed. There is one monitoring pod not created successfully with some network 
erros.

How reproducible:
Sometimes

Version-Release number of selected component (if applicable):
 4.4.0-0.nightly-2020-02-09-220310
 
Steps to Reproduce
Setup a OSP cluster with networkType: "OVNKubernetes".

The installation failed with below errors:
level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is still updating"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager: expected 3 replicas, updated 2 and available 2"
level=fatal msg="failed to initialize the cluster: Cluster operator monitoring is still updating"
tools/launch_instance.rb:621:in `installation_task': shell command failed execution, see logs (RuntimeError)
        from tools/launch_instance.rb:748:in `block in launch_template'
        from tools/launch_instance.rb:747:in `each'
        from tools/launch_instance.rb:747:in `launch_template'
        from tools/launch_instance.rb:55:in `block (2 levels) in run'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:182:in `call'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:153:in `run'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:446:in `run_active_command'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:68:in `run!'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/delegates.rb:15:in `run!'
        from tools/launch_instance.rb:92:in `run'
        from tools/launch_instance.rb:880:in `<main>'
waiting for operation up to 36000 seconds..


oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      90m
cloud-credential                           4.4.0-0.nightly-2020-02-09-220310   True        False         False      112m
cluster-autoscaler                         4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
console                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      95m
csi-snapshot-controller                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      99m
dns                                        4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
etcd                                       4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m
image-registry                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
ingress                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
insights                                   4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
kube-apiserver                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      105m
kube-controller-manager                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
kube-scheduler                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
kube-storage-version-migrator              4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
machine-api                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
machine-config                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
marketplace                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m
monitoring                                                                     False       True          True       95m
network                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
node-tuning                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
openshift-apiserver                        4.4.0-0.nightly-2020-02-09-220310   True        False         False      103m
openshift-controller-manager               4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
openshift-samples                          4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-02-09-220310   True        False         False      104m
service-ca                                 4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
service-catalog-apiserver                  4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
service-catalog-controller-manager         4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
storage                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m

 oc get pod  -n openshift-monitoring -o wide
NAME                                           READY   STATUS              RESTARTS   AGE   IP             NODE                              NOMINATED NODE   READINESS GATES
alertmanager-main-0                            3/3     Running             0          71m   10.131.0.19    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
alertmanager-main-1                            3/3     Running             0          71m   10.131.0.20    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
alertmanager-main-2                            0/3     ContainerCreating   0          66m   <none>         huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
cluster-monitoring-operator-797b964b77-4mjxt   1/1     Running             0          73m   10.130.0.17    huir-osp-ovn-tkzlp-master-1       <none>           <none>
grafana-659f665879-gtt6k                       2/2     Running             0          66m   10.129.2.8     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
kube-state-metrics-bd8f6d6cf-8rjtb             3/3     Running             0          73m   10.131.0.15    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
node-exporter-bbmcc                            2/2     Running             0          71m   192.168.0.41   huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
node-exporter-cbq44                            2/2     Running             0          72m   192.168.0.34   huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
node-exporter-dhj5p                            2/2     Running             0          73m   192.168.0.25   huir-osp-ovn-tkzlp-master-0       <none>           <none>
node-exporter-jrqnw                            2/2     Running             0          73m   192.168.0.19   huir-osp-ovn-tkzlp-master-1       <none>           <none>
node-exporter-l7hj9                            2/2     Running             0          72m   192.168.0.13   huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>
node-exporter-n686p                            2/2     Running             0          73m   192.168.0.36   huir-osp-ovn-tkzlp-master-2       <none>           <none>
openshift-state-metrics-cdfb76f97-mkzgl        3/3     Running             0          73m   10.131.0.17    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-adapter-7775dc5c69-4j4j5            1/1     Running             0          73m   10.131.0.6     huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-adapter-7775dc5c69-q5nxh            1/1     Running             0          73m   10.131.0.9     huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-k8s-0                               7/7     Running             1          55m   10.129.2.6     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
prometheus-k8s-1                               7/7     Running             1          55m   10.128.2.7     huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>
prometheus-operator-c4db55b77-42g5z            1/1     Running             0          66m   10.130.0.22    huir-osp-ovn-tkzlp-master-1       <none>           <none>
telemeter-client-5976f9cc6f-8qzfd              3/3     Running             0          66m   10.129.2.7     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
thanos-querier-69576b85bf-cm5wz                4/4     Running             0          60m   10.129.2.10    huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
thanos-querier-69576b85bf-gvrts                4/4     Running             0          61m   10.128.2.8     huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>


oc describe pod alertmanager-main-2  -n openshift-monitoring
snippet for error part
  Warning  FailedCreatePodSandBox  42m  kubelet, huir-osp-ovn-tkzlp-worker-mtx5g  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(35fe942fb6d74e80de1ab71f9494d2a1d1d298b0f7aef0d80173d02601db18fb): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
  Warning  FailedCreatePodSandBox  46s (x87 over 41m)  kubelet, huir-osp-ovn-tkzlp-worker-mtx5g  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(9791771ff159b59ae95d1642569af85dd42445dc5859f8ef4103c6f83bd4f5b7): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition


Expected Result:
Installation completes successfully without above error


Additional info: this issue can be fixed by recreated the pod

The kubeconfig:
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/80058/console

Comment 1 Alexander Constantinescu 2020-02-11 15:51:08 UTC

This seems to be another informer problem:

alertmanager-main-2 (belonging to the statefulset statefulset.apps/alertmanager-main) is created two times. We are however only receiving the first creation event, we can prove this by looking at the logs in ovnkube-master:

oc logs -c ovnkube-master ovnkube-master-k9nqb | grep alertmanager-main-2
time="2020-02-10T06:49:29Z" level=info msg="Setting annotations map[k8s.ovn.org/pod-networks:{\"default\":{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}} ovn:{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}] on pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:49:29Z" level=info msg="[openshift-monitoring/alertmanager-main-2] addLogicalPort took 398.127906ms"

We can thus see that we received it's creation at 06:49:29 and assigned it an IP of 10.131.0.21 on node huir-osp-ovn-tkzlp-worker-pkwnn (which has the subnet 10.131.0.0). If we look at the ovnkube-node logs from that node we can see that the CNI request for this pod is handled correctly:

oc logs -c ovnkube-node ovnkube-node-gwd84  | grep alertmanager-main-2
time="2020-02-10T06:49:34Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:49:34Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}"
time="2020-02-10T06:49:34Z" level=warning msg="failed to clear stale OVS port \"\" iface-id \"openshift-monitoring_alertmanager-main-2\": failed to run 'ovs-vsctl --timeout=30 remove Interface  external-ids iface-id': exit status 1\n  \"ovs-vsctl: no row \\\"\\\" in table Interface\\n\"\n  \"\""
time="2020-02-10T06:49:36Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}, result \"{\\\"Result\\\":{\\\"interfaces\\\":[{\\\"name\\\":\\\"10c6c81e6ba2284\\\",\\\"mac\\\":\\\"36:34:35:63:c3:7f\\\"},{\\\"name\\\":\\\"eth0\\\",\\\"mac\\\":\\\"2a:fc:56:83:00:16\\\",\\\"sandbox\\\":\\\"/proc/6782/ns/net\\\"}],\\\"ips\\\":[{\\\"version\\\":\\\"4\\\",\\\"interface\\\":1,\\\"address\\\":\\\"10.131.0.21/23\\\",\\\"gateway\\\":\\\"10.131.0.1\\\"}],\\\"dns\\\":{}},\\\"PodIFInfo\\\":null}\", err <nil>"
time="2020-02-10T06:54:59Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}"
time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}, result \"\", err <nil>"


We can also see that at 06:54:59 it is deleted (see the DEL CNI request)

The second alertmanager-main-2 pod is re-created on the node: huir-osp-ovn-tkzlp-worker-mtx5g. The ovnkube-node logs above show the following for that pod:

oc logs -c ovnkube-node ovnkube-node-njhvs  | grep alertmanager-main-2 | head -n 10
time="2020-02-10T06:55:11Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:11Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}, result \"\", err failed to get pod annotation: timed out waiting for the condition"
time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}, result \"\", err <nil>"
time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}, result \"\", err <nil>"
time="2020-02-10T06:55:40Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"

The newly created pod at 06:55:11 is thus never annotated by ovnkube-master (I have just shown all logs mentioning alertmanager-main-2 in ovnkube-master above). 

Our pod watcher in ovnkube-master seems to never have received the new pod creation notification.  

Assign to Dan Williams, because I am seeing this PR in upstreams ovnkube: https://github.com/ovn-org/ovn-kubernetes/pull/1043 which I could be fixing this issue......?

Let me know @dcbw


/Alex

Comment 2 zhaozhanqi 2020-02-21 07:55:31 UTC

Meet same issue again in registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-02-21-011943

Comment 3 Anping Li 2020-02-21 07:56:01 UTC

Hit it payload: 4.4.0-0.nightly-2020-02-21-011943

Events:
  Type     Reason                  Age        From                                                 Message
  ----     ------                  ----       ----                                                 -------
  Normal   Scheduled               <unknown>  default-scheduler                                    Successfully assigned openshift-monitoring/alertmanager-main-1 to anli22-9w7w9-w-a-0.c.openshift-qe.internal
  Warning  FailedCreatePodSandBox  41m        kubelet, anli22-9w7w9-w-a-0.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_9214cc8d-3191-423b-a145-12286d8132f0_0(9f86ea23d8583395c5a4135dfe77d7fe8b6c789dbe7268cd5cbdf4958a2aa4cc): Multus: [openshift-monitoring/alertmanager-main-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1] failed to get pod annotation: timed out waiting for the condition

Comment 4 Dan Williams 2020-04-01 17:11:52 UTC

This is likely to be caused by missed events in the DeltaFIFO on watch disconnect, which was addressed here:

1) upstream kubernetes client-go PR https://github.com/kubernetes/kubernetes/pull/83911
2) upstream ovn-kube issue about it: https://github.com/ovn-org/ovn-kubernetes/issues/1202
3) upstream ovn-kube PR to revendor client-go with the fix from (1): https://github.com/ovn-org/ovn-kubernetes/pull/1199

Comment 5 Yang Yang 2020-04-29 04:01:12 UTC

Facing it on GCP installation with 4.4.0-rc.13

level=info msg="Waiting up to 30m0s for the cluster at https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..."
level=error msg="Cluster operator authentication Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp 34.66.248.230:443: connect: connection refused"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-rc.13"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus: expected 2 replicas, updated 1 and available 1"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, monitoring"

# oc -n openshift-monitoring describe pod/prometheus-k8s-1
/prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal
  Warning  FailedCreatePodSandBox  48m   kubelet, yy4348-sxqkf-w-c-2.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a-89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to get pod annotation: timed out waiting for the condition

Comment 6 Dan Williams 2020-05-11 14:19:10 UTC

(In reply to yangyang from comment #5)
> Facing it on GCP installation with 4.4.0-rc.13
> 
> level=info msg="Waiting up to 30m0s for the cluster at
> https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..."
> level=error msg="Cluster operator authentication Degraded is True with
> RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp
> 34.66.248.230:443: connect: connection refused"
> level=info msg="Cluster operator authentication Progressing is Unknown with
> NoData: "
> level=info msg="Cluster operator authentication Available is Unknown with
> NoData: "
> level=info msg="Cluster operator console Progressing is True with
> SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward
> version 4.4.0-rc.13"
> level=info msg="Cluster operator console Available is False with
> Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for
> console deployment"
> level=info msg="Cluster operator insights Disabled is False with : "
> level=info msg="Cluster operator monitoring Available is False with : "
> level=info msg="Cluster operator monitoring Progressing is True with
> RollOutInProgress: Rolling out the stack."
> level=error msg="Cluster operator monitoring Degraded is True with
> UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running
> task Updating Prometheus-k8s failed: waiting for Prometheus object changes
> failed: waiting for Prometheus: expected 2 replicas, updated 1 and available
> 1"
> level=fatal msg="failed to initialize the cluster: Some cluster operators
> are still updating: authentication, console, monitoring"
> 
> # oc -n openshift-monitoring describe pod/prometheus-k8s-1
> /prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal
>   Warning  FailedCreatePodSandBox  48m   kubelet,
> yy4348-sxqkf-w-c-2.c.openshift-qe.internal  Failed to create pod sandbox:
> rpc error: code = Unknown desc = failed to create pod network sandbox
> k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a-
> 89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e
> 12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container
> to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd -
> "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request
> failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to
> get pod annotation: timed out waiting for the condition

If this happens again, can you grab all three ovnkube-master pod logs for the 'ovnkube-master' container?

Comment 7 zhaozhanqi 2020-05-21 02:20:42 UTC

Met this issue again on 4.5.0-0.nightly-2020-05-19-041951
please check the ovn logs attached. thanks. 

since OVN GA in 4.5. So I think we should fix this issue in 4.5. move the target to 4.5.

Comment 8 zhaozhanqi 2020-05-21 02:22:58 UTC

Created attachment 1690452 [details]
ovn-master logs

Comment 11 Ben Bennett 2020-05-29 13:13:35 UTC

If this is still happening, can you please gather the requested information next time you see it.  Thanks!

Comment 12 zhaozhanqi 2020-06-01 02:05:02 UTC

Hi, Ben
we provided the ovn-master logs in comment 8 when this issue happened. please let me know if this is not enough. Thanks

Comment 13 Ryan Phillips 2020-06-05 17:58:59 UTC

*** Bug 1836376 has been marked as a duplicate of this bug. ***

Comment 14 Dan Williams 2020-06-08 14:14:56 UTC

I0520 10:00:03.379234       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-6ccr7] addLogicalPort took 263.076653ms
I0520 10:00:03.582160       1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.3.254/23"],"mac_address":"c6:0d:c6:80:03:ff","gateway_ips":["10.128.2.1"],"ip_addres
s":"10.128.3.254/23","gateway_ip":"10.128.2.1"}}] on pod openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr
I0520 10:00:03.621251       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr] addLogicalPort took 241.941919ms
I0520 10:00:03.816231       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-8wl2q] addLogicalPort took 194.904483ms
E0520 10:00:03.816260       1 ovn.go:413] Error while obtaining addresses for openshift-authentication_oauth-openshift-fd47d7f7f-8wl2q: Error while obtaining addresses for openshift-authentication_oauth-openshif
t-fd47d7f7f-8wl2q

This means we'd need ovsdb-server and ovn-northd logs too... any chance you can get those, or a full must-gather for the cluster when it enters this state?

Comment 18 Dan Williams 2020-08-03 16:01:29 UTC

In Anurag's case, the master assigned the pod annotation 3 minutes before the node started the container:

Master:
2020-06-23T19:26:57.450184542Z I0623 19:26:57.450127       1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.8/23"],"mac_address":"66:74:33:80:02:09","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.8/23","gateway_ip":"10.128.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0
2020-06-23T19:26:57.47887959Z I0623 19:26:57.478821       1 pods.go:230] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 128.448259ms

Node:
2020-06-23T19:29:03.28464272Z I0623 19:29:03.284587    2276 cniserver.go:148] Waiting for ADD result for pod openshift-monitoring/prometheus-k8s-0
2020-06-23T19:29:03.28464272Z I0623 19:29:03.284617    2276 cni.go:147] [openshift-monitoring/prometheus-k8s-0] dispatching pod network request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270}
...
2020-06-23T19:29:24.955635752Z I0623 19:29:24.955496    2276 cni.go:157] [openshift-monitoring/prometheus-k8s-0] CNI request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270}, result "", err failed to get pod annotation: timed out waiting for the condition
2020-06-23T19:29:24.983550434Z I0623 19:29:24.983238    2276 cniserver.go:148] Waiting for DEL result for pod openshift-monitoring/prometheus-k8s-0

But in between 19:26:57 and 19:29:03 the apiserver hiccupped and likely the node couldn't reach it... apiserver logs from that time (z28zv-master-2) show a lot of initialization-type activity, but nothing interesting.

Comment 19 Ricardo Carrillo Cruz 2020-08-26 11:39:46 UTC

Hi there

Just got this ticket, it seems last logs are from June.
Any chance to recreate this with fresh logs, since there have been multiple enhancements in OVN-K merged in downstream?

Thanks

Comment 20 zhaozhanqi 2020-08-28 06:15:59 UTC

@huirwang
Could you help check this issue if still can be reproduced on 4.6 version? if no, please help move this bug to 'verified', thanks

Comment 22 Mustafa UYSAL 2020-08-30 00:50:11 UTC

Hi all,

Looks like the same issue still exists in 4.5.6.

#oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                       Unknown     Unknown       True       3h17m


#oc describe deployment authentication-operator

  Normal  OperatorStatusChanged  138m  cluster-authentication-operator-status-controller-statussyncer_authentication  Status for clusteroperator/authentication changed: Degraded message changed from "ConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nRouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []" to "RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []"

Comment 23 hhuebler@2innovate.at 2020-10-22 17:31:17 UTC

Hello @zzhao  ,
I'm still facing the issue as per https://bugzilla.redhat.com/show_bug.cgi?id=1836376 which was flagged as a duplicate of this issue. I uploaded the OCP install gather files to https://www.dropbox.com/s/siu5xm3zk9karmu/log-bundle-20201017172330.tar.gz?dl=0 and https://www.dropbox.com/s/gxgxdsr5tjccdus/os45_install.debug?dl=0. Maybe that helps?

Comment 24 zhaozhanqi 2020-10-26 03:28:34 UTC

hi, hhuebler,  could you also provide the install platform?  upi or ipi ? openshift-sdn or OVN ? thanks,

Comment 25 hhuebler@2innovate.at 2020-10-27 09:13:16 UTC

Hey Zhaozhanqi,
sorry for skipping that info. This was an UPI installation using KVM on Centos8 as the Hypervisor.

Thanks!

Comment 29 errata-xmlrpc 2021-02-24 15:10:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 30 Red Hat Bugzilla 2023-09-15 00:29:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.