Description of problem: Install an OSP cluster with ovn network, installation failed. There is one monitoring pod not created successfully with some network erros. How reproducible: Sometimes Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-09-220310 Steps to Reproduce Setup a OSP cluster with networkType: "OVNKubernetes". The installation failed with below errors: level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is still updating" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager: expected 3 replicas, updated 2 and available 2" level=fatal msg="failed to initialize the cluster: Cluster operator monitoring is still updating" tools/launch_instance.rb:621:in `installation_task': shell command failed execution, see logs (RuntimeError) from tools/launch_instance.rb:748:in `block in launch_template' from tools/launch_instance.rb:747:in `each' from tools/launch_instance.rb:747:in `launch_template' from tools/launch_instance.rb:55:in `block (2 levels) in run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:182:in `call' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:153:in `run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:446:in `run_active_command' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:68:in `run!' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/delegates.rb:15:in `run!' from tools/launch_instance.rb:92:in `run' from tools/launch_instance.rb:880:in `<main>' waiting for operation up to 36000 seconds.. oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-02-09-220310 True False False 90m cloud-credential 4.4.0-0.nightly-2020-02-09-220310 True False False 112m cluster-autoscaler 4.4.0-0.nightly-2020-02-09-220310 True False False 100m console 4.4.0-0.nightly-2020-02-09-220310 True False False 95m csi-snapshot-controller 4.4.0-0.nightly-2020-02-09-220310 True False False 99m dns 4.4.0-0.nightly-2020-02-09-220310 True False False 106m etcd 4.4.0-0.nightly-2020-02-09-220310 True False False 101m image-registry 4.4.0-0.nightly-2020-02-09-220310 True False False 100m ingress 4.4.0-0.nightly-2020-02-09-220310 True False False 100m insights 4.4.0-0.nightly-2020-02-09-220310 True False False 108m kube-apiserver 4.4.0-0.nightly-2020-02-09-220310 True False False 105m kube-controller-manager 4.4.0-0.nightly-2020-02-09-220310 True False False 106m kube-scheduler 4.4.0-0.nightly-2020-02-09-220310 True False False 106m kube-storage-version-migrator 4.4.0-0.nightly-2020-02-09-220310 True False False 100m machine-api 4.4.0-0.nightly-2020-02-09-220310 True False False 107m machine-config 4.4.0-0.nightly-2020-02-09-220310 True False False 106m marketplace 4.4.0-0.nightly-2020-02-09-220310 True False False 101m monitoring False True True 95m network 4.4.0-0.nightly-2020-02-09-220310 True False False 108m node-tuning 4.4.0-0.nightly-2020-02-09-220310 True False False 108m openshift-apiserver 4.4.0-0.nightly-2020-02-09-220310 True False False 103m openshift-controller-manager 4.4.0-0.nightly-2020-02-09-220310 True False False 107m openshift-samples 4.4.0-0.nightly-2020-02-09-220310 True False False 100m operator-lifecycle-manager 4.4.0-0.nightly-2020-02-09-220310 True False False 107m operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-02-09-220310 True False False 107m operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-02-09-220310 True False False 104m service-ca 4.4.0-0.nightly-2020-02-09-220310 True False False 108m service-catalog-apiserver 4.4.0-0.nightly-2020-02-09-220310 True False False 108m service-catalog-controller-manager 4.4.0-0.nightly-2020-02-09-220310 True False False 108m storage 4.4.0-0.nightly-2020-02-09-220310 True False False 101m oc get pod -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 3/3 Running 0 71m 10.131.0.19 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> alertmanager-main-1 3/3 Running 0 71m 10.131.0.20 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> alertmanager-main-2 0/3 ContainerCreating 0 66m <none> huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> cluster-monitoring-operator-797b964b77-4mjxt 1/1 Running 0 73m 10.130.0.17 huir-osp-ovn-tkzlp-master-1 <none> <none> grafana-659f665879-gtt6k 2/2 Running 0 66m 10.129.2.8 huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> kube-state-metrics-bd8f6d6cf-8rjtb 3/3 Running 0 73m 10.131.0.15 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> node-exporter-bbmcc 2/2 Running 0 71m 192.168.0.41 huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> node-exporter-cbq44 2/2 Running 0 72m 192.168.0.34 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> node-exporter-dhj5p 2/2 Running 0 73m 192.168.0.25 huir-osp-ovn-tkzlp-master-0 <none> <none> node-exporter-jrqnw 2/2 Running 0 73m 192.168.0.19 huir-osp-ovn-tkzlp-master-1 <none> <none> node-exporter-l7hj9 2/2 Running 0 72m 192.168.0.13 huir-osp-ovn-tkzlp-worker-mwcn5 <none> <none> node-exporter-n686p 2/2 Running 0 73m 192.168.0.36 huir-osp-ovn-tkzlp-master-2 <none> <none> openshift-state-metrics-cdfb76f97-mkzgl 3/3 Running 0 73m 10.131.0.17 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> prometheus-adapter-7775dc5c69-4j4j5 1/1 Running 0 73m 10.131.0.6 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> prometheus-adapter-7775dc5c69-q5nxh 1/1 Running 0 73m 10.131.0.9 huir-osp-ovn-tkzlp-worker-pkwnn <none> <none> prometheus-k8s-0 7/7 Running 1 55m 10.129.2.6 huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> prometheus-k8s-1 7/7 Running 1 55m 10.128.2.7 huir-osp-ovn-tkzlp-worker-mwcn5 <none> <none> prometheus-operator-c4db55b77-42g5z 1/1 Running 0 66m 10.130.0.22 huir-osp-ovn-tkzlp-master-1 <none> <none> telemeter-client-5976f9cc6f-8qzfd 3/3 Running 0 66m 10.129.2.7 huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> thanos-querier-69576b85bf-cm5wz 4/4 Running 0 60m 10.129.2.10 huir-osp-ovn-tkzlp-worker-mtx5g <none> <none> thanos-querier-69576b85bf-gvrts 4/4 Running 0 61m 10.128.2.8 huir-osp-ovn-tkzlp-worker-mwcn5 <none> <none> oc describe pod alertmanager-main-2 -n openshift-monitoring snippet for error part Warning FailedCreatePodSandBox 42m kubelet, huir-osp-ovn-tkzlp-worker-mtx5g Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(35fe942fb6d74e80de1ab71f9494d2a1d1d298b0f7aef0d80173d02601db18fb): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition ' Warning FailedCreatePodSandBox 46s (x87 over 41m) kubelet, huir-osp-ovn-tkzlp-worker-mtx5g (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(9791771ff159b59ae95d1642569af85dd42445dc5859f8ef4103c6f83bd4f5b7): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition Expected Result: Installation completes successfully without above error Additional info: this issue can be fixed by recreated the pod The kubeconfig: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/80058/console
This seems to be another informer problem: alertmanager-main-2 (belonging to the statefulset statefulset.apps/alertmanager-main) is created two times. We are however only receiving the first creation event, we can prove this by looking at the logs in ovnkube-master: oc logs -c ovnkube-master ovnkube-master-k9nqb | grep alertmanager-main-2 time="2020-02-10T06:49:29Z" level=info msg="Setting annotations map[k8s.ovn.org/pod-networks:{\"default\":{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}} ovn:{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}] on pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:49:29Z" level=info msg="[openshift-monitoring/alertmanager-main-2] addLogicalPort took 398.127906ms" We can thus see that we received it's creation at 06:49:29 and assigned it an IP of 10.131.0.21 on node huir-osp-ovn-tkzlp-worker-pkwnn (which has the subnet 10.131.0.0). If we look at the ovnkube-node logs from that node we can see that the CNI request for this pod is handled correctly: oc logs -c ovnkube-node ovnkube-node-gwd84 | grep alertmanager-main-2 time="2020-02-10T06:49:34Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:49:34Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}" time="2020-02-10T06:49:34Z" level=warning msg="failed to clear stale OVS port \"\" iface-id \"openshift-monitoring_alertmanager-main-2\": failed to run 'ovs-vsctl --timeout=30 remove Interface external-ids iface-id': exit status 1\n \"ovs-vsctl: no row \\\"\\\" in table Interface\\n\"\n \"\"" time="2020-02-10T06:49:36Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}, result \"{\\\"Result\\\":{\\\"interfaces\\\":[{\\\"name\\\":\\\"10c6c81e6ba2284\\\",\\\"mac\\\":\\\"36:34:35:63:c3:7f\\\"},{\\\"name\\\":\\\"eth0\\\",\\\"mac\\\":\\\"2a:fc:56:83:00:16\\\",\\\"sandbox\\\":\\\"/proc/6782/ns/net\\\"}],\\\"ips\\\":[{\\\"version\\\":\\\"4\\\",\\\"interface\\\":1,\\\"address\\\":\\\"10.131.0.21/23\\\",\\\"gateway\\\":\\\"10.131.0.1\\\"}],\\\"dns\\\":{}},\\\"PodIFInfo\\\":null}\", err <nil>" time="2020-02-10T06:54:59Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}" time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}, result \"\", err <nil>" We can also see that at 06:54:59 it is deleted (see the DEL CNI request) The second alertmanager-main-2 pod is re-created on the node: huir-osp-ovn-tkzlp-worker-mtx5g. The ovnkube-node logs above show the following for that pod: oc logs -c ovnkube-node ovnkube-node-njhvs | grep alertmanager-main-2 | head -n 10 time="2020-02-10T06:55:11Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:55:11Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}" time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}, result \"\", err failed to get pod annotation: timed out waiting for the condition" time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}" time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}, result \"\", err <nil>" time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2" time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}" time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}, result \"\", err <nil>" time="2020-02-10T06:55:40Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2" The newly created pod at 06:55:11 is thus never annotated by ovnkube-master (I have just shown all logs mentioning alertmanager-main-2 in ovnkube-master above). Our pod watcher in ovnkube-master seems to never have received the new pod creation notification. Assign to Dan Williams, because I am seeing this PR in upstreams ovnkube: https://github.com/ovn-org/ovn-kubernetes/pull/1043 which I could be fixing this issue......? Let me know @dcbw /Alex
Meet same issue again in registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-02-21-011943
Hit it payload: 4.4.0-0.nightly-2020-02-21-011943 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-monitoring/alertmanager-main-1 to anli22-9w7w9-w-a-0.c.openshift-qe.internal Warning FailedCreatePodSandBox 41m kubelet, anli22-9w7w9-w-a-0.c.openshift-qe.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_9214cc8d-3191-423b-a145-12286d8132f0_0(9f86ea23d8583395c5a4135dfe77d7fe8b6c789dbe7268cd5cbdf4958a2aa4cc): Multus: [openshift-monitoring/alertmanager-main-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1] failed to get pod annotation: timed out waiting for the condition
This is likely to be caused by missed events in the DeltaFIFO on watch disconnect, which was addressed here: 1) upstream kubernetes client-go PR https://github.com/kubernetes/kubernetes/pull/83911 2) upstream ovn-kube issue about it: https://github.com/ovn-org/ovn-kubernetes/issues/1202 3) upstream ovn-kube PR to revendor client-go with the fix from (1): https://github.com/ovn-org/ovn-kubernetes/pull/1199
Facing it on GCP installation with 4.4.0-rc.13 level=info msg="Waiting up to 30m0s for the cluster at https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..." level=error msg="Cluster operator authentication Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp 34.66.248.230:443: connect: connection refused" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " level=info msg="Cluster operator authentication Available is Unknown with NoData: " level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-rc.13" level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus: expected 2 replicas, updated 1 and available 1" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, monitoring" # oc -n openshift-monitoring describe pod/prometheus-k8s-1 /prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal Warning FailedCreatePodSandBox 48m kubelet, yy4348-sxqkf-w-c-2.c.openshift-qe.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a-89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to get pod annotation: timed out waiting for the condition
(In reply to yangyang from comment #5) > Facing it on GCP installation with 4.4.0-rc.13 > > level=info msg="Waiting up to 30m0s for the cluster at > https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..." > level=error msg="Cluster operator authentication Degraded is True with > RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp > 34.66.248.230:443: connect: connection refused" > level=info msg="Cluster operator authentication Progressing is Unknown with > NoData: " > level=info msg="Cluster operator authentication Available is Unknown with > NoData: " > level=info msg="Cluster operator console Progressing is True with > SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward > version 4.4.0-rc.13" > level=info msg="Cluster operator console Available is False with > Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for > console deployment" > level=info msg="Cluster operator insights Disabled is False with : " > level=info msg="Cluster operator monitoring Available is False with : " > level=info msg="Cluster operator monitoring Progressing is True with > RollOutInProgress: Rolling out the stack." > level=error msg="Cluster operator monitoring Degraded is True with > UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running > task Updating Prometheus-k8s failed: waiting for Prometheus object changes > failed: waiting for Prometheus: expected 2 replicas, updated 1 and available > 1" > level=fatal msg="failed to initialize the cluster: Some cluster operators > are still updating: authentication, console, monitoring" > > # oc -n openshift-monitoring describe pod/prometheus-k8s-1 > /prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal > Warning FailedCreatePodSandBox 48m kubelet, > yy4348-sxqkf-w-c-2.c.openshift-qe.internal Failed to create pod sandbox: > rpc error: code = Unknown desc = failed to create pod network sandbox > k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a- > 89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e > 12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container > to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - > "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request > failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to > get pod annotation: timed out waiting for the condition If this happens again, can you grab all three ovnkube-master pod logs for the 'ovnkube-master' container?
Met this issue again on 4.5.0-0.nightly-2020-05-19-041951 please check the ovn logs attached. thanks. since OVN GA in 4.5. So I think we should fix this issue in 4.5. move the target to 4.5.
Created attachment 1690452 [details] ovn-master logs
If this is still happening, can you please gather the requested information next time you see it. Thanks!
Hi, Ben we provided the ovn-master logs in comment 8 when this issue happened. please let me know if this is not enough. Thanks
*** Bug 1836376 has been marked as a duplicate of this bug. ***
I0520 10:00:03.379234 1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-6ccr7] addLogicalPort took 263.076653ms I0520 10:00:03.582160 1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.3.254/23"],"mac_address":"c6:0d:c6:80:03:ff","gateway_ips":["10.128.2.1"],"ip_addres s":"10.128.3.254/23","gateway_ip":"10.128.2.1"}}] on pod openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr I0520 10:00:03.621251 1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr] addLogicalPort took 241.941919ms I0520 10:00:03.816231 1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-8wl2q] addLogicalPort took 194.904483ms E0520 10:00:03.816260 1 ovn.go:413] Error while obtaining addresses for openshift-authentication_oauth-openshift-fd47d7f7f-8wl2q: Error while obtaining addresses for openshift-authentication_oauth-openshif t-fd47d7f7f-8wl2q This means we'd need ovsdb-server and ovn-northd logs too... any chance you can get those, or a full must-gather for the cluster when it enters this state?
In Anurag's case, the master assigned the pod annotation 3 minutes before the node started the container: Master: 2020-06-23T19:26:57.450184542Z I0623 19:26:57.450127 1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.8/23"],"mac_address":"66:74:33:80:02:09","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.8/23","gateway_ip":"10.128.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 2020-06-23T19:26:57.47887959Z I0623 19:26:57.478821 1 pods.go:230] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 128.448259ms Node: 2020-06-23T19:29:03.28464272Z I0623 19:29:03.284587 2276 cniserver.go:148] Waiting for ADD result for pod openshift-monitoring/prometheus-k8s-0 2020-06-23T19:29:03.28464272Z I0623 19:29:03.284617 2276 cni.go:147] [openshift-monitoring/prometheus-k8s-0] dispatching pod network request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270} ... 2020-06-23T19:29:24.955635752Z I0623 19:29:24.955496 2276 cni.go:157] [openshift-monitoring/prometheus-k8s-0] CNI request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270}, result "", err failed to get pod annotation: timed out waiting for the condition 2020-06-23T19:29:24.983550434Z I0623 19:29:24.983238 2276 cniserver.go:148] Waiting for DEL result for pod openshift-monitoring/prometheus-k8s-0 But in between 19:26:57 and 19:29:03 the apiserver hiccupped and likely the node couldn't reach it... apiserver logs from that time (z28zv-master-2) show a lot of initialization-type activity, but nothing interesting.
Hi there Just got this ticket, it seems last logs are from June. Any chance to recreate this with fresh logs, since there have been multiple enhancements in OVN-K merged in downstream? Thanks
@huirwang Could you help check this issue if still can be reproduced on 4.6 version? if no, please help move this bug to 'verified', thanks
Hi all, Looks like the same issue still exists in 4.5.6. #oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 3h17m #oc describe deployment authentication-operator Normal OperatorStatusChanged 138m cluster-authentication-operator-status-controller-statussyncer_authentication Status for clusteroperator/authentication changed: Degraded message changed from "ConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nRouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []" to "RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []"
Hello @zzhao , I'm still facing the issue as per https://bugzilla.redhat.com/show_bug.cgi?id=1836376 which was flagged as a duplicate of this issue. I uploaded the OCP install gather files to https://www.dropbox.com/s/siu5xm3zk9karmu/log-bundle-20201017172330.tar.gz?dl=0 and https://www.dropbox.com/s/gxgxdsr5tjccdus/os45_install.debug?dl=0. Maybe that helps?
hi, hhuebler, could you also provide the install platform? upi or ipi ? openshift-sdn or OVN ? thanks,
Hey Zhaozhanqi, sorry for skipping that info. This was an UPI installation using KVM on Centos8 as the Hypervisor. Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days