Bug 1801089

Summary: [OVN] Installation failed and monitoring pod not created due to some network error.
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, anusaxen, bbennett, dcbw, hhuebler, juzhao, lmohanty, mustafa.uysal, ricarril, sumehta, yanyang, zzhao
Version: 4.4Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SDN-CI-IMPACT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:10:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovn-master logs none

Description huirwang 2020-02-10 08:51:10 UTC
Description of problem:
Install an OSP cluster with ovn network, installation failed. There is one monitoring pod not created successfully with some network 
erros.

How reproducible:
Sometimes

Version-Release number of selected component (if applicable):
 4.4.0-0.nightly-2020-02-09-220310
 
Steps to Reproduce
Setup a OSP cluster with networkType: "OVNKubernetes".

The installation failed with below errors:
level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is still updating"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager: expected 3 replicas, updated 2 and available 2"
level=fatal msg="failed to initialize the cluster: Cluster operator monitoring is still updating"
tools/launch_instance.rb:621:in `installation_task': shell command failed execution, see logs (RuntimeError)
        from tools/launch_instance.rb:748:in `block in launch_template'
        from tools/launch_instance.rb:747:in `each'
        from tools/launch_instance.rb:747:in `launch_template'
        from tools/launch_instance.rb:55:in `block (2 levels) in run'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:182:in `call'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/command.rb:153:in `run'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:446:in `run_active_command'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/runner.rb:68:in `run!'
        from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.4.7/lib/commander/delegates.rb:15:in `run!'
        from tools/launch_instance.rb:92:in `run'
        from tools/launch_instance.rb:880:in `<main>'
waiting for operation up to 36000 seconds..


oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      90m
cloud-credential                           4.4.0-0.nightly-2020-02-09-220310   True        False         False      112m
cluster-autoscaler                         4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
console                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      95m
csi-snapshot-controller                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      99m
dns                                        4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
etcd                                       4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m
image-registry                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
ingress                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
insights                                   4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
kube-apiserver                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      105m
kube-controller-manager                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
kube-scheduler                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
kube-storage-version-migrator              4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
machine-api                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
machine-config                             4.4.0-0.nightly-2020-02-09-220310   True        False         False      106m
marketplace                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m
monitoring                                                                     False       True          True       95m
network                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
node-tuning                                4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
openshift-apiserver                        4.4.0-0.nightly-2020-02-09-220310   True        False         False      103m
openshift-controller-manager               4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
openshift-samples                          4.4.0-0.nightly-2020-02-09-220310   True        False         False      100m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-02-09-220310   True        False         False      107m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-02-09-220310   True        False         False      104m
service-ca                                 4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
service-catalog-apiserver                  4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
service-catalog-controller-manager         4.4.0-0.nightly-2020-02-09-220310   True        False         False      108m
storage                                    4.4.0-0.nightly-2020-02-09-220310   True        False         False      101m

 oc get pod  -n openshift-monitoring -o wide
NAME                                           READY   STATUS              RESTARTS   AGE   IP             NODE                              NOMINATED NODE   READINESS GATES
alertmanager-main-0                            3/3     Running             0          71m   10.131.0.19    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
alertmanager-main-1                            3/3     Running             0          71m   10.131.0.20    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
alertmanager-main-2                            0/3     ContainerCreating   0          66m   <none>         huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
cluster-monitoring-operator-797b964b77-4mjxt   1/1     Running             0          73m   10.130.0.17    huir-osp-ovn-tkzlp-master-1       <none>           <none>
grafana-659f665879-gtt6k                       2/2     Running             0          66m   10.129.2.8     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
kube-state-metrics-bd8f6d6cf-8rjtb             3/3     Running             0          73m   10.131.0.15    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
node-exporter-bbmcc                            2/2     Running             0          71m   192.168.0.41   huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
node-exporter-cbq44                            2/2     Running             0          72m   192.168.0.34   huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
node-exporter-dhj5p                            2/2     Running             0          73m   192.168.0.25   huir-osp-ovn-tkzlp-master-0       <none>           <none>
node-exporter-jrqnw                            2/2     Running             0          73m   192.168.0.19   huir-osp-ovn-tkzlp-master-1       <none>           <none>
node-exporter-l7hj9                            2/2     Running             0          72m   192.168.0.13   huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>
node-exporter-n686p                            2/2     Running             0          73m   192.168.0.36   huir-osp-ovn-tkzlp-master-2       <none>           <none>
openshift-state-metrics-cdfb76f97-mkzgl        3/3     Running             0          73m   10.131.0.17    huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-adapter-7775dc5c69-4j4j5            1/1     Running             0          73m   10.131.0.6     huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-adapter-7775dc5c69-q5nxh            1/1     Running             0          73m   10.131.0.9     huir-osp-ovn-tkzlp-worker-pkwnn   <none>           <none>
prometheus-k8s-0                               7/7     Running             1          55m   10.129.2.6     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
prometheus-k8s-1                               7/7     Running             1          55m   10.128.2.7     huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>
prometheus-operator-c4db55b77-42g5z            1/1     Running             0          66m   10.130.0.22    huir-osp-ovn-tkzlp-master-1       <none>           <none>
telemeter-client-5976f9cc6f-8qzfd              3/3     Running             0          66m   10.129.2.7     huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
thanos-querier-69576b85bf-cm5wz                4/4     Running             0          60m   10.129.2.10    huir-osp-ovn-tkzlp-worker-mtx5g   <none>           <none>
thanos-querier-69576b85bf-gvrts                4/4     Running             0          61m   10.128.2.8     huir-osp-ovn-tkzlp-worker-mwcn5   <none>           <none>


oc describe pod alertmanager-main-2  -n openshift-monitoring
snippet for error part
  Warning  FailedCreatePodSandBox  42m  kubelet, huir-osp-ovn-tkzlp-worker-mtx5g  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(35fe942fb6d74e80de1ab71f9494d2a1d1d298b0f7aef0d80173d02601db18fb): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
  Warning  FailedCreatePodSandBox  46s (x87 over 41m)  kubelet, huir-osp-ovn-tkzlp-worker-mtx5g  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-2_openshift-monitoring_97be41f4-2448-49fd-9d2c-2a13287a14a6_0(9791771ff159b59ae95d1642569af85dd42445dc5859f8ef4103c6f83bd4f5b7): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition


Expected Result:
Installation completes successfully without above error


Additional info: this issue can be fixed by recreated the pod

The kubeconfig:
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/80058/console

Comment 1 Alexander Constantinescu 2020-02-11 15:51:08 UTC
This seems to be another informer problem:

alertmanager-main-2 (belonging to the statefulset statefulset.apps/alertmanager-main) is created two times. We are however only receiving the first creation event, we can prove this by looking at the logs in ovnkube-master:

oc logs -c ovnkube-master ovnkube-master-k9nqb | grep alertmanager-main-2
time="2020-02-10T06:49:29Z" level=info msg="Setting annotations map[k8s.ovn.org/pod-networks:{\"default\":{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}} ovn:{\"ip_address\":\"10.131.0.21/23\",\"mac_address\":\"2a:fc:56:83:00:16\",\"gateway_ip\":\"10.131.0.1\"}] on pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:49:29Z" level=info msg="[openshift-monitoring/alertmanager-main-2] addLogicalPort took 398.127906ms"

We can thus see that we received it's creation at 06:49:29 and assigned it an IP of 10.131.0.21 on node huir-osp-ovn-tkzlp-worker-pkwnn (which has the subnet 10.131.0.0). If we look at the ovnkube-node logs from that node we can see that the CNI request for this pod is handled correctly:

oc logs -c ovnkube-node ovnkube-node-gwd84  | grep alertmanager-main-2
time="2020-02-10T06:49:34Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:49:34Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}"
time="2020-02-10T06:49:34Z" level=warning msg="failed to clear stale OVS port \"\" iface-id \"openshift-monitoring_alertmanager-main-2\": failed to run 'ovs-vsctl --timeout=30 remove Interface  external-ids iface-id': exit status 1\n  \"ovs-vsctl: no row \\\"\\\" in table Interface\\n\"\n  \"\""
time="2020-02-10T06:49:36Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc0001292b0}, result \"{\\\"Result\\\":{\\\"interfaces\\\":[{\\\"name\\\":\\\"10c6c81e6ba2284\\\",\\\"mac\\\":\\\"36:34:35:63:c3:7f\\\"},{\\\"name\\\":\\\"eth0\\\",\\\"mac\\\":\\\"2a:fc:56:83:00:16\\\",\\\"sandbox\\\":\\\"/proc/6782/ns/net\\\"}],\\\"ips\\\":[{\\\"version\\\":\\\"4\\\",\\\"interface\\\":1,\\\"address\\\":\\\"10.131.0.21/23\\\",\\\"gateway\\\":\\\"10.131.0.1\\\"}],\\\"dns\\\":{}},\\\"PodIFInfo\\\":null}\", err <nil>"
time="2020-02-10T06:54:59Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}"
time="2020-02-10T06:54:59Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 10c6c81e6ba2284a2721c8d9666e51ad049149bdb340286a18f4dec6b9395dea /proc/6782/ns/net eth0 0xc000128000}, result \"\", err <nil>"


We can also see that at 06:54:59 it is deleted (see the DEL CNI request)

The second alertmanager-main-2 pod is re-created on the node: huir-osp-ovn-tkzlp-worker-mtx5g. The ovnkube-node logs above show the following for that pod:

oc logs -c ovnkube-node ovnkube-node-njhvs  | grep alertmanager-main-2 | head -n 10
time="2020-02-10T06:55:11Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:11Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{ADD openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0002e8680}, result \"\", err failed to get pod annotation: timed out waiting for the condition"
time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc0000ac270}, result \"\", err <nil>"
time="2020-02-10T06:55:33Z" level=info msg="Waiting for DEL result for pod openshift-monitoring/alertmanager-main-2"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] dispatching pod network request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}"
time="2020-02-10T06:55:33Z" level=info msg="[openshift-monitoring/alertmanager-main-2] CNI request &{DEL openshift-monitoring alertmanager-main-2 616f59cb52429a740f6c48c9868bf963960c3c3cade7d2e431eeb84ed5bb09cd /proc/13786/ns/net eth0 0xc000343380}, result \"\", err <nil>"
time="2020-02-10T06:55:40Z" level=info msg="Waiting for ADD result for pod openshift-monitoring/alertmanager-main-2"

The newly created pod at 06:55:11 is thus never annotated by ovnkube-master (I have just shown all logs mentioning alertmanager-main-2 in ovnkube-master above). 

Our pod watcher in ovnkube-master seems to never have received the new pod creation notification.  

Assign to Dan Williams, because I am seeing this PR in upstreams ovnkube: https://github.com/ovn-org/ovn-kubernetes/pull/1043 which I could be fixing this issue......?

Let me know @dcbw


/Alex

Comment 2 zhaozhanqi 2020-02-21 07:55:31 UTC
Meet same issue again in registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-02-21-011943

Comment 3 Anping Li 2020-02-21 07:56:01 UTC
Hit it payload: 4.4.0-0.nightly-2020-02-21-011943

Events:
  Type     Reason                  Age        From                                                 Message
  ----     ------                  ----       ----                                                 -------
  Normal   Scheduled               <unknown>  default-scheduler                                    Successfully assigned openshift-monitoring/alertmanager-main-1 to anli22-9w7w9-w-a-0.c.openshift-qe.internal
  Warning  FailedCreatePodSandBox  41m        kubelet, anli22-9w7w9-w-a-0.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_9214cc8d-3191-423b-a145-12286d8132f0_0(9f86ea23d8583395c5a4135dfe77d7fe8b6c789dbe7268cd5cbdf4958a2aa4cc): Multus: [openshift-monitoring/alertmanager-main-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1] failed to get pod annotation: timed out waiting for the condition

Comment 4 Dan Williams 2020-04-01 17:11:52 UTC
This is likely to be caused by missed events in the DeltaFIFO on watch disconnect, which was addressed here:

1) upstream kubernetes client-go PR https://github.com/kubernetes/kubernetes/pull/83911
2) upstream ovn-kube issue about it: https://github.com/ovn-org/ovn-kubernetes/issues/1202
3) upstream ovn-kube PR to revendor client-go with the fix from (1): https://github.com/ovn-org/ovn-kubernetes/pull/1199

Comment 5 Yang Yang 2020-04-29 04:01:12 UTC
Facing it on GCP installation with 4.4.0-rc.13

level=info msg="Waiting up to 30m0s for the cluster at https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..."
level=error msg="Cluster operator authentication Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp 34.66.248.230:443: connect: connection refused"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-rc.13"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus: expected 2 replicas, updated 1 and available 1"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, monitoring"

# oc -n openshift-monitoring describe pod/prometheus-k8s-1
/prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal
  Warning  FailedCreatePodSandBox  48m   kubelet, yy4348-sxqkf-w-c-2.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a-89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to get pod annotation: timed out waiting for the condition

Comment 6 Dan Williams 2020-05-11 14:19:10 UTC
(In reply to yangyang from comment #5)
> Facing it on GCP installation with 4.4.0-rc.13
> 
> level=info msg="Waiting up to 30m0s for the cluster at
> https://api.yy4348.qe.gcp.devcluster.openshift.com:6443 to initialize..."
> level=error msg="Cluster operator authentication Degraded is True with
> RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp
> 34.66.248.230:443: connect: connection refused"
> level=info msg="Cluster operator authentication Progressing is Unknown with
> NoData: "
> level=info msg="Cluster operator authentication Available is Unknown with
> NoData: "
> level=info msg="Cluster operator console Progressing is True with
> SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward
> version 4.4.0-rc.13"
> level=info msg="Cluster operator console Available is False with
> Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for
> console deployment"
> level=info msg="Cluster operator insights Disabled is False with : "
> level=info msg="Cluster operator monitoring Available is False with : "
> level=info msg="Cluster operator monitoring Progressing is True with
> RollOutInProgress: Rolling out the stack."
> level=error msg="Cluster operator monitoring Degraded is True with
> UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running
> task Updating Prometheus-k8s failed: waiting for Prometheus object changes
> failed: waiting for Prometheus: expected 2 replicas, updated 1 and available
> 1"
> level=fatal msg="failed to initialize the cluster: Some cluster operators
> are still updating: authentication, console, monitoring"
> 
> # oc -n openshift-monitoring describe pod/prometheus-k8s-1
> /prometheus-k8s-1 to yy4348-sxqkf-w-c-2.c.openshift-qe.internal
>   Warning  FailedCreatePodSandBox  48m   kubelet,
> yy4348-sxqkf-w-c-2.c.openshift-qe.internal  Failed to create pod sandbox:
> rpc error: code = Unknown desc = failed to create pod network sandbox
> k8s_prometheus-k8s-1_openshift-monitoring_34941812-c4dc-409b-8a9a-
> 89135bd847d3_0(66aa175c3fe56a54cb5364f0fe5007462461ed74a897d47feae7f8d2d2c51e
> 12): Multus: [openshift-monitoring/prometheus-k8s-1]: error adding container
> to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd -
> "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request
> failed with status 400: '[openshift-monitoring/prometheus-k8s-1] failed to
> get pod annotation: timed out waiting for the condition

If this happens again, can you grab all three ovnkube-master pod logs for the 'ovnkube-master' container?

Comment 7 zhaozhanqi 2020-05-21 02:20:42 UTC
Met this issue again on 4.5.0-0.nightly-2020-05-19-041951
please check the ovn logs attached. thanks. 

since OVN GA in 4.5. So I think we should fix this issue in 4.5. move the target to 4.5.

Comment 8 zhaozhanqi 2020-05-21 02:22:58 UTC
Created attachment 1690452 [details]
ovn-master logs

Comment 11 Ben Bennett 2020-05-29 13:13:35 UTC
If this is still happening, can you please gather the requested information next time you see it.  Thanks!

Comment 12 zhaozhanqi 2020-06-01 02:05:02 UTC
Hi, Ben
we provided the ovn-master logs in comment 8 when this issue happened. please let me know if this is not enough. Thanks

Comment 13 Ryan Phillips 2020-06-05 17:58:59 UTC
*** Bug 1836376 has been marked as a duplicate of this bug. ***

Comment 14 Dan Williams 2020-06-08 14:14:56 UTC
I0520 10:00:03.379234       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-6ccr7] addLogicalPort took 263.076653ms
I0520 10:00:03.582160       1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.3.254/23"],"mac_address":"c6:0d:c6:80:03:ff","gateway_ips":["10.128.2.1"],"ip_addres
s":"10.128.3.254/23","gateway_ip":"10.128.2.1"}}] on pod openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr
I0520 10:00:03.621251       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-gt9sr] addLogicalPort took 241.941919ms
I0520 10:00:03.816231       1 pods.go:231] [openshift-authentication/oauth-openshift-fd47d7f7f-8wl2q] addLogicalPort took 194.904483ms
E0520 10:00:03.816260       1 ovn.go:413] Error while obtaining addresses for openshift-authentication_oauth-openshift-fd47d7f7f-8wl2q: Error while obtaining addresses for openshift-authentication_oauth-openshif
t-fd47d7f7f-8wl2q

This means we'd need ovsdb-server and ovn-northd logs too... any chance you can get those, or a full must-gather for the cluster when it enters this state?

Comment 18 Dan Williams 2020-08-03 16:01:29 UTC
In Anurag's case, the master assigned the pod annotation 3 minutes before the node started the container:

Master:
2020-06-23T19:26:57.450184542Z I0623 19:26:57.450127       1 kube.go:46] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.8/23"],"mac_address":"66:74:33:80:02:09","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.8/23","gateway_ip":"10.128.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0
2020-06-23T19:26:57.47887959Z I0623 19:26:57.478821       1 pods.go:230] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 128.448259ms

Node:
2020-06-23T19:29:03.28464272Z I0623 19:29:03.284587    2276 cniserver.go:148] Waiting for ADD result for pod openshift-monitoring/prometheus-k8s-0
2020-06-23T19:29:03.28464272Z I0623 19:29:03.284617    2276 cni.go:147] [openshift-monitoring/prometheus-k8s-0] dispatching pod network request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270}
...
2020-06-23T19:29:24.955635752Z I0623 19:29:24.955496    2276 cni.go:157] [openshift-monitoring/prometheus-k8s-0] CNI request &{ADD openshift-monitoring prometheus-k8s-0 586af48f424fafd6b641052ceec3b71d3cd1244eb4943efe48c597c555364550 /proc/4450/ns/net eth0 0xc0002e0270}, result "", err failed to get pod annotation: timed out waiting for the condition
2020-06-23T19:29:24.983550434Z I0623 19:29:24.983238    2276 cniserver.go:148] Waiting for DEL result for pod openshift-monitoring/prometheus-k8s-0

But in between 19:26:57 and 19:29:03 the apiserver hiccupped and likely the node couldn't reach it... apiserver logs from that time (z28zv-master-2) show a lot of initialization-type activity, but nothing interesting.

Comment 19 Ricardo Carrillo Cruz 2020-08-26 11:39:46 UTC
Hi there

Just got this ticket, it seems last logs are from June.
Any chance to recreate this with fresh logs, since there have been multiple enhancements in OVN-K merged in downstream?

Thanks

Comment 20 zhaozhanqi 2020-08-28 06:15:59 UTC
@huirwang
Could you help check this issue if still can be reproduced on 4.6 version? if no, please help move this bug to 'verified', thanks

Comment 22 Mustafa UYSAL 2020-08-30 00:50:11 UTC
Hi all,

Looks like the same issue still exists in 4.5.6.

#oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                       Unknown     Unknown       True       3h17m


#oc describe deployment authentication-operator

  Normal  OperatorStatusChanged  138m  cluster-authentication-operator-status-controller-statussyncer_authentication  Status for clusteroperator/authentication changed: Degraded message changed from "ConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nRouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []" to "RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp.dataserv.local: []"

Comment 23 hhuebler@2innovate.at 2020-10-22 17:31:17 UTC
Hello @zzhao  ,
I'm still facing the issue as per https://bugzilla.redhat.com/show_bug.cgi?id=1836376 which was flagged as a duplicate of this issue. I uploaded the OCP install gather files to https://www.dropbox.com/s/siu5xm3zk9karmu/log-bundle-20201017172330.tar.gz?dl=0 and https://www.dropbox.com/s/gxgxdsr5tjccdus/os45_install.debug?dl=0. Maybe that helps?

Comment 24 zhaozhanqi 2020-10-26 03:28:34 UTC
hi, hhuebler,  could you also provide the install platform?  upi or ipi ? openshift-sdn or OVN ? thanks,

Comment 25 hhuebler@2innovate.at 2020-10-27 09:13:16 UTC
Hey Zhaozhanqi,
sorry for skipping that info. This was an UPI installation using KVM on Centos8 as the Hypervisor.

Thanks!

Comment 29 errata-xmlrpc 2021-02-24 15:10:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 30 Red Hat Bugzilla 2023-09-15 00:29:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days