Created attachment 1713431 [details] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1301067010062422016 Description of problem: When running OCP installer on oVirt CI most of the PRs fail on the bootstrap phase. When I sshed into the nodes I saw that on some masters (sometimes one of them, sometimes all it is flaky), the directory /etc/kubernetes/manifests/ is missing some or all of the following yamls: - kube-apiserver-pod.yaml - kube-controller-manager-pod.yaml - kube-scheduler-pod.yaml - etcd-pod.yaml The jobs have pods logs on them, here are examples: 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384 2.https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752 3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1301067010062422016 Attached bootstrap gather logs from the VMs
Created attachment 1713432 [details] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752
Created attachment 1713433 [details] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384 Bootstrap gather of job https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384
I have debugged the third attachment. This is the last kubelet message starting a kube-apiserver-operator pod (a pod because I think there was one before starting at least one kube-apiserver instance on one of the masters): Sep 02 07:58:25 ovirt16-p65wd-master-0 hyperkube[1514]: I0902 07:58:25.377279 1514 event.go:291] "Event occurred" object="openshift-kube-apiserver-operator/kube-apiserver-operator-d778db9d5-h8jh2" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-d778db9d5-h8jh2_openshift-kube-apiserver-operator_64f349ef-ed46-4937-bd5d-34d90d38ebcf_0(eae40f5beee1d0b9fc89b8a77a0ef31dc31ed9ab58d455559a3df1ca5af3daab): [openshift-kube-apiserver-operator/kube-apiserver-operator-d778db9d5-h8jh2:openshift-sdn]: error adding container to network \"openshift-sdn\": failed to send CNI request: Post \"http://dummy/\": dial unix /var/run/openshift-sdn/cniserver/socket: connect: connection refused" During this time the SDN pods on the same node looks normal. I0902 07:57:46.283726 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-storage-version-migrator-operator/metrics:https to [10.130.0.11:8443] I0902 07:57:46.289541 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-apiserver-operator/metrics:https to [10.130.0.15:8443] I0902 07:57:46.300435 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-service-ca-operator/metrics:https to [10.130.0.4:8443] I0902 07:57:46.322426 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https to [10.130.0.7:8443] I0902 07:57:46.336344 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-scheduler-operator/metrics:https to [10.130.0.10:8443] I0902 07:57:46.347094 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-authentication-operator/metrics:https to [10.130.0.13:8443] I0902 07:57:46.355551 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-controller-manager-operator/metrics:https to [10.130.0.12:8443] I0902 07:57:46.373038 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-controller-manager-operator/metrics:https to [10.130.0.14:8443] I0902 07:57:46.393659 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:57:46.393924 13189 proxier.go:349] userspace syncProxyRules took 36.282019ms I0902 07:57:46.500531 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:57:46.500759 13189 proxier.go:349] userspace syncProxyRules took 34.21629ms I0902 07:58:20.357715 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-storage-version-migrator-operator/metrics:https I0902 07:58:20.361694 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-apiserver-operator/metrics:https I0902 07:58:20.396329 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https I0902 07:58:20.466033 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:20.466286 13189 proxier.go:349] userspace syncProxyRules took 34.029282ms I0902 07:58:20.574713 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:20.574954 13189 proxier.go:349] userspace syncProxyRules took 34.081956ms I0902 07:58:21.297235 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-apiserver-operator/metrics:https to [10.130.0.15:8443] I0902 07:58:21.305968 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-controller-manager-operator/metrics:https I0902 07:58:21.312992 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-service-ca-operator/metrics:https I0902 07:58:21.323129 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-scheduler-operator/metrics:https I0902 07:58:21.336831 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-storage-version-migrator-operator/metrics:https to [10.130.0.11:8443] I0902 07:58:21.346417 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https to [10.130.0.7:8443] I0902 07:58:21.352938 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-authentication-operator/metrics:https I0902 07:58:21.404645 13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-controller-manager-operator/metrics:https I0902 07:58:21.406431 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:21.406683 13189 proxier.go:349] userspace syncProxyRules took 34.342713ms I0902 07:58:21.512147 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:21.512371 13189 proxier.go:349] userspace syncProxyRules took 33.799784ms I0902 07:58:22.319558 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-controller-manager-operator/metrics:https to [10.130.0.14:8443] I0902 07:58:22.329539 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-controller-manager-operator/metrics:https to [10.130.0.12:8443] I0902 07:58:22.342064 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-service-ca-operator/metrics:https to [10.130.0.4:8443] I0902 07:58:22.359268 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-scheduler-operator/metrics:https to [10.130.0.10:8443] I0902 07:58:22.371214 13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-authentication-operator/metrics:https to [10.130.0.13:8443] I0902 07:58:22.425066 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:22.425294 13189 proxier.go:349] userspace syncProxyRules took 33.617906ms I0902 07:58:22.531322 13189 proxier.go:370] userspace proxy: processing 0 service events I0902 07:58:22.531528 13189 proxier.go:349] userspace syncProxyRules took 33.74158ms Some minutes earlier I saw in another SDN pod instance: I0902 07:56:47.438088 5363 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found) F0902 07:56:47.659315 5363 cmd.go:111] Failed to start sdn: node SDN setup failed: Link not found But this got resolved somehow. The operator pod in question stays in Unready, ContainerCreating state, but never makes progress (the message above is the last in kubelet logs about that pod): "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-09-02T07:57:03Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2020-09-02T07:57:03Z", "message": "containers with unready status: [kube-apiserver-operator]", "reason": "ContainersNotReady", "status": "False", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2020-09-02T07:57:03Z", "message": "containers with unready status: [kube-apiserver-operator]", "reason": "ContainersNotReady", "status": "False", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2020-09-02T07:57:03Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "image": "registry.svc.ci.openshift.org/ci-op-ytwm9308/stable@sha256:f719fcaf44d4ad8c530c0e6d2e9101db199623cf7ab35bbafa2a62e42f9b90a9", "imageID": "", "lastState": {}, "name": "kube-apiserver-operator", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } ],
The second linked test run looks differently: F0902 08:38:56.213044 1 cmd.go:129] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host One of the openvswitch pods looks broken https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752/artifacts/e2e-ovirt/pods/openshift-sdn_ovs-vbd2n_openvswitch.log: 2020-09-02T08:25:05.839Z|00013|jsonrpc|WARN|unix#76: receive error: Connection reset by peer 2020-09-02T08:25:05.839Z|00014|reconnect|WARN|unix#76: connection dropped (Connection reset by peer) 2020-09-02T08:27:05.956Z|00015|jsonrpc|WARN|unix#108: receive error: Connection reset by peer 2020-09-02T08:27:05.956Z|00016|reconnect|WARN|unix#108: connection dropped (Connection reset by peer) 2020-09-02T08:27:35.984Z|00017|jsonrpc|WARN|unix#116: receive error: Connection reset by peer 2020-09-02T08:27:35.984Z|00018|reconnect|WARN|unix#116: connection dropped (Connection reset by peer) 2020-09-02T08:33:06.297Z|00019|jsonrpc|WARN|unix#204: receive error: Connection reset by peer 2020-09-02T08:33:06.297Z|00020|reconnect|WARN|unix#204: connection dropped (Connection reset by peer)
And finally the first test run also look network related https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-wsjmf_sdn.log: I0902 08:22:56.322605 57831 node.go:338] Starting openshift-sdn network plugin I0902 08:22:56.429902 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:22:56.934274 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:22:57.563815 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:22:58.349635 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:22:59.330725 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:00.556999 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:02.087161 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:03.999185 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:06.388090 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:09.372697 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:23:09.372730 57831 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) I0902 08:23:39.378852 57831 ovs.go:180] Error executing ovs-vsctl: 2020-09-02T08:23:39Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) I0902 08:24:09.913966 57831 ovs.go:180] Error executing ovs-vsctl: 2020-09-02T08:24:09Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) I0902 08:24:10.424441 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:10.929233 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:11.558516 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:12.345301 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:13.327428 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:14.552918 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:16.083183 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:17.995740 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:20.384328 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0902 08:24:23.368930 57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) F0902 08:24:23.368992 57831 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition
I don't know how to deploy on ovirt. Is it possible for you to reproduce in your lab and give me connection details to the machines?
(In reply to Ricardo Carrillo Cruz from comment #6) > I don't know how to deploy on ovirt. > Is it possible for you to reproduce in your lab and give me connection > details to the machines? It is very flacky and we mainly see it on CI (I think that I saw it on my env before but it is hard to hit). To debug it I can suggest the following: 1. Look at the logs of the jobs, all the journals (masters and workers) are collected and all pod logs. 2. Look at the bootstrap gather output. Once we have some direction we can: 3. The easiest way to reproduce it is to initiate a dummy PR and use the CI env (You can log into your CI job namespace and interact with the cluster), if you want I can create a PR that will not destroy the cluster and ping you once it is reproduced and then we can login to the machines and debug it. 4. If you want a ENV to deploy a cluster I can give you access to my env, but it is hard to hit and you will probably waste time on reproducing. Also noticed you change the target release to 4.7, this is very urgent for us and really paints our CI red, I think we should try to resolve it on 4.6
Per your comments, this is hard to reproduce and is very flaky and only hit on CI. Therefore, it's not realistic that I can fix this on 4.6, thus why I moved to 4.7. This is better to have an env to repro as I can trigger CI jobs without the urgency of CI cleaning up the env after all jobs. Thanks
(In reply to Ricardo Carrillo Cruz from comment #8) > Per your comments, this is hard to reproduce and is very flaky and only hit > on CI. > Therefore, it's not realistic that I can fix this on 4.6, thus why I moved > to 4.7. > We see it almost 50% of cases on the CI, I have no problem reproduce it and prevent the cluster from being destroied. And give you access to the env, can we shedual a meeting Monday for a first debug ? > This is better to have an env to repro as I can trigger CI jobs without the > urgency of CI cleaning up the env after all jobs. > I'll try to reproduce it on my env > Thanks
Hi Domink, Sandro told me that you can assist us here, we have a lot of information from the logs plus this happens in almost any PR that we send so it is easy to reproduce.
On openvswitch level, it looks like openvswitch fails to start in log-bundle-20200902083855/control-plane/192.168.210.104/containers/openvswitch-ba45bf46266963f3091ca3408fea1c7339cabe99664af4573401fae6861d1905.log : Failed to connect to bus: No data available openvswitch is running in container /etc/openvswitch/conf.db does not exist ... (warning). Creating empty database /etc/openvswitch/conf.db. ovsdb-server: /var/run/openvswitch/ovsdb-server.pid: pidfile check failed (No such process), aborting Starting ovsdb-server ... failed! The failing shell code in the openvswitch container might be: # launch OVS # Start the ovsdb so that we can prep it before we start the ovs-vswitchd /usr/share/openvswitch/scripts/ovs-ctl start --ovs-user=openvswitch:openvswitch --no-ovs-vswitchd --system-id=random --no-monitor I guess that the maintainer of the openvswitch can help to interpret the error.
Very likely a dup of BZ 1874696 and should potentially get fixed by https://github.com/openshift/cluster-network-operator/pull/785
*** This bug has been marked as a duplicate of bug 1874696 ***