1874820 – oVirt: Installer fails on bootstrap phase

Bug 1874820 - oVirt: Installer fails on bootstrap phase

Summary: oVirt: Installer fails on bootstrap phase

Keywords:
Status:	CLOSED DUPLICATE of bug 1874696
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Juan Luis de Sousa-Valadas
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-02 10:48 UTC by Gal Zaidman
Modified:	2020-09-15 13:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 13:41:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1301067010062422016 (3.87 MB, application/gzip) 2020-09-02 10:48 UTC, Gal Zaidman	no flags	Details
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752 (10.96 MB, application/gzip) 2020-09-02 10:51 UTC, Gal Zaidman	no flags	Details
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384 (13.18 MB, application/gzip) 2020-09-02 10:56 UTC, Gal Zaidman	no flags	Details
View All

Description Gal Zaidman 2020-09-02 10:48:00 UTC

Created attachment 1713431 [details]
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1301067010062422016

Description of problem:

When running OCP installer on oVirt CI most of the PRs fail on the bootstrap phase.
When I sshed into the nodes I saw that on some masters (sometimes one of them, sometimes all it is flaky), the directory /etc/kubernetes/manifests/ is missing some or all of the following yamls:
- kube-apiserver-pod.yaml
- kube-controller-manager-pod.yaml
- kube-scheduler-pod.yaml
- etcd-pod.yaml

The jobs have pods logs on them, here are examples:
1. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384

2.https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752

3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1301067010062422016


Attached bootstrap gather logs from the VMs

Comment 1 Gal Zaidman 2020-09-02 10:51:22 UTC

Created attachment 1713432 [details]
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752

Comment 2 Gal Zaidman 2020-09-02 10:56:34 UTC

Created attachment 1713433 [details]
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384

Bootstrap gather of job https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384

Comment 3 Stefan Schimanski 2020-09-02 13:54:06 UTC

I have debugged the third attachment.

This is the last kubelet message starting a kube-apiserver-operator pod (a pod because I think there was one before starting at least one kube-apiserver instance on one of the masters):

Sep 02 07:58:25 ovirt16-p65wd-master-0 hyperkube[1514]: I0902 07:58:25.377279    1514 event.go:291] "Event occurred" object="openshift-kube-apiserver-operator/kube-apiserver-operator-d778db9d5-h8jh2" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-d778db9d5-h8jh2_openshift-kube-apiserver-operator_64f349ef-ed46-4937-bd5d-34d90d38ebcf_0(eae40f5beee1d0b9fc89b8a77a0ef31dc31ed9ab58d455559a3df1ca5af3daab): [openshift-kube-apiserver-operator/kube-apiserver-operator-d778db9d5-h8jh2:openshift-sdn]: error adding container to network \"openshift-sdn\": failed to send CNI request: Post \"http://dummy/\": dial unix /var/run/openshift-sdn/cniserver/socket: connect: connection refused"

During this time the SDN pods on the same node looks normal. 

I0902 07:57:46.283726   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-storage-version-migrator-operator/metrics:https to [10.130.0.11:8443]
I0902 07:57:46.289541   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-apiserver-operator/metrics:https to [10.130.0.15:8443]
I0902 07:57:46.300435   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-service-ca-operator/metrics:https to [10.130.0.4:8443]
I0902 07:57:46.322426   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https to [10.130.0.7:8443]
I0902 07:57:46.336344   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-scheduler-operator/metrics:https to [10.130.0.10:8443]
I0902 07:57:46.347094   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-authentication-operator/metrics:https to [10.130.0.13:8443]
I0902 07:57:46.355551   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-controller-manager-operator/metrics:https to [10.130.0.12:8443]
I0902 07:57:46.373038   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-controller-manager-operator/metrics:https to [10.130.0.14:8443]
I0902 07:57:46.393659   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:57:46.393924   13189 proxier.go:349] userspace syncProxyRules took 36.282019ms
I0902 07:57:46.500531   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:57:46.500759   13189 proxier.go:349] userspace syncProxyRules took 34.21629ms
I0902 07:58:20.357715   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-storage-version-migrator-operator/metrics:https
I0902 07:58:20.361694   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-apiserver-operator/metrics:https
I0902 07:58:20.396329   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https
I0902 07:58:20.466033   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:20.466286   13189 proxier.go:349] userspace syncProxyRules took 34.029282ms
I0902 07:58:20.574713   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:20.574954   13189 proxier.go:349] userspace syncProxyRules took 34.081956ms
I0902 07:58:21.297235   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-apiserver-operator/metrics:https to [10.130.0.15:8443]
I0902 07:58:21.305968   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-controller-manager-operator/metrics:https
I0902 07:58:21.312992   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-service-ca-operator/metrics:https
I0902 07:58:21.323129   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-scheduler-operator/metrics:https
I0902 07:58:21.336831   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-storage-version-migrator-operator/metrics:https to [10.130.0.11:8443]
I0902 07:58:21.346417   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics:https to [10.130.0.7:8443]
I0902 07:58:21.352938   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-authentication-operator/metrics:https
I0902 07:58:21.404645   13189 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-kube-controller-manager-operator/metrics:https
I0902 07:58:21.406431   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:21.406683   13189 proxier.go:349] userspace syncProxyRules took 34.342713ms
I0902 07:58:21.512147   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:21.512371   13189 proxier.go:349] userspace syncProxyRules took 33.799784ms
I0902 07:58:22.319558   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-controller-manager-operator/metrics:https to [10.130.0.14:8443]
I0902 07:58:22.329539   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-controller-manager-operator/metrics:https to [10.130.0.12:8443]
I0902 07:58:22.342064   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-service-ca-operator/metrics:https to [10.130.0.4:8443]
I0902 07:58:22.359268   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-kube-scheduler-operator/metrics:https to [10.130.0.10:8443]
I0902 07:58:22.371214   13189 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-authentication-operator/metrics:https to [10.130.0.13:8443]
I0902 07:58:22.425066   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:22.425294   13189 proxier.go:349] userspace syncProxyRules took 33.617906ms
I0902 07:58:22.531322   13189 proxier.go:370] userspace proxy: processing 0 service events
I0902 07:58:22.531528   13189 proxier.go:349] userspace syncProxyRules took 33.74158ms

Some minutes earlier I saw in another SDN pod instance:

  I0902 07:56:47.438088    5363 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)
  F0902 07:56:47.659315    5363 cmd.go:111] Failed to start sdn: node SDN setup failed: Link not found

But this got resolved somehow.

The operator pod in question stays in Unready, ContainerCreating state, but never makes progress (the message above is the last in kubelet logs about that pod):

 "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-09-02T07:57:03Z",
                        "status": "True",
                        "type": "Initialized"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-09-02T07:57:03Z",
                        "message": "containers with unready status: [kube-apiserver-operator]",
                        "reason": "ContainersNotReady",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-09-02T07:57:03Z",
                        "message": "containers with unready status: [kube-apiserver-operator]",
                        "reason": "ContainersNotReady",
                        "status": "False",
                        "type": "ContainersReady"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-09-02T07:57:03Z",
                        "status": "True",
                        "type": "PodScheduled"
                    }
                ],
                "containerStatuses": [
                    {
                        "image": "registry.svc.ci.openshift.org/ci-op-ytwm9308/stable@sha256:f719fcaf44d4ad8c530c0e6d2e9101db199623cf7ab35bbafa2a62e42f9b90a9",
                        "imageID": "",
                        "lastState": {},
                        "name": "kube-apiserver-operator",
                        "ready": false,
                        "restartCount": 0,
                        "started": false,
                        "state": {
                            "waiting": {
                                "reason": "ContainerCreating"
                            }
                        }
                    }
                ],

Comment 4 Stefan Schimanski 2020-09-07 11:59:26 UTC

The second linked test run looks differently:

  F0902 08:38:56.213044       1 cmd.go:129] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host

One of the openvswitch pods looks broken https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/11473/rehearse-11473-pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1301067009575882752/artifacts/e2e-ovirt/pods/openshift-sdn_ovs-vbd2n_openvswitch.log:

  2020-09-02T08:25:05.839Z|00013|jsonrpc|WARN|unix#76: receive error: Connection reset by peer
  2020-09-02T08:25:05.839Z|00014|reconnect|WARN|unix#76: connection dropped (Connection reset by peer)
  2020-09-02T08:27:05.956Z|00015|jsonrpc|WARN|unix#108: receive error: Connection reset by peer
  2020-09-02T08:27:05.956Z|00016|reconnect|WARN|unix#108: connection dropped (Connection reset by peer)
  2020-09-02T08:27:35.984Z|00017|jsonrpc|WARN|unix#116: receive error: Connection reset by peer
  2020-09-02T08:27:35.984Z|00018|reconnect|WARN|unix#116: connection dropped (Connection reset by peer)
  2020-09-02T08:33:06.297Z|00019|jsonrpc|WARN|unix#204: receive error: Connection reset by peer
  2020-09-02T08:33:06.297Z|00020|reconnect|WARN|unix#204: connection dropped (Connection reset by peer)

Comment 5 Stefan Schimanski 2020-09-07 12:02:01 UTC

And finally the first test run also look network related https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301063351337488384/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-wsjmf_sdn.log:

I0902 08:22:56.322605   57831 node.go:338] Starting openshift-sdn network plugin
I0902 08:22:56.429902   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:22:56.934274   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:22:57.563815   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:22:58.349635   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:22:59.330725   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:00.556999   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:02.087161   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:03.999185   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:06.388090   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:09.372697   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:23:09.372730   57831 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
I0902 08:23:39.378852   57831 ovs.go:180] Error executing ovs-vsctl: 2020-09-02T08:23:39Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0902 08:24:09.913966   57831 ovs.go:180] Error executing ovs-vsctl: 2020-09-02T08:24:09Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0902 08:24:10.424441   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:10.929233   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:11.558516   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:12.345301   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:13.327428   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:14.552918   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:16.083183   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:17.995740   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:20.384328   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0902 08:24:23.368930   57831 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
F0902 08:24:23.368992   57831 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition

Comment 6 Ricardo Carrillo Cruz 2020-09-08 11:15:11 UTC

I don't know how to deploy on ovirt.
Is it possible for you to reproduce in your lab and give me connection details to the machines?

Comment 7 Gal Zaidman 2020-09-08 11:31:33 UTC

(In reply to Ricardo Carrillo Cruz from comment #6)
> I don't know how to deploy on ovirt.
> Is it possible for you to reproduce in your lab and give me connection
> details to the machines?

It is very flacky and we mainly see it on CI (I think that I saw it on my env before but it is hard to hit).

To debug it I can suggest the following:
1. Look at the logs of the jobs, all the journals (masters and workers) are collected and all pod logs.
2. Look at the bootstrap gather output.
Once we have some direction we can:
3. The easiest way to reproduce it is to initiate a dummy PR and use the CI env (You can log into your CI job namespace and interact with the cluster), if you want I can create a PR that will not destroy the cluster and ping you once it is reproduced and then we can login to the machines and debug it.
4. If you want a ENV to deploy a cluster I can give you access to my env, but it is hard to hit and you will probably waste time on reproducing.

Also noticed you change the target release to 4.7, this is very urgent for us and really paints our CI red, I think we should try to resolve it on 4.6

Comment 8 Ricardo Carrillo Cruz 2020-09-09 10:26:41 UTC

Per your comments, this is hard to reproduce and is very flaky and only hit on CI.
Therefore, it's not realistic that I can fix this on 4.6, thus why I moved to 4.7.

This is better to have an env to repro as I can trigger CI jobs without the urgency of CI cleaning up the env after all jobs.

Thanks

Comment 9 Gal Zaidman 2020-09-10 13:53:53 UTC

(In reply to Ricardo Carrillo Cruz from comment #8)
> Per your comments, this is hard to reproduce and is very flaky and only hit
> on CI.
> Therefore, it's not realistic that I can fix this on 4.6, thus why I moved
> to 4.7.
> 

We see it almost 50% of cases on the CI, I have no problem reproduce it and prevent the cluster from being destroied.
And give you access to the env, can we shedual a meeting Monday for a first debug ?

> This is better to have an env to repro as I can trigger CI jobs without the
> urgency of CI cleaning up the env after all jobs.
> 

I'll try to reproduce it on my env

> Thanks

Comment 10 Gal Zaidman 2020-09-14 12:26:46 UTC

Hi Domink, Sandro told me that you can assist us here, we have a lot of information from the logs plus this happens in almost any PR that we send so it is easy to reproduce.

Comment 11 Dominik Holler 2020-09-14 13:22:16 UTC

On openvswitch level, it looks like openvswitch fails to start in
log-bundle-20200902083855/control-plane/192.168.210.104/containers/openvswitch-ba45bf46266963f3091ca3408fea1c7339cabe99664af4573401fae6861d1905.log :

Failed to connect to bus: No data available
openvswitch is running in container
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db.
ovsdb-server: /var/run/openvswitch/ovsdb-server.pid: pidfile check failed (No such process), aborting
Starting ovsdb-server ... failed!

The failing shell code in the openvswitch container might be:

  # launch OVS
  # Start the ovsdb so that we can prep it before we start the ovs-vswitchd
  /usr/share/openvswitch/scripts/ovs-ctl start --ovs-user=openvswitch:openvswitch --no-ovs-vswitchd --system-id=random --no-monitor

I guess that the maintainer of the openvswitch can help to interpret the error.

Comment 12 Aniket Bhat 2020-09-14 19:51:04 UTC

Very likely a dup of BZ 1874696 and should potentially get fixed by https://github.com/openshift/cluster-network-operator/pull/785

Comment 13 Juan Luis de Sousa-Valadas 2020-09-15 13:41:39 UTC


*** This bug has been marked as a duplicate of bug 1874696 ***

Note You need to log in before you can comment on or make changes to this bug.