Bug 1915295

Summary: [BM][IP][Dualstack] Installation failed - operators report dial tcp 172.30.0.1:443: i/o timeout
Product: OpenShift Container Platform Reporter: Yurii Prokulevych <yprokule>
Component: NetworkingAssignee: Antonio Ojea <aojeagar>
Networking sub component: ovn-kubernetes QA Contact: Yurii Prokulevych <yprokule>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: achernet, anbhat, aojeagar, bbennett, kquinn, mcornea, sasha, wsun, zzhao
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The OVN-Kubernetes master node performs an initial synchronization to keep OVN and Kubernetes system databases in sync, however, OVN-Kubernetes was using the wrong source of truth. This issue only affected deploying the cluster in dual stack mode. Consequence: Race conditions on OVN-Kubernetes startup lead to some of the Kubernetes services becoming unreachable because the bootstrap logic deleted them as they were considered orphans. Fix: Use Kubernetes as a source of truth. Result: OVN-Kubernetes starts correctly and keeps both OVN and Kubernetes in sync on startup.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:52:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yurii Prokulevych 2021-01-12 12:07:42 UTC
Description of problem:
-----------------------
Dualstack installation of 4.7 failed:
...
time="2021-01-12T10:38:41Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20210112103755.tar.gz"
time="2021-01-12T10:38:41Z" level=info msg="Bootstrap gather logs captured here \"/home/kni/clusterconfigs/log-bundle-20210112103755.tar.gz\""
time="2021-01-12T10:38:41Z" level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"

------------------------------------------
oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 False       False         True       106m
baremetal                                  4.7.0-0.nightly-2021-01-10-070949   True        True          False      106m
cloud-credential                           4.7.0-0.nightly-2021-01-10-070949   True        False         False      123m
cluster-autoscaler                         4.7.0-0.nightly-2021-01-10-070949   True        False         False      81m
config-operator                            4.7.0-0.nightly-2021-01-10-070949   True        False         False      106m
console
csi-snapshot-controller                    4.7.0-0.nightly-2021-01-10-070949   True        False         False      105m
dns                                        4.7.0-0.nightly-2021-01-10-070949   True        False         False      80m
etcd                                       4.7.0-0.nightly-2021-01-10-070949   True        True          False      102m
image-registry                             4.7.0-0.nightly-2021-01-10-070949   True        False         False      76m
ingress                                                                        False       True          True       75m
insights                                   4.7.0-0.nightly-2021-01-10-070949   True        False         False      98m
kube-apiserver                             4.7.0-0.nightly-2021-01-10-070949   True        True          False      80m
kube-controller-manager                    4.7.0-0.nightly-2021-01-10-070949   True        True          False      102m
kube-scheduler                             4.7.0-0.nightly-2021-01-10-070949   True        True          False      79m
kube-storage-version-migrator              4.7.0-0.nightly-2021-01-10-070949   False       False         False      106m
machine-api                                4.7.0-0.nightly-2021-01-10-070949   True        False         False      78m
machine-approver                           4.7.0-0.nightly-2021-01-10-070949   True        False         False      105m
machine-config                             4.7.0-0.nightly-2021-01-10-070949   True        False         False      79m
marketplace                                4.7.0-0.nightly-2021-01-10-070949   True        False         False      80m
monitoring                                                                     False       True          True       98m
network                                    4.7.0-0.nightly-2021-01-10-070949   True        True          False      105m
node-tuning                                4.7.0-0.nightly-2021-01-10-070949   True        False         False      81m
openshift-apiserver                        4.7.0-0.nightly-2021-01-10-070949   True        False         False      75m
openshift-controller-manager               4.7.0-0.nightly-2021-01-10-070949   True        False         False      102m
openshift-samples
operator-lifecycle-manager                 4.7.0-0.nightly-2021-01-10-070949   True        False         False      81m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-01-10-070949   True        False         False      81m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-01-10-070949   True        False         False      75m
service-ca                                 4.7.0-0.nightly-2021-01-10-070949   True        False         False      106m
storage                                    4.7.0-0.nightly-2021-01-10-070949   True        False         False      106m

----------------------------------------------
oc get po -A | grep Crash
openshift-authentication-operator                  authentication-operator-f6987fb-6vgml                                      0/1     CrashLoopBackOff        19         125m
openshift-console-operator                         console-operator-6b77dd4477-lht5h                                          0/1     CrashLoopBackOff        25         75m
openshift-machine-api                              cluster-autoscaler-operator-c48656855-zrjjq                                1/2     CrashLoopBackOff        26         125m
openshift-machine-api                              machine-api-controllers-d46575786-bwfnx                                    5/7     CrashLoopBackOff        71         82m
openshift-machine-api                              metal3-bd4968444-nz5xh                                                     9/10    CrashLoopBackOff        17         106m
openshift-machine-api                              metal3-image-cache-bxq7k                                                   0/1     Init:CrashLoopBackOff   16         106m
openshift-machine-api                              metal3-image-cache-v6frq                                                   0/1     Init:CrashLoopBackOff   16         106m

-----------------------------------------------------------------------
oc logs console-operator-6b77dd4477-lht5h -n openshift-console-operator
I0112 11:56:55.637630       1 cmd.go:200] Using service-serving-cert provided certificates
I0112 11:56:55.638602       1 observer_polling.go:159] Starting file observer
W0112 11:57:20.640921       1 builder.go:207] unable to get owner reference (falling back to namespace): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-console-operator/pods/console-operator-6b
77dd4477-lht5h": dial tcp 172.30.0.1:443: i/o timeout
I0112 11:57:20.641295       1 builder.go:238] console-operator version -
I0112 11:57:20.642440       1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
I0112 11:57:27.198344       1 cmd.go:88] Received SIGTERM or SIGINT signal, shutting down controller.
I0112 11:57:30.639053       1 observer_polling.go:162] Shutting down file observer
F0112 11:57:46.203622       1 cmd.go:129] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authenti$
ation": dial tcp 172.30.0.1:443: i/o timeout
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000132001, 0xc0008741c0, 0xf7, 0x1af)
        /go/src/github.com/openshift/console-operator/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x36224a0, 0xc000000003, 0x0, 0x0, 0xc00067c150, 0x353db58, 0x6, 0x81, 0x414300)
        /go/src/github.com/openshift/console-operator/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
k8s.io/klog/v2.(*loggingT).printDepth(0x36224a0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000662090, 0x1, 0x1)
        /go/src/github.com/openshift/console-operator/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
k8s.io/klog/v2.(*loggingT).print(...)
        /go/src/github.com/openshift/console-operator/vendor/k8s.io/klog/v2/klog.go:714
k8s.io/klog/v2.Fatal(...)
        /go/src/github.com/openshift/console-operator/vendor/k8s.io/klog/v2/klog.go:1482
github.com/openshift/library-go/pkg/controller/controllercmd.(*ControllerCommandConfig).NewCommandWithContext.func1(0xc000268dc0, 0xc00090cd60, 0x0, 0x2)
        /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/controllercmd/cmd.go:129 +0x785
github.com/spf13/cobra.(*Command).execute(0xc000268dc0, 0xc00090cd40, 0x2, 0x2, 0xc000268dc0, 0xc00090cd40)
        /go/src/github.com/openshift/console-operator/vendor/github.com/spf13/cobra/command.go:846 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc000268000, 0xc000134120, 0x3621ee0, 0x0)
        /go/src/github.com/openshift/console-operator/vendor/github.com/spf13/cobra/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/console-operator/vendor/github.com/spf13/cobra/command.go:887
main.main()
        /go/src/github.com/openshift/console-operator/cmd/console/main.go:38 +0x155

----------------------------------------------------------------------------------
oc logs authentication-operator-f6987fb-6vgml -n openshift-authentication-operator
Copying system trust bundle
I0112 11:58:03.869496       1 observer_polling.go:159] Starting file observer
I0112 11:58:03.869774       1 cmd.go:200] Using service-serving-cert provided certificates
I0112 11:58:03.870600       1 observer_polling.go:159] Starting file observer
W0112 11:58:28.872689       1 builder.go:207] unable to get owner reference (falling back to namespace): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-authentication-operator/pods/authenticati
on-operator-f6987fb-6vgml": dial tcp 172.30.0.1:443: i/o timeout
I0112 11:58:28.872862       1 builder.go:238] cluster-authentication-operator version v4.7.0-202101090212.p0-0-gda548f8-
I0112 11:58:28.874389       1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
F0112 11:59:20.828040       1 cmd.go:129] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentic
ation": dial tcp 172.30.0.1:443: i/o timeout
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00031cc80, 0xf7, 0x260)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x365ffc0, 0xc000000003, 0x0, 0x0, 0xc00083c230, 0x349da6c, 0x6, 0x81, 0x0)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x365ffc0, 0x3, 0x0, 0x0, 0x1, 0xc000e69c30, 0x1, 0x1)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/klog/v2/klog.go:718 +0x165
k8s.io/klog/v2.(*loggingT).print(...)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/klog/v2/klog.go:703
k8s.io/klog/v2.Fatal(...)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/klog/v2/klog.go:1443
github.com/openshift/library-go/pkg/controller/controllercmd.(*ControllerCommandConfig).NewCommandWithContext.func1(0xc000622000, 0xc0005c4ae0, 0x0, 0x3)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/github.com/openshift/library-go/pkg/controller/controllercmd/cmd.go:129 +0x6fe
github.com/spf13/cobra.(*Command).execute(0xc000622000, 0xc0005c4ab0, 0x3, 0x3, 0xc000622000, 0xc0005c4ab0)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/github.com/spf13/cobra/command.go:846 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc00049db80, 0xc0000b8180, 0x365faa0, 0x0)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/github.com/spf13/cobra/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/cluster-authentication-operator/vendor/github.com/spf13/cobra/command.go:887
main.main()
        /go/src/github.com/openshift/cluster-authentication-operator/cmd/authentication-operator/main.go:29 +0x155


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
4.7.0-0.nightly-2021-01-10-070949



Steps to Reproduce:
-------------------
1. Try to deploy dual stack 4.7 cluster with baremetal ipi


Actual results:
---------------
Deployment failed


Expected results:
-----------------
Deployment succeeded

Additional info:
----------------
Virtual deployment, 3masters + 2workers, deployment with

must-gather failed as well:
---------------------------
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD WARNING: Collecting one or more service logs on ALL linux master nodes in your cluster. This could take a large amount of time.
[must-gather-mwm9m] POD INFO: Collecting host service logs for kubelet
[must-gather-mwm9m] POD INFO: Collecting host service logs for crio
[must-gather-mwm9m] POD INFO: Collecting host service logs for machine-config-daemon-firstboot
[must-gather-mwm9m] POD INFO: Collecting host service logs for machine-config-daemon-host
[must-gather-mwm9m] POD INFO: Waiting for worker host service log collection to complete ...
[must-gather-mwm9m] POD error: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD error: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: i/o timeout
[must-gather-mwm9m] POD error: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: i/o timeout

Comment 2 Antonio Ojea 2021-01-12 18:08:44 UTC
It is a race condition in the new service controller, it runs a function to deal with stale data, but it wrongly delete the right services. 
We can see in the logs:

2021-01-12T10:44:42.718Z|00363|nbctl|INFO|Running command run --if-exists -- remove load_balancer 0d44f11a-6c61-457b-8065-251f39c9ce22 vips "\"172.30.0.10:9154\""
I0112 10:44:42.721772       1 repair.go:106] Deleting non-existing Kubernetes vip 172.30.0.1:443 from OVN load balancer [0d44f11a-6c61-457b-8065-251f39c9ce22 58d9cf35-1960-4220-b224-51a0158cc12d c0d0698d-1e05-43ec-85c2-6458d25d46a9 7186b91e-cfa0-4509-b68a-b4f7a6fdba67]
2021-01-12T10:44:42.731Z|00368|nbctl|INFO|Running command run --if-exists -- remove load_balancer 0d44f11a-6c61-457b-8065-251f39c9ce22 vips "\"172.30.0.1:443\""
I0112 10:44:42.917421       1 repair.go:106] Deleting non-existing Kubernetes vip 172.30.0.10:53 from OVN load balancer [342e0ee1-7d07-4990-bb0a-60bc2cc5a3ad adb10c2a-8bd5-4ba9-9400-f5df73286251 cfde3642-134d-4e82-9e73-2e20471d4fb3 ac843e5a-89ac-475b-ab61-997c13ba8087]

Fix in
https://github.com/ovn-org/ovn-kubernetes/pull/1945

Comment 4 Wei Sun 2021-01-19 06:29:16 UTC
Adding the testblocker keyword since it's blocking the test for IPI on Bare Metal with Dualstack

Comment 6 zhaozhanqi 2021-02-02 08:48:39 UTC
this issue should be fixed. move this issue to Verified.

Comment 9 errata-xmlrpc 2021-02-24 15:52:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633