Created attachment 1761731 [details] installer-gather Description of problem: When installing using assisted-installer while bootstrapping the cluster bootkube timed-out and got restarted. That caused the openshift-sdn-controller to crash and after it came back it failed to connect to the kube-apiserver on the bootstrap node Version-Release number of selected component (if applicable): How reproducible: Customer issue, happened once. Steps to Reproduce: 1. 2. 3. Actual results: controller crash: I0304 08:31:15.604011 1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators" E0304 08:34:17.264576 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:27.266235 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:37.266221 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:47.265726 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:51.244940 1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading' I0304 08:34:51.245028 1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition F0304 08:34:51.245044 1 network_controller.go:97] leaderelection lost goroutine 29 [running]: k8s.io/klog/v2.stacks(0xc0000cc001, 0xc0005665a0, 0x4c, 0xe5) k8s.io/klog/v2.0/klog.go:1026 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x28da000, 0xc000000003, 0x0, 0x0, 0xc000936930, 0x282da78, 0x15, 0x61, 0x0) k8s.io/klog/v2.0/klog.go:975 +0x19b k8s.io/klog/v2.(*loggingT).printf(0x28da000, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1b147b4, 0x13, 0x0, 0x0, ...) and fail to connect once it starts again Uploaded image for project: 'KNI Management' KNI ManagementMGMT-4409 cloud.redhat.com failure: 2021-03-04_08:11:03_6e4004a3-7899-46e3-a50e-ebacf3c0dbd7 Edit Comment Assign More To Do In Progress Workflow Start Scrum Poker Export Details Type: Bug Status:IN PROGRESS (View Workflow) Priority: Blocker Resolution:Unresolved Affects Version/s: OpenShift 4.7 Fix Version/s:None Component/s: Assisted-installer Triage Labels: AI_CLOUD_TRIAGE AI_CLUSTER_6e4004a3-7899-46e3-a50e-ebacf3c0dbd7 AI_DOMAIN_staff.voo.be AI_USER_luca.belforte.be no-qe QEStatus:ToDo Description Do not manually edit this description, it will get automatically over-written Cluster Info Cluster ID: 6e4004a3-7899-46e3-a50e-ebacf3c0dbd7 Username: luca.belforte.be Email domain: staff.voo.be Created_at: 2021-03-02 16:02:48 Installation started at: 2021-03-04 08:11:03 Failed on: 2021-03-04 09:15:19 status: error status_info: cluster has hosts in error OpenShift version: 4.7 links: logs Metrics Kibana DM Elastic DM Elastic Dashboard Options Attachments Drop files to attach, or browse. Activity All Comments Slack Work Log History Activity Links Hierarchy Ascending order - Click to sort in descending order 8 older comments Permalink Edit Delete ercohen Eran Cohen added a comment - 18 minutes ago Same issue on the other master node at the same time!!: Mar 04 08:34:09 openshift2 hyperkube[1827]: E0304 08:34:09.924949 1827 pod_workers.go:191] Error syncing pod cf50a3bc-aaa3-4852-b547-5ca666e72553 ("network-check-target-clcc9_openshift-network-diagnostics(cf50a3bc-aaa3-4852-b547-5ca666e72553)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.785060 1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.786281 1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused Mar 04 08:34:11 openshift2 hyperkube[1827]: W0304 08:34:11.786805 1827 reflector.go:436] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Unexpected watch close - watch lasted less than a second and no items received Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.787960 1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.788777 1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.789546 1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused Mar 04 08:34:11 openshift2 hyperkube[1827]: I0304 08:34:11.789587 1827 controller.go:114] failed to update lease using latest lease, fallback to ensure lease, err: failed 5 attempts to update lease Permalink Edit Delete ercohen Eran Cohen added a comment - 16 minutes ago Keepalived log on the bootstrap: The client sent: stop Thu Mar 4 08:34:15 2021: Stopping Thu Mar 4 08:34:15 2021: (devtest_API) sent 0 priority Thu Mar 4 08:34:15 2021: (devtest_API) removing VIPs. Thu Mar 4 08:34:16 2021: Stopped - used 0.035811 user time, 0.061405 system time Thu Mar 4 08:34:16 2021: Stopped Keepalived v2.0.10 (11/12,2018) The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: reload Permalink Edit Delete ercohen Eran Cohen added a comment - 13 minutes ago - edited OK seems that kubelet failed to talk to the API because bootkube got restarted (taking the keepalived and the API down) The root issue seems to be that we fail to configure the CNI in time. Permalink Edit Delete ercohen Eran Cohen added a comment - 7 minutes ago openshift-sdn-controller log: I0304 08:31:15.604011 1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators" E0304 08:34:17.264576 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:27.266235 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:37.266221 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:47.265726 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:51.244940 1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading' I0304 08:34:51.245028 1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition F0304 08:34:51.245044 1 network_controller.go:97] leaderelection lost goroutine 29 [running]: k8s.io/klog/v2.stacks(0xc0000cc001, 0xc0005665a0, 0x4c, 0xe5) k8s.io/klog/v2.0/klog.go:1026 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x28da000, 0xc000000003, 0x0, 0x0, 0xc000936930, 0x282da78, 0x15, 0x61, 0x0) k8s.io/klog/v2.0/klog.go:975 +0x19b k8s.io/klog/v2.(*loggingT).printf(0x28da000, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1b147b4, 0x13, 0x0, 0x0, ...) k8s.io/klog/v2.0/klog.go:750 +0x191 k8s.io/klog/v2.Fatalf(...) k8s.io/klog/v2.0/klog.go:1502 github.com/openshift/sdn/pkg/openshift-network-controller.RunOpenShiftNetworkController.func2() github.com/openshift/sdn/pkg/openshift-network-controller/network_controller.go:97 +0x8f k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000145b00) k8s.io/client-go.6/tools/leaderelection/leaderelection.go:199 +0x29 k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000145b00, 0x1d95c00, 0xc0000491c0) k8s.io/client-go.6/tools/leaderelection/leaderelection.go:209 +0x15d k8s.io/client-go/tools/leaderelection.RunOrDie(0x1d95c40, 0xc0000c8010, 0x1da0fe0, 0xc00032d040, 0xdf8475800, 0x826299e00, 0x2540be400, 0xc00030b660, 0x1c14fb0, 0x0, ...) k8s.io/client-go.6/tools/leaderelection/leaderelection.go:222 +0x9c created by github.com/openshift/sdn/pkg/openshift-network-controller.RunOpenShiftNetworkController github.com/openshift/sdn/pkg/openshift-network-controller/network_controller.go:88 +0x538 goroutine 1 [select (no cases), 3 minutes]: github.com/openshift/sdn/pkg/openshift-network-controller.(*OpenShiftNetworkController).StartNetworkController(0xc000193440, 0x0, 0x1c170a8) github.com/openshift/sdn/pkg/openshift-network-controller/cmd.go:75 +0x7e github.com/openshift/sdn/pkg/openshift-network-controller.NewOpenShiftNetworkControllerCommand.func1(0xc00013f680, 0xc0009194c0, 0x0, 0x1) github.com/openshift/sdn/pkg/openshift-network-controller/cmd.go:41 +0x7c github.com/spf13/cobra.(*Command).execute(0xc00013f680, 0xc0000c0030, 0x1, 0x1, 0xc00013f680, 0xc0000c0030) github.com/spf13/cobra.1/command.go:766 +0x2c2 github.com/spf13/cobra.(*Command).ExecuteC(0xc00013f680, 0x1c, 0x1b13132, 0x12) github.com/spf13/cobra.1/command.go:850 +0x30b github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra.1/command.go:800 main.main() github.com/openshift/sdn/cmd/network-controller/network-controller.go:29 +0x1b5 Permalink Edit Delete ercohen Eran Cohen added a comment - 5 minutes ago And after that it just failed to reconnect: I0304 08:34:52.124004 1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller... E0304 08:34:52.125279 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:35:07.282756 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:35:26.573311 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:35:46.391482 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:36:08.201106 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:36:26.349852 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:36:46.409535 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:37:00.201621 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api- Expected results: Additional info:
Hi I've been having a look at the problem but I can't find a root cause. > The root issue seems to be that we fail to configure the CNI in time I don't think that's true. The problem is that everything is trying to talk to the configured VIP 10.32.218.178, which seems to not be working. This is all host networking, so the SDN is not involved and everything which is reporting errors (kubelet/openshift-sdn) is running hostNetwork: true I think the root cause might be the keepalive process on the bootstrap node, I am seeing the following logs: $ less keepalived-118c6c1c28880f9953846f458578a1b5d74e29d6908a3637439906be773f540e.log + declare -r keepalived_sock=/var/run/keepalived/keepalived.sock + export -f msg_handler + export -f reload_keepalived + export -f stop_keepalived + '[' -s /var/run/keepalived/stopped ']' + '[' -s /etc/keepalived/keepalived.conf ']' + rm -f /var/run/keepalived/keepalived.sock + socat UNIX-LISTEN:/var/run/keepalived/keepalived.sock,fork 'system:bash -c msg_handler' The client sent: reload Thu Mar 4 08:13:50 2021: Starting Keepalived v2.0.10 (11/12,2018) Thu Mar 4 08:13:50 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0) Thu Mar 4 08:13:50 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp' Thu Mar 4 08:13:50 2021: '--log-detail' '--log-console' Thu Mar 4 08:13:50 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:14:00 2021: Starting VRRP child process, pid=14 Thu Mar 4 08:14:00 2021: Registering Kernel netlink reflector Thu Mar 4 08:14:00 2021: Registering Kernel netlink command channel Thu Mar 4 08:14:00 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:14:00 2021: (Line 22) Truncating auth_pass to 8 characters Thu Mar 4 08:14:00 2021: Assigned address 10.32.218.173 for interface eth0 Thu Mar 4 08:14:00 2021: Registering gratuitous ARP shared channel Thu Mar 4 08:14:00 2021: (devtest_API) removing VIPs. Thu Mar 4 08:14:00 2021: (devtest_API) Entering BACKUP STATE (init) Thu Mar 4 08:14:00 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(0), fd(11,12)] Thu Mar 4 08:14:04 2021: (devtest_API) Receive advertisement timeout Thu Mar 4 08:14:04 2021: (devtest_API) Entering MASTER STATE Thu Mar 4 08:14:04 2021: (devtest_API) setting VIPs. Thu Mar 4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:04 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178 Thu Mar 4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 The client sent: stop The client sent: stop Thu Mar 4 08:14:15 2021: Stopping Thu Mar 4 08:14:15 2021: (devtest_API) sent 0 priority Thu Mar 4 08:14:15 2021: (devtest_API) removing VIPs. Thu Mar 4 08:14:16 2021: Stopped - used 0.002099 user time, 0.002962 system time Thu Mar 4 08:14:16 2021: Stopped Keepalived v2.0.10 (11/12,2018) The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: reload Thu Mar 4 08:14:41 2021: Starting Keepalived v2.0.10 (11/12,2018) Thu Mar 4 08:14:41 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0) Thu Mar 4 08:14:41 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp' Thu Mar 4 08:14:41 2021: '--log-detail' '--log-console' Thu Mar 4 08:14:41 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:14:55 2021: Starting VRRP child process, pid=72 Thu Mar 4 08:14:55 2021: Registering Kernel netlink reflector Thu Mar 4 08:14:55 2021: Registering Kernel netlink command channel Thu Mar 4 08:14:55 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:14:55 2021: (Line 22) Truncating auth_pass to 8 characters Thu Mar 4 08:14:55 2021: Assigned address 10.32.218.173 for interface eth0 Thu Mar 4 08:14:55 2021: Registering gratuitous ARP shared channel Thu Mar 4 08:14:55 2021: (devtest_API) removing VIPs. Thu Mar 4 08:14:55 2021: (devtest_API) Entering BACKUP STATE (init) Thu Mar 4 08:14:55 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(0), fd(11,12)] Thu Mar 4 08:14:59 2021: (devtest_API) Receive advertisement timeout Thu Mar 4 08:14:59 2021: (devtest_API) Entering MASTER STATE Thu Mar 4 08:14:59 2021: (devtest_API) setting VIPs. Thu Mar 4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:59 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178 Thu Mar 4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 Thu Mar 4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178 The client sent: reload Thu Mar 4 08:28:33 2021: Reloading ... Thu Mar 4 08:28:33 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:28:33 2021: Reloading Thu Mar 4 08:28:33 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:28:33 2021: (Line 23) Truncating auth_pass to 8 characters Thu Mar 4 08:28:33 2021: Assigned address 10.32.218.173 for interface eth0 Thu Mar 4 08:28:33 2021: (devtest_API) setting VIPs. Thu Mar 4 08:28:33 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(1), fd(11,12)] The client sent: reload Thu Mar 4 08:29:48 2021: Reloading ... Thu Mar 4 08:29:48 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:29:48 2021: Reloading Thu Mar 4 08:29:48 2021: Opening file '/etc/keepalived/keepalived.conf'. Thu Mar 4 08:29:48 2021: (Line 24) Truncating auth_pass to 8 characters Thu Mar 4 08:29:48 2021: Assigned address 10.32.218.173 for interface eth0 Thu Mar 4 08:29:48 2021: (devtest_API) setting VIPs. Thu Mar 4 08:29:48 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(1), fd(11,12)] The client sent: stop Thu Mar 4 08:34:15 2021: Stopping Thu Mar 4 08:34:15 2021: (devtest_API) sent 0 priority Thu Mar 4 08:34:15 2021: (devtest_API) removing VIPs. Thu Mar 4 08:34:16 2021: Stopped - used 0.035811 user time, 0.061405 system time Thu Mar 4 08:34:16 2021: Stopped Keepalived v2.0.10 (11/12,2018) The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: reload Thu Mar 4 08:34:25 2021: Starting Keepalived v2.0.10 (11/12,2018) Thu Mar 4 08:34:25 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0) Thu Mar 4 08:34:25 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp' Thu Mar 4 08:34:25 2021: '--log-detail' '--log-console' Thu Mar 4 08:34:25 2021: Opening file '/etc/keepalived/keepalived.conf'. The client sent: stop The client sent: stop The client sent: stop environment: line 2: 939 Terminated /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: stop The client sent: reload Thu Mar 4 08:35:37 2021: Starting Keepalived v2.0.10 (11/12,2018) Thu Mar 4 08:35:37 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0) Thu Mar 4 08:35:37 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp' Thu Mar 4 08:35:37 2021: '--log-detail' '--log-console' Thu Mar 4 08:35:37 2021: Opening file '/etc/keepalived/keepalived.conf'. The client sent: reload The client sent: stop environment: line 2: 1053 Hangup /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console The client sent: stop The client sent: stop The client sent: stop The client sent: reload Given that everything (openshift-sdn and kubelet) was running fine until roughly 08:34:17, see: $ less sdn-controller-918392caa9dc02c3e9b6ec8f80979f12807efdbbab6f950747f0157d084e0540.log I0304 08:31:15.004467 1 vnids.go:117] Allocated netid 4140068 for namespace "openshift-kube-controller-manager" I0304 08:31:15.204021 1 vnids.go:117] Allocated netid 5758811 for namespace "openshift-ovirt-infra" I0304 08:31:15.404404 1 vnids.go:117] Allocated netid 16102858 for namespace "openshift-insights" I0304 08:31:15.604011 1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators" E0304 08:34:17.264576 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:27.266235 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:37.266221 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:47.265726 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused E0304 08:34:51.244940 1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVers ion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[stri ng]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' du e to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading' I0304 08:34:51.245028 1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition F0304 08:34:51.245044 1 network_controller.go:97] leaderelection lost goroutine 29 [running]: I suspect there is a problem with those restarts seen in the keepalive process, assigning to runtimecfg for further analysis
I successfully deployed OCP 4.7.3 with OpenshiftSDN using IPI baremetal installer flow, so point me if you think I can help you with this bug
The connection refused errors strongly suggest this is one of the double vip bugs (that are now fixed). Is this still happening in recent builds?
Yep, it seems like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1957708 .
*** This bug has been marked as a duplicate of bug 1957708 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days