Bug 1936502 - openshift-sdn-controller fail to connect to kube-apiserver on the bootstrap
Summary: openshift-sdn-controller fail to connect to kube-apiserver on the bootstrap
Keywords:
Status: CLOSED DUPLICATE of bug 1957708
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Antoni Segura Puimedon
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-08 16:12 UTC by Eran Cohen
Modified: 2023-09-15 01:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-20 07:25:20 UTC
Target Upstream Version:
Embargoed:
ercohen: needinfo-


Attachments (Terms of Use)
installer-gather (1.92 MB, application/x-tar)
2021-03-08 16:12 UTC, Eran Cohen
no flags Details

Description Eran Cohen 2021-03-08 16:12:13 UTC
Created attachment 1761731 [details]
installer-gather

Description of problem:
When installing using assisted-installer while bootstrapping the cluster bootkube timed-out and got restarted.
That caused the openshift-sdn-controller to crash and after it came back it failed to connect to the kube-apiserver on the bootstrap node

Version-Release number of selected component (if applicable):


How reproducible:
Customer issue, happened once.

Steps to Reproduce:
1.
2.
3.

Actual results:
controller crash:

I0304 08:31:15.604011       1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators"
E0304 08:34:17.264576       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:27.266235       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:37.266221       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:47.265726       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:51.244940       1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading'
I0304 08:34:51.245028       1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition
F0304 08:34:51.245044       1 network_controller.go:97] leaderelection lost
goroutine 29 [running]:
k8s.io/klog/v2.stacks(0xc0000cc001, 0xc0005665a0, 0x4c, 0xe5)
	k8s.io/klog/v2.0/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x28da000, 0xc000000003, 0x0, 0x0, 0xc000936930, 0x282da78, 0x15, 0x61, 0x0)
	k8s.io/klog/v2.0/klog.go:975 +0x19b
k8s.io/klog/v2.(*loggingT).printf(0x28da000, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1b147b4, 0x13, 0x0, 0x0, ...)

and fail to connect once it starts again




Uploaded image for project: 'KNI Management'
KNI ManagementMGMT-4409
cloud.redhat.com failure: 2021-03-04_08:11:03_6e4004a3-7899-46e3-a50e-ebacf3c0dbd7
 Edit
 Comment
Assign
More
To Do
In Progress
Workflow
 Start Scrum Poker
 Export
Details
Type: Bug
Status:IN PROGRESS (View Workflow)
Priority: Blocker
Resolution:Unresolved
Affects Version/s:
OpenShift 4.7
Fix Version/s:None
Component/s:
Assisted-installer Triage
Labels:
AI_CLOUD_TRIAGE AI_CLUSTER_6e4004a3-7899-46e3-a50e-ebacf3c0dbd7 AI_DOMAIN_staff.voo.be AI_USER_luca.belforte.be no-qe
QEStatus:ToDo
Description
Do not manually edit this description, it will get automatically over-written

Cluster Info
Cluster ID: 6e4004a3-7899-46e3-a50e-ebacf3c0dbd7
Username: luca.belforte.be
Email domain: staff.voo.be
Created_at: 2021-03-02 16:02:48
Installation started at: 2021-03-04 08:11:03
Failed on: 2021-03-04 09:15:19
status: error
status_info: cluster has hosts in error
OpenShift version: 4.7

links:

logs
Metrics
Kibana
DM Elastic
DM Elastic Dashboard
Options
Attachments
Drop files to attach, or browse.
Activity
All
Comments
Slack
Work Log
History
Activity
Links Hierarchy
Ascending order - Click to sort in descending order
8 older comments
Permalink Edit Delete
ercohen
Eran Cohen added a comment - 18 minutes ago
Same issue on the other master node at the same time!!:

Mar 04 08:34:09 openshift2 hyperkube[1827]: E0304 08:34:09.924949    1827 pod_workers.go:191] Error syncing pod cf50a3bc-aaa3-4852-b547-5ca666e72553 ("network-check-target-clcc9_openshift-network-diagnostics(cf50a3bc-aaa3-4852-b547-5ca666e72553)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.785060    1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused
Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.786281    1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused
Mar 04 08:34:11 openshift2 hyperkube[1827]: W0304 08:34:11.786805    1827 reflector.go:436] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Unexpected watch close - watch lasted less than a second and no items received
Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.787960    1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused
Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.788777    1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused
Mar 04 08:34:11 openshift2 hyperkube[1827]: E0304 08:34:11.789546    1827 controller.go:187] failed to update lease, error: Put "https://api-int.devtest.openshift.dev.voo:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/openshift2?timeout=10s": dial tcp 10.32.218.178:6443: connect: connection refused
Mar 04 08:34:11 openshift2 hyperkube[1827]: I0304 08:34:11.789587    1827 controller.go:114] failed to update lease using latest lease, fallback to ensure lease, err: failed 5 attempts to update lease
Permalink Edit Delete
ercohen
Eran Cohen added a comment - 16 minutes ago
Keepalived log on the bootstrap:

The client sent: stop
Thu Mar  4 08:34:15 2021: Stopping
Thu Mar  4 08:34:15 2021: (devtest_API) sent 0 priority
Thu Mar  4 08:34:15 2021: (devtest_API) removing VIPs.
Thu Mar  4 08:34:16 2021: Stopped - used 0.035811 user time, 0.061405 system time
Thu Mar  4 08:34:16 2021: Stopped Keepalived v2.0.10 (11/12,2018)
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: reload
Permalink Edit Delete
ercohen
Eran Cohen added a comment - 13 minutes ago - edited
OK seems that kubelet failed to talk to the API because bootkube got restarted (taking the keepalived and the API down)
The root issue seems to be that we fail to configure the CNI in time.

Permalink Edit Delete
ercohen
Eran Cohen added a comment - 7 minutes ago
openshift-sdn-controller log:

I0304 08:31:15.604011       1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators"
E0304 08:34:17.264576       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:27.266235       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:37.266221       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:47.265726       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:51.244940       1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading'
I0304 08:34:51.245028       1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition
F0304 08:34:51.245044       1 network_controller.go:97] leaderelection lost
goroutine 29 [running]:
k8s.io/klog/v2.stacks(0xc0000cc001, 0xc0005665a0, 0x4c, 0xe5)
	k8s.io/klog/v2.0/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x28da000, 0xc000000003, 0x0, 0x0, 0xc000936930, 0x282da78, 0x15, 0x61, 0x0)
	k8s.io/klog/v2.0/klog.go:975 +0x19b
k8s.io/klog/v2.(*loggingT).printf(0x28da000, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1b147b4, 0x13, 0x0, 0x0, ...)
	k8s.io/klog/v2.0/klog.go:750 +0x191
k8s.io/klog/v2.Fatalf(...)
	k8s.io/klog/v2.0/klog.go:1502
github.com/openshift/sdn/pkg/openshift-network-controller.RunOpenShiftNetworkController.func2()
	github.com/openshift/sdn/pkg/openshift-network-controller/network_controller.go:97 +0x8f
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000145b00)
	k8s.io/client-go.6/tools/leaderelection/leaderelection.go:199 +0x29
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000145b00, 0x1d95c00, 0xc0000491c0)
	k8s.io/client-go.6/tools/leaderelection/leaderelection.go:209 +0x15d
k8s.io/client-go/tools/leaderelection.RunOrDie(0x1d95c40, 0xc0000c8010, 0x1da0fe0, 0xc00032d040, 0xdf8475800, 0x826299e00, 0x2540be400, 0xc00030b660, 0x1c14fb0, 0x0, ...)
	k8s.io/client-go.6/tools/leaderelection/leaderelection.go:222 +0x9c
created by github.com/openshift/sdn/pkg/openshift-network-controller.RunOpenShiftNetworkController
	github.com/openshift/sdn/pkg/openshift-network-controller/network_controller.go:88 +0x538

goroutine 1 [select (no cases), 3 minutes]:
github.com/openshift/sdn/pkg/openshift-network-controller.(*OpenShiftNetworkController).StartNetworkController(0xc000193440, 0x0, 0x1c170a8)
	github.com/openshift/sdn/pkg/openshift-network-controller/cmd.go:75 +0x7e
github.com/openshift/sdn/pkg/openshift-network-controller.NewOpenShiftNetworkControllerCommand.func1(0xc00013f680, 0xc0009194c0, 0x0, 0x1)
	github.com/openshift/sdn/pkg/openshift-network-controller/cmd.go:41 +0x7c
github.com/spf13/cobra.(*Command).execute(0xc00013f680, 0xc0000c0030, 0x1, 0x1, 0xc00013f680, 0xc0000c0030)
	github.com/spf13/cobra.1/command.go:766 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc00013f680, 0x1c, 0x1b13132, 0x12)
	github.com/spf13/cobra.1/command.go:850 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra.1/command.go:800
main.main()
	github.com/openshift/sdn/cmd/network-controller/network-controller.go:29 +0x1b5

Permalink Edit Delete
ercohen
Eran Cohen added a comment - 5 minutes ago
And after that it just failed to reconnect:

I0304 08:34:52.124004       1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller...
E0304 08:34:52.125279       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:35:07.282756       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:35:26.573311       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:35:46.391482       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:36:08.201106       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:36:26.349852       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:36:46.409535       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:37:00.201621       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-


Expected results:


Additional info:

Comment 1 Alexander Constantinescu 2021-03-22 10:29:44 UTC
Hi

I've been having a look at the problem but I can't find a root cause. 

> The root issue seems to be that we fail to configure the CNI in time

I don't think that's true. The problem is that everything is trying to talk to the configured VIP 10.32.218.178, which seems to not be working. This is all host networking, so the SDN is not involved and everything which is reporting errors (kubelet/openshift-sdn) is running hostNetwork: true

I think the root cause might be the keepalive process on the bootstrap node, I am seeing the following logs:

$ less keepalived-118c6c1c28880f9953846f458578a1b5d74e29d6908a3637439906be773f540e.log

+ declare -r keepalived_sock=/var/run/keepalived/keepalived.sock
+ export -f msg_handler
+ export -f reload_keepalived
+ export -f stop_keepalived
+ '[' -s /var/run/keepalived/stopped ']'
+ '[' -s /etc/keepalived/keepalived.conf ']'
+ rm -f /var/run/keepalived/keepalived.sock
+ socat UNIX-LISTEN:/var/run/keepalived/keepalived.sock,fork 'system:bash -c msg_handler'
The client sent: reload
Thu Mar  4 08:13:50 2021: Starting Keepalived v2.0.10 (11/12,2018)
Thu Mar  4 08:13:50 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0)
Thu Mar  4 08:13:50 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp'
Thu Mar  4 08:13:50 2021:               '--log-detail' '--log-console'
Thu Mar  4 08:13:50 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:14:00 2021: Starting VRRP child process, pid=14
Thu Mar  4 08:14:00 2021: Registering Kernel netlink reflector
Thu Mar  4 08:14:00 2021: Registering Kernel netlink command channel
Thu Mar  4 08:14:00 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:14:00 2021: (Line 22) Truncating auth_pass to 8 characters
Thu Mar  4 08:14:00 2021: Assigned address 10.32.218.173 for interface eth0
Thu Mar  4 08:14:00 2021: Registering gratuitous ARP shared channel
Thu Mar  4 08:14:00 2021: (devtest_API) removing VIPs.
Thu Mar  4 08:14:00 2021: (devtest_API) Entering BACKUP STATE (init)
Thu Mar  4 08:14:00 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(0), fd(11,12)]
Thu Mar  4 08:14:04 2021: (devtest_API) Receive advertisement timeout
Thu Mar  4 08:14:04 2021: (devtest_API) Entering MASTER STATE
Thu Mar  4 08:14:04 2021: (devtest_API) setting VIPs.
Thu Mar  4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:04 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178
Thu Mar  4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:09 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
The client sent: stop
The client sent: stop
Thu Mar  4 08:14:15 2021: Stopping
Thu Mar  4 08:14:15 2021: (devtest_API) sent 0 priority
Thu Mar  4 08:14:15 2021: (devtest_API) removing VIPs.
Thu Mar  4 08:14:16 2021: Stopped - used 0.002099 user time, 0.002962 system time
Thu Mar  4 08:14:16 2021: Stopped Keepalived v2.0.10 (11/12,2018)
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: reload
Thu Mar  4 08:14:41 2021: Starting Keepalived v2.0.10 (11/12,2018)
Thu Mar  4 08:14:41 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0)
Thu Mar  4 08:14:41 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp'
Thu Mar  4 08:14:41 2021:               '--log-detail' '--log-console'
Thu Mar  4 08:14:41 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:14:55 2021: Starting VRRP child process, pid=72
Thu Mar  4 08:14:55 2021: Registering Kernel netlink reflector
Thu Mar  4 08:14:55 2021: Registering Kernel netlink command channel
Thu Mar  4 08:14:55 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:14:55 2021: (Line 22) Truncating auth_pass to 8 characters
Thu Mar  4 08:14:55 2021: Assigned address 10.32.218.173 for interface eth0
Thu Mar  4 08:14:55 2021: Registering gratuitous ARP shared channel
Thu Mar  4 08:14:55 2021: (devtest_API) removing VIPs.
Thu Mar  4 08:14:55 2021: (devtest_API) Entering BACKUP STATE (init)
Thu Mar  4 08:14:55 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(0), fd(11,12)]
Thu Mar  4 08:14:59 2021: (devtest_API) Receive advertisement timeout
Thu Mar  4 08:14:59 2021: (devtest_API) Entering MASTER STATE
Thu Mar  4 08:14:59 2021: (devtest_API) setting VIPs.
Thu Mar  4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:59 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178
Thu Mar  4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:14:59 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: (devtest_API) Sending/queueing gratuitous ARPs on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
Thu Mar  4 08:15:04 2021: Sending gratuitous ARP on eth0 for 10.32.218.178
The client sent: reload
Thu Mar  4 08:28:33 2021: Reloading ...
Thu Mar  4 08:28:33 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:28:33 2021: Reloading
Thu Mar  4 08:28:33 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:28:33 2021: (Line 23) Truncating auth_pass to 8 characters
Thu Mar  4 08:28:33 2021: Assigned address 10.32.218.173 for interface eth0
Thu Mar  4 08:28:33 2021: (devtest_API) setting VIPs.
Thu Mar  4 08:28:33 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(1), fd(11,12)]
The client sent: reload
Thu Mar  4 08:29:48 2021: Reloading ...
Thu Mar  4 08:29:48 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:29:48 2021: Reloading
Thu Mar  4 08:29:48 2021: Opening file '/etc/keepalived/keepalived.conf'.
Thu Mar  4 08:29:48 2021: (Line 24) Truncating auth_pass to 8 characters
Thu Mar  4 08:29:48 2021: Assigned address 10.32.218.173 for interface eth0
Thu Mar  4 08:29:48 2021: (devtest_API) setting VIPs.
Thu Mar  4 08:29:48 2021: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(1), fd(11,12)]
The client sent: stop
Thu Mar  4 08:34:15 2021: Stopping
Thu Mar  4 08:34:15 2021: (devtest_API) sent 0 priority
Thu Mar  4 08:34:15 2021: (devtest_API) removing VIPs.
Thu Mar  4 08:34:16 2021: Stopped - used 0.035811 user time, 0.061405 system time
Thu Mar  4 08:34:16 2021: Stopped Keepalived v2.0.10 (11/12,2018)
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: reload
Thu Mar  4 08:34:25 2021: Starting Keepalived v2.0.10 (11/12,2018)
Thu Mar  4 08:34:25 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0)
Thu Mar  4 08:34:25 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp'
Thu Mar  4 08:34:25 2021:               '--log-detail' '--log-console'
Thu Mar  4 08:34:25 2021: Opening file '/etc/keepalived/keepalived.conf'.
The client sent: stop
The client sent: stop
The client sent: stop
environment: line 2:   939 Terminated              /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: reload
Thu Mar  4 08:35:37 2021: Starting Keepalived v2.0.10 (11/12,2018)
Thu Mar  4 08:35:37 2021: Running on Linux 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 (built for Linux 4.18.0)
Thu Mar  4 08:35:37 2021: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '--dont-fork' '--vrrp'
Thu Mar  4 08:35:37 2021:               '--log-detail' '--log-console'
Thu Mar  4 08:35:37 2021: Opening file '/etc/keepalived/keepalived.conf'.
The client sent: reload
The client sent: stop
environment: line 2:  1053 Hangup                  /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console
The client sent: stop
The client sent: stop
The client sent: stop
The client sent: reload

Given that everything (openshift-sdn and kubelet) was running fine until roughly 08:34:17, see:

$ less sdn-controller-918392caa9dc02c3e9b6ec8f80979f12807efdbbab6f950747f0157d084e0540.log

I0304 08:31:15.004467       1 vnids.go:117] Allocated netid 4140068 for namespace "openshift-kube-controller-manager"
I0304 08:31:15.204021       1 vnids.go:117] Allocated netid 5758811 for namespace "openshift-ovirt-infra"
I0304 08:31:15.404404       1 vnids.go:117] Allocated netid 16102858 for namespace "openshift-insights"
I0304 08:31:15.604011       1 vnids.go:117] Allocated netid 6334540 for namespace "openshift-operators"
E0304 08:34:17.264576       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw
ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:27.266235       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw
ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:37.266221       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw
ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:47.265726       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: Get "https://api-int.devtest.openshift.dev.voo:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-netw
ork-controller": dial tcp 10.32.218.178:6443: connect: connection refused
E0304 08:34:51.244940       1 event.go:329] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVers
ion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[stri
ng]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' du
e to: 'no kind is registered for the type v1.ConfigMap in scheme "pkg/runtime/scheme.go:100"'. Will not report event: 'Normal' 'LeaderElection' 'openshift2 stopped leading'
I0304 08:34:51.245028       1 leaderelection.go:278] failed to renew lease openshift-sdn/openshift-network-controller: timed out waiting for the condition
F0304 08:34:51.245044       1 network_controller.go:97] leaderelection lost
goroutine 29 [running]:

I suspect there is a problem with those restarts seen in the keepalive process, assigning to runtimecfg for further analysis

Comment 2 Victor Voronkov 2021-03-25 09:05:50 UTC
I successfully deployed OCP 4.7.3 with OpenshiftSDN using IPI baremetal installer flow, so point me if you think I can help you with this bug

Comment 3 Ben Nemec 2021-05-14 21:12:57 UTC
The connection refused errors strongly suggest this is one of the double vip bugs (that are now fixed). Is this still happening in recent builds?

Comment 4 Yossi Boaron 2021-05-18 13:36:43 UTC
Yep, it seems like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1957708 .

Comment 5 Eran Cohen 2021-05-20 07:25:20 UTC

*** This bug has been marked as a duplicate of bug 1957708 ***

Comment 6 Red Hat Bugzilla 2023-09-15 01:03:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.