1854249 – API fails to come up on all on-premise clusters in latest CI builds

Bug 1854249 - API fails to come up on all on-premise clusters in latest CI builds

Summary: API fails to come up on all on-premise clusters in latest CI builds

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antoni Segura Puimedon
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-06 23:36 UTC by Stephen Benjamin
Modified:	2020-10-27 16:12 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In OCP4.6 the keepalived version was updated to v2.0.10, In Keepalived v2.0.10, there's a bug (was fixed in advanced Keepalived versions) causing track scripts to stop running in case they timed out. Consequence: No node owned API VIP and as a result of that deployment breaks. Fix: Wrap Keepalived tack script with timeout command. Result: Deployment succeeds.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:12:24 UTC
Target Upstream Version:
Embargoed:
Flags:	asegurap: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1909	0	None	closed	Bug 1854249: [On-Prem] Workaround issues with keepalived v2.0.10	2021-01-18 02:40:56 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:12:46 UTC

Description Stephen Benjamin 2020-07-06 23:36:16 UTC

Description of problem:

All on-premise platforms seem to be failing with similar errors.

See https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_ironic-inspector-image/36/pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi/1280216231084822528, https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ironic-inspector-image/36/pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi/1280216231084822528/build-log.txt

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:
Always

Steps to Reproduce:
1. Install a 4.6 CI build (note: nightlies are so far OK)

Actual results:


E0706 21:04:36.483676   71188 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get https://api.ostest.test.metalkube.org:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0: dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused

Expected results:

Additional info:

The discussion on Slack from Martin seemed to indicate the VIP was moving too early to the cluster, but I haven't dug into this myself yet.

Comment 1 Stephen Benjamin 2020-07-06 23:38:38 UTC

There still doesn't appear to be any Networking component for baremetal-runtimecfg :-\

Comment 2 Martin André 2020-07-07 06:45:20 UTC

What we've observed so far:
- it's affecting all on-prem platforms
- it's affecting pre-submit jobs (and likely origin release jobs as well), but not ocp release jobs.
- the failure was introduced in api.ci release payload between 2020-07-02 22:18:36 +0000 UTC and 2020-07-02 23:49:22 +0000 UTC according to ovirt CI
- the API VIP moves to the master nodes but there is no kube-apiserver listening. Somehow, the chk_ocp keepalived check succeeded meaning that it got a response on https://localhost:6443/readyz.

From keepalived logs on a master node:
Mon Jul  6 12:22:38 2020: VRRP_Script(chk_ocp) failed (exited with status 7)
Mon Jul  6 12:23:15 2020: Script `chk_ocp` now returning 0
Mon Jul  6 12:23:15 2020: VRRP_Script(chk_ocp) succeeded
Mon Jul  6 12:23:15 2020: (mandre_API) Changing effective priority from 40 to 90
Mon Jul  6 12:23:16 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding
Mon Jul  6 12:23:17 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding
Mon Jul  6 12:23:18 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding
Mon Jul  6 12:23:18 2020: (mandre_API) Receive advertisement timeout
Mon Jul  6 12:23:18 2020: (mandre_API) Entering MASTER STATE
Mon Jul  6 12:23:18 2020: (mandre_API) setting VIPs.

However, nothing to listen on port 6443:
$ sudo crictl ps -a 
CONTAINER           IMAGE                                                                                                                         CREATED             STATE               NAME                ATTEMPT             POD ID
d61e8e1b48004       325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a                                                              18 hours ago        Running             haproxy-monitor     0                   cfef4da781c28
331b62a2b45fc       registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:81bae9207a6e0061fae165873e94213b0e7b6329d8055ef814c053bd97de35a4   18 hours ago        Running             haproxy             0                   cfef4da781c28
3559b35650405       registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:ccae9f3d7d869575c829b91e4dbd09cffdcf79d9510c6d94bc1dbf694e3e14c0   18 hours ago        Running             mdns-publisher      0                   babdf5c21d29f
340fabbdd9373       registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:a4c71e5c5e0f9ec0818a7367922cab6d0f08944473468e26521a5cd5ec5e60d8   18 hours ago        Running             keepalived          0                   7b4a3a5e1e20a
22d21b46d5f3c       registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:4d630ee97ef6a8b43138f050fee51899b07c8bf72655d1e0337ff50ca89589f3   18 hours ago        Running             coredns             0                   3a7ab40059fac
e1b3bdd1c1fa4       325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a                                                              18 hours ago        Exited              render-config       0                   babdf5c21d29f
fcd58c6b5a9ca       325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a                                                              18 hours ago        Exited              verify-hostname     0                   babdf5c21d29f
af577e6bbcc2c       325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a                                                              18 hours ago        Exited              render-config       0                   3a7ab40059fac
7ef2f7a6a91d0       325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a                                                              18 hours ago        Exited              render-config       0                   7b4a3a5e1e20a

$ ss -tnl 
State                        Recv-Q                       Send-Q                                             Local Address:Port                                              Peer Address:Port                       
LISTEN                       0                            128                                                      0.0.0.0:111                                                    0.0.0.0:*                          
LISTEN                       0                            128                                                      0.0.0.0:49585                                                  0.0.0.0:*                          
LISTEN                       0                            128                                                      0.0.0.0:22                                                     0.0.0.0:*                          
LISTEN                       0                            128                                                  10.0.128.28:10010                                                  0.0.0.0:*                          
LISTEN                       0                            128                                                    127.0.0.1:10248                                                  0.0.0.0:*                          
LISTEN                       0                            128                                                  10.0.128.28:10250                                                  0.0.0.0:*                          
LISTEN                       0                            128                                                         [::]:111                                                       [::]:*                          
LISTEN                       0                            128                                                        [::1]:50000                                                     [::]:*                          
LISTEN                       0                            128                                                            *:53                                                           *:*                          
LISTEN                       0                            128                                                         [::]:22                                                        [::]:*                          
LISTEN                       0                            128                                                         [::]:55735                                                     [::]:*                          
LISTEN                       0                            128                                                            *:50936                                                        *:*                          
LISTEN                       0                            128                                                            *:18080                                                        *:*                          
LISTEN                       0                            128                                                            *:9537                                                         *:*                          
LISTEN                       0                            128                                                            *:9443                                                         *:*

Comment 3 Yossi Boaron 2020-07-07 08:07:04 UTC

I also hit the same problem on my setup, seems that the root cause kube-*.yaml files are missing at static pod directory on masters nodes (/etc/kubernetes/manifests)


[1] is the list of files from master-2 in my setup 
[2] Is the list of files from master in working cluster

I don't think it's related to keepalived/HAProxy.

[1] 
[core@master-2 ~]$ sudo ls  -l /etc/kubernetes/manifests/
total 20
-rw-r--r--. 1 root root 3012 Jul  6 19:36 coredns.yaml
-rw-r--r--. 1 root root 4029 Jul  6 19:36 haproxy.yaml
-rw-r--r--. 1 root root 3164 Jul  6 19:36 keepalived.yaml
-rw-r--r--. 1 root root 2635 Jul  6 19:36 mdns-publisher.yaml
-rw-r--r--. 1 root root  706 Jul  6 19:36 recycler-pod.yaml
[core@master-2 ~]$ 


[2] 

[core@master-0-0 ~]$ sudo ls -l /etc/kubernetes/manifests/
total 68
-rw-r--r--. 1 root root  2994 Jul  2 18:15 coredns.yaml
-rw-r--r--. 1 root root 24738 Jul  2 18:20 etcd-pod.yaml
-rw-r--r--. 1 root root  4024 Jul  2 18:15 haproxy.yaml
-rw-r--r--. 1 root root  3155 Jul  2 18:15 keepalived.yaml
-rw-r--r--. 1 root root  5476 Jul  3 18:00 kube-apiserver-pod.yaml
-rw-r--r--. 1 root root  5825 Jul  2 18:44 kube-controller-manager-pod.yaml
-rw-r--r--. 1 root root  3558 Jul  2 18:45 kube-scheduler-pod.yaml
-rw-r--r--. 1 root root  2626 Jul  2 18:15 mdns-publisher.yaml
-rw-r--r--. 1 root root   706 Jul  2 18:15 recycler-pod.yaml
[core@master-0-0 ~]$

Comment 4 Yossi Boaron 2020-07-07 11:54:04 UTC

In addition, it seems that the CVO was never started on bootkube --> api-operator not started on one of the masters-->  kube-apiserever static pods manifest not set at /etc/kubernetes/manifests.

Also noticed the following error on kubelet journal at bootstrap node: 

"
2511 container_manager_linux.go:512] failed to find cgroups of kubelet - cpu and memory cgroup hierarchy not unified. cpu: /, memory: /system.slice/kubelet.service
"

Comment 5 Karim Boumedhel 2020-07-07 13:58:19 UTC

in my case, i could see that api vip was actually running on both bootstrap node and one of the master.
Also, as stated in slack conversation, there was a major bump of version of keepalived in container image openshift/origin-keepalived-ipfailover, from 1.35 to 2.0.10

Comment 6 Abhinav Dahiya 2020-07-07 16:44:34 UTC

*** Bug 1854355 has been marked as a duplicate of this bug. ***

Comment 7 jima 2020-07-08 05:25:06 UTC

Issue is only reproduced on ipi on vsphere.
API vip is running on the bootstrap node at the beginning, then removed but not moved to one master node.
Logs can be found on bug 1854355.

Comment 10 jima 2020-07-13 09:46:29 UTC

On vSphere platform, install ocp ipi with nightly build: 4.6.0-0.nightly-2020-07-12-014740 and still failed.

On bootstrap server, from keepalived log, bootstrap server is in MASTER state, but apivip is not binding to interface ens192 on bootstrap, and also not found it on any master node.

[root@localhost ~]# crictl logs c9b2bd27a8d57
Starting Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2
Opening file '/etc/keepalived/keepalived.conf'.
Starting VRRP child process, pid=9
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Registering gratuitous ARP shared channel
Opening file '/etc/keepalived/keepalived.conf'.
Truncating auth_pass to 8 characters
VRRP_Instance(qeci-5999check1_API) removing protocol VIPs.
Using LinkWatch kernel netlink reflector...
VRRP_Instance(qeci-5999check1_API) Entering BACKUP STATE
VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(9,10)]
VRRP_Instance(qeci-5999check1_API) Transition to MASTER STATE
VRRP_Instance(qeci-5999check1_API) Entering MASTER STATE
VRRP_Instance(qeci-5999check1_API) setting protocol VIPs.
Sending gratuitous ARP on ens192 for 10.0.0.5
VRRP_Instance(qeci-5999check1_API) Sending/queueing gratuitous ARPs on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
VRRP_Instance(qeci-5999check1_API) Sending/queueing gratuitous ARPs on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5
Sending gratuitous ARP on ens192 for 10.0.0.5

[root@localhost ~]# nmcli d show ens192
GENERAL.DEVICE:                         ens192
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         00:50:56:AB:B3:AB
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     Wired Connection
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/19
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         10.0.0.207/24
IP4.GATEWAY:                            10.0.0.1
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 10.0.0.1, mt = 100
IP4.ROUTE[2]:                           dst = 10.0.0.0/24, nh = 0.0.0.0, mt = 100
IP4.DNS[1]:                             10.0.0.51
IP6.ADDRESS[1]:                         fe80::250:56ff:feab:b3ab/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 100
IP6.ROUTE[2]:                           dst = ff00::/8, nh = ::, mt = 256, table=255

Comment 15 Ben Nemec 2020-07-16 18:03:18 UTC

That's strange, you've still got the old 1.x version of keepalived. That should never have broken, and I wouldn't expect the changes to fix this to affect it either.

It's not clear to me from the logs what would be causing this, but it's very suspicious that api-int isn't resolving. That's a static record in coredns so it shouldn't be affected by any VIP issues. Is the prepender script correctly setting the first DNS server to be the node's own IP? If that's failing for some reason it might give us a clue to what is going on here.

Comment 17 Martin André 2020-07-17 14:39:57 UTC

vsphere is still using dhclient.conf [1], however dhclient was removed from RHCOS image. It needs to implement a patch similar to https://github.com/openshift/installer/pull/3789.

[1] https://github.com/openshift/installer/blob/30682e7/data/data/bootstrap/vsphere/files/etc/dhcp/dhclient.conf.template

Comment 18 Ben Nemec 2020-07-17 15:57:11 UTC

I am going to propose that we re-close this bug and start tracking the vsphere issue in a new one. Based on Martin's comment and the fact that vsphere is still on the old keepalived it seems highly unlikely that their problem is the same as this one.

Jinyun, can you open a new bug with the information you've provided here? At the very least we can track the dhclient change there, and if that doesn't fix the deployment we can figure out where to go after that's resolved.

Comment 19 jima 2020-07-20 06:04:50 UTC

Thanks for your debug, Ben and Martin.
New bug 1858683 is opened and close this bug.

Comment 22 errata-xmlrpc 2020-10-27 16:12:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.