Description of problem: All on-premise platforms seem to be failing with similar errors. See https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_ironic-inspector-image/36/pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi/1280216231084822528, https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ironic-inspector-image/36/pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi/1280216231084822528/build-log.txt Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Always Steps to Reproduce: 1. Install a 4.6 CI build (note: nightlies are so far OK) Actual results: E0706 21:04:36.483676 71188 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get https://api.ostest.test.metalkube.org:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0: dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused Expected results: Additional info: The discussion on Slack from Martin seemed to indicate the VIP was moving too early to the cluster, but I haven't dug into this myself yet.
There still doesn't appear to be any Networking component for baremetal-runtimecfg :-\
What we've observed so far: - it's affecting all on-prem platforms - it's affecting pre-submit jobs (and likely origin release jobs as well), but not ocp release jobs. - the failure was introduced in api.ci release payload between 2020-07-02 22:18:36 +0000 UTC and 2020-07-02 23:49:22 +0000 UTC according to ovirt CI - the API VIP moves to the master nodes but there is no kube-apiserver listening. Somehow, the chk_ocp keepalived check succeeded meaning that it got a response on https://localhost:6443/readyz. From keepalived logs on a master node: Mon Jul 6 12:22:38 2020: VRRP_Script(chk_ocp) failed (exited with status 7) Mon Jul 6 12:23:15 2020: Script `chk_ocp` now returning 0 Mon Jul 6 12:23:15 2020: VRRP_Script(chk_ocp) succeeded Mon Jul 6 12:23:15 2020: (mandre_API) Changing effective priority from 40 to 90 Mon Jul 6 12:23:16 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding Mon Jul 6 12:23:17 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding Mon Jul 6 12:23:18 2020: (mandre_API) received lower priority (50) advert from 10.0.128.16 - discarding Mon Jul 6 12:23:18 2020: (mandre_API) Receive advertisement timeout Mon Jul 6 12:23:18 2020: (mandre_API) Entering MASTER STATE Mon Jul 6 12:23:18 2020: (mandre_API) setting VIPs. However, nothing to listen on port 6443: $ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID d61e8e1b48004 325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a 18 hours ago Running haproxy-monitor 0 cfef4da781c28 331b62a2b45fc registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:81bae9207a6e0061fae165873e94213b0e7b6329d8055ef814c053bd97de35a4 18 hours ago Running haproxy 0 cfef4da781c28 3559b35650405 registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:ccae9f3d7d869575c829b91e4dbd09cffdcf79d9510c6d94bc1dbf694e3e14c0 18 hours ago Running mdns-publisher 0 babdf5c21d29f 340fabbdd9373 registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:a4c71e5c5e0f9ec0818a7367922cab6d0f08944473468e26521a5cd5ec5e60d8 18 hours ago Running keepalived 0 7b4a3a5e1e20a 22d21b46d5f3c registry.svc.ci.openshift.org/ci-op-g0k7x20c/stable@sha256:4d630ee97ef6a8b43138f050fee51899b07c8bf72655d1e0337ff50ca89589f3 18 hours ago Running coredns 0 3a7ab40059fac e1b3bdd1c1fa4 325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a 18 hours ago Exited render-config 0 babdf5c21d29f fcd58c6b5a9ca 325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a 18 hours ago Exited verify-hostname 0 babdf5c21d29f af577e6bbcc2c 325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a 18 hours ago Exited render-config 0 3a7ab40059fac 7ef2f7a6a91d0 325dc9bca6a0a7fb8c93fa456f287404ce11c62198a7daeb4f90a1ba66e3c11a 18 hours ago Exited render-config 0 7b4a3a5e1e20a $ ss -tnl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:111 0.0.0.0:* LISTEN 0 128 0.0.0.0:49585 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 10.0.128.28:10010 0.0.0.0:* LISTEN 0 128 127.0.0.1:10248 0.0.0.0:* LISTEN 0 128 10.0.128.28:10250 0.0.0.0:* LISTEN 0 128 [::]:111 [::]:* LISTEN 0 128 [::1]:50000 [::]:* LISTEN 0 128 *:53 *:* LISTEN 0 128 [::]:22 [::]:* LISTEN 0 128 [::]:55735 [::]:* LISTEN 0 128 *:50936 *:* LISTEN 0 128 *:18080 *:* LISTEN 0 128 *:9537 *:* LISTEN 0 128 *:9443 *:*
I also hit the same problem on my setup, seems that the root cause kube-*.yaml files are missing at static pod directory on masters nodes (/etc/kubernetes/manifests) [1] is the list of files from master-2 in my setup [2] Is the list of files from master in working cluster I don't think it's related to keepalived/HAProxy. [1] [core@master-2 ~]$ sudo ls -l /etc/kubernetes/manifests/ total 20 -rw-r--r--. 1 root root 3012 Jul 6 19:36 coredns.yaml -rw-r--r--. 1 root root 4029 Jul 6 19:36 haproxy.yaml -rw-r--r--. 1 root root 3164 Jul 6 19:36 keepalived.yaml -rw-r--r--. 1 root root 2635 Jul 6 19:36 mdns-publisher.yaml -rw-r--r--. 1 root root 706 Jul 6 19:36 recycler-pod.yaml [core@master-2 ~]$ [2] [core@master-0-0 ~]$ sudo ls -l /etc/kubernetes/manifests/ total 68 -rw-r--r--. 1 root root 2994 Jul 2 18:15 coredns.yaml -rw-r--r--. 1 root root 24738 Jul 2 18:20 etcd-pod.yaml -rw-r--r--. 1 root root 4024 Jul 2 18:15 haproxy.yaml -rw-r--r--. 1 root root 3155 Jul 2 18:15 keepalived.yaml -rw-r--r--. 1 root root 5476 Jul 3 18:00 kube-apiserver-pod.yaml -rw-r--r--. 1 root root 5825 Jul 2 18:44 kube-controller-manager-pod.yaml -rw-r--r--. 1 root root 3558 Jul 2 18:45 kube-scheduler-pod.yaml -rw-r--r--. 1 root root 2626 Jul 2 18:15 mdns-publisher.yaml -rw-r--r--. 1 root root 706 Jul 2 18:15 recycler-pod.yaml [core@master-0-0 ~]$
In addition, it seems that the CVO was never started on bootkube --> api-operator not started on one of the masters--> kube-apiserever static pods manifest not set at /etc/kubernetes/manifests. Also noticed the following error on kubelet journal at bootstrap node: " 2511 container_manager_linux.go:512] failed to find cgroups of kubelet - cpu and memory cgroup hierarchy not unified. cpu: /, memory: /system.slice/kubelet.service "
in my case, i could see that api vip was actually running on both bootstrap node and one of the master. Also, as stated in slack conversation, there was a major bump of version of keepalived in container image openshift/origin-keepalived-ipfailover, from 1.35 to 2.0.10
*** Bug 1854355 has been marked as a duplicate of this bug. ***
Issue is only reproduced on ipi on vsphere. API vip is running on the bootstrap node at the beginning, then removed but not moved to one master node. Logs can be found on bug 1854355.
On vSphere platform, install ocp ipi with nightly build: 4.6.0-0.nightly-2020-07-12-014740 and still failed. On bootstrap server, from keepalived log, bootstrap server is in MASTER state, but apivip is not binding to interface ens192 on bootstrap, and also not found it on any master node. [root@localhost ~]# crictl logs c9b2bd27a8d57 Starting Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2 Opening file '/etc/keepalived/keepalived.conf'. Starting VRRP child process, pid=9 Registering Kernel netlink reflector Registering Kernel netlink command channel Registering gratuitous ARP shared channel Opening file '/etc/keepalived/keepalived.conf'. Truncating auth_pass to 8 characters VRRP_Instance(qeci-5999check1_API) removing protocol VIPs. Using LinkWatch kernel netlink reflector... VRRP_Instance(qeci-5999check1_API) Entering BACKUP STATE VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(9,10)] VRRP_Instance(qeci-5999check1_API) Transition to MASTER STATE VRRP_Instance(qeci-5999check1_API) Entering MASTER STATE VRRP_Instance(qeci-5999check1_API) setting protocol VIPs. Sending gratuitous ARP on ens192 for 10.0.0.5 VRRP_Instance(qeci-5999check1_API) Sending/queueing gratuitous ARPs on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 VRRP_Instance(qeci-5999check1_API) Sending/queueing gratuitous ARPs on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 Sending gratuitous ARP on ens192 for 10.0.0.5 [root@localhost ~]# nmcli d show ens192 GENERAL.DEVICE: ens192 GENERAL.TYPE: ethernet GENERAL.HWADDR: 00:50:56:AB:B3:AB GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: Wired Connection GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/19 WIRED-PROPERTIES.CARRIER: on IP4.ADDRESS[1]: 10.0.0.207/24 IP4.GATEWAY: 10.0.0.1 IP4.ROUTE[1]: dst = 0.0.0.0/0, nh = 10.0.0.1, mt = 100 IP4.ROUTE[2]: dst = 10.0.0.0/24, nh = 0.0.0.0, mt = 100 IP4.DNS[1]: 10.0.0.51 IP6.ADDRESS[1]: fe80::250:56ff:feab:b3ab/64 IP6.GATEWAY: -- IP6.ROUTE[1]: dst = fe80::/64, nh = ::, mt = 100 IP6.ROUTE[2]: dst = ff00::/8, nh = ::, mt = 256, table=255
That's strange, you've still got the old 1.x version of keepalived. That should never have broken, and I wouldn't expect the changes to fix this to affect it either. It's not clear to me from the logs what would be causing this, but it's very suspicious that api-int isn't resolving. That's a static record in coredns so it shouldn't be affected by any VIP issues. Is the prepender script correctly setting the first DNS server to be the node's own IP? If that's failing for some reason it might give us a clue to what is going on here.
vsphere is still using dhclient.conf [1], however dhclient was removed from RHCOS image. It needs to implement a patch similar to https://github.com/openshift/installer/pull/3789. [1] https://github.com/openshift/installer/blob/30682e7/data/data/bootstrap/vsphere/files/etc/dhcp/dhclient.conf.template
I am going to propose that we re-close this bug and start tracking the vsphere issue in a new one. Based on Martin's comment and the fact that vsphere is still on the old keepalived it seems highly unlikely that their problem is the same as this one. Jinyun, can you open a new bug with the information you've provided here? At the very least we can track the dhclient change there, and if that doesn't fix the deployment we can figure out where to go after that's resolved.
Thanks for your debug, Ben and Martin. New bug 1858683 is opened and close this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196