Created attachment 1717525 [details] openshift-openstack-infra_worker.log Description of problem: OSP installation failed due to OAuthRouteCheckEndpointAccessibleController_SyncError. Using oc describe to check status of bad core clusteroperators ... Name: authentication Status: Conditions: Last Transition Time: 2020-09-29T03:40:00Z Message: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz": dial tcp 192.168.0.7:443: connect: connection refused Reason: OAuthRouteCheckEndpointAccessibleController_SyncError Status: True Type: Degraded Last Transition Time: 2020-09-29T03:41:48Z Message: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz" not successfull yet Reason: OAuthVersionRoute_WaitingForRoute Status: True Type: Progressing Last Transition Time: 2020-09-29T03:41:26Z Message: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz" failed: dial tcp 192.168.0.7:443: connect: connection refused OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz": dial tcp 192.168.0.7:443: connect: connection refused Reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed Status: False Type: Available $ oc get co | grep -v '.*True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-28-212756 False True True 6h32m console 4.6.0-0.nightly-2020-09-28-212756 False True True 6h29m oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME xiyuan09291-8kdg9-master-0 Ready master 7h30m v1.19.0+e465e66 192.168.0.253 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 xiyuan09291-8kdg9-master-1 Ready master 7h30m v1.19.0+e465e66 192.168.3.227 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 xiyuan09291-8kdg9-master-2 Ready master 7h30m v1.19.0+e465e66 192.168.3.176 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 xiyuan09291-8kdg9-worker-0 Ready worker 7h12m v1.19.0+e465e66 192.168.2.104 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 xiyuan09291-8kdg9-worker-1 Ready worker 7h12m v1.19.0+e465e66 192.168.2.55 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 xiyuan09291-8kdg9-worker-2 Ready worker 7h12m v1.19.0+e465e66 192.168.0.121 <none> Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8 $ oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-7cf995f6d7-5rrsb 1/1 Running 0 6h34m 192.168.0.121 xiyuan09291-8kdg9-worker-2 <none> <none> router-default-7cf995f6d7-fbgwc 1/1 Running 0 6h29m 192.168.2.104 xiyuan09291-8kdg9-worker-0 <none> <none> $ for i in `oc get nodes -l node-role.kubernetes.io/worker= -o name`; do oc debug $i -- chroot /host hostname -I ; done Starting pod/xiyuan09291-8kdg9-worker-0-debug ... To use host binaries, run `chroot /host` 10.128.2.2 169.254.0.1 192.168.2.104 Removing debug pod ... Starting pod/xiyuan09291-8kdg9-worker-1-debug ... To use host binaries, run `chroot /host` 10.131.0.2 169.254.0.1 192.168.2.55 Removing debug pod ... Starting pod/xiyuan09291-8kdg9-worker-2-debug ... To use host binaries, run `chroot /host` 10.129.2.2 169.254.0.1 192.168.0.121 Removing debug pod ... Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-28-212756 How reproducible: Not sure Steps to Reproduce: 1. Install 4.6.0-0.nightly-2020-09-28-212756 with matrix upi-on-osp/versioned-installer-disconnected-osp16-ci with OVN enalbed Actual results: installation failed Expected results: installation succeeded. Additional info: Logs for pods in namespace openshift-openstack-infra related to worker was captured, seen from attachment. $ for i in `oc get pods -n openshift-openstack-infra -o name | grep worker`; do oc logs $i --all-containers -n openshift-openstack-infra >> openshift-openstack-infra_worker.log ; done
any reason this is assigned to MCO? can you clarify and move accordingly please
(In reply to Antonio Murdaca from comment #1) > any reason this is assigned to MCO? can you clarify and move accordingly > please Seems it is ralated to keepalived, and keepalived in MCO. So assigned to MCO. Please help to correct if I am wrong. Thanks.
From what I could trace from the logs: 03:12:14 ingress VIP on 192.168.2.55 (worker1) 03:34:39 ingress VIP on 192.168.2.104 (worker2) 03:38:38 ingress VIP on 192.168.0.121 (worker0) The output of oc get pods -n openshift-ingress -o wide also shows that the routers are running on worker2 and worker0 at the time the command ran (deployment + 6h) which seems in line with what we're seeing on keepalived logs. Could you gather the logs from router-default-7cf995f6d7-fbgwc on worker0 and also the security group rules for master and worker security groups? I wonder if this could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1842876.
As @mandre mentioned I also see that VIP was set 3 times (see [1] ) but in all cases for some reason the Keepalived track script chk_ingress failed after few minutes and VIP moves to another node. Also noticed that Keepalived containers were stopped (search for Stopping in the log) for some reason in two nodes Since I can't reproduce it locally, I think the best thing will be to access your env, @xiyuan, is it possible ? [1] =========================================================================== Tue Sep 29 03:12:11 2020: VRRP_Script(chk_ingress) succeeded Tue Sep 29 03:12:11 2020: (xiyuan09291_INGRESS) Changing effective priority from 40 to 90 Tue Sep 29 03:12:11 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding Tue Sep 29 03:12:12 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding Tue Sep 29 03:12:13 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Receive advertisement timeout Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Entering MASTER STATE Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) setting VIPs. Tue Sep 29 03:12:14 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7 Tue Sep 29 03:34:35 2020: Script `chk_ingress` now returning 22 Tue Sep 29 03:34:35 2020: VRRP_Script(chk_ingress) failed (exited with status 22) Tue Sep 29 03:34:35 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40 Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.2.104 with higher priority 90, ours 40 Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) removing VIPs. ========================================================================================= Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Entering MASTER STATE Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) setting VIPs. Tue Sep 29 03:34:39 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7 Tue Sep 29 03:34:39 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:38:35 2020: Script `chk_ingress` now returning 22 Tue Sep 29 03:38:35 2020: VRRP_Script(chk_ingress) failed (exited with status 22) Tue Sep 29 03:38:35 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40 Tue Sep 29 03:38:36 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.0.121 with higher priority 90, ours 40 Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE =================================================================================== Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering MASTER STATE Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) setting VIPs. Tue Sep 29 03:38:38 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:41:24 2020: Script `chk_ingress` now returning 22 Tue Sep 29 03:41:24 2020: VRRP_Script(chk_ingress) failed (exited with status 22) Tue Sep 29 03:41:24 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40 [2] Tue Sep 29 03:38:36 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.0.121 with higher priority 90, ours 40 Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) removing VIPs. Tue Sep 29 03:38:46 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:38:56 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:06 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:06 2020: Script `chk_ingress` now returning 7 Tue Sep 29 03:39:16 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:17 2020: Stopping Tue Sep 29 03:38:43 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7 Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7 Tue Sep 29 03:38:52 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:02 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:12 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:22 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:32 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:42 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:39:52 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:02 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:12 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:22 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:32 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:42 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:40:52 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:02 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:12 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:21 2020: Interface 8db07e04c0d8239 deleted Tue Sep 29 03:41:22 2020: Interface 1e6453fec8291c3 deleted Tue Sep 29 03:41:22 2020: Interface 88d979115ff0e59 deleted Tue Sep 29 03:41:22 2020: Interface 74d0a882801cf38 deleted Tue Sep 29 03:41:22 2020: Interface 36e43da848d6a88 deleted Tue Sep 29 03:41:22 2020: Interface d42fb12648c0f1f deleted Tue Sep 29 03:41:22 2020: Interface 662c80f2e2c6f46 deleted Tue Sep 29 03:41:22 2020: Interface d7f246f63a2fc04 deleted Tue Sep 29 03:41:22 2020: Interface bd0720372a5061b deleted Tue Sep 29 03:41:22 2020: Interface 4d75082878d9ec6 deleted Tue Sep 29 03:41:22 2020: Interface 74d4a917f35e5c4 deleted Tue Sep 29 03:41:22 2020: Interface 40b043be53966e2 deleted Tue Sep 29 03:41:22 2020: Interface 202d021f53a01d3 deleted Tue Sep 29 03:41:22 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:22 2020: Interface 0043abd856310c3 deleted Tue Sep 29 03:41:22 2020: Interface 22257b16092e2dd deleted Tue Sep 29 03:41:22 2020: Interface c97d64f738ada53 deleted Tue Sep 29 03:41:24 2020: Script `chk_ingress` now returning 22 Tue Sep 29 03:41:24 2020: VRRP_Script(chk_ingress) failed (exited with status 22) Tue Sep 29 03:41:24 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40 Tue Sep 29 03:41:32 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:42 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:52 2020: Printing VRRP data for process(14) on signal Tue Sep 29 03:41:55 2020: Script `chk_ingress` now returning 7 Tue Sep 29 03:42:02 2020: Stopping 2020/09/29 03:42:02 socat[8] E waitpid(): child 9 exited with status 143
@Weiwei @David could someone from your team help verify this BZ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days