Bug 1883508 - [OSP] cluster installation failed due to OAuthRouteCheckEndpointAccessibleController_SyncError
Summary: [OSP] cluster installation failed due to OAuthRouteCheckEndpointAccessibleCon...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Yossi Boaron
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 12:43 UTC by xiyuan
Modified: 2024-03-25 16:36 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-01 15:49:57 UTC
Target Upstream Version:
Embargoed:
dsanzmor: needinfo-


Attachments (Terms of Use)
openshift-openstack-infra_worker.log (6.66 MB, text/plain)
2020-09-29 12:43 UTC, xiyuan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2134 0 None closed Bug 1883508: [ Baremetal and friends] wrap Keepalived chk_ingress script with timeout 2021-02-21 20:04:35 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:46:52 UTC

Description xiyuan 2020-09-29 12:43:29 UTC
Created attachment 1717525 [details]
openshift-openstack-infra_worker.log

Description of problem:
OSP installation failed due to OAuthRouteCheckEndpointAccessibleController_SyncError.

Using oc describe to check status of bad core clusteroperators ...
Name: authentication
Status:
  Conditions:
    Last Transition Time:  2020-09-29T03:40:00Z
    Message:               OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz": dial tcp 192.168.0.7:443: connect: connection refused
    Reason:                OAuthRouteCheckEndpointAccessibleController_SyncError
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-09-29T03:41:48Z
    Message:               OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz" not successfull yet
    Reason:                OAuthVersionRoute_WaitingForRoute
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-09-29T03:41:26Z
    Message:               OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz" failed: dial tcp 192.168.0.7:443: connect: connection refused
OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.xiyuan09291.qe.devcluster.openshift.com/healthz": dial tcp 192.168.0.7:443: connect: connection refused
    Reason:                OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed
    Status:                False
    Type:                  Available

$  oc get co | grep -v '.*True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-28-212756   False       True          True       6h32m
console                                    4.6.0-0.nightly-2020-09-28-212756   False       True          True       6h29m

oc get nodes -o wide
NAME                         STATUS   ROLES    AGE     VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
xiyuan09291-8kdg9-master-0   Ready    master   7h30m   v1.19.0+e465e66   192.168.0.253   <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
xiyuan09291-8kdg9-master-1   Ready    master   7h30m   v1.19.0+e465e66   192.168.3.227   <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
xiyuan09291-8kdg9-master-2   Ready    master   7h30m   v1.19.0+e465e66   192.168.3.176   <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
xiyuan09291-8kdg9-worker-0   Ready    worker   7h12m   v1.19.0+e465e66   192.168.2.104   <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
xiyuan09291-8kdg9-worker-1   Ready    worker   7h12m   v1.19.0+e465e66   192.168.2.55    <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
xiyuan09291-8kdg9-worker-2   Ready    worker   7h12m   v1.19.0+e465e66   192.168.0.121   <none>        Red Hat Enterprise Linux CoreOS 46.82.202009281740-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
$ oc get pods -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP              NODE                         NOMINATED NODE   READINESS GATES
router-default-7cf995f6d7-5rrsb   1/1     Running   0          6h34m   192.168.0.121   xiyuan09291-8kdg9-worker-2   <none>           <none>
router-default-7cf995f6d7-fbgwc   1/1     Running   0          6h29m   192.168.2.104   xiyuan09291-8kdg9-worker-0   <none>           <none>

$ for i in `oc get nodes -l node-role.kubernetes.io/worker= -o name`; do oc debug $i -- chroot /host hostname -I ; done
Starting pod/xiyuan09291-8kdg9-worker-0-debug ...
To use host binaries, run `chroot /host`
10.128.2.2 169.254.0.1 192.168.2.104 Removing debug pod ...
Starting pod/xiyuan09291-8kdg9-worker-1-debug ...
To use host binaries, run `chroot /host`
10.131.0.2 169.254.0.1 192.168.2.55 Removing debug pod ...
Starting pod/xiyuan09291-8kdg9-worker-2-debug ...
To use host binaries, run `chroot /host`
10.129.2.2 169.254.0.1 192.168.0.121 Removing debug pod ...

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-28-212756

How reproducible:
Not sure

Steps to Reproduce:
1. Install 4.6.0-0.nightly-2020-09-28-212756 with matrix upi-on-osp/versioned-installer-disconnected-osp16-ci with OVN enalbed

Actual results:
installation failed

Expected results:
installation succeeded.

Additional info:
Logs for pods in namespace openshift-openstack-infra related to worker was captured, seen from attachment.
$ for i in `oc get pods -n openshift-openstack-infra -o name | grep worker`; do oc logs $i --all-containers -n openshift-openstack-infra >> openshift-openstack-infra_worker.log ; done

Comment 1 Antonio Murdaca 2020-09-29 12:51:05 UTC
any reason this is assigned to MCO? can you clarify and move accordingly please

Comment 3 xiyuan 2020-09-29 13:08:16 UTC
(In reply to Antonio Murdaca from comment #1)
> any reason this is assigned to MCO? can you clarify and move accordingly
> please

Seems it is ralated to keepalived, and keepalived in MCO. So assigned to MCO.
Please help to correct if I am wrong. Thanks.

Comment 5 Martin André 2020-10-01 07:28:20 UTC
From what I could trace from the logs:

03:12:14 ingress VIP on 192.168.2.55 (worker1)
03:34:39 ingress VIP on 192.168.2.104 (worker2)
03:38:38 ingress VIP on 192.168.0.121 (worker0)

The output of oc get pods -n openshift-ingress -o wide also shows that the routers are running on worker2 and worker0 at the time the command ran (deployment + 6h) which seems in line with what we're seeing on keepalived logs.

Could you gather the logs from router-default-7cf995f6d7-fbgwc on worker0 and also the security group rules for master and worker security groups?

I wonder if this could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1842876.

Comment 6 Yossi Boaron 2020-10-01 10:02:18 UTC
As @mandre mentioned I also see that VIP was set 3 times (see [1] ) but in all cases for some reason the Keepalived track script chk_ingress  failed after few minutes and VIP moves to another node.

Also noticed that Keepalived containers were stopped (search for Stopping in the log)  for some reason in two nodes

Since I can't reproduce it locally, I think the best thing will be to access your env, @xiyuan,  is it possible ?


[1]
===========================================================================
Tue Sep 29 03:12:11 2020: VRRP_Script(chk_ingress) succeeded
Tue Sep 29 03:12:11 2020: (xiyuan09291_INGRESS) Changing effective priority from 40 to 90
Tue Sep 29 03:12:11 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding
Tue Sep 29 03:12:12 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding
Tue Sep 29 03:12:13 2020: (xiyuan09291_INGRESS) received lower priority (40) advert from 192.168.3.227 - discarding
Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Receive advertisement timeout
Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Entering MASTER STATE
Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) setting VIPs.
Tue Sep 29 03:12:14 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:12:14 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7
Tue Sep 29 03:34:35 2020: Script `chk_ingress` now returning 22
Tue Sep 29 03:34:35 2020: VRRP_Script(chk_ingress) failed (exited with status 22)
Tue Sep 29 03:34:35 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.2.104 with higher priority 90, ours 40
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) removing VIPs.


=========================================================================================
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Entering MASTER STATE
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) setting VIPs.
Tue Sep 29 03:34:39 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:34:39 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7
Tue Sep 29 03:34:39 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:38:35 2020: Script `chk_ingress` now returning 22
Tue Sep 29 03:38:35 2020: VRRP_Script(chk_ingress) failed (exited with status 22)
Tue Sep 29 03:38:35 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40
Tue Sep 29 03:38:36 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.0.121 with higher priority 90, ours 40
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE

===================================================================================
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering MASTER STATE
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) setting VIPs.
Tue Sep 29 03:38:38 2020: Sending gratuitous ARP on br-ex for 192.168.0.7

Tue Sep 29 03:41:24 2020: Script `chk_ingress` now returning 22
Tue Sep 29 03:41:24 2020: VRRP_Script(chk_ingress) failed (exited with status 22)
Tue Sep 29 03:41:24 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40



[2]
Tue Sep 29 03:38:36 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Master received advert from 192.168.0.121 with higher priority 90, ours 40
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) Entering BACKUP STATE
Tue Sep 29 03:38:38 2020: (xiyuan09291_INGRESS) removing VIPs.
Tue Sep 29 03:38:46 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:38:56 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:06 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:06 2020: Script `chk_ingress` now returning 7
Tue Sep 29 03:39:16 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:17 2020: Stopping


Tue Sep 29 03:38:43 2020: (xiyuan09291_INGRESS) Sending/queueing gratuitous ARPs on br-ex for 192.168.0.7
Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:38:43 2020: Sending gratuitous ARP on br-ex for 192.168.0.7
Tue Sep 29 03:38:52 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:02 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:12 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:22 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:32 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:42 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:39:52 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:02 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:12 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:22 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:32 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:42 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:40:52 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:02 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:12 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:21 2020: Interface 8db07e04c0d8239 deleted
Tue Sep 29 03:41:22 2020: Interface 1e6453fec8291c3 deleted
Tue Sep 29 03:41:22 2020: Interface 88d979115ff0e59 deleted
Tue Sep 29 03:41:22 2020: Interface 74d0a882801cf38 deleted
Tue Sep 29 03:41:22 2020: Interface 36e43da848d6a88 deleted
Tue Sep 29 03:41:22 2020: Interface d42fb12648c0f1f deleted
Tue Sep 29 03:41:22 2020: Interface 662c80f2e2c6f46 deleted
Tue Sep 29 03:41:22 2020: Interface d7f246f63a2fc04 deleted
Tue Sep 29 03:41:22 2020: Interface bd0720372a5061b deleted
Tue Sep 29 03:41:22 2020: Interface 4d75082878d9ec6 deleted
Tue Sep 29 03:41:22 2020: Interface 74d4a917f35e5c4 deleted
Tue Sep 29 03:41:22 2020: Interface 40b043be53966e2 deleted
Tue Sep 29 03:41:22 2020: Interface 202d021f53a01d3 deleted
Tue Sep 29 03:41:22 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:22 2020: Interface 0043abd856310c3 deleted
Tue Sep 29 03:41:22 2020: Interface 22257b16092e2dd deleted
Tue Sep 29 03:41:22 2020: Interface c97d64f738ada53 deleted
Tue Sep 29 03:41:24 2020: Script `chk_ingress` now returning 22
Tue Sep 29 03:41:24 2020: VRRP_Script(chk_ingress) failed (exited with status 22)
Tue Sep 29 03:41:24 2020: (xiyuan09291_INGRESS) Changing effective priority from 90 to 40
Tue Sep 29 03:41:32 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:42 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:52 2020: Printing VRRP data for process(14) on signal
Tue Sep 29 03:41:55 2020: Script `chk_ingress` now returning 7
Tue Sep 29 03:42:02 2020: Stopping
2020/09/29 03:42:02 socat[8] E waitpid(): child 9 exited with status 143

Comment 8 Micah Abbott 2020-10-05 13:36:59 UTC
@Weiwei @David could someone from your team help verify this BZ?

Comment 13 errata-xmlrpc 2020-10-27 16:46:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 16 Red Hat Bugzilla 2023-09-15 00:48:56 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.