Bug 1835100 - Openshift installation fails - Waiting on authentication
Summary: Openshift installation fails - Waiting on authentication
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Pierre Prinetti
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 05:59 UTC by Itzik Brown
Modified: 2020-10-07 07:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-21 03:57:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Itzik Brown 2020-05-13 05:59:19 UTC
Description of problem:
Openshift installation fails
time="2020-05-13T01:01:21-04:00" level=fatal msg="failed to initialize the cluster: Working towards 4.5.0-0.nightly-2020-05-12-224129: 87% complete, waiting on authentication"

After the installation exists with an error the cluster is functional:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-12-224129   True        False         49m     Cluster version is 4.5.0-0.nightly-2020-05-12-224129

Version-Release number of the following components:
4.5.0-0.nightly-2020-05-12-224129
Openshift on OSP16.1

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Abhinav Dahiya 2020-05-14 02:31:58 UTC
moving to authentication operator since that was the failing operator

Comment 3 Standa Laznicka 2020-05-14 07:46:14 UTC
The authentication operator is actually up and running at the end, and it reports that correctly in both the logs and its clusteroperator resource, but it's probably too late at that time. The operator took way too long to come up though, which suggests other issues in the cluster.

I can see many etcd leader changes:
```
2020-05-13T05:00:31.985381733Z I0513 05:00:31.985343       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"b3b84ca7-baae-4828-90de-1b75d18d6b5b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected 2.5 leader changes in last 5 minutes on "OpenStack" disk metrics are: etcd-ostest-sqszc-master-1=0.014960000000000005,etcd-ostest-sqszc-master-0=0.015760000000000003,etcd-ostest-sqszc-master-2=0.015466666666666663
```

I am going to move this to etcd, although the root cause might be elsewhere.

Comment 7 Pierre Prinetti 2020-05-22 14:09:59 UTC
Waiting on reporter's feedback

Comment 9 Pierre Prinetti 2020-05-25 09:43:57 UTC
The Control Plane nodes need fast disks to comply with etcd requirements:

https://github.com/openshift/installer/tree/master/docs/user/openstack#disk-requirements

Comment 12 Martin André 2020-06-25 14:44:28 UTC
Is it still an issue with current 4.6 branch?

Comment 13 W. Trevor King 2020-07-02 20:33:34 UTC
Seems rare in mainline release-promotion jobs, although 4.5, OpenStack, and a few other less-common platforms are getting hit reasonably often:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=failed+to+initialize+the+cluster.*waiting+on.*authentication&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-ocp&groupBy=job' | grep 'failures match'
release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 8 runs, 63% failed, 20% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 89 runs, 63% failed, 9% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 60 runs, 77% failed, 11% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.6 - 60 runs, 67% failed, 10% of failures match
release-openshift-ocp-installer-e2e-openstack-4.2 - 17 runs, 71% failed, 17% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.3 - 34 runs, 76% failed, 8% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.5 - 88 runs, 51% failed, 7% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 145 runs, 30% failed, 2% of failures match
release-openshift-ocp-installer-e2e-azure-4.3 - 43 runs, 37% failed, 6% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.5 - 74 runs, 26% failed, 11% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.5 - 114 runs, 46% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-azure-serial-4.4 - 47 runs, 43% failed, 5% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.4 - 47 runs, 45% failed, 5% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.4 - 47 runs, 87% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-4.5 - 82 runs, 30% failed, 20% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 93 runs, 85% failed, 1% of failures match
release-openshift-ocp-installer-e2e-openstack-4.3 - 34 runs, 94% failed, 3% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.2 - 17 runs, 35% failed, 17% of failures match

Comment 16 Itzik Brown 2020-07-21 03:57:18 UTC
It seems that it's working now.
It it will happen again - I'll reopen.


Note You need to log in before you can comment on or make changes to this bug.