Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1835100

Summary: Openshift installation fails - Waiting on authentication
Product: OpenShift Container Platform Reporter: Itzik Brown <itbrown>
Component: InstallerAssignee: Pierre Prinetti <pprinett>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, bleanhar, ltomasbo, m.andre, mfojtik, pprinett, slaznick, smaitra, wking
Version: 4.5Keywords: TestBlockerForLayeredProduct, UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-21 03:57:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Itzik Brown 2020-05-13 05:59:19 UTC
Description of problem:
Openshift installation fails
time="2020-05-13T01:01:21-04:00" level=fatal msg="failed to initialize the cluster: Working towards 4.5.0-0.nightly-2020-05-12-224129: 87% complete, waiting on authentication"

After the installation exists with an error the cluster is functional:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-12-224129   True        False         49m     Cluster version is 4.5.0-0.nightly-2020-05-12-224129

Version-Release number of the following components:
4.5.0-0.nightly-2020-05-12-224129
Openshift on OSP16.1

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Abhinav Dahiya 2020-05-14 02:31:58 UTC
moving to authentication operator since that was the failing operator

Comment 3 Standa Laznicka 2020-05-14 07:46:14 UTC
The authentication operator is actually up and running at the end, and it reports that correctly in both the logs and its clusteroperator resource, but it's probably too late at that time. The operator took way too long to come up though, which suggests other issues in the cluster.

I can see many etcd leader changes:
```
2020-05-13T05:00:31.985381733Z I0513 05:00:31.985343       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"b3b84ca7-baae-4828-90de-1b75d18d6b5b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected 2.5 leader changes in last 5 minutes on "OpenStack" disk metrics are: etcd-ostest-sqszc-master-1=0.014960000000000005,etcd-ostest-sqszc-master-0=0.015760000000000003,etcd-ostest-sqszc-master-2=0.015466666666666663
```

I am going to move this to etcd, although the root cause might be elsewhere.

Comment 7 Pierre Prinetti 2020-05-22 14:09:59 UTC
Waiting on reporter's feedback

Comment 9 Pierre Prinetti 2020-05-25 09:43:57 UTC
The Control Plane nodes need fast disks to comply with etcd requirements:

https://github.com/openshift/installer/tree/master/docs/user/openstack#disk-requirements

Comment 12 Martin André 2020-06-25 14:44:28 UTC
Is it still an issue with current 4.6 branch?

Comment 13 W. Trevor King 2020-07-02 20:33:34 UTC
Seems rare in mainline release-promotion jobs, although 4.5, OpenStack, and a few other less-common platforms are getting hit reasonably often:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=failed+to+initialize+the+cluster.*waiting+on.*authentication&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-ocp&groupBy=job' | grep 'failures match'
release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 8 runs, 63% failed, 20% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 89 runs, 63% failed, 9% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 60 runs, 77% failed, 11% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.6 - 60 runs, 67% failed, 10% of failures match
release-openshift-ocp-installer-e2e-openstack-4.2 - 17 runs, 71% failed, 17% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.3 - 34 runs, 76% failed, 8% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.5 - 88 runs, 51% failed, 7% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 145 runs, 30% failed, 2% of failures match
release-openshift-ocp-installer-e2e-azure-4.3 - 43 runs, 37% failed, 6% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.5 - 74 runs, 26% failed, 11% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.5 - 114 runs, 46% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-azure-serial-4.4 - 47 runs, 43% failed, 5% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.4 - 47 runs, 45% failed, 5% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.4 - 47 runs, 87% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-4.5 - 82 runs, 30% failed, 20% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 93 runs, 85% failed, 1% of failures match
release-openshift-ocp-installer-e2e-openstack-4.3 - 34 runs, 94% failed, 3% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.2 - 17 runs, 35% failed, 17% of failures match

Comment 16 Itzik Brown 2020-07-21 03:57:18 UTC
It seems that it's working now.
It it will happen again - I'll reopen.