test: [sig-auth][Feature:HTPasswdAuth] HTPasswd IDP should successfully configure htpasswd and be responsive [Suite:openshift/conformance/parallel] failed, see job: <link> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/73 Seem multiple tests are failing due to OAuth Server issue.
testgrid job view: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-azure-compact-4.5&sort-by-flakiness=
The test is consistently failing.
Job failed consistently: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/74
Looks like a whole block of oauth tests are consistently failing in that test job specifically... whether it's a compact cluster issue or an azure issue. I also see the etcd leader change test is failing meaning we probably had etcd issues that need to be investigated, but it's odd they'd only/specifically impact the oauth tests every time.
> Job failed consistently: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/74 The data exposed form event is very useful. Unfortunately, even with current raft tuning, we are seeing leader elections. I triaged this BZ before and at the time the load balancers were causing the test to fail compact clusters. Since this test never passed to my knowledge and support for compact is pending for Azure I am moving to the Installer team for clarification. Once support exists we can try to resolve.
The fix for compact was merged on May 7, and starting May 8 no `HTPasswd IDP should successfully configure htpasswd and be responsive` consistently failing. https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-azure-compact-4.5&sort-by-flakiness=10 there is one failure https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/81 ``` fail [github.com/openshift/origin/test/extended/util/client.go:693]: May 10 15:20:30.771: the server is currently unable to handle the request (get users.user.openshift.io e2e-test-htpasswd-idp-7pv5k-user) ``` which could be etcd stalling the kube-apiserver. So moving back to etcd to traige if that's so, otherwise close it as dup of https://bugzilla.redhat.com/show_bug.cgi?id=1794839
We have made some improvements to reduce the number of leader elections during an upgrade and regular installations. Can the reporter verify if the etcd failures are still found (not the problem reported in BZ 1794839)?
We took a look at the isolated failure[1] and noticed a few things: 1. etcd seemed to be okay, although the operator was misreporting one of the members as unhealthy 2. the oauth server couldn't talk to the openshift apiserver, but it's not entirely clear why 2020-05-10T15:25:55.982108229Z I0510 15:25:55.982033 1 log.go:172] http: TLS handshake error from 10.128.0.50:51094: read tcp 10.129.0.42:6443->10.128.0.50:51094: read: connection timed out 3. the openshift apiserver logs didn't reveal much of interest about why 4. the kube apiserver logs didn't either Given that failure was part of a huge vertical cascade of other test failures, we'd like to continue observing to see if things remain stable. More analysis is required to root cause the last specific failure, but it's not at all clear yet there's an etcd issue there. Can't rule out networking yet. Moving to 4.6. [1] https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/81
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days