1820800 – release-openshift-origin-installer-e2e-azure-compact-4.4: Regularly Failing: ingress routes not available

Bug 1820800 - release-openshift-origin-installer-e2e-azure-compact-4.4: Regularly Failing: ingress routes not available

Summary: release-openshift-origin-installer-e2e-azure-compact-4.4: Regularly Failing: ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1794839
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-03 22:36 UTC by Kirsten Garrison
Modified:	2020-04-15 21:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-15 21:18:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kirsten Garrison 2020-04-03 22:36:02 UTC

Description of problem:
This release informing azure test is in a permafail state: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4

Seeing the following errors in these failed runs:
fail [k8s.io/kubernetes/test/e2e/framework/volume/fixtures.go:390]: Unexpected error:
    <*errors.StatusError | 0xc0015cde00>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "etcdserver: request timed out",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    etcdserver: request timed out
occurred

Apr 02 23:11:06.343 E kube-apiserver Kube API started failing: etcdserver: leader changed

Examples: 
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4/70/artifacts/e2e-azure/e2e.log

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4/70/artifacts/e2e-azure/e2e.log

Also see:
2020-04-02 23:19:44.148718 W | wal: sync duration of 1.204969s, expected less than 1s
2020-04-02 23:19:44.148793 W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 205.9901ms, to e2fe3bf11f2491c5)
2020-04-02 23:19:44.148801 W | etcdserver: server is likely overloaded
2020-04-02 23:19:44.148808 W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 206.0069ms, to 382c74853240f5cf)
2020-04-02 23:19:44.148812 W | etcdserver: server is likely overloaded
20

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4/72/artifacts/e2e-azure/pods/openshift-etcd_etcd-ci-op-02852rlc-5ab2f-4w8zd-master-0_etcd.log

And a lot of:
2020-04-02 23:01:50.926685 W | etcdserver: read-only range request "key:\"/kubernetes.io/networkpolicies/e2e-test-router-scoped-d2ftl/\" range_end:\"/kubernetes.io/networkpolicies/e2e-test-router-scoped-d2ftl0\" " with result "range_response_count:0 size:6" took too long (105.9526ms) to execute
2020-04-02 23:01:51.556599 W | etcdserver: read-only range request "key:\"/kubernetes.io/rolebindings/openshift-machine-api/cluster-autoscaler-operator\" " with result "range_response_count:1 size:405" took too long (106.0961ms) to execute
2020-04-02 23:01:51.556716 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/prometheusrules/openshift-marketplace/marketplace-alert-rules\" " with result "range_response_count:1 size:3965" took too long (105.7011ms) to execute
2

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4/72/artifacts/e2e-azure/pods/openshift-etcd_etcd-ci-op-02852rlc-5ab2f-4w8zd-master-1_etcd.log

Another run from Feb: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.4/53/artifacts/e2e-azure/e2e.log

I did see https://bugzilla.redhat.com/show_bug.cgi?id=1819907 but unsure if this is a dupe.

Comment 1 Stefan Schimanski 2020-04-07 12:42:26 UTC

This is a 4.4 blocker. Moving back to 4.4 release.

Comment 2 Daneyon Hansen 2020-04-10 18:20:18 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24858/pull-ci-openshift-origin-master-e2e-gcp-builds/1845 is failing and I see:

2020-04-09T21:34:46.03330722Z 2020-04-09 21:34:46.033252 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/serviceaccount-ca\" " with result "range_response_count:1 size:6299" took too long (456.196812ms) to execute

xref https://bugzilla.redhat.com/show_bug.cgi?id=1817588#c2 as I see other etcd/apiserver errors, so we may have a dupe.

Comment 3 Sam Batschelet 2020-04-14 00:32:25 UTC

Tests seem to be failing more recently because of a failure with image registry operator. 

> E0413 22:43:19.318721      14 controller.go:252] unable to sync: Config.imageregistry.operator.openshift.io "cluster" is invalid: spec.storage.azure.container: Invalid value: "": spec.storage.azure.container in body should be at least 3 chars long, requeuing


This was fixed in 4.5 but also needs to be backported to 4.4
https://bugzilla.redhat.com/show_bug.cgi?id=1823590

Note You need to log in before you can comment on or make changes to this bug.