1910417 – e2e-metal-compact-4.7 failing test [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

Bug 1910417 - e2e-metal-compact-4.7 failing test [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

Summary: e2e-metal-compact-4.7 failing test [sig-arch] Managed cluster should have no ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1908145
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sudha Ponnaganti
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-23 19:45 UTC by jamo luhrsen
Modified:	2020-12-23 21:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-23 21:41:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jamo luhrsen 2020-12-23 19:45:42 UTC

Description of problem:

release-openshift-ocp-installer-e2e-metal-compact-4.7 has been failing for the past month on the test:
  openshift-tests.[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel]

the first job that failed [0] was 12/02 and the previous job that passed this test is here [1]. Looks like the job
only runs every 2 days.

The failure is not happening in 4.6 [2] or 4.5 [3]

a couple of things I noticed:

[4] CMO complains about initializing certificate reloader:
  Failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory
goroutine 1 [running]:

[5] I see "summary: Half or more of the Alertmanager instances within the same cluster are crashlooping." in the configmaps.json artifact which has this
warn/error:
  level=warn ts=2020-12-16T20:16:55.534Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n"

That warn/error message is also there in a 4.6 job [7], but only at the beginning of the log and eventually it seems to settle.

  

[0] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744
[1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1333494216650657792
[2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-metal-compact-4.6
[3] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-ocp-installer-e2e-metal-compact-4.5
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/pods/openshift-monitoring_cluster-monitoring-operator-74dbf47bff-hgccp_kube-rbac-proxy_previous.log
[5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/pods/openshift-monitoring_alertmanager-main-2_alertmanager.log
[6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/configmaps.json
[7] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1341603354345738240/artifacts/e2e-metal/pods/openshift-monitoring_alertmanager-main-1_alertmanager.log

Version-Release number of selected component (if applicable):

4.7


How reproducible:

seems that as long as the job gets off the ground, this will fail every time. There were two hiccups
recently where the job didn't really start.

Comment 1 W. Trevor King 2020-12-23 21:34:09 UTC

Confirming the tls.crt crashloop cause in one of those jobs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1341603354345738240/artifacts/e2e-metal/pods.json | jq -r '.items[] | .metadata as $m | .status.containerStatuses[] | select(.restartCount > 2) | $m.namespace + " " + $m.name + " " + .name + " " + (.restartCount | tostring) + "\n" + .lastState.terminated.message
'
openshift-monitoring cluster-monitoring-operator-5f697b97cf-zxk8z kube-rbac-proxy 3
I1223 05:08:52.636459       1 main.go:188] Valid token audiences: 
I1223 05:08:52.636553       1 main.go:261] Reading certificate files
F1223 05:08:52.636578       1 main.go:265] Failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory

Comment 2 W. Trevor King 2020-12-23 21:41:16 UTC

Ah, looking at one of the 4.7 jobs [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744/artifacts/e2e-metal/pods.json | jq -r '.items[] | .metadata as $m | .status.containerStatuses[] | select(.restartCount > 2) | $m.namespace + " " + $m.name + " " + .name + " " + (.restartCount | tostring) + "\n" + .lastState.terminated.message'
openshift-controller-manager-operator openshift-controller-manager-operator-7d84c45df8-25l9v openshift-controller-manager-operator 3

openshift-etcd-operator etcd-operator-5f8d959d79-vhjxg etcd-operator 3

openshift-kube-controller-manager-operator kube-controller-manager-operator-696fffdf8-2mvf5 kube-controller-manager-operator 3

openshift-kube-scheduler openshift-kube-scheduler-master-2.ci-op-44cfs12v-22d79.origin-ci-int-aws.dev.rhcloud.com kube-scheduler-recovery-controller 7
    127.0.0.1:10443           0.0.0.0:*     ' ']'
+ sleep 1
++ ss -Htanop '(' sport = 10443 ')'
+ '[' -n 'LISTEN     0           128              127.0.0.1:10443           0.0.0.0:*     ' ']'
+ sleep 1
...
++ ss -Htanop '(' sport = 10443 ')'
+ '[' -n 'LISTEN     0           128              127.0.0.1:10443           0.0.0.0:*     ' ']'
+ sleep 1

openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-c87ff95dc-4k6jh kube-storage-version-migrator-operator 3

*That* is bug 1908145.  I'm going to guess this is a dup of that one (which is still POST), since this is a compact cluster.  We can re-open if I'm wrong ;).

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744

*** This bug has been marked as a duplicate of bug 1908145 ***

Note You need to log in before you can comment on or make changes to this bug.