Description of problem: release-openshift-ocp-installer-e2e-metal-compact-4.7 has been failing for the past month on the test: openshift-tests.[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] the first job that failed [0] was 12/02 and the previous job that passed this test is here [1]. Looks like the job only runs every 2 days. The failure is not happening in 4.6 [2] or 4.5 [3] a couple of things I noticed: [4] CMO complains about initializing certificate reloader: Failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory goroutine 1 [running]: [5] I see "summary: Half or more of the Alertmanager instances within the same cluster are crashlooping." in the configmaps.json artifact which has this warn/error: level=warn ts=2020-12-16T20:16:55.534Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" That warn/error message is also there in a 4.6 job [7], but only at the beginning of the log and eventually it seems to settle. [0] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744 [1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1333494216650657792 [2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-metal-compact-4.6 [3] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-ocp-installer-e2e-metal-compact-4.5 [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/pods/openshift-monitoring_cluster-monitoring-operator-74dbf47bff-hgccp_kube-rbac-proxy_previous.log [5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/pods/openshift-monitoring_alertmanager-main-2_alertmanager.log [6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1339294198414708736/artifacts/e2e-metal/configmaps.json [7] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1341603354345738240/artifacts/e2e-metal/pods/openshift-monitoring_alertmanager-main-1_alertmanager.log Version-Release number of selected component (if applicable): 4.7 How reproducible: seems that as long as the job gets off the ground, this will fail every time. There were two hiccups recently where the job didn't really start.
Confirming the tls.crt crashloop cause in one of those jobs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1341603354345738240/artifacts/e2e-metal/pods.json | jq -r '.items[] | .metadata as $m | .status.containerStatuses[] | select(.restartCount > 2) | $m.namespace + " " + $m.name + " " + .name + " " + (.restartCount | tostring) + "\n" + .lastState.terminated.message ' openshift-monitoring cluster-monitoring-operator-5f697b97cf-zxk8z kube-rbac-proxy 3 I1223 05:08:52.636459 1 main.go:188] Valid token audiences: I1223 05:08:52.636553 1 main.go:261] Reading certificate files F1223 05:08:52.636578 1 main.go:265] Failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory
Ah, looking at one of the 4.7 jobs [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744/artifacts/e2e-metal/pods.json | jq -r '.items[] | .metadata as $m | .status.containerStatuses[] | select(.restartCount > 2) | $m.namespace + " " + $m.name + " " + .name + " " + (.restartCount | tostring) + "\n" + .lastState.terminated.message' openshift-controller-manager-operator openshift-controller-manager-operator-7d84c45df8-25l9v openshift-controller-manager-operator 3 openshift-etcd-operator etcd-operator-5f8d959d79-vhjxg etcd-operator 3 openshift-kube-controller-manager-operator kube-controller-manager-operator-696fffdf8-2mvf5 kube-controller-manager-operator 3 openshift-kube-scheduler openshift-kube-scheduler-master-2.ci-op-44cfs12v-22d79.origin-ci-int-aws.dev.rhcloud.com kube-scheduler-recovery-controller 7 127.0.0.1:10443 0.0.0.0:* ' ']' + sleep 1 ++ ss -Htanop '(' sport = 10443 ')' + '[' -n 'LISTEN 0 128 127.0.0.1:10443 0.0.0.0:* ' ']' + sleep 1 ... ++ ss -Htanop '(' sport = 10443 ')' + '[' -n 'LISTEN 0 128 127.0.0.1:10443 0.0.0.0:* ' ']' + sleep 1 openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-c87ff95dc-4k6jh kube-storage-version-migrator-operator 3 *That* is bug 1908145. I'm going to guess this is a dup of that one (which is still POST), since this is a compact cluster. We can re-open if I'm wrong ;). [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.7/1334219109352607744 *** This bug has been marked as a duplicate of bug 1908145 ***