+++ This bug was initially created as a clone of Bug #1814458 +++ One of the more common 4.5 failure modes in the past 24h: $ curl -s 'https://search.svc.ci.openshift.org/search?name=^release-openshift-ocp-installer-.*-4.5&search=promQL+query:+count_over_time.*reported+incorrect+results&type=build-log&maxAge=24h&context=0' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail -n3 17 TargetDown 33 KubePodCrashLooping 107 FailingOperator Reasonably well distributed over our flavors: $ curl -s 'https://search.svc.ci.openshift.org/search?name=^release-openshift-ocp-installer-.*-4.5&search=promQL+query:+count_over_time.*reported+incorrect+results.*KubePodCrashLooping&maxAge=24h' | jq -r '. | keys[]' | sed 's|/[^/]*$||' | sort | uniq -c 12 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5 6 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.5 4 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5 2 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5 1 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.5 1 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5 1 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5 4 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.5 2 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.5 AWS jobs to dig into: $ curl -s 'https://search.svc.ci.openshift.org/search?name=^release-openshift-ocp-installer-e2e-aws-4.5&search=promQL+query:+count_over_time.*reported+incorrect+results.*KubePodCrashLooping&maxAge=24h' | jq -r '. | keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/435 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/436 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/444 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/447 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/449 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/451 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/455 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/457 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/471 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/473 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/478 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/479 Picking the most recent (479), it is also impacted by bug 1812261 (iptables segfaulting and bug 1785023 (ResourceQuota life of a secret). From the pod JSON [1], I don't see a pod in CrashLoopBackOff, and none of the restartCount seem higher than 2, so I'm not sure what is crashing or causing the crashes. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/479/artifacts/e2e-aws/pods.json --- Additional comment from Stefan Schimanski on 2020-03-18 11:41:45 CET --- curl -s 'https://search.svc.ci.openshift.org/search?name=^release-openshift-ocp-installer-.*-4.5&search=promQL+query:+count_over_time.*reported+incorrect+results&type=build-log&maxAge=24h&context=0' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric | select(.alertname == "KubePodCrashLooping") | (.namespace + "/" + .pod)' | sort | uniq -c | sort -n 1 openshift-csi-snapshot-controller/csi-snapshot-controller-547f4bb8b5-sb2lj 1 openshift-csi-snapshot-controller/csi-snapshot-controller-594d844779-whtx9 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-2jf85 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-8cgqd 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-8lkfx 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-99gr4 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-bknnk 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-czf9t 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-h5bgz 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-k5xzn 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-kcg98 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-l42g5 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-l94sh 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-m57tm 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-mtvpp 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-mxx8n 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-ps846 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-qrb7n 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-rsgdg 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-tq74j 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-tvhdn 1 openshift-csi-snapshot-controller/csi-snapshot-controller-65d54b4b4c-wrw6v 1 openshift-csi-snapshot-controller/csi-snapshot-controller-7cc54dbd4d-z4cl2 1 openshift-csi-snapshot-controller/csi-snapshot-controller-7f9d66fc78-94q9p 1 openshift-ovn-kubernetes/ovnkube-node-9vhvs 1 openshift-ovn-kubernetes/ovnkube-node-v45nr 1 openshift-ovn-kubernetes/ovnkube-node-vlwj6 1 openshift-ovn-kubernetes/ovnkube-node-vmvzz 1 openshift-sdn/sdn-controller-fp8nt
I checked past release-openshift-ocp-installer-e2e-aws-4.5 and the iptables issue seems gone. Also, I don't see KubePodCrashLooping. Closing, if this reoccurs please reopen.
Checking again (over the past 48h), I agree that the csi-snapshot-controller-..., ovnkube-node-..., and sdn-controller-... issues seem to be gone in CI: $ curl -Ls 'https://search.svc.ci.openshift.org/search?search=promQL+query:+count_over_time.*reported+incorrect+results.*KubePodCrashLooping&type=junit&maxAge=48h&context=0' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric | select(.alertname == "KubePodCrashLooping").namespace' | sort | uniq -c | sort -n parse error: Invalid numeric literal at line 47, column 1698 1 openshift-cloud-credential-operator 1 openshift-cluster-storage-operator 1 openshift-kube-scheduler 4 openshift-console 43 openshift-kube-apiserver 47 openshift-kube-controller-manager