Description of problem: 4.7.32 to 4.8.28 saw a large number of pods on the cluster fail probes and get killed multiple times during the upgrade. Picking a pod, it looks like it reattached fine to the SDN: ./var/log/pods/openshift-sdn_sdn-6kswb_6b203465-74aa-4bad-bf9e-4cebf2d6f9f5/sdn/0.log:2022-02-07T17:54:43.652746769+00:00 stderr F I0207 17:54:43.652718 2569999 node.go:457] Reattaching pod 'e69aae-prod/formsflowai-1' to SDN ./var/log/pods/openshift-sdn_sdn-6kswb_6b203465-74aa-4bad-bf9e-4cebf2d6f9f5/sdn/0.log:2022-02-07T17:54:44.495121899+00:00 stderr F I0207 17:54:44.495024 2569999 pod.go:508] CNI_ADD e69aae-prod/formsflowai-1 got IP 10.97.88.69, ofport 3 Probes failing here: ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:06 app-33.dmz hyperkube[2873]: I0207 17:56:06.889468 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:16 app-33.dmz hyperkube[2873]: I0207 17:56:16.890603 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:26 app-33.dmz hyperkube[2873]: I0207 17:56:26.889115 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:36 app-33.dmz hyperkube[2873]: I0207 17:56:36.889232 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:40 app-33.dmz hyperkube[2873]: I0207 17:56:40.477176 2873 kubelet.go:1957] SyncLoop (PLEG): "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f)", event: &pleg.PodLifecycleEvent{ID:"fd27d089-b58d-42d7-bcf2-e52f689b961f", Type:"ContainerDied", Data:"e2f08e9e2e02648cab71f2c43b0eaa8e03120e3de3f1a7ea77de4436bf59cde5"} ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:42 app-33.dmz hyperkube[2873]: I0207 17:56:42.557766 2873 kubelet.go:1957] SyncLoop (PLEG): "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f)", event: &pleg.PodLifecycleEvent{ID:"fd27d089-b58d-42d7-bcf2-e52f689b961f", Type:"ContainerStarted", Data:"e90b80fe75131bda9fffed4405ff3cf74597f71088ed60b6d194e6583eb79a91"} ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:46 app-33.dmz hyperkube[2873]: I0207 17:56:46.887203 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:56 app-33.dmz hyperkube[2873]: I0207 17:56:56.887232 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:57:06 app-33.dmz hyperkube[2873]: I0207 17:57:06.887499 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 ./sos_commands/logs/journalctl_--no-pager:Feb 07 17:57:16 app-33.dmz hyperkube[2873]: I0207 17:57:16.916475 2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500 This seems to be trigged by a lot of other pods having context deadline exceeded issues with their probes: ./journalctl_--no-pager:Feb 07 17:58:57 app-33.dmz hyperkube[2873]: I0207 17:58:57.915325 2873 prober.go:117] Startup probe for "pr-1848-webapi-15-qnxv7_9c33a9-dev(75a47973-2984-4976-a620-0cea6fd35346):pr-1848-webapi" failed (failure): Get "http://10.97.89.11:8080/api/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ./journalctl_--no-pager:Feb 07 18:13:42 app-33.dmz hyperkube[2873]: I0207 18:13:42.061162 2873 prober.go:117] Liveness probe for "docman-51-6t4ng_4c2ba9-prod(1b359cd6-b8fd-47b5-8da8-6e307368b277):app" failed (failure): Get "http://10.97.89.203:5001/document-manager/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers) There were also a number of 'the server has received too many requests' messages on secrets: ./journalctl_--no-pager:Feb 07 17:51:17 app-33.dmz hyperkube[2873]: E0207 17:51:17.248954 2873 reflector.go:138] object-"e69aae-prod"/"formsflowai-edb-app-user": Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets) Looking at some stats, we see some very large spikes in p&f currently executing requests for global-default and service-accounts:workload-low. They went from pretty much zero to 150-200. A number of the other's have smaller increases, but are higher in 4.8 versus 4.7. We also saw a large spike in p&f rejected, around 7200, right when the upgrade started, for system-nodes:system but it flattened out. There were also a handful for for openshift-sdn:system. Requests in flight has gone up significantly for read only requests Version-Release number of selected component (if applicable): 4.8.28 How reproducible: Once during upgrade Steps to Reproduce: 1. 2. 3. Actual results: Lots of pod instability and restarting
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069