2052332 – Probe failures and pod restarts during 4.7 to 4.8 upgrade

Bug 2052332 - Probe failures and pod restarts during 4.7 to 4.8 upgrade

Summary: Probe failures and pod restarts during 4.7 to 4.8 upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Mohamed Mahmoud
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2087763
TreeView+	depends on / blocked

Reported:	2022-02-09 01:30 UTC by Matthew Robson
Modified:	2022-10-18 06:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2087763 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:
Flags:	mmahmoud: needinfo- mmahmoud: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 411	0	None	open	Bug 2052332: use pods cache during reattach processing	2022-05-16 15:20:15 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:49:09 UTC

Description Matthew Robson 2022-02-09 01:30:19 UTC

Description of problem:

4.7.32 to 4.8.28 saw a large number of pods on the cluster fail probes and get killed multiple times during the upgrade.

Picking a pod, it looks like it reattached fine to the SDN:
./var/log/pods/openshift-sdn_sdn-6kswb_6b203465-74aa-4bad-bf9e-4cebf2d6f9f5/sdn/0.log:2022-02-07T17:54:43.652746769+00:00 stderr F I0207 17:54:43.652718 2569999 node.go:457] Reattaching pod 'e69aae-prod/formsflowai-1' to SDN
./var/log/pods/openshift-sdn_sdn-6kswb_6b203465-74aa-4bad-bf9e-4cebf2d6f9f5/sdn/0.log:2022-02-07T17:54:44.495121899+00:00 stderr F I0207 17:54:44.495024 2569999 pod.go:508] CNI_ADD e69aae-prod/formsflowai-1 got IP 10.97.88.69, ofport 3

Probes failing here:
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:06 app-33.dmz hyperkube[2873]: I0207 17:56:06.889468    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:16 app-33.dmz hyperkube[2873]: I0207 17:56:16.890603    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:26 app-33.dmz hyperkube[2873]: I0207 17:56:26.889115    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:36 app-33.dmz hyperkube[2873]: I0207 17:56:36.889232    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:40 app-33.dmz hyperkube[2873]: I0207 17:56:40.477176    2873 kubelet.go:1957] SyncLoop (PLEG): "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f)", event: &pleg.PodLifecycleEvent{ID:"fd27d089-b58d-42d7-bcf2-e52f689b961f", Type:"ContainerDied", Data:"e2f08e9e2e02648cab71f2c43b0eaa8e03120e3de3f1a7ea77de4436bf59cde5"}

./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:42 app-33.dmz hyperkube[2873]: I0207 17:56:42.557766    2873 kubelet.go:1957] SyncLoop (PLEG): "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f)", event: &pleg.PodLifecycleEvent{ID:"fd27d089-b58d-42d7-bcf2-e52f689b961f", Type:"ContainerStarted", Data:"e90b80fe75131bda9fffed4405ff3cf74597f71088ed60b6d194e6583eb79a91"}
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:46 app-33.dmz hyperkube[2873]: I0207 17:56:46.887203    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:56:56 app-33.dmz hyperkube[2873]: I0207 17:56:56.887232    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:57:06 app-33.dmz hyperkube[2873]: I0207 17:57:06.887499    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500
./sos_commands/logs/journalctl_--no-pager:Feb 07 17:57:16 app-33.dmz hyperkube[2873]: I0207 17:57:16.916475    2873 prober.go:117] Readiness probe for "formsflowai-1_e69aae-prod(fd27d089-b58d-42d7-bcf2-e52f689b961f):postgres" failed (failure): HTTP probe failed with statuscode: 500

This seems to be trigged by a lot of other pods having context deadline exceeded issues with their probes:

./journalctl_--no-pager:Feb 07 17:58:57 app-33.dmz hyperkube[2873]: I0207 17:58:57.915325    2873 prober.go:117] Startup probe for "pr-1848-webapi-15-qnxv7_9c33a9-dev(75a47973-2984-4976-a620-0cea6fd35346):pr-1848-webapi" failed (failure): Get "http://10.97.89.11:8080/api/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
./journalctl_--no-pager:Feb 07 18:13:42 app-33.dmz hyperkube[2873]: I0207 18:13:42.061162    2873 prober.go:117] Liveness probe for "docman-51-6t4ng_4c2ba9-prod(1b359cd6-b8fd-47b5-8da8-6e307368b277):app" failed (failure): Get "http://10.97.89.203:5001/document-manager/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

There were also a number of 'the server has received too many requests' messages on secrets:

./journalctl_--no-pager:Feb 07 17:51:17 app-33.dmz hyperkube[2873]: E0207 17:51:17.248954    2873 reflector.go:138] object-"e69aae-prod"/"formsflowai-edb-app-user": Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)

Looking at some stats, we see some very large spikes in p&f currently executing requests for global-default and service-accounts:workload-low. They went from pretty much zero to 150-200. A number of the other's have smaller increases, but are higher in 4.8 versus 4.7.

We also saw a large spike in p&f rejected, around 7200, right when the upgrade started, for system-nodes:system but it flattened out. There were also a handful for for openshift-sdn:system.

Requests in flight has gone up significantly for read only requests 

Version-Release number of selected component (if applicable):
4.8.28

How reproducible:
Once during upgrade

Steps to Reproduce:
1.
2.
3.

Actual results:
Lots of pod instability and restarting

Comment 23 errata-xmlrpc 2022-08-10 10:48:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.