Bug 1856250 - OpenShift router is failing readiness probe regularly
Summary: OpenShift router is failing readiness probe regularly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Andrew McDermott
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1870134
TreeView+ depends on / blocked
 
Reported: 2020-07-13 07:44 UTC by Simon Reber
Modified: 2024-06-13 22:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1870134 (view as bug list)
Environment:
Last Closed: 2020-08-19 12:11:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25274 0 None closed Bug 1856250: UPSTREAM: 84792: Fixes for the `No ref for container` in probes after kubelet restart 2021-02-18 03:12:58 UTC
Red Hat Knowledge Base (Solution) 5218841 0 None None None 2020-07-13 08:10:25 UTC
Red Hat Product Errata RHBA-2020:2990 0 None None None 2020-07-27 13:49:22 UTC

Description Simon Reber 2020-07-13 07:44:25 UTC
Description of problem:

We are seeing readiness probes failing for the OpenShift provided router (haproxy) once in a while (approx. once every minute). Interesting enough all the additional checks are successful.

When investigating a bit in detail we found that always when the readiness probe is failing we are seeing "No ref for container" failure.

Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: W0706 15:26:34.200801    8313 prober.go:103] No ref for container "docker://a572168853266b01558abd857eecfd4f25aebd810d1bb34c18d663f20340185d" (router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router)
Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: I0706 15:26:34.200842    8313 prober.go:111] Readiness probe for "router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034   97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router)
Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: I0710 11:46:56.349075   97315 prober.go:111] Readiness probe for "router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 3.11

How reproducible:

 - N/A

Steps to Reproduce:
1. N/A


Actual results:

Router readiness probes are failing with Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034   97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router) while same readiness probe afterwards is successful.

Expected results:

All readiness probes to work as long as the application is considered health and ready

Additional info:

Comment 13 errata-xmlrpc 2020-07-27 13:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990

Comment 16 Seth Jennings 2020-08-10 16:06:04 UTC
Ryan is on leave

Comment 17 Seth Jennings 2020-08-10 21:25:18 UTC
QE note: please test on v3.11.249-1 or later.

Comment 20 Seth Jennings 2020-08-12 15:28:16 UTC
After re-reviewing this bug, I think the fix eliminates the "No ref for container" but not the actual failure of the liveness probe, which is the reported bug.

Comment 21 Seth Jennings 2020-08-12 16:20:39 UTC
In the sos from comment 19

$ grep "Readiness probe for" journalctl_--no-pager_--unit_atomic-openshift-node  | grep router | grep failed | wc -l
514

514 failures over about 70h of logs.  So about one failure every 8m on average but not periodic in nature.

Sending to Network Edge and reducing severity to match ticket since this doesn't cause an outage, just event noise.

Network Edge, if you don't want to address this please close with wontfix and don't bounce back.

Comment 22 Luke Meyer 2020-08-19 12:11:08 UTC
Please don't reopen ERRATA bugs. There is a reason for this message. Please make a clone.

(In reply to errata-xmlrpc from comment #13)
> Since the problem described in this bug report should be
> resolved in a recent advisory, it has been closed with a
> resolution of ERRATA.
> 
> If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.