Bug 1856250

Summary: OpenShift router is failing readiness probe regularly
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: anowak, aos-bugs, ckoep, jokerman, rphillips, skrenger, talessio
Version: 3.11.0Keywords: Reopened
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1870134 (view as bug list) Environment:
Last Closed: 2020-08-19 12:11:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1870134    

Description Simon Reber 2020-07-13 07:44:25 UTC
Description of problem:

We are seeing readiness probes failing for the OpenShift provided router (haproxy) once in a while (approx. once every minute). Interesting enough all the additional checks are successful.

When investigating a bit in detail we found that always when the readiness probe is failing we are seeing "No ref for container" failure.

Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: W0706 15:26:34.200801    8313 prober.go:103] No ref for container "docker://a572168853266b01558abd857eecfd4f25aebd810d1bb34c18d663f20340185d" (router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router)
Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: I0706 15:26:34.200842    8313 prober.go:111] Readiness probe for "router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034   97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router)
Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: I0710 11:46:56.349075   97315 prober.go:111] Readiness probe for "router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 3.11

How reproducible:

 - N/A

Steps to Reproduce:
1. N/A


Actual results:

Router readiness probes are failing with Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034   97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router) while same readiness probe afterwards is successful.

Expected results:

All readiness probes to work as long as the application is considered health and ready

Additional info:

Comment 13 errata-xmlrpc 2020-07-27 13:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990

Comment 16 Seth Jennings 2020-08-10 16:06:04 UTC
Ryan is on leave

Comment 17 Seth Jennings 2020-08-10 21:25:18 UTC
QE note: please test on v3.11.249-1 or later.

Comment 20 Seth Jennings 2020-08-12 15:28:16 UTC
After re-reviewing this bug, I think the fix eliminates the "No ref for container" but not the actual failure of the liveness probe, which is the reported bug.

Comment 21 Seth Jennings 2020-08-12 16:20:39 UTC
In the sos from comment 19

$ grep "Readiness probe for" journalctl_--no-pager_--unit_atomic-openshift-node  | grep router | grep failed | wc -l
514

514 failures over about 70h of logs.  So about one failure every 8m on average but not periodic in nature.

Sending to Network Edge and reducing severity to match ticket since this doesn't cause an outage, just event noise.

Network Edge, if you don't want to address this please close with wontfix and don't bounce back.

Comment 22 Luke Meyer 2020-08-19 12:11:08 UTC
Please don't reopen ERRATA bugs. There is a reason for this message. Please make a clone.

(In reply to errata-xmlrpc from comment #13)
> Since the problem described in this bug report should be
> resolved in a recent advisory, it has been closed with a
> resolution of ERRATA.
> 
> If the solution does not work for you, open a new bug report.