Description of problem: We are seeing readiness probes failing for the OpenShift provided router (haproxy) once in a while (approx. once every minute). Interesting enough all the additional checks are successful. When investigating a bit in detail we found that always when the readiness probe is failing we are seeing "No ref for container" failure. Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: W0706 15:26:34.200801 8313 prober.go:103] No ref for container "docker://a572168853266b01558abd857eecfd4f25aebd810d1bb34c18d663f20340185d" (router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router) Jul 06 15:26:34 node.example.com atomic-openshift-node[8313]: I0706 15:26:34.200842 8313 prober.go:111] Readiness probe for "router-16-zm4wv_default(47a37213-bc4f-11ea-90da-fa163ed1f711):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034 97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router) Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: I0710 11:46:56.349075 97315 prober.go:111] Readiness probe for "router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router" failed (failure): Get http://localhost:1936/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Version-Release number of selected component (if applicable): - OpenShift Container Platform 3.11 How reproducible: - N/A Steps to Reproduce: 1. N/A Actual results: Router readiness probes are failing with Jul 10 11:46:56 node-foo.example.com atomic-openshift-node[97315]: W0710 11:46:56.349034 97315 prober.go:103] No ref for container "docker://6270fcef05054557af91ed786e2de65bd0845eb372b99e4976c89487c371bd59" (router-37-shpnl_default(2f347f86-c0ab-11ea-a8c5-005056a97e00):router) while same readiness probe afterwards is successful. Expected results: All readiness probes to work as long as the application is considered health and ready Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2990
Ryan is on leave
QE note: please test on v3.11.249-1 or later.
After re-reviewing this bug, I think the fix eliminates the "No ref for container" but not the actual failure of the liveness probe, which is the reported bug.
In the sos from comment 19 $ grep "Readiness probe for" journalctl_--no-pager_--unit_atomic-openshift-node | grep router | grep failed | wc -l 514 514 failures over about 70h of logs. So about one failure every 8m on average but not periodic in nature. Sending to Network Edge and reducing severity to match ticket since this doesn't cause an outage, just event noise. Network Edge, if you don't want to address this please close with wontfix and don't bounce back.
Please don't reopen ERRATA bugs. There is a reason for this message. Please make a clone. (In reply to errata-xmlrpc from comment #13) > Since the problem described in this bug report should be > resolved in a recent advisory, it has been closed with a > resolution of ERRATA. > > If the solution does not work for you, open a new bug report.