Bug 1958718 - Hawkular cassandra pod readiness probe failed when run on the CRIO node.
Summary: Hawkular cassandra pod readiness probe failed when run on the CRIO node.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 3.11.z
Assignee: Peter Hunt
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-10 01:35 UTC by Vijay Samanthapuri
Modified: 2024-10-01 18:08 UTC (History)
5 users (show)

Fixed In Version: cri-o-1.11.16-0.16.rhaos3.11.git54f9e69.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-25 15:16:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4981 0 None open 1.11 conmon: kill process group if timed out 2021-08-11 16:10:51 UTC
Red Hat Product Errata RHSA-2021:3193 0 None None None 2021-08-25 15:17:05 UTC

Description Vijay Samanthapuri 2021-05-10 01:35:06 UTC
Created attachment 1781468 [details]
crio logs

Description of problem: Hawkular cassandra pod readiness probe failed when run on the CRIO node.

Liveliness probe failed continuously for pods when timeout set to 1s. Where as no of failures reduced when the timeout increased With 30s timeout we have observed no failures. This is happening specifically on CRIO container runtime but docker works fine.

I am attaching crio logs and sos report of the node.

Comment 3 Vijay Samanthapuri 2021-05-11 12:36:12 UTC
Hello,

Any update on this.

Thanks,
Vijay

Comment 4 Peter Hunt 2021-05-11 12:43:38 UTC
I will try to take a look today

Comment 5 Vijay Samanthapuri 2021-05-13 10:28:54 UTC
Hello,

Any update on this request.

Thanks,
Vijay

Comment 6 Peter Hunt 2021-05-13 14:12:50 UTC
Finally got a moment to look at this.

How low can you get the liveness probes in cri-o? does 2s work? I am not sure this is a bug per-se. maybe a perf difference, but I don't know if that qualifies for fix in 3.11. If the containers are eventually coming up, then the difference in time it takes to become live may be due to an architectural difference.

Comment 7 Peter Hunt 2021-05-13 16:03:21 UTC
This actually looks like it's working because of an age-old bug with the dockershim code: https://github.com/kubernetes/kubernetes/pull/94115
In other words: exec probes are not respecting the timeout with dockershim. cri-o enforces its own timeouts, so the probes are appropriately failing.

I am closing this as NOTABUG. I suggest bumping the liveness probe timeouts.

Comment 8 Vijay Samanthapuri 2021-05-17 00:58:53 UTC
Hello,

I understood that increasing the timeout for liveliness probe is the solution for it. But user needs to change it from current value 1s to 30s to avoid any failures, with lower timeout value(i.e. less than 30s) he is seeing intermittent failures.

Do you think this is normal behaviour on crio node?

Thanks,
Vijay

Comment 9 Vijay Samanthapuri 2021-05-18 06:06:19 UTC
Hello,

Can you please confirm if the scenario mentioned in my previous comment is a normal behaviour.

Thanks,
Vijay

Comment 10 Peter Hunt 2021-05-18 13:16:37 UTC
that seems very dependent on the container itself.
The old docker behavior may have given a false idea of how quickly the app was coming up. It could be perfectly realistic for the app to need to come up after 30 seconds. I don't know of any bugs we've had that would so drastically affect container creation time that it would dramatically reduce with a crio change.

Comment 11 Peter Hunt 2021-05-20 19:13:31 UTC
Are there still issues? are we good to close this?

Comment 12 Vijay Samanthapuri 2021-05-26 02:27:31 UTC
I have gone back to the cu with the update. He has following queries.

1) When runs the same readiness probe script inside the pod it responds with in a second where as it is taking too long.

2) And when readiness probe fails container process going into defunct state can be see as orphan on the node.

Any idea on this.

Comment 13 Peter Hunt 2021-05-27 18:21:39 UTC
can you describe reproducer steps to get that behavior?

Comment 14 Vijay Samanthapuri 2021-06-01 01:08:27 UTC
Customer has all his nodes running on docker and one node with cri-o. Issue occurs only when the cassandra pod scheduled on cri-o node.
Readiness probe fails even after the application started serving the requests.

Comment 15 Vijay Samanthapuri 2021-06-01 01:11:04 UTC
I also asked cu to enable debug logging on cri-o. Attaching cri-o and system log from problematic node

Comment 17 Vijay Samanthapuri 2021-06-04 00:46:31 UTC
Hello,

Any update on this.

Thanks,
Vijay

Comment 19 Peter Hunt 2021-06-04 16:04:19 UTC
I don't know what to say here. The timeout is too short:
```
time="2021-05-28 20:30:30.459667483+02:00" level=debug msg="ExecSyncRequest &ExecSyncRequest{ContainerId:cb5cd71c228dfdb0e7344b2cf247faa3f938d2443137315878e179fbb6b4be94,Cmd:[/opt/apache-cassandra/bin/cassandra-docker-ready.sh],Timeout:2,}"
me="2021-05-28 20:30:32.685043456+02:00" level=debug msg="Received container exit code: -1, message: command timed out"                                                                                                                       
```
I really think the timeout should just be bumped.

> Readiness probe fails even after the application started serving the requests.

Note: the readiness probe timing out on a failure does not indicate it wouldn't have eventually succeeded. it just means the command is taking more than the timeout to complete. If you bump the timeout, the readiness probe will succeed.

Unless I can get evidence there is a bug here, I will close again...

Comment 20 Vijay Samanthapuri 2021-06-08 11:55:53 UTC
I will go back to customer asking for increasing the timeout but can you please tell me why the failed readiness probe process is going into defunct state

Comment 21 Peter Hunt 2021-06-08 14:25:25 UTC
Hm I need to know the conmon version to know for sure, but it's possible we're missing the attached patch from our conmon version.

can you get me `cat /etc/crio/crio.conf | grep 'conmon ='` from a node. If the output not empty, get the output of `$path --version`. If it's empty can you get me `/usr/bin/conmon --version`

Comment 23 Peter Hunt 2021-06-09 16:00:51 UTC
I see, I believe the attached PR will help this, then

Comment 24 Peter Hunt 2021-06-11 18:46:26 UTC
waiting on downstream merge+packaging, hopefully will have this in soon.

Comment 25 Peter Hunt 2021-07-02 20:36:25 UTC
ci is still wonky, hopefully I'll have cycles to fix it next sprint

Comment 26 Peter Hunt 2021-07-23 19:56:48 UTC
alas, I did not

Comment 27 Peter Hunt 2021-08-11 16:11:02 UTC
pr merged!

Comment 32 errata-xmlrpc 2021-08-25 15:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 3.11.z security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3193


Note You need to log in before you can comment on or make changes to this bug.