Created attachment 1781468 [details] crio logs Description of problem: Hawkular cassandra pod readiness probe failed when run on the CRIO node. Liveliness probe failed continuously for pods when timeout set to 1s. Where as no of failures reduced when the timeout increased With 30s timeout we have observed no failures. This is happening specifically on CRIO container runtime but docker works fine. I am attaching crio logs and sos report of the node.
Hello, Any update on this. Thanks, Vijay
I will try to take a look today
Hello, Any update on this request. Thanks, Vijay
Finally got a moment to look at this. How low can you get the liveness probes in cri-o? does 2s work? I am not sure this is a bug per-se. maybe a perf difference, but I don't know if that qualifies for fix in 3.11. If the containers are eventually coming up, then the difference in time it takes to become live may be due to an architectural difference.
This actually looks like it's working because of an age-old bug with the dockershim code: https://github.com/kubernetes/kubernetes/pull/94115 In other words: exec probes are not respecting the timeout with dockershim. cri-o enforces its own timeouts, so the probes are appropriately failing. I am closing this as NOTABUG. I suggest bumping the liveness probe timeouts.
Hello, I understood that increasing the timeout for liveliness probe is the solution for it. But user needs to change it from current value 1s to 30s to avoid any failures, with lower timeout value(i.e. less than 30s) he is seeing intermittent failures. Do you think this is normal behaviour on crio node? Thanks, Vijay
Hello, Can you please confirm if the scenario mentioned in my previous comment is a normal behaviour. Thanks, Vijay
that seems very dependent on the container itself. The old docker behavior may have given a false idea of how quickly the app was coming up. It could be perfectly realistic for the app to need to come up after 30 seconds. I don't know of any bugs we've had that would so drastically affect container creation time that it would dramatically reduce with a crio change.
Are there still issues? are we good to close this?
I have gone back to the cu with the update. He has following queries. 1) When runs the same readiness probe script inside the pod it responds with in a second where as it is taking too long. 2) And when readiness probe fails container process going into defunct state can be see as orphan on the node. Any idea on this.
can you describe reproducer steps to get that behavior?
Customer has all his nodes running on docker and one node with cri-o. Issue occurs only when the cassandra pod scheduled on cri-o node. Readiness probe fails even after the application started serving the requests.
I also asked cu to enable debug logging on cri-o. Attaching cri-o and system log from problematic node
I don't know what to say here. The timeout is too short: ``` time="2021-05-28 20:30:30.459667483+02:00" level=debug msg="ExecSyncRequest &ExecSyncRequest{ContainerId:cb5cd71c228dfdb0e7344b2cf247faa3f938d2443137315878e179fbb6b4be94,Cmd:[/opt/apache-cassandra/bin/cassandra-docker-ready.sh],Timeout:2,}" me="2021-05-28 20:30:32.685043456+02:00" level=debug msg="Received container exit code: -1, message: command timed out" ``` I really think the timeout should just be bumped. > Readiness probe fails even after the application started serving the requests. Note: the readiness probe timing out on a failure does not indicate it wouldn't have eventually succeeded. it just means the command is taking more than the timeout to complete. If you bump the timeout, the readiness probe will succeed. Unless I can get evidence there is a bug here, I will close again...
I will go back to customer asking for increasing the timeout but can you please tell me why the failed readiness probe process is going into defunct state
Hm I need to know the conmon version to know for sure, but it's possible we're missing the attached patch from our conmon version. can you get me `cat /etc/crio/crio.conf | grep 'conmon ='` from a node. If the output not empty, get the output of `$path --version`. If it's empty can you get me `/usr/bin/conmon --version`
I see, I believe the attached PR will help this, then
waiting on downstream merge+packaging, hopefully will have this in soon.
ci is still wonky, hopefully I'll have cycles to fix it next sprint
alas, I did not
pr merged!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 3.11.z security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3193