Description of problem: An OCP 4.6 cluster with RHEL 7.9 nodes After upgrading the cluster from 4.6.20 to version 4.6.27, readiness and liveness probes cluster wide (for containers on the RHEL worker nodes) seemingly randomly fail a lot with timeouts. Installed RPMs on the worker nodes: ~~~ $ egrep 'cri|conmon|podman' ./sosreport-tklis-terra1478-02976483-2021-06-29-nxqghol/installed-rpms conmon-2.0.20-1.rhaos4.5.el7.x86_64 Tue Nov 17 14:13:41 2020 criu-3.12-2.el7.x86_64 Tue Nov 17 14:13:50 2020 cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64 Thu Jun 24 09:34:56 2021 cri-tools-1.19.0-2.el7.x86_64 Thu Jun 24 09:25:52 2021 initscripts-9.49.53-1.el7_9.1.x86_64 Tue Nov 17 13:57:38 2020 podman-1.6.4-29.el7_9.x86_64 Thu Mar 18 12:25:42 2021 subscription-manager-1.24.42-1.el7.x86_64 Tue Nov 17 13:57:57 2020 subscription-manager-rhsm-1.24.42-1.el7.x86_64 Tue Nov 17 13:57:56 2020 subscription-manager-rhsm-certificates-1.24.42-1.el7.x86_64 Tue Nov 17 13:57:56 2020 ~~~ After downgrade from cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64 to cri-o-1.19.1-16.rhaos4.6.git130633b.el7.x86_64, the liveness and readiness probe issues go away. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Further details: =========================================================== cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64 is not the newest RPM that we have, so I asked the customer to bump to -6 and try it. The newest changelog: https://access.redhat.com/downloads/content/cri-o/1.19.2-6.rhaos4.6.git686e6d9.el7/x86_64/fd431d51/package-changelog ~~~ 2021-06-18 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-6.rhaos4.6.git686e6d9 - autobuilt 686e6d9 2021-06-16 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-5.rhaos4.6.gite9d5cb8 - autobuilt e9d5cb8 2021-06-11 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-4.rhaos4.6.git4f7cb5e - autobuilt 4f7cb5e ~~~ Cloning the repository to analyze the changelog: ~~~ git clone https://github.com/cri-o/cri-o.git git checkout origin/release-1.19 git log --oneline ~~~ Quite a few changes have landed since -4 version. Here are the changes between 1.19.2-6.rhaos4.6.git686e6d9 and 1.19.2-4.rhaos4.6.git4f7cb5e: ~~~ 686e6d9f9 Merge pull request #4969 from haircommander/gh-actions-1.19-2 bc4bbe6d6 test: drop critest from gh-actions ba7da0e27 test/image.bats: pull the image to be used 70bb9b494 test: fix mocks for oci 8acde4f0b test: tidy image prefetch 4a5796bf3 test: rm "run ctr with image with Config.Volumes" 91acd0fbd test: add no-pull-on-run=true 68e9e900b test/apparmor: remove specific failure check bacb55d5b gh actions: build correctly for 1.19 1694cd3c1 scripts: bump cri-tools version c7d5ba2da setup packages for lint 399379640 use go for lint ef3b19b69 fix vendor and bump golangcilint version 91678fbba scripts: drop unneeded 851d8d9ac Makefile: drop unneeded targets 34c28f8f3 golangcilint: exclude correctly libraries for ci images 99180c4db add libbtrfs-dev to unit tests 92c76e917 gh actions: update to 1.20 branch 553ad689d drop gocyclo nolint df9b93ef5 fix resource store test 413673364 Make config tests work rootless 5a4bcfc67 lib/sandbox/namespaces: correctly set always running pid 3ed8bf35a update github actions to sync with main branch e9d5cb813 Merge pull request #5001 from openshift-cherrypick-robot/cherry-pick-4999-to-release-1.19 99e62cb8d oci: kill runtime process on exec if exec pid isn't written yet bfbea6994 oci: don't pre-create pid file 4f7cb5e32 Merge pull request #4953 from saschagrunert/release-1.19-4943 ~~~ And here are the changes between 1.19.2-4.rhaos4.6.git4f7cb5e and cri-o-1.19.1-16.rhaos4.6.git130633b.el7.x86_64 which would have introduced the issue: ~~~ 4f7cb5e32 Merge pull request #4953 from saschagrunert/release-1.19-4943 ad76f443d Merge pull request #4987 from haircommander/storage-segfault-1.19 1662da737 vendor: bump storage to pickup segfault fix 9575c6978 Merge pull request #4963 from haircommander/bump-1.19.2 ad3dd83b9 bump to v1.19.2 6c6152cda oci: do not use conmon for exec sync 130633b58 Merge pull request #4848 from openshift-cherrypick-robot/cherry-pick-4834-to-release-1.19 ~~~
the commit range: ``` e9d5cb813 Merge pull request #5001 from openshift-cherrypick-robot/cherry-pick-4999-to-release-1.19 99e62cb8d oci: kill runtime process on exec if exec pid isn't written yet bfbea6994 oci: don't pre-create pid file ``` are some pretty important bug fixes for execs (used in liveness probes). I would not be surprised the issues were caused by missing them. I would wait until we have feedback about the customer using a version with that cri-o
did not have a moment to look further
hey folks, as an update, we did find an issue with the version of cri-o 1.19.2-6.rhaos4.6.git686e6d9 basically, we're giving execs about 10-20 ms less time than we were before. The attached PR (when it works as expected) should help the situation
*** Bug 1976387 has been marked as a duplicate of this bug. ***
upstream PR merged
run a 4.9 cluster for a long time, don't find "probe failed: command timed out" events. And from https://search.ci.openshift.org/?search=probe+failed%3A+command+timed+out&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job , don't find similar error in latest 2 days. set verified.
*** Bug 1982448 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759