Bug 1978268 - Exec probes fail clusterwide after upgrade to cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64
Summary: Exec probes fail clusterwide after upgrade to cri-o-1.19.2-4.rhaos4.6.git4f7c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
: 1976387 1982448 (view as bug list)
Depends On:
Blocks: 1982725 1983127
TreeView+ depends on / blocked
 
Reported: 2021-07-01 12:21 UTC by Andreas Karis
Modified: 2024-10-01 18:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1982725 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:37:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5067 0 None closed oci: wait until pidfile to be written before starting timer 2021-07-16 18:25:35 UTC
Red Hat Knowledge Base (Solution) 6194881 0 None None None 2021-07-19 08:40:22 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:51 UTC

Description Andreas Karis 2021-07-01 12:21:38 UTC
Description of problem:

An OCP 4.6 cluster with RHEL 7.9 nodes

After upgrading the cluster from 4.6.20 to version 4.6.27, readiness  and liveness probes cluster wide (for containers on the RHEL worker nodes) seemingly randomly fail a lot with timeouts.

Installed RPMs on the worker nodes:
~~~
$ egrep 'cri|conmon|podman' ./sosreport-tklis-terra1478-02976483-2021-06-29-nxqghol/installed-rpms
conmon-2.0.20-1.rhaos4.5.el7.x86_64                         Tue Nov 17 14:13:41 2020
criu-3.12-2.el7.x86_64                                      Tue Nov 17 14:13:50 2020
cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64               Thu Jun 24 09:34:56 2021
cri-tools-1.19.0-2.el7.x86_64                               Thu Jun 24 09:25:52 2021
initscripts-9.49.53-1.el7_9.1.x86_64                        Tue Nov 17 13:57:38 2020
podman-1.6.4-29.el7_9.x86_64                                Thu Mar 18 12:25:42 2021
subscription-manager-1.24.42-1.el7.x86_64                   Tue Nov 17 13:57:57 2020
subscription-manager-rhsm-1.24.42-1.el7.x86_64              Tue Nov 17 13:57:56 2020
subscription-manager-rhsm-certificates-1.24.42-1.el7.x86_64 Tue Nov 17 13:57:56 2020
~~~

After downgrade from cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64 to cri-o-1.19.1-16.rhaos4.6.git130633b.el7.x86_64, the liveness and readiness probe issues go away.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Andreas Karis 2021-07-01 12:58:32 UTC
Further details:
===========================================================

cri-o-1.19.2-4.rhaos4.6.git4f7cb5e.el7.x86_64 is not the newest RPM that we have, so I asked the customer to bump to -6 and try it.

The newest changelog:
https://access.redhat.com/downloads/content/cri-o/1.19.2-6.rhaos4.6.git686e6d9.el7/x86_64/fd431d51/package-changelog
~~~
2021-06-18 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-6.rhaos4.6.git686e6d9

    - autobuilt 686e6d9
2021-06-16 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-5.rhaos4.6.gite9d5cb8

    - autobuilt e9d5cb8
2021-06-11 Peter Hunt (Bot) <pehunt+bot> - 1.19.2-4.rhaos4.6.git4f7cb5e

    - autobuilt 4f7cb5e
~~~

Cloning the repository to analyze the changelog:
~~~
git clone https://github.com/cri-o/cri-o.git
git checkout origin/release-1.19
git log --oneline
~~~

Quite a few changes have landed since -4 version. Here are the changes between 1.19.2-6.rhaos4.6.git686e6d9 and 1.19.2-4.rhaos4.6.git4f7cb5e:
~~~
686e6d9f9 Merge pull request #4969 from haircommander/gh-actions-1.19-2
bc4bbe6d6 test: drop critest from gh-actions
ba7da0e27 test/image.bats: pull the image to be used
70bb9b494 test: fix mocks for oci
8acde4f0b test: tidy image prefetch
4a5796bf3 test: rm "run ctr with image with Config.Volumes"
91acd0fbd test: add no-pull-on-run=true
68e9e900b test/apparmor: remove specific failure check
bacb55d5b gh actions: build correctly for 1.19
1694cd3c1 scripts: bump cri-tools version
c7d5ba2da setup packages for lint
399379640 use go for lint
ef3b19b69 fix vendor and bump golangcilint version
91678fbba scripts: drop unneeded
851d8d9ac Makefile: drop unneeded targets
34c28f8f3 golangcilint: exclude correctly libraries for ci images
99180c4db add libbtrfs-dev to unit tests
92c76e917 gh actions: update to 1.20 branch
553ad689d drop gocyclo nolint
df9b93ef5 fix resource store test
413673364 Make config tests work rootless
5a4bcfc67 lib/sandbox/namespaces: correctly set always running pid
3ed8bf35a update github actions to sync with main branch
e9d5cb813 Merge pull request #5001 from openshift-cherrypick-robot/cherry-pick-4999-to-release-1.19
99e62cb8d oci: kill runtime process on exec if exec pid isn't written yet
bfbea6994 oci: don't pre-create pid file
4f7cb5e32 Merge pull request #4953 from saschagrunert/release-1.19-4943
~~~

And here are the changes between 1.19.2-4.rhaos4.6.git4f7cb5e and cri-o-1.19.1-16.rhaos4.6.git130633b.el7.x86_64 which would have introduced the issue:
~~~
4f7cb5e32 Merge pull request #4953 from saschagrunert/release-1.19-4943
ad76f443d Merge pull request #4987 from haircommander/storage-segfault-1.19
1662da737 vendor: bump storage to pickup segfault fix
9575c6978 Merge pull request #4963 from haircommander/bump-1.19.2
ad3dd83b9 bump to v1.19.2
6c6152cda oci: do not use conmon for exec sync
130633b58 Merge pull request #4848 from openshift-cherrypick-robot/cherry-pick-4834-to-release-1.19
~~~

Comment 4 Peter Hunt 2021-07-01 13:57:42 UTC
the commit range:
```
e9d5cb813 Merge pull request #5001 from openshift-cherrypick-robot/cherry-pick-4999-to-release-1.19
99e62cb8d oci: kill runtime process on exec if exec pid isn't written yet
bfbea6994 oci: don't pre-create pid file
```

are some pretty important bug fixes for execs (used in liveness probes). I would not be surprised the issues were caused by missing them.

I would wait until we have feedback about the customer using a version with that cri-o

Comment 6 Peter Hunt 2021-07-02 20:41:37 UTC
did not have a moment to look further

Comment 8 Peter Hunt 2021-07-13 15:03:20 UTC
hey folks, as an update, we did find an issue with the version of cri-o 1.19.2-6.rhaos4.6.git686e6d9
basically, we're giving execs about 10-20 ms less time than we were before.
The attached PR (when it works as expected) should help the situation

Comment 20 Peter Hunt 2021-07-15 14:09:37 UTC
*** Bug 1976387 has been marked as a duplicate of this bug. ***

Comment 29 Peter Hunt 2021-07-16 18:25:46 UTC
upstream PR merged

Comment 31 MinLi 2021-07-20 09:03:59 UTC
run a 4.9 cluster for a long time, don't find "probe failed: command timed out" events. 
And from https://search.ci.openshift.org/?search=probe+failed%3A+command+timed+out&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job , don't find similar error in latest 2 days.

set verified.

Comment 32 Peter Hunt 2021-07-23 13:23:36 UTC
*** Bug 1982448 has been marked as a duplicate of this bug. ***

Comment 35 errata-xmlrpc 2021-10-18 17:37:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.