Bug 1826895 - run crio-wipe on reboots to solve error reserving pod name across reboots
Summary: run crio-wipe on reboots to solve error reserving pod name across reboots
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks: 1826896
TreeView+ depends on / blocked
 
Reported: 2020-04-22 17:39 UTC by Ryan Phillips
Modified: 2020-07-13 17:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1826896 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:30:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 3647 0 None closed [1.18] crio wipe: add version-file-persist 2021-02-15 12:02:17 UTC
Github openshift installer pull 3509 0 None closed Bug 1826895: rhcos: bump RHCOS boot image to 44.81.202004250133-0 2021-02-15 12:02:18 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:30:39 UTC

Description Ryan Phillips 2020-04-22 17:39:17 UTC
Description of problem:
This BZ is splitting out Casey's reported symptom of pods not starting on reboots.
https://bugzilla.redhat.com/show_bug.cgi?id=1785399#c19

We have looked into the logs and are going to wipe the crio state on the reboot.

Version-Release number of selected component (if applicable):
4.5 and 4.4

How reproducible:
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2020-04-24 14:06:11 UTC
This will be fixed in 4.5 soley with a fix in CRI-O. the corresponding 4.4 bug has an MCO fix, but in 4.5 we have https://github.com/openshift/machine-config-operator/pull/1660, which gives control of the required config values to CRI-O. Thus, once the attached CRI-O PR merges into 4.5, this will be fixed

Comment 2 Peter Hunt 2020-04-24 14:11:50 UTC
This can be verified as follows:

Reboot any node with the new version of CRI-O

check `journalctl -u crio-wipe` and verify the string 'wiping containers' is there, hopefully followed by a list of containers that were wiped.

Also doing `crictl pods` should result in only pods that are ready, and newer than `uptime` (all pods created before node reboot were wiped)

Comment 3 Kirsten Garrison 2020-04-24 17:39:10 UTC
crio PR currently blocked on failing ci & lgtm (pending passing ci)

Comment 6 MinLi 2020-04-30 07:23:55 UTC
 
check `journalctl -u crio-wipe` but can not find the string 'wiping containers'. Is this acceptable?  @Peter Hunt and  @Ryan Phillips
And `crictl pods` results is as expected. 



sh-4.4# crictl version 
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.18.0-4.dev.rhaos4.5.git7d79f42.el8
RuntimeApiVersion:  v1alpha1

sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers"
sh-4.4#

sh-4.4# crictl pods
POD ID              CREATED             STATE               NAME                                           NAMESPACE                                ATTEMPT
f22850e77b659       42 seconds ago      Ready               ip-10-0-129-17us-east-2computeinternal-debug   default                                  0
8d2d0f0558d0f       2 minutes ago       Ready               alertmanager-main-0                            openshift-monitoring                     0
f17eb7427cf47       2 minutes ago       Ready               prometheus-k8s-0                               openshift-monitoring                     0
0baf4c962fb11       2 minutes ago       Ready               machine-config-daemon-s2n67                    openshift-machine-config-operator        0
c3bd5f41dff2c       2 minutes ago       Ready               dns-default-vc27l                              openshift-dns                            0
c909f85eba305       2 minutes ago       Ready               node-ca-fzcb7                                  openshift-image-registry                 0

Comment 7 Peter Hunt 2020-04-30 14:02:15 UTC
what is the output of `journalctl -u crio-wipe` ?

Comment 8 MinLi 2020-05-06 09:12:13 UTC
@Peter 
# journalctl -u crio-wipe                           
-- Logs begin at Thu 2020-04-30 03:26:01 UTC, end at Thu 2020-04-30 07:07:02 UTC. --
Apr 30 03:26:30 ip-10-0-160-164.us-east-2.compute.internal systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937377364Z" level=info msg="version file /var/run/crio/version not>
Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937840377Z" level=info msg="version file /var/lib/crio/version not>
Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script.
Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 279ms CPU time
-- Reboot --
Apr 30 03:28:37 localhost systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 03:28:38 ip-10-0-160-164 crio[1238]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di>
Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script.
Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 236ms CPU time
-- Reboot --
Apr 30 06:55:29 localhost systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.565332213Z" level=info msg="Deleted container c4503e24070ce11c8758>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.583378488Z" level=info msg="Deleted container aa84d465d0e0815d6a3f>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.596691724Z" level=info msg="Deleted container 80038c408b2fc4879efb>

Comment 9 MinLi 2020-05-06 09:22:31 UTC
@Peter , I think the problem is due to the loss of /var/run/crio/version, and I test on another cluster, it verified.

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-05-205255   True        False         4h55m   Cluster version is 4.5.0-0.nightly-2020-05-05-205255

$ oc debug node/ip-10-0-137-125.us-east-2.compute.internal
Starting pod/ip-10-0-137-125us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.137.125
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host 
sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.394018783Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.412021493Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.424264527Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.435701280Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.452223046Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.460930258Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.472280840Z" level=info msg="Wiping containers"

Comment 10 errata-xmlrpc 2020-07-13 17:30:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.