1826895 – run crio-wipe on reboots to solve error reserving pod name across reboots

Bug 1826895 - run crio-wipe on reboots to solve error reserving pod name across reboots

Summary: run crio-wipe on reboots to solve error reserving pod name across reboots

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Peter Hunt
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1826896
TreeView+	depends on / blocked

Reported:	2020-04-22 17:39 UTC by Ryan Phillips
Modified:	2020-07-13 17:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1826896 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:30:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 3647	None	closed	[1.18] crio wipe: add version-file-persist	2021-02-15 12:02:17 UTC
Github	openshift installer pull 3509	None	closed	Bug 1826895: rhcos: bump RHCOS boot image to 44.81.202004250133-0	2021-02-15 12:02:18 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:30:39 UTC

Description Ryan Phillips 2020-04-22 17:39:17 UTC

Description of problem:
This BZ is splitting out Casey's reported symptom of pods not starting on reboots.
https://bugzilla.redhat.com/show_bug.cgi?id=1785399#c19

We have looked into the logs and are going to wipe the crio state on the reboot.

Version-Release number of selected component (if applicable):
4.5 and 4.4

How reproducible:
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2020-04-24 14:06:11 UTC

This will be fixed in 4.5 soley with a fix in CRI-O. the corresponding 4.4 bug has an MCO fix, but in 4.5 we have https://github.com/openshift/machine-config-operator/pull/1660, which gives control of the required config values to CRI-O. Thus, once the attached CRI-O PR merges into 4.5, this will be fixed

Comment 2 Peter Hunt 2020-04-24 14:11:50 UTC

This can be verified as follows:

Reboot any node with the new version of CRI-O

check `journalctl -u crio-wipe` and verify the string 'wiping containers' is there, hopefully followed by a list of containers that were wiped.

Also doing `crictl pods` should result in only pods that are ready, and newer than `uptime` (all pods created before node reboot were wiped)

Comment 3 Kirsten Garrison 2020-04-24 17:39:10 UTC

crio PR currently blocked on failing ci & lgtm (pending passing ci)

Comment 6 MinLi 2020-04-30 07:23:55 UTC

 
check `journalctl -u crio-wipe` but can not find the string 'wiping containers'. Is this acceptable?  @Peter Hunt and  @Ryan Phillips
And `crictl pods` results is as expected. 



sh-4.4# crictl version 
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.18.0-4.dev.rhaos4.5.git7d79f42.el8
RuntimeApiVersion:  v1alpha1

sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers"
sh-4.4#

sh-4.4# crictl pods
POD ID              CREATED             STATE               NAME                                           NAMESPACE                                ATTEMPT
f22850e77b659       42 seconds ago      Ready               ip-10-0-129-17us-east-2computeinternal-debug   default                                  0
8d2d0f0558d0f       2 minutes ago       Ready               alertmanager-main-0                            openshift-monitoring                     0
f17eb7427cf47       2 minutes ago       Ready               prometheus-k8s-0                               openshift-monitoring                     0
0baf4c962fb11       2 minutes ago       Ready               machine-config-daemon-s2n67                    openshift-machine-config-operator        0
c3bd5f41dff2c       2 minutes ago       Ready               dns-default-vc27l                              openshift-dns                            0
c909f85eba305       2 minutes ago       Ready               node-ca-fzcb7                                  openshift-image-registry                 0

Comment 7 Peter Hunt 2020-04-30 14:02:15 UTC

what is the output of `journalctl -u crio-wipe` ?

Comment 8 MinLi 2020-05-06 09:12:13 UTC

@Peter 
# journalctl -u crio-wipe                           
-- Logs begin at Thu 2020-04-30 03:26:01 UTC, end at Thu 2020-04-30 07:07:02 UTC. --
Apr 30 03:26:30 ip-10-0-160-164.us-east-2.compute.internal systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937377364Z" level=info msg="version file /var/run/crio/version not>
Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937840377Z" level=info msg="version file /var/lib/crio/version not>
Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script.
Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 279ms CPU time
-- Reboot --
Apr 30 03:28:37 localhost systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 03:28:38 ip-10-0-160-164 crio[1238]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di>
Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script.
Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 236ms CPU time
-- Reboot --
Apr 30 06:55:29 localhost systemd[1]: Starting CRI-O Auto Update Script...
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.565332213Z" level=info msg="Deleted container c4503e24070ce11c8758>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.583378488Z" level=info msg="Deleted container aa84d465d0e0815d6a3f>
Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.596691724Z" level=info msg="Deleted container 80038c408b2fc4879efb>

Comment 9 MinLi 2020-05-06 09:22:31 UTC

@Peter , I think the problem is due to the loss of /var/run/crio/version, and I test on another cluster, it verified.

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-05-205255   True        False         4h55m   Cluster version is 4.5.0-0.nightly-2020-05-05-205255

$ oc debug node/ip-10-0-137-125.us-east-2.compute.internal
Starting pod/ip-10-0-137-125us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.137.125
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host 
sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.394018783Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.412021493Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.424264527Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.435701280Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.452223046Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.460930258Z" level=info msg="Wiping containers"
May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.472280840Z" level=info msg="Wiping containers"

Comment 10 errata-xmlrpc 2020-07-13 17:30:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.