Description of problem: This BZ is splitting out Casey's reported symptom of pods not starting on reboots. https://bugzilla.redhat.com/show_bug.cgi?id=1785399#c19 We have looked into the logs and are going to wipe the crio state on the reboot. Version-Release number of selected component (if applicable): 4.5 and 4.4 How reproducible: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/ Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This will be fixed in 4.5 soley with a fix in CRI-O. the corresponding 4.4 bug has an MCO fix, but in 4.5 we have https://github.com/openshift/machine-config-operator/pull/1660, which gives control of the required config values to CRI-O. Thus, once the attached CRI-O PR merges into 4.5, this will be fixed
This can be verified as follows: Reboot any node with the new version of CRI-O check `journalctl -u crio-wipe` and verify the string 'wiping containers' is there, hopefully followed by a list of containers that were wiped. Also doing `crictl pods` should result in only pods that are ready, and newer than `uptime` (all pods created before node reboot were wiped)
crio PR currently blocked on failing ci & lgtm (pending passing ci)
check `journalctl -u crio-wipe` but can not find the string 'wiping containers'. Is this acceptable? @Peter Hunt and @Ryan Phillips And `crictl pods` results is as expected. sh-4.4# crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.18.0-4.dev.rhaos4.5.git7d79f42.el8 RuntimeApiVersion: v1alpha1 sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers" sh-4.4# sh-4.4# crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT f22850e77b659 42 seconds ago Ready ip-10-0-129-17us-east-2computeinternal-debug default 0 8d2d0f0558d0f 2 minutes ago Ready alertmanager-main-0 openshift-monitoring 0 f17eb7427cf47 2 minutes ago Ready prometheus-k8s-0 openshift-monitoring 0 0baf4c962fb11 2 minutes ago Ready machine-config-daemon-s2n67 openshift-machine-config-operator 0 c3bd5f41dff2c 2 minutes ago Ready dns-default-vc27l openshift-dns 0 c909f85eba305 2 minutes ago Ready node-ca-fzcb7 openshift-image-registry 0
what is the output of `journalctl -u crio-wipe` ?
@Peter # journalctl -u crio-wipe -- Logs begin at Thu 2020-04-30 03:26:01 UTC, end at Thu 2020-04-30 07:07:02 UTC. -- Apr 30 03:26:30 ip-10-0-160-164.us-east-2.compute.internal systemd[1]: Starting CRI-O Auto Update Script... Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937377364Z" level=info msg="version file /var/run/crio/version not> Apr 30 03:26:32 ip-10-0-160-164 crio[1517]: time="2020-04-30 03:26:32.937840377Z" level=info msg="version file /var/lib/crio/version not> Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script. Apr 30 03:26:33 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 279ms CPU time -- Reboot -- Apr 30 03:28:37 localhost systemd[1]: Starting CRI-O Auto Update Script... Apr 30 03:28:38 ip-10-0-160-164 crio[1238]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di> Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: Started CRI-O Auto Update Script. Apr 30 03:28:38 ip-10-0-160-164 systemd[1]: crio-wipe.service: Consumed 236ms CPU time -- Reboot -- Apr 30 06:55:29 localhost systemd[1]: Starting CRI-O Auto Update Script... Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: version file /var/run/crio/version not found: open /var/run/crio/version: no such file or di> Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.565332213Z" level=info msg="Deleted container c4503e24070ce11c8758> Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.583378488Z" level=info msg="Deleted container aa84d465d0e0815d6a3f> Apr 30 06:55:30 ip-10-0-160-164 crio[1215]: time="2020-04-30 06:55:30.596691724Z" level=info msg="Deleted container 80038c408b2fc4879efb>
@Peter , I think the problem is due to the loss of /var/run/crio/version, and I test on another cluster, it verified. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-05-205255 True False 4h55m Cluster version is 4.5.0-0.nightly-2020-05-05-205255 $ oc debug node/ip-10-0-137-125.us-east-2.compute.internal Starting pod/ip-10-0-137-125us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.137.125 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.394018783Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.412021493Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.424264527Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.435701280Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.452223046Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.460930258Z" level=info msg="Wiping containers" May 06 09:16:23 ip-10-0-137-125 crio[1214]: time="2020-05-06 09:16:23.472280840Z" level=info msg="Wiping containers"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409