Bug 1826896

Summary: [4.4] run crio-wipe on reboots to solve error reserving pod name across reboots
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: NodeAssignee: Peter Hunt <pehunt>
Status: CLOSED ERRATA QA Contact: MinLi <minmli>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.4CC: aos-bugs, jhou, jokerman, kgarriso, mpatel, pehunt, schoudha, scuppett, wking, xtian
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1826895 Environment:
Last Closed: 2020-05-04 11:50:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1826895    
Bug Blocks:    

Description Ryan Phillips 2020-04-22 17:41:13 UTC
+++ This bug was initially created as a clone of Bug #1826895 +++

Description of problem:
This BZ is splitting out Casey's reported symptom of pods not starting on reboots.
https://bugzilla.redhat.com/show_bug.cgi?id=1785399#c19

We have looked into the logs and are going to wipe the crio state on the reboot.

Version-Release number of selected component (if applicable):
4.5 and 4.4

How reproducible:
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2020-04-22 21:22:29 UTC
PR to fix in 4.4 linked
(machine config operator will need another, small fix)

Comment 3 Peter Hunt 2020-04-24 14:10:36 UTC
This can be verified as follows:

Upgrade any node with the new version of CRI-O and MCO

check `journalctl -u crio-wipe` and verify the string 'wiping containers' is there, hopefully followed by a list of containers that were wiped.

Also doing `crictl pods` should result in only pods that are ready, and newer than `uptime` (all pods created before node reboot were wiped)

Comment 4 Peter Hunt 2020-04-24 14:12:50 UTC
s/Upgrade/Reboot/g
in the above comment. upgrading the node to that version will involve a reboot, but the real test is to see if containers are wiped without an upgrade and only on reboot

Comment 5 Kirsten Garrison 2020-04-24 19:53:02 UTC
Crio PR merged waiting for MCO PR.

Comment 6 Kirsten Garrison 2020-04-24 20:07:57 UTC
We need the following:

1. code merged to upstream - DONE
2. RPM built in brew - peter will ask lokesh to do this
3. ART makes a puddle - they pinged in slack
4. RHCOS pulls RPMs from puddle - on a timer doesn't need a person

Comment 7 Kirsten Garrison 2020-04-24 20:09:10 UTC
After those 4 are done the MCO PR can pass tests and merge.

Comment 8 Kirsten Garrison 2020-04-24 23:32:59 UTC
1. crio code merged to upstream (https://github.com/cri-o/cri-o/pull/3635) - DONE
2. RPM built in brew (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1177645) - DONE
3. ART signs/pulls into puddle - in progress - luke
4. RHCOS pulls RPMs from puddle - on a timer doesn't need a person
5. Installer PR: ashcrow/mrunal
6. MCO PR merges (https://github.com/openshift/machine-config-operator/pull/1679)
7. QE verifies this BZ

Comment 9 Kirsten Garrison 2020-04-25 01:46:18 UTC
1. crio code merged to upstream (https://github.com/cri-o/cri-o/pull/3635) - DONE
2. RPM built in brew (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1177645) - DONE
3. ART signs/pulls into puddle - DONE
4. RHCOS pulls RPMs from puddle - on a timer doesn't need a person
5. Installer PR: ashcrow/mrunal
6. MCO PR merges (https://github.com/openshift/machine-config-operator/pull/1679)
7. QE verifies this BZ

Comment 10 Kirsten Garrison 2020-04-25 04:01:58 UTC
1. crio code merged to upstream (https://github.com/cri-o/cri-o/pull/3635) - DONE
2. RPM built in brew (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1177645) - DONE
3. ART signs/pulls into puddle - DONE
4. RHCOS pulls RPMs from puddle - DONE
5. Installer PR: https://github.com/openshift/installer/pull/3508
6. MCO PR merges (https://github.com/openshift/machine-config-operator/pull/1679)
7. QE verifies this BZ

Comment 15 MinLi 2020-04-27 07:23:00 UTC
verified with version : 4.4.0-0.nightly-2020-04-26-205915

sh-4.4# crictl version
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
RuntimeApiVersion:  v1alpha1

$ oc adm  release info --commit-urls | grep machine-config-operator
  machine-config-operator                        https://github.com/openshift/machine-config-operator/commit/c83f295e07d1cfd5c3124dc140bcdb10f6e094ae  (pr#1679 merged) 

after worker node reboot, check logs as follows: 
sh-4.4# journalctl -u crio-wipe | grep -i "wiping containers"
Apr 27 07:02:13 ip-10-0-165-102 crio[1163]: time="2020-04-27 07:02:13.934775271Z" level=info msg="wiping containers"
Apr 27 07:02:13 ip-10-0-165-102 crio[1163]: time="2020-04-27 07:02:13.950494123Z" level=info msg="wiping containers"
Apr 27 07:02:13 ip-10-0-165-102 crio[1163]: time="2020-04-27 07:02:13.963384553Z" level=info msg="wiping containers"

sh-4.4# crictl pods
POD ID              CREATED              STATE               NAME                                            NAMESPACE                                ATTEMPT
de3b1d5ee6597       About a minute ago   Ready               ip-10-0-165-102us-east-2computeinternal-debug   default                                  0
72a40a58315fd       2 minutes ago        Ready               node-ca-wp874                                   openshift-image-registry                 0
0db1c6909c49c       2 minutes ago        Ready               multus-5f9gs                                    openshift-multus                         0

Comment 17 errata-xmlrpc 2020-05-04 11:50:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581