Description of problem: During an upgrade, the MCO cluster version is bumped before a rendered config is generated. Working theory is that this has to do with preexisting function: isMachineConfigPoolConfigurationValid()L https://github.com/openshift/machine-config-operator/blob/4245fea0c3c16a49c68207049b3a9a245d8b53b3/pkg/operator/status.go#L356 and probably has always existed just got exposed recently due to the work done in 4.7 What we see happening: 1. OCP update triggered 2. CVO bumps the image ConfigMap 3. Outgoing MCC leaves it alone; thanks version-mismatch guard (https://github.com/openshift/machine-config-operator/commit/60c5ee214b5220e5499c1b687594dcf1a16ed9e8#diff-8047e5e8697bdac51[…]922c47ca937f03508f24fa) 4. MCO grabs the pool, sees version.Hash match (MCC commit same before and after this particular OCP update), assumes that means the rendered config is up to date :bug:. Sees Updated=True. Bumps ClusterOperator version 5. everything else starts rolling out to the pools after the bump
This was discovered when working on: https://bugzilla.redhat.com/show_bug.cgi?id=1932105 We have a PR up adding logging to confirm root cause: https://github.com/openshift/machine-config-operator/pull/2444
Example job where we saw this [1], which had the fix from bug 1932105. Timeline: 1. 18:14Z, machine-config-operator-65985fcd74-jd724 exists [2]. 2. 18:15, machine-config-controller-7b69b565bf-wh95z exists [2]. 3. 18:53:20.484613, CVO bumps the MCO Deployment [3]. 4. 18:53:20.506900, CVO bumps the machine-config-osimageurl ConfigMap [4]. 5. 18:53:20.530438, first reference to machine-config-operator-6d7485cd55-6v46c in the kubelet logs [2]. 6. 18:53:25.408063, last reference to machine-config-operator-65985fcd74-jd724 in kubelet logs [2]. 7. 18:55:19.317356, machine-config-operator-6d7485cd55-6v46c acquires the leader lease [7]. 8. 18:55:26.145255, machine-config-operator-6d7485cd55-6v46c bumps ClusterOperator version BUG [7]. 9. 18:55:28, new rendered master [5], with a new osImageURL [6] (pulled from the machine-config-osimageurl ConfigMap). 10. 18:55:42.744856, machine-config-operator-6d7485cd55-bd2zd shows up in kubelet logs [2]. 11. 18:55:51.148309, last reference to machine-config-operator-6d7485cd55-6v46c in kubelet logs [2]. 12. 18:59:58.169961, first reference to machine-config-controller-7b69b565bf-zbjf5 in the kubelet logs [2]. 13. 19:00:05.096181, last reference to machine-config-controller-7b69b565bf-wh95z in the kubelet logs [2]. So seems like there was no need to bump the MCC, because the MCO commit hash, base image, and other bits that feed the image digest had not changed. Took a bit for the MCC to render the new MachineConfig after the machine-config-osimageurl ConfigMap bump; we're not entirely clear on why. The MCO came through just before the new MachineConfig was rendered and said "yeah, looks like the 'master' pool is up to date", and leveled the ClusterOperator, not realizing that seconds later a new MachineConfig would come out with the new RHCOS. Distinguishing between this failure mode and bug 1932105 seems hard. Probably best to fix them both, and then check CI-search [8] and confirm that it is empty. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632 [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xz $ cd quay-io-* $ grep '^Mar 01 18:5' host_service_logs/masters/kubelet_service.log | grep -o 'machine-config-\(controller\|operator\)-[0-9a-z]*-[0-9a-z]*' | sort | uniq | while read POD; do echo "${POD}"; LOGS="$(grep "${POD}" host_service_logs/masters/kubelet_service.log | sed 's/^/ /')"; echo "${LOGS}" | head -n1; echo "${LOGS}" | tail -n1; done machine-config-controller-7b69b565bf-wh95z Mar 01 18:15:46.256012 ip-10-0-131-96 hyperkube[1438]: I0301 18:15:46.255704 1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-controller-7b69b565bf-wh95z_openshift-machine-config-operator(13f9f6de-6f02-4e8f-a6dc-dd74aa61415f)" Mar 01 19:00:05.096181 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.096036 1438 kubelet.go:2120] Failed to delete pod "machine-config-controller-7b69b565bf-wh95z_openshift-machine-config-operator(13f9f6de-6f02-4e8f-a6dc-dd74aa61415f)", err: pod not found machine-config-controller-7b69b565bf-zbjf5 Mar 01 18:59:58.169961 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.169773 1525 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-controller-7b69b565bf-zbjf5_openshift-machine-config-operator(02991a83-f0ec-4aec-b926-c9f0cde18d01)" Mar 01 19:20:02.347659 ip-10-0-176-61 hyperkube[1525]: I0301 19:20:02.347632 1525 logs.go:317] Finish parsing log file "/var/log/pods/openshift-machine-config-operator_machine-config-controller-7b69b565bf-zbjf5_02991a83-f0ec-4aec-b926-c9f0cde18d01/machine-config-controller/0.log" machine-config-controller-dockercfg-jsclw Mar 01 18:59:58.172216 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.171873 1525 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-dockercfg-jsclw" Mar 01 18:59:58.172216 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.171873 1525 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-dockercfg-jsclw" machine-config-controller-token-f5mml Mar 01 18:15:46.256230 ip-10-0-131-96 hyperkube[1438]: I0301 18:15:46.256209 1438 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-token-f5mml" Mar 01 19:00:05.096181 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.096089 1438 reflector.go:225] Stopping reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-token-f5mml" machine-config-operator-65985fcd74-jd724 Mar 01 18:14:51.271796 ip-10-0-131-96 hyperkube[1438]: I0301 18:14:51.271763 1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-65985fcd74-jd724_openshift-machine-config-operator(002bf91d-228d-4359-a8dc-a4201dc16112)" Mar 01 18:53:25.408063 ip-10-0-131-96 hyperkube[1438]: I0301 18:53:25.407957 1438 kubelet.go:2120] Failed to delete pod "machine-config-operator-65985fcd74-jd724_openshift-machine-config-operator(002bf91d-228d-4359-a8dc-a4201dc16112)", err: pod not found machine-config-operator-6d7485cd55-6v46c Mar 01 18:53:20.530438 ip-10-0-176-61 hyperkube[1451]: I0301 18:53:20.530406 1451 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-6v46c_openshift-machine-config-operator(f823ee29-733c-42ee-a767-0584085267bc)" Mar 01 18:55:51.148309 ip-10-0-176-61 hyperkube[1451]: I0301 18:55:51.148261 1451 kubelet.go:2120] Failed to delete pod "machine-config-operator-6d7485cd55-6v46c_openshift-machine-config-operator(f823ee29-733c-42ee-a767-0584085267bc)", err: pod not found machine-config-operator-6d7485cd55-bd2zd Mar 01 18:55:42.744856 ip-10-0-131-96 hyperkube[1438]: I0301 18:55:42.731755 1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-bd2zd_openshift-machine-config-operator(a77662f0-2a3b-4722-a8cd-da27d192d209)" Mar 01 19:00:05.227097 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.226955 1438 kubelet.go:2120] Failed to delete pod "machine-config-operator-6d7485cd55-bd2zd_openshift-machine-config-operator(a77662f0-2a3b-4722-a8cd-da27d192d209)", err: pod not found machine-config-operator-6d7485cd55-xgntn Mar 01 18:59:58.272571 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.272549 1525 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-xgntn_openshift-machine-config-operator(0b5c7d83-1b80-4011-a279-3d28c56014bb)" Mar 01 19:20:02.554286 ip-10-0-176-61 hyperkube[1525]: I0301 19:20:02.554261 1525 logs.go:317] Finish parsing log file "/var/log/pods/openshift-machine-config-operator_machine-config-operator-6d7485cd55-xgntn_0b5c7d83-1b80-4011-a279-3d28c56014bb/machine-config-operator/0.log" [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar xz $ cd quay-io-* $ zgrep -h '"verb":"update".*"resource":"deployments".*"name":"machine-config-operator"' audit_logs/*/*.gz 2>/dev/null | jq -r '.stageTimestamp + " " + .user.username' | sort | grep cluster-version ... 2021-03-01T18:36:44.179249Z system:serviceaccount:openshift-cluster-version:default 2021-03-01T18:53:20.484613Z system:serviceaccount:openshift-cluster-version:default 2021-03-01T19:09:13.465341Z system:serviceaccount:openshift-cluster-version:default ... [4]: In the audit log dir from [3]: $ zgrep -h '"verb":"update".*"name":"machine-config-osimageurl"' audit_logs/*/*.gz 2>/dev/null | jq -r '.stageTimestamp + " " + .user.username' | sort 2021-03-01T18:53:20.506900Z system:serviceaccount:openshift-cluster-version:default [5]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | sort ... 2021-03-01T18:15:49Z 99-worker-generated-registries 2021-03-01T18:15:54Z rendered-master-07c510edac4d5ef5cff3307709d3c444 2021-03-01T18:15:54Z rendered-worker-d7f92e25364c243bc7e49131b2d574eb 2021-03-01T18:55:28Z rendered-master-4ef29138db42bdfe5d43b28d7f77798f 2021-03-01T18:55:28Z rendered-worker-c7448314bb99bafccae83da0b28ff9a9 [6]: $ MCS="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json)" $ diff -U0 <(echo "${MCS}" | jq '.items[] | select(.metadata.name == "rendered-master-07c510edac4d5ef5cff3307709d3c444 ").spec') <(echo "${MCS}" | jq '.items[] | select(.metadata.name == "rendered-master-4ef29138db42bdfe5d43b28d7f77798f").spec') --- /dev/fd/63 2021-03-01 17:16:00.425435297 -0800 +++ /dev/fd/62 2021-03-01 17:16:00.427435297 -0800 @@ -374 +374 @@ - "osImageURL": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96be060a7824bed1eae6431f2209457a0263aa7cb70f495a68d23f734ba384d3" + "osImageURL": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c879c26bc14a2347d8a37dd677601cc0d012f8ce89807fce98abe043a509e5b8" [7]: {container_name="machine-config-operator",invoker=~".*/1366447972245573632",pod_name="machine-config-operator-6d7485cd55-6v46c"} https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Bcontainer_name%3D%5C%22machine-config-operator%5C%22,invoker%3D~%5C%22.*%2F1366447972245573632%5C%22,pod_name%3D%5C%22machine-config-operator-6d7485cd55-6v46c%5C%22%7D%22%7D%5D 2021-03-01 18:55:19 I0301 18:55:19.317356 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config ...warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition 2021-03-01 18:55:19 I0301 18:55:19.941596 1 operator.go:255] Starting MachineConfigOperator 2021-03-01 18:55:19 I0301 18:55:19.946063 1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"8da8fd3c-184e-4218-9db6-99ffd24675ce", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.8.0-0.nightly-2021-03-01-143026}] to [{operator 4.8.0-0.nightly-2021-03-01-175301}] 2021-03-01 18:55:19 W0301 18:55:19.964905 1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition 2021-03-01 18:55:25 I0301 18:55:25.092948 1 sync.go:569] syncing Required MachineConfigPools 2021-03-01 18:55:26 I0301 18:55:26.145255 1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"8da8fd3c-184e-4218-9db6-99ffd24675ce", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.8.0-0.nightly-2021-03-01-143026}] to [{operator 4.8.0-0.nightly-2021-03-01-175301}] [8]: https://search.ci.openshift.org/?search=the+%22master%22+pool+should+be+updated+before+the+CVO+reports+available+at+the+new+version&maxAge=12h
Mentioning the test-case this fails, to get this bug linked from Sippy (as one of potentially many causes for that test-case to fail).
*** This bug has been marked as a duplicate of bug 1955929 ***