Bug 1933897 - MCO required-pool check for MachineConfig version too broad: cluster operator bumped before renderedconfig generated
Summary: MCO required-pool check for MachineConfig version too broad: cluster operator...
Keywords:
Status: CLOSED DUPLICATE of bug 1955929
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Kirsten Garrison
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-02 00:33 UTC by Kirsten Garrison
Modified: 2021-05-25 20:05 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Last Closed: 2021-05-25 20:05:16 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Kirsten Garrison 2021-03-02 00:33:47 UTC
Description of problem:

During an upgrade, the MCO cluster version is bumped before a rendered config is generated.

Working theory is that this has to do with preexisting function: isMachineConfigPoolConfigurationValid()L
https://github.com/openshift/machine-config-operator/blob/4245fea0c3c16a49c68207049b3a9a245d8b53b3/pkg/operator/status.go#L356 

and probably has always existed just got exposed recently due to the work done in 4.7

What we see happening:
1. OCP update triggered
2. CVO bumps the image ConfigMap
3. Outgoing MCC leaves it alone; thanks version-mismatch guard (https://github.com/openshift/machine-config-operator/commit/60c5ee214b5220e5499c1b687594dcf1a16ed9e8#diff-8047e5e8697bdac51[…]922c47ca937f03508f24fa)
4. MCO grabs the pool, sees version.Hash match (MCC commit same before and after this particular OCP update), assumes that means the rendered config is up to date :bug:.  Sees Updated=True.  Bumps ClusterOperator version
5. everything else starts rolling out to the pools after the bump

Comment 1 Kirsten Garrison 2021-03-02 00:36:26 UTC
This was discovered when working on: https://bugzilla.redhat.com/show_bug.cgi?id=1932105

We have a PR up adding logging to confirm root cause: https://github.com/openshift/machine-config-operator/pull/2444

Comment 2 W. Trevor King 2021-03-02 01:30:30 UTC
Example job where we saw this [1], which had the fix from bug 1932105.  Timeline:

1. 18:14Z, machine-config-operator-65985fcd74-jd724 exists [2].
2. 18:15, machine-config-controller-7b69b565bf-wh95z exists [2].
3. 18:53:20.484613, CVO bumps the MCO Deployment [3].
4. 18:53:20.506900, CVO bumps the machine-config-osimageurl ConfigMap [4].
5. 18:53:20.530438, first reference to machine-config-operator-6d7485cd55-6v46c in the kubelet logs [2].
6. 18:53:25.408063, last reference to machine-config-operator-65985fcd74-jd724 in kubelet logs [2].
7. 18:55:19.317356, machine-config-operator-6d7485cd55-6v46c acquires the leader lease [7].
8. 18:55:26.145255, machine-config-operator-6d7485cd55-6v46c bumps ClusterOperator version BUG [7].
9. 18:55:28, new rendered master [5], with a new osImageURL [6] (pulled from the machine-config-osimageurl ConfigMap).
10. 18:55:42.744856, machine-config-operator-6d7485cd55-bd2zd shows up in kubelet logs [2].
11. 18:55:51.148309, last reference to machine-config-operator-6d7485cd55-6v46c in kubelet logs [2].
12. 18:59:58.169961, first reference to machine-config-controller-7b69b565bf-zbjf5 in the kubelet logs [2].
13. 19:00:05.096181, last reference to machine-config-controller-7b69b565bf-wh95z in the kubelet logs [2].

So seems like there was no need to bump the MCC, because the MCO commit hash, base image, and other bits that feed the image digest had not changed.  Took a bit for the MCC to render the new MachineConfig after the machine-config-osimageurl ConfigMap bump; we're not entirely clear on why.  The MCO came through just before the new MachineConfig was rendered and said "yeah, looks like the 'master' pool is up to date", and leveled the ClusterOperator, not realizing that seconds later a new MachineConfig would come out with the new RHCOS.

Distinguishing between this failure mode and bug 1932105 seems hard.  Probably best to fix them both, and then check CI-search [8] and confirm that it is empty.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632
[2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xz
     $ cd quay-io-*
     $ grep '^Mar 01 18:5' host_service_logs/masters/kubelet_service.log | grep -o 'machine-config-\(controller\|operator\)-[0-9a-z]*-[0-9a-z]*' | sort | uniq | while read POD; do echo "${POD}"; LOGS="$(grep "${POD}" host_service_logs/masters/kubelet_service.log | sed 's/^/  /')"; echo "${LOGS}" | head -n1; echo "${LOGS}" | tail -n1; done
     machine-config-controller-7b69b565bf-wh95z
       Mar 01 18:15:46.256012 ip-10-0-131-96 hyperkube[1438]: I0301 18:15:46.255704    1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-controller-7b69b565bf-wh95z_openshift-machine-config-operator(13f9f6de-6f02-4e8f-a6dc-dd74aa61415f)"
       Mar 01 19:00:05.096181 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.096036    1438 kubelet.go:2120] Failed to delete pod "machine-config-controller-7b69b565bf-wh95z_openshift-machine-config-operator(13f9f6de-6f02-4e8f-a6dc-dd74aa61415f)", err: pod not found
     machine-config-controller-7b69b565bf-zbjf5
       Mar 01 18:59:58.169961 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.169773    1525 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-controller-7b69b565bf-zbjf5_openshift-machine-config-operator(02991a83-f0ec-4aec-b926-c9f0cde18d01)"
       Mar 01 19:20:02.347659 ip-10-0-176-61 hyperkube[1525]: I0301 19:20:02.347632    1525 logs.go:317] Finish parsing log file "/var/log/pods/openshift-machine-config-operator_machine-config-controller-7b69b565bf-zbjf5_02991a83-f0ec-4aec-b926-c9f0cde18d01/machine-config-controller/0.log"
     machine-config-controller-dockercfg-jsclw
       Mar 01 18:59:58.172216 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.171873    1525 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-dockercfg-jsclw"
       Mar 01 18:59:58.172216 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.171873    1525 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-dockercfg-jsclw"
     machine-config-controller-token-f5mml
       Mar 01 18:15:46.256230 ip-10-0-131-96 hyperkube[1438]: I0301 18:15:46.256209    1438 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-token-f5mml"
       Mar 01 19:00:05.096181 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.096089    1438 reflector.go:225] Stopping reflector *v1.Secret (0s) from object-"openshift-machine-config-operator"/"machine-config-controller-token-f5mml"
     machine-config-operator-65985fcd74-jd724
       Mar 01 18:14:51.271796 ip-10-0-131-96 hyperkube[1438]: I0301 18:14:51.271763    1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-65985fcd74-jd724_openshift-machine-config-operator(002bf91d-228d-4359-a8dc-a4201dc16112)"
       Mar 01 18:53:25.408063 ip-10-0-131-96 hyperkube[1438]: I0301 18:53:25.407957    1438 kubelet.go:2120] Failed to delete pod "machine-config-operator-65985fcd74-jd724_openshift-machine-config-operator(002bf91d-228d-4359-a8dc-a4201dc16112)", err: pod not found
     machine-config-operator-6d7485cd55-6v46c
       Mar 01 18:53:20.530438 ip-10-0-176-61 hyperkube[1451]: I0301 18:53:20.530406    1451 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-6v46c_openshift-machine-config-operator(f823ee29-733c-42ee-a767-0584085267bc)"
       Mar 01 18:55:51.148309 ip-10-0-176-61 hyperkube[1451]: I0301 18:55:51.148261    1451 kubelet.go:2120] Failed to delete pod "machine-config-operator-6d7485cd55-6v46c_openshift-machine-config-operator(f823ee29-733c-42ee-a767-0584085267bc)", err: pod not found
     machine-config-operator-6d7485cd55-bd2zd
       Mar 01 18:55:42.744856 ip-10-0-131-96 hyperkube[1438]: I0301 18:55:42.731755    1438 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-bd2zd_openshift-machine-config-operator(a77662f0-2a3b-4722-a8cd-da27d192d209)"
       Mar 01 19:00:05.227097 ip-10-0-131-96 hyperkube[1438]: I0301 19:00:05.226955    1438 kubelet.go:2120] Failed to delete pod "machine-config-operator-6d7485cd55-bd2zd_openshift-machine-config-operator(a77662f0-2a3b-4722-a8cd-da27d192d209)", err: pod not found
     machine-config-operator-6d7485cd55-xgntn
       Mar 01 18:59:58.272571 ip-10-0-176-61 hyperkube[1525]: I0301 18:59:58.272549    1525 kubelet.go:1920] SyncLoop (ADD, "api"): "machine-config-operator-6d7485cd55-xgntn_openshift-machine-config-operator(0b5c7d83-1b80-4011-a279-3d28c56014bb)"
       Mar 01 19:20:02.554286 ip-10-0-176-61 hyperkube[1525]: I0301 19:20:02.554261    1525 logs.go:317] Finish parsing log file "/var/log/pods/openshift-machine-config-operator_machine-config-operator-6d7485cd55-xgntn_0b5c7d83-1b80-4011-a279-3d28c56014bb/machine-config-operator/0.log"
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar xz
     $ cd quay-io-*
     $ zgrep -h '"verb":"update".*"resource":"deployments".*"name":"machine-config-operator"' audit_logs/*/*.gz 2>/dev/null | jq -r '.stageTimestamp + " " + .user.username' | sort | grep cluster-version
     ...
     2021-03-01T18:36:44.179249Z system:serviceaccount:openshift-cluster-version:default
     2021-03-01T18:53:20.484613Z system:serviceaccount:openshift-cluster-version:default
     2021-03-01T19:09:13.465341Z system:serviceaccount:openshift-cluster-version:default
     ...
[4]: In the audit log dir from [3]:
     $ zgrep -h '"verb":"update".*"name":"machine-config-osimageurl"' audit_logs/*/*.gz 2>/dev/null | jq -r '.stageTimestamp + " " + .user.username' | sort
     2021-03-01T18:53:20.506900Z system:serviceaccount:openshift-cluster-version:default
[5]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | sort
     ...
     2021-03-01T18:15:49Z 99-worker-generated-registries
     2021-03-01T18:15:54Z rendered-master-07c510edac4d5ef5cff3307709d3c444
     2021-03-01T18:15:54Z rendered-worker-d7f92e25364c243bc7e49131b2d574eb
     2021-03-01T18:55:28Z rendered-master-4ef29138db42bdfe5d43b28d7f77798f
     2021-03-01T18:55:28Z rendered-worker-c7448314bb99bafccae83da0b28ff9a9
[6]: $ MCS="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1366447972245573632/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json)"
     $ diff -U0 <(echo "${MCS}" | jq '.items[] | select(.metadata.name == "rendered-master-07c510edac4d5ef5cff3307709d3c444
").spec') <(echo "${MCS}" | jq '.items[] | select(.metadata.name == "rendered-master-4ef29138db42bdfe5d43b28d7f77798f").spec')
    --- /dev/fd/63  2021-03-01 17:16:00.425435297 -0800
    +++ /dev/fd/62  2021-03-01 17:16:00.427435297 -0800
    @@ -374 +374 @@
    -  "osImageURL": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96be060a7824bed1eae6431f2209457a0263aa7cb70f495a68d23f734ba384d3"
    +  "osImageURL": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c879c26bc14a2347d8a37dd677601cc0d012f8ce89807fce98abe043a509e5b8"
[7]: {container_name="machine-config-operator",invoker=~".*/1366447972245573632",pod_name="machine-config-operator-6d7485cd55-6v46c"}
     https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Bcontainer_name%3D%5C%22machine-config-operator%5C%22,invoker%3D~%5C%22.*%2F1366447972245573632%5C%22,pod_name%3D%5C%22machine-config-operator-6d7485cd55-6v46c%5C%22%7D%22%7D%5D
     2021-03-01 18:55:19	I0301 18:55:19.317356       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config
     ...warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
     2021-03-01 18:55:19	I0301 18:55:19.941596       1 operator.go:255] Starting MachineConfigOperator
     2021-03-01 18:55:19	I0301 18:55:19.946063       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"8da8fd3c-184e-4218-9db6-99ffd24675ce", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.8.0-0.nightly-2021-03-01-143026}] to [{operator 4.8.0-0.nightly-2021-03-01-175301}]
     2021-03-01 18:55:19	W0301 18:55:19.964905       1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
     2021-03-01 18:55:25	I0301 18:55:25.092948       1 sync.go:569] syncing Required MachineConfigPools
     2021-03-01 18:55:26	I0301 18:55:26.145255       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"8da8fd3c-184e-4218-9db6-99ffd24675ce", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.8.0-0.nightly-2021-03-01-143026}] to [{operator 4.8.0-0.nightly-2021-03-01-175301}]
[8]: https://search.ci.openshift.org/?search=the+%22master%22+pool+should+be+updated+before+the+CVO+reports+available+at+the+new+version&maxAge=12h

Comment 4 W. Trevor King 2021-03-09 17:56:09 UTC
Mentioning the test-case this fails, to get this bug linked from Sippy (as one of potentially many causes for that test-case to fail).

Comment 5 Kirsten Garrison 2021-05-25 20:05:16 UTC

*** This bug has been marked as a duplicate of bug 1955929 ***


Note You need to log in before you can comment on or make changes to this bug.