Bug 2050946 - Cluster-version operator fails to notice TechPreviewNoUpgrade featureSet change after initialization-lookup error
Summary: Cluster-version operator fails to notice TechPreviewNoUpgrade featureSet chan...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 2053359
TreeView+ depends on / blocked
 
Reported: 2022-02-05 07:10 UTC by W. Trevor King
Modified: 2022-08-10 10:47 UTC (History)
0 users

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2053359 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:47:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 736 0 None Merged Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState 2022-02-10 19:00:27 UTC
Github openshift cluster-version-operator pull 739 0 None Merged Bug 2050946: Fix wrong informer for feature-gate-stopper 2022-02-11 04:28:13 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:47:40 UTC

Description W. Trevor King 2022-02-05 07:10:43 UTC
Poking around in rc.1 to look at the tech-preview, cluster-API bits:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
  Extracted release payload from digest sha256:eadfb8892c466438995e979f9a112c454286b4c0888eeb90da19354e8410fb24 created at 2022-02-04T20:00:57Z
  $ jq . manifests/image-references | jq -r '.spec.tags[] | select(.name | contains("cluster-api")) | .name + " " + .from.name + " " + .annotations["io.openshift.build.source-location"]'
  cluster-api quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c https://github.com/openshift/cluster-api
  $ grep -r a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c manifests
  manifests/image-references:          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c"

nothing consuming that image?  Anyhow, there are some tech-preview manifests:

  $ grep -r feature-gate manifests | grep namespace
  manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml:    release.openshift.io/feature-gate: "TechPreviewNoUpgrade"

which seems to be from [1].

  $ grep name: manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
    name: openshift-cluster-api

Looking in a recent tech-preview CI job [2], I don't see that namespace:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
  openshift-cluster-csi-drivers
  ...

Because the CVO doen't think it's a tech-preview cluster?

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep tech
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Checking the gate:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar tzv | grep featuregates.yaml
  -rw------- 1015690000/root     594 2022-02-04 02:59 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
  {
    "creationTimestamp": "2022-02-04T10:05:23Z",
    "featureSet": "TechPreviewNoUpgrade"
  }

Aha, it's because we are edge triggered, but we're missing the initial state sync:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
  W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Not clear to me why our informer logic doesn't eventually resync and let us know it was confused.  Perhaps we need some sort of post-cache-sync check before we enter the event-driven loop?  Or, if we can't figure that graceful approach out, we could revert [3], and just crash-loop until the initial FeatureGate fetch succeeds.

[1]: https://github.com/openshift/cluster-capi-operator/blob/ef673ccce35453a8ae8a9ef9016e322c93d8c03a/manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
[3]: https://github.com/openshift/cluster-version-operator/commit/90b14542aed4134eb4054780f8550191dde01381

Comment 4 W. Trevor King 2022-02-11 04:27:35 UTC
Fix landed in [1], which ran TechPreview CI like [2].  Checking its CVO logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator_previous.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:11:23.577531       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
W0211 02:11:23.579038       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0211 02:30:49.795387       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0211 02:30:49.795936       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0211 02:30:49.796010       1 techpreviewchangestopper.go:109] Shutting down stop-on-techpreview-change controller
I0211 02:30:49.795882       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m8.377710572s
I0211 02:30:49.796588       1 start.go:282] Collected stop-on-techpreview-change controller goroutine.
I0211 02:30:49.807000       1 cvo.go:440] Shutting down ClusterVersionOperator
I0211 02:30:49.814421       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82.

So that's the initial control-plane CVO, and hit the early-install Kube-API hiccup, assumed the gate was off, Kube-API came up, CVO got a lease, launched controllers, noticed the gate was actually TechPreviewNoUpgrade, and gracefully shutdown.  Lovely.  When the replacement container comes up:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:30:50.795928       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
I0211 02:30:51.413209       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0211 02:30:51.413803       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m15.086131489s

So it noticed when it was starting up that it was TechPreviewNoUpgrade, no reboot required.  Also lovely.  Checking the namespace:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
openshift-cluster-api
openshift-cluster-csi-drivers
...

Hooray :).  And the ClusterOperator for good measure:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cluster-api").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-11T02:09:50Z Available=True AsExpected: Cluster CAPI Operator is available at 4.11.0-0.nightly-2022-02-11-014337
2022-02-11T02:09:50Z Progressing=False AsExpected: 
2022-02-11T02:09:50Z Degraded=False AsExpected: 
2022-02-11T02:09:50Z Upgradeable=True AsExpected: 

Hooray :).  Looks verified to me, but I'll wait for a second set of eyes from QE :)

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-02-11-014337
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488

Comment 6 liujia 2022-03-07 08:57:15 UTC
From the description, the TechPreviewNoUpgrade featureset is created during cluster's initialization stage. But the cluster is not accessible for users before the installation finish(this should be the only supported way).

So I tried to check if it could be reproduced by following way based on a normal cluster(getting the featureset enabled before cvo sync).
1. scale down cvo to stop the sync.
2. create TechPreviewNoUpgrade featureset
# ./oc get featuregate cluster -oyaml|grep 'creationTimestamp\|featureSet'
  creationTimestamp: "2022-03-07T02:21:20Z"
  featureSet: TechPreviewNoUpgrade
3. check that no tp operator installed even with TechPreviewNoUpgrade featureset
# ./oc get ns openshift-cluster-api
Error from server (NotFound): namespaces "openshift-cluster-api" not found
# ./oc get co cluster-api
Error from server (NotFound): clusteroperators.config.openshift.io "cluster-api" not found
4. scale up cvo to start the sync.
5. check that tp operator installed successfully.
# ./oc get co cluster-api
NAME          VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
cluster-api   4.10.2    True        False         False      10s     
# ./oc logs cluster-version-operator-567756b458-88hcg|grep TechPreviewNoUpgrade
I0307 03:18:22.338574       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.

It works well, so can not reproduce the issue. Verified the bug through the ci job periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview as comment4.

Version:4.11.0-0.nightly-2022-03-04-063157

TechPreviewNoUpgrade featureset was created 06:46:22Z.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e80ffaaae914124522da3865034d497aa44b64a52b0c70ff5e8d24925147bbbd/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
{
  "creationTimestamp": "2022-03-04T06:46:22Z",
  "featureSet": "TechPreviewNoUpgrade"
}

During initial stage, no tp operator installed due to kube-api is not reday even the TechPreviewNoUpgrade featureset is enabled.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator_previous.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'
I0304 06:51:25.859053       1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e
W0304 06:51:25.860400       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
...
I0304 07:04:24.490835       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0304 07:04:24.490869       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m16.675706003s
I0304 07:04:24.490903       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0304 07:04:24.496164       1 cvo.go:440] Shutting down ClusterVersionOperator
I0304 07:04:24.502419       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e.

After initial finish, tp operator is installed successfully.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'|head -n5
I0304 07:04:25.470420       1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e
I0304 07:04:25.995604       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0304 07:04:25.997350       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m16.4555211s
I0304 07:04:34.092211       1 sync_worker.go:824] Running sync for customresourcedefinition "featuregates.config.openshift.io" (42 of 814)
I0304 07:04:34.459782       1 sync_worker.go:824] Running sync for featuregate "cluster" (24 of 814)

From above, verified the bug.

Comment 9 errata-xmlrpc 2022-08-10 10:47:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.