Bug 2053359

Summary: Cluster-version operator fails to notice TechPreviewNoUpgrade featureSet change after initialization-lookup error
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, jiajliu
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2050946 Environment:
Last Closed: 2022-03-21 12:40:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2050946    
Bug Blocks:    

Description W. Trevor King 2022-02-11 04:30:23 UTC
+++ This bug was initially created as a clone of Bug #2050946 +++

Poking around in rc.1 to look at the tech-preview, cluster-API bits:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
  Extracted release payload from digest sha256:eadfb8892c466438995e979f9a112c454286b4c0888eeb90da19354e8410fb24 created at 2022-02-04T20:00:57Z
  $ jq . manifests/image-references | jq -r '.spec.tags[] | select(.name | contains("cluster-api")) | .name + " " + .from.name + " " + .annotations["io.openshift.build.source-location"]'
  cluster-api quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c https://github.com/openshift/cluster-api
  $ grep -r a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c manifests
  manifests/image-references:          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c"

nothing consuming that image?  Anyhow, there are some tech-preview manifests:

  $ grep -r feature-gate manifests | grep namespace
  manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml:    release.openshift.io/feature-gate: "TechPreviewNoUpgrade"

which seems to be from [1].

  $ grep name: manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
    name: openshift-cluster-api

Looking in a recent tech-preview CI job [2], I don't see that namespace:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
  openshift-cluster-csi-drivers
  ...

Because the CVO doen't think it's a tech-preview cluster?

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep tech
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Checking the gate:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar tzv | grep featuregates.yaml
  -rw------- 1015690000/root     594 2022-02-04 02:59 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
  {
    "creationTimestamp": "2022-02-04T10:05:23Z",
    "featureSet": "TechPreviewNoUpgrade"
  }

Aha, it's because we are edge triggered, but we're missing the initial state sync:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
  W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Not clear to me why our informer logic doesn't eventually resync and let us know it was confused.  Perhaps we need some sort of post-cache-sync check before we enter the event-driven loop?  Or, if we can't figure that graceful approach out, we could revert [3], and just crash-loop until the initial FeatureGate fetch succeeds.

[1]: https://github.com/openshift/cluster-capi-operator/blob/ef673ccce35453a8ae8a9ef9016e322c93d8c03a/manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
[3]: https://github.com/openshift/cluster-version-operator/commit/90b14542aed4134eb4054780f8550191dde01381

--- Additional comment from W. Trevor King on 2022-02-11 04:27:35 UTC ---

Fix landed in [1], which ran TechPreview CI like [2].  Checking its CVO logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator_previous.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:11:23.577531       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
W0211 02:11:23.579038       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0211 02:30:49.795387       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0211 02:30:49.795936       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0211 02:30:49.796010       1 techpreviewchangestopper.go:109] Shutting down stop-on-techpreview-change controller
I0211 02:30:49.795882       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m8.377710572s
I0211 02:30:49.796588       1 start.go:282] Collected stop-on-techpreview-change controller goroutine.
I0211 02:30:49.807000       1 cvo.go:440] Shutting down ClusterVersionOperator
I0211 02:30:49.814421       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82.

So that's the initial control-plane CVO, and hit the early-install Kube-API hiccup, assumed the gate was off, Kube-API came up, CVO got a lease, launched controllers, noticed the gate was actually TechPreviewNoUpgrade, and gracefully shutdown.  Lovely.  When the replacement container comes up:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:30:50.795928       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
I0211 02:30:51.413209       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0211 02:30:51.413803       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m15.086131489s

So it noticed when it was starting up that it was TechPreviewNoUpgrade, no reboot required.  Also lovely.  Checking the namespace:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
openshift-cluster-api
openshift-cluster-csi-drivers
...

Hooray :).  And the ClusterOperator for good measure:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cluster-api").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-11T02:09:50Z Available=True AsExpected: Cluster CAPI Operator is available at 4.11.0-0.nightly-2022-02-11-014337
2022-02-11T02:09:50Z Progressing=False AsExpected: 
2022-02-11T02:09:50Z Degraded=False AsExpected: 
2022-02-11T02:09:50Z Upgradeable=True AsExpected: 

Hooray :).  Looks verified to me, but I'll wait for a second set of eyes from QE :)

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-02-11-014337
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488

Comment 3 liujia 2022-03-14 06:16:01 UTC
Version:4.10.0-0.nightly-2022-03-11-131656

TechPreviewNoUpgrade featureset was created 13:40:00Z.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1502274584513089536/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-de53ed4750def9ed611ca460c780cc78eece525162747213cfdbb4c26687460c/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
{
  "creationTimestamp": "2022-03-11T13:40:00Z",
  "featureSet": "TechPreviewNoUpgrade"
}

During initial stage, no tp operator installed due to kube-api is not reday even the TechPreviewNoUpgrade featureset is enabled.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1502274584513089536/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-68577f66bd-vkl26_cluster-version-operator_previous.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'
I0311 13:42:34.298586       1 start.go:23] ClusterVersionOperator 4.10.0-202203110747.p0.g50246cb.assembly.stream-50246cb
W0311 13:42:34.299262       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
...
I0311 13:47:30.989340       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0311 13:47:30.989386       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0311 13:47:30.989471       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m15.951374588s
I0311 13:47:30.995186       1 cvo.go:440] Shutting down ClusterVersionOperator
I0311 13:47:31.000454       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.10.0-202203110747.p0.g50246cb.assembly.stream-50246cb.

After initial finish, tp operator is installed successfully.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1502274584513089536/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-68577f66bd-vkl26_cluster-version-operator.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'|head -n7
I0311 13:47:32.320511       1 start.go:23] ClusterVersionOperator 4.10.0-202203110747.p0.g50246cb.assembly.stream-50246cb
I0311 13:47:32.927407       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0311 13:47:32.932766       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m57.431406309s
I0311 13:47:41.589515       1 sync_worker.go:772] Running sync for customresourcedefinition "featuregates.config.openshift.io" (41 of 835)
I0311 13:47:41.637945       1 sync_worker.go:784] Done syncing for customresourcedefinition "featuregates.config.openshift.io" (41 of 835)
I0311 13:47:42.536406       1 sync_worker.go:772] Running sync for featuregate "cluster" (24 of 835)
I0311 13:47:43.508424       1 sync_worker.go:784] Done syncing for featuregate "cluster" (24 of 835)

Comment 6 errata-xmlrpc 2022-03-21 12:40:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0928