Bug 2050946

Summary: Cluster-version operator fails to notice TechPreviewNoUpgrade featureSet change after initialization-lookup error
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2053359 (view as bug list) Environment:
Last Closed: 2022-08-10 10:47:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2053359    

Description W. Trevor King 2022-02-05 07:10:43 UTC
Poking around in rc.1 to look at the tech-preview, cluster-API bits:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
  Extracted release payload from digest sha256:eadfb8892c466438995e979f9a112c454286b4c0888eeb90da19354e8410fb24 created at 2022-02-04T20:00:57Z
  $ jq . manifests/image-references | jq -r '.spec.tags[] | select(.name | contains("cluster-api")) | .name + " " + .from.name + " " + .annotations["io.openshift.build.source-location"]'
  cluster-api quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c https://github.com/openshift/cluster-api
  $ grep -r a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c manifests
  manifests/image-references:          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c"

nothing consuming that image?  Anyhow, there are some tech-preview manifests:

  $ grep -r feature-gate manifests | grep namespace
  manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml:    release.openshift.io/feature-gate: "TechPreviewNoUpgrade"

which seems to be from [1].

  $ grep name: manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
    name: openshift-cluster-api

Looking in a recent tech-preview CI job [2], I don't see that namespace:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
  openshift-cluster-csi-drivers
  ...

Because the CVO doen't think it's a tech-preview cluster?

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep tech
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Checking the gate:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar tzv | grep featuregates.yaml
  -rw------- 1015690000/root     594 2022-02-04 02:59 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
  {
    "creationTimestamp": "2022-02-04T10:05:23Z",
    "featureSet": "TechPreviewNoUpgrade"
  }

Aha, it's because we are edge triggered, but we're missing the initial state sync:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
  W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
  I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

Not clear to me why our informer logic doesn't eventually resync and let us know it was confused.  Perhaps we need some sort of post-cache-sync check before we enter the event-driven loop?  Or, if we can't figure that graceful approach out, we could revert [3], and just crash-loop until the initial FeatureGate fetch succeeds.

[1]: https://github.com/openshift/cluster-capi-operator/blob/ef673ccce35453a8ae8a9ef9016e322c93d8c03a/manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
[3]: https://github.com/openshift/cluster-version-operator/commit/90b14542aed4134eb4054780f8550191dde01381

Comment 4 W. Trevor King 2022-02-11 04:27:35 UTC
Fix landed in [1], which ran TechPreview CI like [2].  Checking its CVO logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator_previous.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:11:23.577531       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
W0211 02:11:23.579038       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0211 02:30:49.795387       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0211 02:30:49.795936       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0211 02:30:49.796010       1 techpreviewchangestopper.go:109] Shutting down stop-on-techpreview-change controller
I0211 02:30:49.795882       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m8.377710572s
I0211 02:30:49.796588       1 start.go:282] Collected stop-on-techpreview-change controller goroutine.
I0211 02:30:49.807000       1 cvo.go:440] Shutting down ClusterVersionOperator
I0211 02:30:49.814421       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82.

So that's the initial control-plane CVO, and hit the early-install Kube-API hiccup, assumed the gate was off, Kube-API came up, CVO got a lease, launched controllers, noticed the gate was actually TechPreviewNoUpgrade, and gracefully shutdown.  Lovely.  When the replacement container comes up:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview'
I0211 02:30:50.795928       1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82
I0211 02:30:51.413209       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0211 02:30:51.413803       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m15.086131489s

So it noticed when it was starting up that it was TechPreviewNoUpgrade, no reboot required.  Also lovely.  Checking the namespace:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster
openshift-cluster-api
openshift-cluster-csi-drivers
...

Hooray :).  And the ClusterOperator for good measure:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cluster-api").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-11T02:09:50Z Available=True AsExpected: Cluster CAPI Operator is available at 4.11.0-0.nightly-2022-02-11-014337
2022-02-11T02:09:50Z Progressing=False AsExpected: 
2022-02-11T02:09:50Z Degraded=False AsExpected: 
2022-02-11T02:09:50Z Upgradeable=True AsExpected: 

Hooray :).  Looks verified to me, but I'll wait for a second set of eyes from QE :)

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-02-11-014337
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488

Comment 6 liujia 2022-03-07 08:57:15 UTC
From the description, the TechPreviewNoUpgrade featureset is created during cluster's initialization stage. But the cluster is not accessible for users before the installation finish(this should be the only supported way).

So I tried to check if it could be reproduced by following way based on a normal cluster(getting the featureset enabled before cvo sync).
1. scale down cvo to stop the sync.
2. create TechPreviewNoUpgrade featureset
# ./oc get featuregate cluster -oyaml|grep 'creationTimestamp\|featureSet'
  creationTimestamp: "2022-03-07T02:21:20Z"
  featureSet: TechPreviewNoUpgrade
3. check that no tp operator installed even with TechPreviewNoUpgrade featureset
# ./oc get ns openshift-cluster-api
Error from server (NotFound): namespaces "openshift-cluster-api" not found
# ./oc get co cluster-api
Error from server (NotFound): clusteroperators.config.openshift.io "cluster-api" not found
4. scale up cvo to start the sync.
5. check that tp operator installed successfully.
# ./oc get co cluster-api
NAME          VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
cluster-api   4.10.2    True        False         False      10s     
# ./oc logs cluster-version-operator-567756b458-88hcg|grep TechPreviewNoUpgrade
I0307 03:18:22.338574       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.

It works well, so can not reproduce the issue. Verified the bug through the ci job periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview as comment4.

Version:4.11.0-0.nightly-2022-03-04-063157

TechPreviewNoUpgrade featureset was created 06:46:22Z.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e80ffaaae914124522da3865034d497aa44b64a52b0c70ff5e8d24925147bbbd/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}'
{
  "creationTimestamp": "2022-03-04T06:46:22Z",
  "featureSet": "TechPreviewNoUpgrade"
}

During initial stage, no tp operator installed due to kube-api is not reday even the TechPreviewNoUpgrade featureset is enabled.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator_previous.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'
I0304 06:51:25.859053       1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e
W0304 06:51:25.860400       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
...
I0304 07:04:24.490835       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0304 07:04:24.490869       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m16.675706003s
I0304 07:04:24.490903       1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I0304 07:04:24.496164       1 cvo.go:440] Shutting down ClusterVersionOperator
I0304 07:04:24.502419       1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e.

After initial finish, tp operator is installed successfully.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'|head -n5
I0304 07:04:25.470420       1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e
I0304 07:04:25.995604       1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true.
I0304 07:04:25.997350       1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m16.4555211s
I0304 07:04:34.092211       1 sync_worker.go:824] Running sync for customresourcedefinition "featuregates.config.openshift.io" (42 of 814)
I0304 07:04:34.459782       1 sync_worker.go:824] Running sync for featuregate "cluster" (24 of 814)

From above, verified the bug.

Comment 9 errata-xmlrpc 2022-08-10 10:47:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069