Bug 1861431 - Cannot upgrade a cluster when adding Performance Profile Operator
Summary: Cannot upgrade a cluster when adding Performance Profile Operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1862156
TreeView+ depends on / blocked
 
Reported: 2020-07-28 15:34 UTC by Yolanda Robla
Modified: 2020-10-27 16:21 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: A bug in the feature gate upgradeability logic. Consequence: The CVO was marking the cluster as not upgradeable when LatencySensitive FeatureGate was in use. Workaround (if any): Force the upgrade to a version that has this bug fixed. Result: Upgrade is performed and the upgraded version includes this bug fix so CVO no longer treats LatencySensitive FeatureGate as blocking for upgrades.
Clone Of:
: 1862156 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:21:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 920 0 None closed Bug 1861431: LatencySensitive feature gate allows upgrades 2021-01-26 14:58:32 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:21:40 UTC

Description Yolanda Robla 2020-07-28 15:34:39 UTC
Description of problem:

When deploying an OpenShift 4.4 cluster and above, a Performance Profile Addon operator can be applied, in order to get realtime kernel, cpu pinning, etc... https://github.com/openshift-kni/performance-addon-operators

This works fine but is causing a side effect on the cluster, it blocks the upgrades. As soon as I apply this operator, i receive this error on ClusterVersion:

  - lastTransitionTime: "2020-07-28T08:48:15Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: FeatureGatesUpgradeable:
      "LatencySensitive" does not allow updates'
    reason: FeatureGates_RestrictedFeatureGates_LatencySensitive
    status: "False"
    type: Upgradeable

This seems to be caused by https://github.com/openshift/cluster-kube-apiserver-operator/blob/f73bebb6361c3649dab5305d8c7d1cd9753e61aa/pkg/operator/featureupgradablecontroller/feature_upgradeable_controller.go#L18 . It needs to list "LatencySensitive" on that line, to allow upgrades of this component.

This bug is applying from 4.4 in advance.

Comment 1 Martin Sivák 2020-07-29 10:17:31 UTC
LatencySensitive should not be blocking release according to https://github.com/openshift/api/blob/7192180f496aab1f7659d8660fc360498bab498b/config/v1/types_feature.go#L38

This feature gate is needed for enabling TopologyManager in OCP 4.4 and is explicitely mentioned in the OCP docs here (with no warning about upgrade being blocked): https://docs.openshift.com/container-platform/4.4/scalability_and_performance/using-topology-manager.html#seting_up_topology_manager_using-topology-manager

Comment 2 Federico Simoncelli 2020-07-29 12:15:12 UTC
What's the next step here?

Yolanda should reproduce and give you access to the setup?
Martin meanwhile you/QE should try to reproduce investigate this in parallel?

Comment 3 Martin Sivák 2020-07-29 12:41:11 UTC
We already know what happened. The next step is the kube-apiserver team reviewing our findings.

1. PAO enabled LatencySensitive FG
2. kube-apiserver set status.upgradeable to False as it does not recognize this feature gate as allowed
3. CVO interrogated all operators and noticed it can't upgrade

I believe the step 2) is a bug as other place in the sources explicitly says LatencySensitive FG does not block upgrades.

Comment 5 Denys Shchedrivyi 2020-07-30 04:58:17 UTC
Want to add that upgrade failed only for minor versions, for example from 4.4.10 to 4.4.15

As for major versions (4.4.z -> 4.5.z) - upgrade successfully completed:

>    Initially I had 4.4.15
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.15    True        False         92m     Cluster version is 4.4.15

>    Start upgrading to 4.5.4
# oc adm upgrade --to-image "registry.svc.ci.openshift.org/ocp/release:4.5.4" --allow-explicit-upgrade --force
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.5.4

>    Upgrade in process
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.15    True        True          16m     Working towards 4.5.4: 76% complete

>    Successfully finished after ~2hrs
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.4     True        False         2m31s   Cluster version is 4.5.4

>    All operators are active and upgraded:
# oc get clusteroperator
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.4     True        False         False      170m
cloud-credential                           4.5.4     True        False         False      3h32m
cluster-autoscaler                         4.5.4     True        False         False      3h13m
config-operator                            4.5.4     True        False         False      76m
console                                    4.5.4     True        False         False      32m
csi-snapshot-controller                    4.5.4     True        False         False      38m
dns                                        4.5.4     True        False         False      3h20m
etcd                                       4.5.4     True        False         False      3h19m
image-registry                             4.5.4     True        False         False      3h14m
ingress                                    4.5.4     True        False         False      133m
insights                                   4.5.4     True        False         False      3h14m
kube-apiserver                             4.5.4     True        False         False      3h18m
kube-controller-manager                    4.5.4     True        False         False      3h19m
kube-scheduler                             4.5.4     True        False         False      3h18m
kube-storage-version-migrator              4.5.4     True        False         False      38m
machine-api                                4.5.4     True        False         False      3h14m
machine-approver                           4.5.4     True        False         False      63m
machine-config                             4.5.4     True        False         False      7m3s
marketplace                                4.5.4     True        False         False      37m
monitoring                                 4.5.4     True        False         False      7m7s
network                                    4.5.4     True        False         False      3h21m
node-tuning                                4.5.4     True        False         False      65m
openshift-apiserver                        4.5.4     True        False         False      26m
openshift-controller-manager               4.5.4     True        False         False      64m
openshift-samples                          4.5.4     True        False         False      65m
operator-lifecycle-manager                 4.5.4     True        False         False      3h21m
operator-lifecycle-manager-catalog         4.5.4     True        False         False      3h21m
operator-lifecycle-manager-packageserver   4.5.4     True        False         False      26m
service-ca                                 4.5.4     True        False         False      3h21m
storage                                    4.5.4     True        False         False      65m

Comment 11 Ke Wang 2020-07-31 14:22:37 UTC
Verified with OCP 4.6.0-0.nightly-2020-07-31-080025, steps see below,

$ oc edit featuregate/cluster

$ oc describe featuregate/cluster
Name:         cluster
Namespace:    
Labels:       <none>
Annotations:  release.openshift.io/create-only: true
API Version:  config.openshift.io/v1
Kind:         FeatureGate
...
Spec:
  Feature Set:  LatencySensitive
Events:         <none>

$ oc create -f topologymanager-kubeletconfig.yaml 
kubeletconfig.machineconfiguration.openshift.io/cpumanager-enabled created

$ oc get KubeletConfig
NAME                 AGE
cpumanager-enabled   17s

Looking for a 4.6 nightly payload included the bug fix PR.
$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025 | grep kube-apiserver
  cluster-kube-apiserver-operator         https://github.com/openshift/cluster-kube-apiserver-operator                19c2ecc4e39d7da2388265c3e85dbd17e8b1fd1c

$ git log --date local --pretty="%h %an %cd - %s" 19c2ecc4e  | grep '#920'
19c2ecc4 OpenShift Merge Robot Thu Jul 30 22:31:42 2020 - Merge pull request #920 from MarSik/bug_1861431

The build 4.6.0-0.nightly-2020-07-31-080025 just we wanted

$  oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched

$ oc adm upgrade
Cluster version is 4.5.4

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.


Because the build 4.6.0-0.nightly-2020-07-31-080025 has not been signed, have o upgrade with --force  parameter, 

$ oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to preceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.4     True        True          35s     Working towards 4.6.0-0.nightly-2020-07-31-080025: 0% complete

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-31-080025   True        False         64m     Cluster version is 4.6.0-0.nightly-2020-07-31-080025

$ oc get clusterversion -o json|jq ".items[0].spec"
{
  "channel": "stable-4.5",
  "clusterID": "607a8084-b37d-4f17-9f43-122d38d382e4",
  "desiredUpdate": {
    "force": true,
    "image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025",
    "version": ""
  },
  "upstream": "https://openshift-release.svc.ci.openshift.org/graph"
}

$ oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": "2020-07-31T11:46:08Z",
    "image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025",
    "startedTime": "2020-07-31T10:28:03Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.0-0.nightly-2020-07-31-080025"
  },
  {
    "completionTime": "2020-07-31T09:57:40Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:02dfcae8f6a67e715380542654c952c981c59604b1ba7f569b13b9e5d0fbbed3",
    "startedTime": "2020-07-31T09:27:53Z",
    "state": "Completed",
    "verified": false,
    "version": "4.5.4"
  }
]

$ oc describe featuregate/cluster
Name:         cluster
Namespace:    
Labels:       <none>
Annotations:  release.openshift.io/create-only: true
API Version:  config.openshift.io/v1
Kind:         FeatureGate
Metadata:
  Creation Timestamp:  2020-07-31T09:28:23Z
  Generation:          3
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:release.openshift.io/create-only:
      f:spec:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2020-07-31T09:28:23Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:featureSet:
    Manager:         oc
    Operation:       Update
    Time:            2020-07-31T10:15:49Z
  Resource Version:  28617
  Self Link:         /apis/config.openshift.io/v1/featuregates/cluster
  UID:               70c3346d-5ef3-4238-a64b-4cf00545ac37
Spec:
  Feature Set:  LatencySensitive
Events:         <none>

$ oc get KubeletConfig
NAME                 AGE
cpumanager-enabled   154m


$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-07-31-080025   True        False         False      161m
cloud-credential                           4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h51m
cluster-autoscaler                         4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h36m
config-operator                            4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h37m
console                                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      158m
csi-snapshot-controller                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h31m
dns                                        4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h41m
etcd                                       4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h41m
image-registry                             4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h31m
ingress                                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      3h46m
insights                                   4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h37m
kube-apiserver                             4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h39m
kube-controller-manager                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h40m
kube-scheduler                             4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h39m
kube-storage-version-migrator              4.6.0-0.nightly-2020-07-31-080025   True        False         False      163m
machine-api                                4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h34m
machine-approver                           4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h38m
machine-config                             4.6.0-0.nightly-2020-07-31-080025   True        False         False      154m
marketplace                                4.6.0-0.nightly-2020-07-31-080025   True        False         False      162m
monitoring                                 4.6.0-0.nightly-2020-07-31-080025   True        False         False      175m
network                                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h43m
node-tuning                                4.6.0-0.nightly-2020-07-31-080025   True        False         False      3h45m
openshift-apiserver                        4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h38m
openshift-controller-manager               4.6.0-0.nightly-2020-07-31-080025   True        False         False      3h45m
openshift-samples                          4.6.0-0.nightly-2020-07-31-080025   True        False         False      3h45m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h41m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h41m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-07-31-080025   True        False         False      158m
service-ca                                 4.6.0-0.nightly-2020-07-31-080025   True        False         False      4h42m
storage                                    4.6.0-0.nightly-2020-07-31-080025   True        False         False      3h45m

We can see all is well.

Comment 12 Ke Wang 2020-07-31 14:44:15 UTC
Also had another try which upgrade OCP 4.5.4 to 4.6 nightly without fixed PR, run into the same problem, see below,

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched

$ oc adm upgrade
Cluster version is 4.5.4

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.

$ oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-30-112525 --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to preceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-30-112525

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.4     True        True          17s     Unable to apply registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-30-112525: could not download the update

$ oc get clusterversion -o json
{
   ...
            "status": {
                "availableUpdates": null,
                "conditions": [
                    {
                        "lastTransitionTime": "2020-07-31T14:24:15Z",
                        "message": "Done applying 4.5.4",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2020-07-31T14:35:37Z",
                        "status": "False",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2020-07-31T14:35:12Z",
                        "message": "Working towards 4.6.0-0.nightly-2020-07-30-112525: 1% complete",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2020-07-31T13:55:07Z",
                        "status": "True",
                        "type": "RetrievedUpdates"
                    },
                    {
                        "lastTransitionTime": "2020-07-31T14:26:57Z",
                        "message": "Multiple cluster operators cannot be upgraded between minor versions:\n* Cluster operator kube-apiserver cannot be upgraded between minor versions: FeatureGates_RestrictedFeatureGates_LatencySensitive: FeatureGatesUpgradeable: \"LatencySensitive\" does not allow updates\n* Cluster operator marketplace cannot be upgraded between minor versions: DeprecatedAPIsInUse: The cluster has custom OperatorSource, which is deprecated in future versions. Please visit this link for further details: https://docs.openshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html#ocp-4-4-marketplace-apis-deprecated",
                        "reason": "ClusterOperatorsNotUpgradeable",
                        "status": "False",
                        "type": "Upgradeable"
                    }
                ],
 ...


By comparing the above test results, we can see that the problem has been fixed, so move the bug Verified.

Comment 13 Ke Wang 2020-07-31 16:30:50 UTC
Tried one upgrade from 4.6 nightly 4.6.0-0.nightly-2020-07-25-091217 to 4.6.0-0.nightly-2020-07-31-080025, hit the bug again, detail see below,

$ oc describe featuregate/cluster
Name:         cluster
Namespace:    
Labels:       <none>
Annotations:  release.openshift.io/create-only: true
API Version:  config.openshift.io/v1
Kind:         FeatureGate
...
Spec:
  Feature Set:  LatencySensitive

$ oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025  --force=true --allow-explicit-upgrade=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to preceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-07-31-080025

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-25-091217   True        False         36m     Cluster version is 4.6.0-0.nightly-2020-07-25-091217

$ oc get clusterversion -o json
...
                    {
                        "lastTransitionTime": "2020-07-31T16:08:54Z",
                        "message": "Cluster operator kube-apiserver cannot be upgraded between minor versions: FeatureGatesUpgradeable: \"LatencySensitive\" does not allow updates",
                        "reason": "FeatureGates_RestrictedFeatureGates_LatencySensitive",
                        "status": "False",
                        "type": "Upgradeable"
                    }

Can anyone take a look this problem?

Comment 15 Artyom 2020-08-02 11:46:12 UTC
The PR was merged 3 days ago, probably you should use newer start version for the upgrade, can you try the upgrade from 4.6.0-0.nightly-2020-07-31-080025-> 4.6.0-0.nightly-2020-08-02-044648?

Comment 21 errata-xmlrpc 2020-10-27 16:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.