Bug 1731228

Summary: The cluster should be reporting an alert if it is not upgradeable due to tech preview features being enabled
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: kube-apiserverAssignee: Clayton Coleman <ccoleman>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: low Docs Contact:
Priority: medium    
Version: 4.2.0CC: aos-bugs, calfonso, maszulik, mfojtik, sttts, vlaad, xtian
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
The cluster will report a TechPreviewNoUpgrade Prometheus alert if it is not upgradeable due to tech preview features being enabled.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:04:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1766518, 1769779    
Bug Blocks:    

Description Clayton Coleman 2019-07-18 17:17:53 UTC
This may end up being one or more alerts, but if the cluster cannot be upgraded there should be a firing alert on the cluster (it is an abnormal, potentially dangerous, definitely alertable condition).

I would suggest starting with the highest level alert we can create around the feature gate and the CVO.

After discussion with David, we proposed

cluster_feature_set{name="TechPreviewNoUpgrade"} 1

or

cluster_feature_set{name="CustomNoUpgrade"} 1

or

cluster_feature_set{name=""} 1

depending on which feature set was returned.

This would allow an alert "Cluster is using custom unsupported features that will block upgrade" or "Cluster has enabled tech preview features that will prevent upgrades."

This is a key feedback item from people debugging 4.1 clusters that they don't know whether custom features are enabled.

In addition, we need to add an upgrade blocked alert via the CVO which can be handled separately.

Comment 2 Xingxing Xia 2019-09-19 11:36:53 UTC
Tried verifying other bugs. No time yet on this today. Need some investigation and discussion with monitoring QE

Comment 3 Xingxing Xia 2019-09-20 06:54:23 UTC
Tested in 4.2.0-0.nightly-2019-09-19-014231 :
oc edit FeatureGate cluster # add below and save
...
spec:
  featureSet: "TechPreviewNoUpgrade"

This wait for about 5 mins for related KAS/KS pods restart.
Then query cluster_feature_set, found the value is 0:
cluster_feature_set{endpoint="https",instance="10.129.0.7:8443",job="metrics",name="TechPreviewNoUpgrade",namespace="openshift-kube-apiserver-operator",pod="kube-apiserver-operator-8678cd9449-lgwvz",service="metrics"}	0
Check "Alerts" in Prometheus UI or follow the `oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://localhost:9095/api/v1/alerts | python -mjson.tool` in http://pastebin.test.redhat.com/793268 , didn't find alert message of comment 0

Comment 4 Xingxing Xia 2019-09-20 07:01:05 UTC
Discussed with monitoring QE, he told to check:
oc get prometheusrules --all-namespaces
NAMESPACE                   NAME                                    AGE
openshift-cluster-version   cluster-version-operator                27h
openshift-machine-api       machine-api-operator-prometheus-rules   27h
openshift-marketplace       marketplace-alert-rules                 27h
openshift-monitoring        prometheus-k8s-rules                    27h
openshift-sdn               networking-rules                        27h

oc get prometheusrules --all-namespaces -o yaml | grep -i -e feature -e upgrade # found no related rules for the alert of comment 0 BTW

Comment 5 Xingxing Xia 2019-09-20 07:05:00 UTC
For the BZ flag qe_test_coverage:
It has corresponding doc: https://docs.openshift.com/container-platform/4.1/nodes/clusters/nodes-cluster-enabling-features.html
And has existent test case: OCP-23389
And per comment 0, this bug is to improve the monitoring to help Dev debug customer cluster once upgrade is blocked by the enabling.

Comment 6 Maciej Szulik 2019-09-20 17:00:51 UTC
Spoke with Clayton, this is a potential backport so I'll lower the priority for now.

Comment 8 Xingxing Xia 2019-10-08 09:30:41 UTC
Tested in latest 4.2.0-0.nightly-2019-10-07-203748 env, got same result as comment 3 and 4. Then checked and found above KASO PR 569 is not included in 4.2 branch. If 4.2.0 is not the intended target release, please modify it. Thx

Comment 15 Xingxing Xia 2019-11-08 02:37:05 UTC
Verification is blocked by bug 1766518. Not sure if it is fine to keep such bug ON_QA for long time. Moving to Assigned temporarily. Once bug 1766518 is fixed, will check this bug again.

Comment 16 Xingxing Xia 2019-11-27 07:12:23 UTC
The blocker bug is verified. Will verify this bug later.

Comment 17 Xingxing Xia 2019-12-11 13:24:15 UTC
Verified in 4.3.0-0.nightly-2019-12-10-235659 with both featuregates:
oc edit FeatureGate cluster
...
spec:
  featureSet: "TechPreviewNoUpgrade"

oc edit FeatureGate cluster
...
spec:
spec:
  customNoUpgrade:
    disabled:
    - LegacyNodeRoleBehavior
    enabled:
    - RotateKubeletServerCertificate
    - SupportPodPidsLimit
    - NodeDisruptionExclusion
    - ServiceNodeExclusion
  featureSet: CustomNoUpgrade

Both can set the cluster_feature_set and fire the "alert: TechPreviewNoUpgrade" (CustomNoUpgrade also fires this named alert because the alert expr defines it so)

Comment 19 errata-xmlrpc 2020-01-23 11:04:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062