1731228 – The cluster should be reporting an alert if it is not upgradeable due to tech preview features being enabled

Bug 1731228 - The cluster should be reporting an alert if it is not upgradeable due to tech preview features being enabled

Summary: The cluster should be reporting an alert if it is not upgradeable due to tech...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Clayton Coleman
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1766518 1769779
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-18 17:17 UTC by Clayton Coleman
Modified:	2020-01-23 11:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	The cluster will report a TechPreviewNoUpgrade Prometheus alert if it is not upgradeable due to tech preview features being enabled.
Clone Of:
Environment:
Last Closed:	2020-01-23 11:04:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 527	'None'	closed	Bug 1731228: Expose feature set field as a metric for alerting	2020-07-03 20:23:47 UTC
Github	openshift cluster-kube-apiserver-operator pull 569	'None'	closed	Bug 1731228: add alert when feature set is set	2020-07-03 20:23:47 UTC
Github	openshift cluster-kube-apiserver-operator pull 590	'None'	closed	[release-4.2] Bug 1731228: add alert when feature set is set	2020-07-03 20:23:47 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-01-23 11:04:40 UTC

Description Clayton Coleman 2019-07-18 17:17:53 UTC

This may end up being one or more alerts, but if the cluster cannot be upgraded there should be a firing alert on the cluster (it is an abnormal, potentially dangerous, definitely alertable condition).

I would suggest starting with the highest level alert we can create around the feature gate and the CVO.

After discussion with David, we proposed

cluster_feature_set{name="TechPreviewNoUpgrade"} 1

or

cluster_feature_set{name="CustomNoUpgrade"} 1

or

cluster_feature_set{name=""} 1

depending on which feature set was returned.

This would allow an alert "Cluster is using custom unsupported features that will block upgrade" or "Cluster has enabled tech preview features that will prevent upgrades."

This is a key feedback item from people debugging 4.1 clusters that they don't know whether custom features are enabled.

In addition, we need to add an upgrade blocked alert via the CVO which can be handled separately.

Comment 2 Xingxing Xia 2019-09-19 11:36:53 UTC

Tried verifying other bugs. No time yet on this today. Need some investigation and discussion with monitoring QE

Comment 3 Xingxing Xia 2019-09-20 06:54:23 UTC

Tested in 4.2.0-0.nightly-2019-09-19-014231 :
oc edit FeatureGate cluster # add below and save
...
spec:
  featureSet: "TechPreviewNoUpgrade"

This wait for about 5 mins for related KAS/KS pods restart.
Then query cluster_feature_set, found the value is 0:
cluster_feature_set{endpoint="https",instance="10.129.0.7:8443",job="metrics",name="TechPreviewNoUpgrade",namespace="openshift-kube-apiserver-operator",pod="kube-apiserver-operator-8678cd9449-lgwvz",service="metrics"}	0
Check "Alerts" in Prometheus UI or follow the `oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://localhost:9095/api/v1/alerts | python -mjson.tool` in http://pastebin.test.redhat.com/793268 , didn't find alert message of comment 0

Comment 4 Xingxing Xia 2019-09-20 07:01:05 UTC

Discussed with monitoring QE, he told to check:
oc get prometheusrules --all-namespaces
NAMESPACE                   NAME                                    AGE
openshift-cluster-version   cluster-version-operator                27h
openshift-machine-api       machine-api-operator-prometheus-rules   27h
openshift-marketplace       marketplace-alert-rules                 27h
openshift-monitoring        prometheus-k8s-rules                    27h
openshift-sdn               networking-rules                        27h

oc get prometheusrules --all-namespaces -o yaml | grep -i -e feature -e upgrade # found no related rules for the alert of comment 0 BTW

Comment 5 Xingxing Xia 2019-09-20 07:05:00 UTC

For the BZ flag qe_test_coverage:
It has corresponding doc: https://docs.openshift.com/container-platform/4.1/nodes/clusters/nodes-cluster-enabling-features.html
And has existent test case: OCP-23389
And per comment 0, this bug is to improve the monitoring to help Dev debug customer cluster once upgrade is blocked by the enabling.

Comment 6 Maciej Szulik 2019-09-20 17:00:51 UTC

Spoke with Clayton, this is a potential backport so I'll lower the priority for now.

Comment 8 Xingxing Xia 2019-10-08 09:30:41 UTC

Tested in latest 4.2.0-0.nightly-2019-10-07-203748 env, got same result as comment 3 and 4. Then checked and found above KASO PR 569 is not included in 4.2 branch. If 4.2.0 is not the intended target release, please modify it. Thx

Comment 15 Xingxing Xia 2019-11-08 02:37:05 UTC

Verification is blocked by bug 1766518. Not sure if it is fine to keep such bug ON_QA for long time. Moving to Assigned temporarily. Once bug 1766518 is fixed, will check this bug again.

Comment 16 Xingxing Xia 2019-11-27 07:12:23 UTC

The blocker bug is verified. Will verify this bug later.

Comment 17 Xingxing Xia 2019-12-11 13:24:15 UTC

Verified in 4.3.0-0.nightly-2019-12-10-235659 with both featuregates:
oc edit FeatureGate cluster
...
spec:
  featureSet: "TechPreviewNoUpgrade"

oc edit FeatureGate cluster
...
spec:
spec:
  customNoUpgrade:
    disabled:
    - LegacyNodeRoleBehavior
    enabled:
    - RotateKubeletServerCertificate
    - SupportPodPidsLimit
    - NodeDisruptionExclusion
    - ServiceNodeExclusion
  featureSet: CustomNoUpgrade

Both can set the cluster_feature_set and fire the "alert: TechPreviewNoUpgrade" (CustomNoUpgrade also fires this named alert because the alert expr defines it so)

Comment 19 errata-xmlrpc 2020-01-23 11:04:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.