This may end up being one or more alerts, but if the cluster cannot be upgraded there should be a firing alert on the cluster (it is an abnormal, potentially dangerous, definitely alertable condition). I would suggest starting with the highest level alert we can create around the feature gate and the CVO. After discussion with David, we proposed cluster_feature_set{name="TechPreviewNoUpgrade"} 1 or cluster_feature_set{name="CustomNoUpgrade"} 1 or cluster_feature_set{name=""} 1 depending on which feature set was returned. This would allow an alert "Cluster is using custom unsupported features that will block upgrade" or "Cluster has enabled tech preview features that will prevent upgrades." This is a key feedback item from people debugging 4.1 clusters that they don't know whether custom features are enabled. In addition, we need to add an upgrade blocked alert via the CVO which can be handled separately.
Tried verifying other bugs. No time yet on this today. Need some investigation and discussion with monitoring QE
Tested in 4.2.0-0.nightly-2019-09-19-014231 : oc edit FeatureGate cluster # add below and save ... spec: featureSet: "TechPreviewNoUpgrade" This wait for about 5 mins for related KAS/KS pods restart. Then query cluster_feature_set, found the value is 0: cluster_feature_set{endpoint="https",instance="10.129.0.7:8443",job="metrics",name="TechPreviewNoUpgrade",namespace="openshift-kube-apiserver-operator",pod="kube-apiserver-operator-8678cd9449-lgwvz",service="metrics"} 0 Check "Alerts" in Prometheus UI or follow the `oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://localhost:9095/api/v1/alerts | python -mjson.tool` in http://pastebin.test.redhat.com/793268 , didn't find alert message of comment 0
Discussed with monitoring QE, he told to check: oc get prometheusrules --all-namespaces NAMESPACE NAME AGE openshift-cluster-version cluster-version-operator 27h openshift-machine-api machine-api-operator-prometheus-rules 27h openshift-marketplace marketplace-alert-rules 27h openshift-monitoring prometheus-k8s-rules 27h openshift-sdn networking-rules 27h oc get prometheusrules --all-namespaces -o yaml | grep -i -e feature -e upgrade # found no related rules for the alert of comment 0 BTW
For the BZ flag qe_test_coverage: It has corresponding doc: https://docs.openshift.com/container-platform/4.1/nodes/clusters/nodes-cluster-enabling-features.html And has existent test case: OCP-23389 And per comment 0, this bug is to improve the monitoring to help Dev debug customer cluster once upgrade is blocked by the enabling.
Spoke with Clayton, this is a potential backport so I'll lower the priority for now.
Tested in latest 4.2.0-0.nightly-2019-10-07-203748 env, got same result as comment 3 and 4. Then checked and found above KASO PR 569 is not included in 4.2 branch. If 4.2.0 is not the intended target release, please modify it. Thx
Verification is blocked by bug 1766518. Not sure if it is fine to keep such bug ON_QA for long time. Moving to Assigned temporarily. Once bug 1766518 is fixed, will check this bug again.
The blocker bug is verified. Will verify this bug later.
Verified in 4.3.0-0.nightly-2019-12-10-235659 with both featuregates: oc edit FeatureGate cluster ... spec: featureSet: "TechPreviewNoUpgrade" oc edit FeatureGate cluster ... spec: spec: customNoUpgrade: disabled: - LegacyNodeRoleBehavior enabled: - RotateKubeletServerCertificate - SupportPodPidsLimit - NodeDisruptionExclusion - ServiceNodeExclusion featureSet: CustomNoUpgrade Both can set the cluster_feature_set and fire the "alert: TechPreviewNoUpgrade" (CustomNoUpgrade also fires this named alert because the alert expr defines it so)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062