Bug 1828608
| Summary: | Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: ImageRegistryRemoved filring on vpshere IPI | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abhinav Dahiya <adahiya> |
| Component: | Image Registry | Assignee: | Ricardo Maraschini <rmarasch> |
| Status: | CLOSED ERRATA | QA Contact: | Wenjing Zheng <wzheng> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | adam.kaplan, aos-bugs |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-13 17:31:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@Abhinav what is the expectation for the image registry on vSphere IPI installs? Are we to configure block PVC storage? the vpshere ipi job configures the cluster to use emptyDir https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/vsphere/registry/ipi-install-vsphere-registry-commands.sh#L9 and if you look at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1/artifacts/e2e-vsphere/gather-must-gather/must-gather.tar from the CI run. ``` --- apiVersion: imageregistry.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: "2020-04-27T19:23:53Z" finalizers: - imageregistry.operator.openshift.io/finalizer generation: 2 name: cluster resourceVersion: "25390" selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/cluster uid: a8cfa785-9f6d-421e-ab14-c7e552e7ea6f spec: httpSecret: 9737c6a7b448333192d11ef44a420d0bf339cb6596018e807c53c9e8685fdbd0b4050f9fbb63367bd5ad678ce52e5158c4426a3375fe7e2d720f50a05e837b6a logging: 2 managementState: Managed proxy: {} replicas: 1 requests: read: maxWaitInQueue: 0s write: maxWaitInQueue: 0s rolloutStrategy: RollingUpdate storage: emptyDir: {} status: conditions: - lastTransitionTime: "2020-04-27T19:40:24Z" message: The registry is ready reason: Ready status: "False" type: Progressing - lastTransitionTime: "2020-04-27T19:40:24Z" message: The registry is ready reason: Ready status: "True" type: Available - lastTransitionTime: "2020-04-27T19:23:54Z" status: "False" type: Degraded - lastTransitionTime: "2020-04-27T19:40:17Z" status: "False" type: Removed - lastTransitionTime: "2020-04-27T19:40:15Z" message: EmptyDir storage successfully created reason: Creation Successful status: "True" type: StorageExists generations: - group: apps hash: "" lastGeneration: 2 name: image-registry namespace: openshift-image-registry resource: deployments observedGeneration: 2 readyReplicas: 0 storage: emptyDir: {} storageManaged: false ``` the cluster was configured with it. So my expactation would be that since we explicitly set the registry to to emptyDir, that there would be no alerts fired. This alert is triggered when the Image Registry is "Removed" for more than 5 minutes and it takes up to 1 minute to clear itself after the configuration is changed. Might it be that this test happened in less than 1 minute after setting up the Registry Operator? That is the only explanation for the issue we are seeing. Removing the NEEDINFO flag, fix is on its way. Verified on 4.5.0-0.nightly-2020-05-27-202943:
$ oc get PrometheusRule -n openshift-image-registry image-registry-operator-alerts -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: "2020-05-28T03:20:32Z"
generation: 1
labels:
name: image-registry-operator-alerts
managedFields:
- apiVersion: monitoring.coreos.com/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:name: {}
f:spec:
.: {}
f:groups: {}
manager: cluster-version-operator
operation: Update
time: "2020-05-28T03:20:32Z"
name: image-registry-operator-alerts
namespace: openshift-image-registry
resourceVersion: "13377"
selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-image-registry/prometheusrules/image-registry-operator-alerts
uid: 52a6a268-f61e-4711-b93f-8b6e160ce5f2
spec:
groups:
- name: ImageRegistryOperator
rules:
- alert: ImageRegistryStorageReconfigured
annotations:
message: |
Image Registry Storage configuration has changed in the last 30
minutes. This change may have caused data loss.
expr: increase(image_registry_operator_storage_reconfigured_total[30m]) > 0
labels:
severity: warning
- name: ImagePruner
rules:
- alert: ImagePruningDisabled
annotations:
message: |
Automatic image pruning is not enabled. Regular pruning of images
no longer referenced by ImageStreams is strongly recommended to
ensure your cluster remains healthy.
To remove this warning, install the image pruner by creating an
imagepruner.imageregistry.operator.openshift.io resource with the
name `cluster`. Ensure that the `suspend` field is set to `false`.
expr: image_registry_operator_image_pruner_install_status < 2
labels:
severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |
Description of problem: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 1m35s fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected <map[string]error | len:1>: { "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": { s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"ImageRegistryRemoved\",\"alertstate\":\"firing\",\"condition\":\"Available\",\"endpoint\":\"metrics\",\"instance\":\"139.178.87.251:9099\",\"job\":\"cluster-version-operator\",\"name\":\"image-registry\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-86d78876fc-6qq8j\",\"reason\":\"Removed\",\"service\":\"cluster-version-operator\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"7\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"builds\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"clusterversions\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"24\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"20\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"12\"]},{\"metric\":{\"alertname\":\"TargetDown\",\"alertstate\":\"firing\",\"job\":\"metrics\",\"namespace\":\"openshift-console-operator\",\"service\":\"metrics\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"45\"]}]", }, } to be empty Version-Release number of selected component (if applicable): How reproducible: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: For vsphere IPI there shouldn't be alerts firing on a default cluster. the installer CI job does make sure it uses emptyDir for the CI tests to work at least, but it seems like there are still alerts being fired.