Description of problem: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 1m35s fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected <map[string]error | len:1>: { "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": { s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"ImageRegistryRemoved\",\"alertstate\":\"firing\",\"condition\":\"Available\",\"endpoint\":\"metrics\",\"instance\":\"139.178.87.251:9099\",\"job\":\"cluster-version-operator\",\"name\":\"image-registry\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-86d78876fc-6qq8j\",\"reason\":\"Removed\",\"service\":\"cluster-version-operator\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"7\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"builds\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"clusterversions\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"24\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"20\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"12\"]},{\"metric\":{\"alertname\":\"TargetDown\",\"alertstate\":\"firing\",\"job\":\"metrics\",\"namespace\":\"openshift-console-operator\",\"service\":\"metrics\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"45\"]}]", }, } to be empty Version-Release number of selected component (if applicable): How reproducible: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: For vsphere IPI there shouldn't be alerts firing on a default cluster. the installer CI job does make sure it uses emptyDir for the CI tests to work at least, but it seems like there are still alerts being fired.
@Abhinav what is the expectation for the image registry on vSphere IPI installs? Are we to configure block PVC storage?
the vpshere ipi job configures the cluster to use emptyDir https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/vsphere/registry/ipi-install-vsphere-registry-commands.sh#L9 and if you look at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1/artifacts/e2e-vsphere/gather-must-gather/must-gather.tar from the CI run. ``` --- apiVersion: imageregistry.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: "2020-04-27T19:23:53Z" finalizers: - imageregistry.operator.openshift.io/finalizer generation: 2 name: cluster resourceVersion: "25390" selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/cluster uid: a8cfa785-9f6d-421e-ab14-c7e552e7ea6f spec: httpSecret: 9737c6a7b448333192d11ef44a420d0bf339cb6596018e807c53c9e8685fdbd0b4050f9fbb63367bd5ad678ce52e5158c4426a3375fe7e2d720f50a05e837b6a logging: 2 managementState: Managed proxy: {} replicas: 1 requests: read: maxWaitInQueue: 0s write: maxWaitInQueue: 0s rolloutStrategy: RollingUpdate storage: emptyDir: {} status: conditions: - lastTransitionTime: "2020-04-27T19:40:24Z" message: The registry is ready reason: Ready status: "False" type: Progressing - lastTransitionTime: "2020-04-27T19:40:24Z" message: The registry is ready reason: Ready status: "True" type: Available - lastTransitionTime: "2020-04-27T19:23:54Z" status: "False" type: Degraded - lastTransitionTime: "2020-04-27T19:40:17Z" status: "False" type: Removed - lastTransitionTime: "2020-04-27T19:40:15Z" message: EmptyDir storage successfully created reason: Creation Successful status: "True" type: StorageExists generations: - group: apps hash: "" lastGeneration: 2 name: image-registry namespace: openshift-image-registry resource: deployments observedGeneration: 2 readyReplicas: 0 storage: emptyDir: {} storageManaged: false ``` the cluster was configured with it. So my expactation would be that since we explicitly set the registry to to emptyDir, that there would be no alerts fired.
This alert is triggered when the Image Registry is "Removed" for more than 5 minutes and it takes up to 1 minute to clear itself after the configuration is changed. Might it be that this test happened in less than 1 minute after setting up the Registry Operator? That is the only explanation for the issue we are seeing.
Removing the NEEDINFO flag, fix is on its way.
Verified on 4.5.0-0.nightly-2020-05-27-202943: $ oc get PrometheusRule -n openshift-image-registry image-registry-operator-alerts -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2020-05-28T03:20:32Z" generation: 1 labels: name: image-registry-operator-alerts managedFields: - apiVersion: monitoring.coreos.com/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:name: {} f:spec: .: {} f:groups: {} manager: cluster-version-operator operation: Update time: "2020-05-28T03:20:32Z" name: image-registry-operator-alerts namespace: openshift-image-registry resourceVersion: "13377" selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-image-registry/prometheusrules/image-registry-operator-alerts uid: 52a6a268-f61e-4711-b93f-8b6e160ce5f2 spec: groups: - name: ImageRegistryOperator rules: - alert: ImageRegistryStorageReconfigured annotations: message: | Image Registry Storage configuration has changed in the last 30 minutes. This change may have caused data loss. expr: increase(image_registry_operator_storage_reconfigured_total[30m]) > 0 labels: severity: warning - name: ImagePruner rules: - alert: ImagePruningDisabled annotations: message: | Automatic image pruning is not enabled. Regular pruning of images no longer referenced by ImageStreams is strongly recommended to ensure your cluster remains healthy. To remove this warning, install the image pruner by creating an imagepruner.imageregistry.operator.openshift.io resource with the name `cluster`. Ensure that the `suspend` field is set to `false`. expr: image_registry_operator_image_pruner_install_status < 2 labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409