Bug 1828608 - Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: ImageRegistryRemoved filring on vpshere IPI
Summary: Alerts shouldn't report any alerts in firing state apart from Watchdog and Al...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Ricardo Maraschini
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-27 21:55 UTC by Abhinav Dahiya
Modified: 2020-07-13 17:32 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:31:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 561 0 None closed Bug 1828608: Removing ImageRegistryRemoved alert 2020-08-24 12:23:32 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:32:11 UTC

Description Abhinav Dahiya 2020-04-27 21:55:06 UTC
Description of problem:


[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m35s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"ImageRegistryRemoved\",\"alertstate\":\"firing\",\"condition\":\"Available\",\"endpoint\":\"metrics\",\"instance\":\"139.178.87.251:9099\",\"job\":\"cluster-version-operator\",\"name\":\"image-registry\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-86d78876fc-6qq8j\",\"reason\":\"Removed\",\"service\":\"cluster-version-operator\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"7\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"builds\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubeAPIErrorsHigh\",\"alertstate\":\"firing\",\"resource\":\"clusterversions\",\"severity\":\"warning\",\"verb\":\"LIST\"},\"value\":[1588018257.367,\"25\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"24\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"20\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.128.2.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ci-op-cyr77ydz-3b021-lctgx-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588018257.367,\"12\"]},{\"metric\":{\"alertname\":\"TargetDown\",\"alertstate\":\"firing\",\"job\":\"metrics\",\"namespace\":\"openshift-console-operator\",\"service\":\"metrics\",\"severity\":\"warning\"},\"value\":[1588018257.367,\"45\"]}]",
        },
    }
to be empty

Version-Release number of selected component (if applicable):


How reproducible:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

For vsphere IPI  there shouldn't be alerts firing on a default cluster. the installer CI job does make sure it uses emptyDir for the CI tests to work at least, but it seems like there are still alerts being fired.

Comment 2 Adam Kaplan 2020-05-18 14:38:03 UTC
@Abhinav what is the expectation for the image registry on vSphere IPI installs? Are we to configure block PVC storage?

Comment 3 Abhinav Dahiya 2020-05-18 17:49:08 UTC
the vpshere ipi job configures the cluster to use emptyDir

https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/vsphere/registry/ipi-install-vsphere-registry-commands.sh#L9

and if you look at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/8582/rehearse-8582-pull-ci-openshift-installer-master-e2e-vsphere/1/artifacts/e2e-vsphere/gather-must-gather/must-gather.tar from the CI run.

```
---
apiVersion: imageregistry.operator.openshift.io/v1
kind: Config
metadata:
  creationTimestamp: "2020-04-27T19:23:53Z"
  finalizers:
  - imageregistry.operator.openshift.io/finalizer
  generation: 2
  name: cluster
  resourceVersion: "25390"
  selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/cluster
  uid: a8cfa785-9f6d-421e-ab14-c7e552e7ea6f
spec:
  httpSecret: 9737c6a7b448333192d11ef44a420d0bf339cb6596018e807c53c9e8685fdbd0b4050f9fbb63367bd5ad678ce52e5158c4426a3375fe7e2d720f50a05e837b6a
  logging: 2
  managementState: Managed
  proxy: {}
  replicas: 1
  requests:
    read:
      maxWaitInQueue: 0s
    write:
      maxWaitInQueue: 0s
  rolloutStrategy: RollingUpdate
  storage:
    emptyDir: {}
status:
  conditions:
  - lastTransitionTime: "2020-04-27T19:40:24Z"
    message: The registry is ready
    reason: Ready
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-04-27T19:40:24Z"
    message: The registry is ready
    reason: Ready
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-27T19:23:54Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-04-27T19:40:17Z"
    status: "False"
    type: Removed
  - lastTransitionTime: "2020-04-27T19:40:15Z"
    message: EmptyDir storage successfully created
    reason: Creation Successful
    status: "True"
    type: StorageExists
  generations:
  - group: apps
    hash: ""
    lastGeneration: 2
    name: image-registry
    namespace: openshift-image-registry
    resource: deployments
  observedGeneration: 2
  readyReplicas: 0
  storage:
    emptyDir: {}
  storageManaged: false
```

the cluster was configured with it.

So my expactation would be that since we explicitly set the registry to to emptyDir, that there would be no alerts fired.

Comment 4 Ricardo Maraschini 2020-05-19 07:59:51 UTC
This alert is triggered when the Image Registry is "Removed" for more than 5 minutes and it takes up to 1 minute to clear itself after the configuration is changed. Might it be that this test happened in less than 1 minute after setting up the Registry Operator? That is the only explanation for the issue we are seeing.

Comment 7 Ricardo Maraschini 2020-05-27 12:42:03 UTC
Removing the NEEDINFO flag, fix is on its way.

Comment 8 Wenjing Zheng 2020-05-28 07:10:04 UTC
Verified on 4.5.0-0.nightly-2020-05-27-202943: 
$ oc get PrometheusRule -n openshift-image-registry image-registry-operator-alerts -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2020-05-28T03:20:32Z"
  generation: 1
  labels:
    name: image-registry-operator-alerts
  managedFields:
  - apiVersion: monitoring.coreos.com/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:name: {}
      f:spec:
        .: {}
        f:groups: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-05-28T03:20:32Z"
  name: image-registry-operator-alerts
  namespace: openshift-image-registry
  resourceVersion: "13377"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-image-registry/prometheusrules/image-registry-operator-alerts
  uid: 52a6a268-f61e-4711-b93f-8b6e160ce5f2
spec:
  groups:
  - name: ImageRegistryOperator
    rules:
    - alert: ImageRegistryStorageReconfigured
      annotations:
        message: |
          Image Registry Storage configuration has changed in the last 30
          minutes. This change may have caused data loss.
      expr: increase(image_registry_operator_storage_reconfigured_total[30m]) > 0
      labels:
        severity: warning
  - name: ImagePruner
    rules:
    - alert: ImagePruningDisabled
      annotations:
        message: |
          Automatic image pruning is not enabled. Regular pruning of images
          no longer referenced by ImageStreams is strongly recommended to
          ensure your cluster remains healthy.

          To remove this warning, install the image pruner by creating an
          imagepruner.imageregistry.operator.openshift.io resource with the
          name `cluster`. Ensure that the `suspend` field is set to `false`.
      expr: image_registry_operator_image_pruner_install_status < 2
      labels:
        severity: warning

Comment 9 errata-xmlrpc 2020-07-13 17:31:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.