2115821 – FIO Target Down Alert firing

Bug 2115821 - FIO Target Down Alert firing

Summary: FIO Target Down Alert firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	File Integrity Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Matt Rogers
QA Contact:	xiyuan
Docs Contact:	Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-05 13:28 UTC by Brian Jarvis
Modified:	2022-12-22 21:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, underlying dependencies of the File Integrity Operator changed how alerts and notifications were handled, and the Operator didn't send metrics as a result. With this release the Operator ensures that the metrics endpoint is correct and reachable on startup. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2115821[2115821])
Clone Of:
Environment:
Last Closed:	2022-11-09 22:27:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift file-integrity-operator pull 278	0	None	open	Bug 2115821: Fix controller metrics port	2022-08-09 18:53:58 UTC
Red Hat Product Errata	RHBA-2022:7095	0	None	None	None	2022-11-09 22:27:17 UTC

Description Brian Jarvis 2022-08-05 13:28:46 UTC

Description of problem:
FIO operator firing alert:
33.33% of the metrics/metrics targets in NamespaceNS
openshift-file-integrity
namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

The pod is not listening on port 8686 as configured in the ServiceMonitor and Service.

https://access.redhat.com/solutions/6337121 mentions restarting the pod, but that has not helped in this case.

Comment 1 Brian Jarvis 2022-08-05 13:29:01 UTC

Port 8686 is not available in the pod.  The CR metrics appear to be available on the 8383 endpoint

sh-4.4$ curl http://localhost:8686/metrics
curl: (7) Failed to connect to localhost port 8686: Connection refused
sh-4.4$ curl http://localhost:8383/metrics
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="configmap-controller"} 0
controller_runtime_active_workers{controller="fileintegrity-controller"} 0
controller_runtime_active_workers{controller="node-controller"} 0
controller_runtime_active_workers{controller="status-controller"} 0
# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller
# TYPE controller_runtime_max_concurrent_reconciles gauge
controller_runtime_max_concurrent_reconciles{controller="configmap-controller"} 1
controller_runtime_max_concurrent_reconciles{controller="fileintegrity-controller"} 1
controller_runtime_max_concurrent_reconciles{controller="node-controller"} 1
controller_runtime_max_concurrent_reconciles{controller="status-controller"} 1

Comment 2 Lance Bragstad 2022-08-05 14:44:40 UTC

Hi Brian, do you know what FIO version was used to reproduce this?

Comment 3 Brian Jarvis 2022-08-05 14:48:33 UTC

OCP 4.10  with FIO 0.1.30

Comment 4 Jakub Hrozek 2022-08-08 12:14:22 UTC

Matt, would https://github.com/openshift/file-integrity-operator/commit/be830f7dfe3cd6223f26f24a4103b75befd1485f also resolve this bug?

Comment 7 Matt Rogers 2022-08-09 18:53:24 UTC

I was able to reproduce it and work on a fix. Since the operator-sdk update, there were some changes to the controller-runtime metrics defaults and I needed to adjust the port for the controller metrics (as well as making sure that the Service is updated on operator start).

Comment 8 xiyuan 2022-08-10 09:27:55 UTC

Verified pass with pre-merge process, the "33.33% of the metrics/metrics targets in NamespaceNS" not reproduced with the pr.
$ oc get ip
NAME            CSV                               APPROVAL    APPROVED
install-wz4dx   file-integrity-operator.v0.1.30   Automatic   true
$ oc get csv -w
NAME                              DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0     OpenShift Elasticsearch Operator   5.5.0                Succeeded
file-integrity-operator.v0.1.30   File Integrity Operator            0.1.30               Installing
file-integrity-operator.v0.1.30   File Integrity Operator            0.1.30               Succeeded
^C
$ oc get fileintegrity example-fileintegrity -o=jsonpath={.status}
{"phase":"Active"}
$ oc get fileintegritynodestatus
NAME                                                                            NODE                                                      STATUS
example-fileintegrity-xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal         xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal         Failed
example-fileintegrity-xiyuan10-2-hbvnf-master-1.c.openshift-qe.internal         xiyuan10-2-hbvnf-master-1.c.openshift-qe.internal         Succeeded
example-fileintegrity-xiyuan10-2-hbvnf-master-2.c.openshift-qe.internal         xiyuan10-2-hbvnf-master-2.c.openshift-qe.internal         Succeeded
example-fileintegrity-xiyuan10-2-hbvnf-worker-a-rznfc.c.openshift-qe.internal   xiyuan10-2-hbvnf-worker-a-rznfc.c.openshift-qe.internal   Succeeded
example-fileintegrity-xiyuan10-2-hbvnf-worker-b-kv6fc.c.openshift-qe.internal   xiyuan10-2-hbvnf-worker-b-kv6fc.c.openshift-qe.internal   Succeeded
example-fileintegrity-xiyuan10-2-hbvnf-worker-c-f47mt.c.openshift-qe.internal   xiyuan10-2-hbvnf-worker-c-f47mt.c.openshift-qe.internal   Succeeded
[xiyuan@MiWiFi-RA69-srv file-integrity-operator]NAME                      PACKAGE                   SOURCE             CHANNEL

$ ALERT_MANAGER=$(oc get route alertmanager-main -n openshift-monitoring -o jsonpath='{@.spec.host}')
$ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)"  https://$ALERT_MANAGER/api/v1/alerts  | jq -r
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4484    0  4484    0     0   5962      0 --:--:-- --:--:-- --:--:--  5962
{
  "status": "success",
  "data": [
    {
      "labels": {
        "alertname": "NodeHasIntegrityFailure",
        "namespace": "fio",
        "node": "xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal",
        "openshift_io_alert_source": "platform",
        "prometheus": "openshift-monitoring/k8s",
        "severity": "warning"
      },
      "annotations": {
        "description": "Node xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal has an integrity check status of Failed for more than 1 second.",
        "summary": "Node xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal has a file integrity failure"
      },
      "startsAt": "2022-08-10T03:50:27.748Z",
      "endsAt": "2022-08-10T06:56:27.748Z",
      "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=file_integrity_operator_node_failed%7Bnode%3D~%22.%2B%22%7D+%2A+on%28node%29+kube_node_info+%3E+0&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": null,
        "inhibitedBy": null
      },
      "receivers": [
        "Default"
      ],
      "fingerprint": "5a5ba900c24cb04e"
    },
    {
      "labels": {
        "alertname": "Watchdog",
        "namespace": "openshift-monitoring",
        "openshift_io_alert_source": "platform",
        "prometheus": "openshift-monitoring/k8s",
        "severity": "none"
      },
      "annotations": {
        "description": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n",
        "summary": "An alert that should always be firing to certify that Alertmanager is working properly."
      },
      "startsAt": "2022-08-10T02:17:21.095Z",
      "endsAt": "2022-08-10T06:56:21.095Z",
      "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=vector%281%29&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": null,
        "inhibitedBy": null
      },
      "receivers": [
        "Watchdog"
      ],
      "fingerprint": "6934731368443c07"
    },
    {
      "labels": {
        "alertname": "AlertmanagerReceiversNotConfigured",
        "namespace": "openshift-monitoring",
        "openshift_io_alert_source": "platform",
        "prometheus": "openshift-monitoring/k8s",
        "severity": "warning"
      },
      "annotations": {
        "description": "Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. Check the OpenShift documentation to learn how to configure notifications with Alertmanager.",
        "summary": "Receivers (notification integrations) are not configured on Alertmanager"
      },
      "startsAt": "2022-08-10T02:27:29.532Z",
      "endsAt": "2022-08-10T06:56:59.532Z",
      "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": null,
        "inhibitedBy": null
      },
      "receivers": [
        "Default"
      ],
      "fingerprint": "72bc0ebbd3167d00"
    },
    {
      "labels": {
        "alertname": "CannotRetrieveUpdates",
        "endpoint": "metrics",
        "instance": "10.0.0.6:9099",
        "job": "cluster-version-operator",
        "namespace": "openshift-cluster-version",
        "openshift_io_alert_source": "platform",
        "pod": "cluster-version-operator-64f9cddbcd-xczmp",
        "prometheus": "openshift-monitoring/k8s",
        "service": "cluster-version-operator",
        "severity": "warning"
      },
      "annotations": {
        "description": "Failure to retrieve updates means that cluster administrators will need to monitor for available updates on their own or risk falling behind on security or other bugfixes. If the failure is expected, you can clear spec.channel in the ClusterVersion object to tell the cluster-version operator to not retrieve updates. Failure reason VersionNotFound .  For more information refer to https://console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/settings/cluster/.",
        "summary": "Cluster version operator has not retrieved updates in 4h 37m 30s."
      },
      "startsAt": "2022-08-10T03:15:32.736Z",
      "endsAt": "2022-08-10T06:57:02.736Z",
      "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=%28time%28%29+-+cluster_version_operator_update_retrieval_timestamp_seconds%29+%3E%3D+3600+and+ignoring%28condition%2C+name%2C+reason%29+cluster_operator_conditions%7Bcondition%3D%22RetrievedUpdates%22%2Cendpoint%3D%22metrics%22%2Cname%3D%22version%22%2Creason%21%3D%22NoChannel%22%7D&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": null,
        "inhibitedBy": null
      },
      "receivers": [
        "Default"
      ],
      "fingerprint": "bf378e174b4c1929"
    }
  ]
}

Comment 17 xiyuan 2022-10-21 12:37:29 UTC

verification pass with file-integrity-operator.v0.1.31 + 4.12.0-0.nightly-2022-10-20-104328
Tried to verify with below two scenarios:
1. install File Integrity Operator v0.1.30
2. Create fileintegrity and  make NodeHasIntegrityFailure fire from a node. Wait for about 10 minutes. 
Check TargetDown alerts:
$ ALERT_MANAGER=$(oc get route alertmanager-main -n openshift-monitoring -o jsonpath='{@.spec.host}')
$ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)"  https://$ALERT_MANAGER/api/v1/alerts  | jq -r  | grep -i "TargetDown" -A 10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20370    0 20370    0     0  17899      0 --:--:--  0:00:01 --:--:-- 17915
        "alertname": "TargetDown",
        "job": "metrics",
        "namespace": "openshift-file-integrity",
        "openshift_io_alert_source": "platform",
        "prometheus": "openshift-monitoring/k8s",
        "service": "metrics",
        "severity": "warning"
      },
      "annotations": {
        "description": "33.33% of the metrics/metrics targets in openshift-file-integrity namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.",
        "summary": "Some targets were not reachable from the monitoring server for an extended period of time."

3. Upgrade to File-integrity-operator.v0.1.31, retrigger a NodeHasIntegrityFailure, and check TargetDown alerts.
No TargetDown fired for openshift-file-integrity namespace.
$ oc get csv
NAME                              DISPLAY                   VERSION   REPLACES                          PHASE
file-integrity-operator.v0.1.31   File Integrity Operator   0.1.31    file-integrity-operator.v0.1.30   Succeeded

$ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)"  https://$ALERT_MANAGER/api/v1/alerts  | jq -r  | grep -i "TargetDown" -A 10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16276    0 16276    0     0  15976      0 --:--:--  0:00:01 --:--:-- 15988
$

Comment 20 errata-xmlrpc 2022-11-09 22:27:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift File Integrity Operator bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7095

Note You need to log in before you can comment on or make changes to this bug.