Description of problem: FIO operator firing alert: 33.33% of the metrics/metrics targets in NamespaceNS openshift-file-integrity namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support. The pod is not listening on port 8686 as configured in the ServiceMonitor and Service. https://access.redhat.com/solutions/6337121 mentions restarting the pod, but that has not helped in this case.
Port 8686 is not available in the pod. The CR metrics appear to be available on the 8383 endpoint sh-4.4$ curl http://localhost:8686/metrics curl: (7) Failed to connect to localhost port 8686: Connection refused sh-4.4$ curl http://localhost:8383/metrics # HELP controller_runtime_active_workers Number of currently used workers per controller # TYPE controller_runtime_active_workers gauge controller_runtime_active_workers{controller="configmap-controller"} 0 controller_runtime_active_workers{controller="fileintegrity-controller"} 0 controller_runtime_active_workers{controller="node-controller"} 0 controller_runtime_active_workers{controller="status-controller"} 0 # HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller # TYPE controller_runtime_max_concurrent_reconciles gauge controller_runtime_max_concurrent_reconciles{controller="configmap-controller"} 1 controller_runtime_max_concurrent_reconciles{controller="fileintegrity-controller"} 1 controller_runtime_max_concurrent_reconciles{controller="node-controller"} 1 controller_runtime_max_concurrent_reconciles{controller="status-controller"} 1
Hi Brian, do you know what FIO version was used to reproduce this?
OCP 4.10 with FIO 0.1.30
Matt, would https://github.com/openshift/file-integrity-operator/commit/be830f7dfe3cd6223f26f24a4103b75befd1485f also resolve this bug?
I was able to reproduce it and work on a fix. Since the operator-sdk update, there were some changes to the controller-runtime metrics defaults and I needed to adjust the port for the controller metrics (as well as making sure that the Service is updated on operator start).
Verified pass with pre-merge process, the "33.33% of the metrics/metrics targets in NamespaceNS" not reproduced with the pr. $ oc get ip NAME CSV APPROVAL APPROVED install-wz4dx file-integrity-operator.v0.1.30 Automatic true $ oc get csv -w NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded file-integrity-operator.v0.1.30 File Integrity Operator 0.1.30 Installing file-integrity-operator.v0.1.30 File Integrity Operator 0.1.30 Succeeded ^C $ oc get fileintegrity example-fileintegrity -o=jsonpath={.status} {"phase":"Active"} $ oc get fileintegritynodestatus NAME NODE STATUS example-fileintegrity-xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal Failed example-fileintegrity-xiyuan10-2-hbvnf-master-1.c.openshift-qe.internal xiyuan10-2-hbvnf-master-1.c.openshift-qe.internal Succeeded example-fileintegrity-xiyuan10-2-hbvnf-master-2.c.openshift-qe.internal xiyuan10-2-hbvnf-master-2.c.openshift-qe.internal Succeeded example-fileintegrity-xiyuan10-2-hbvnf-worker-a-rznfc.c.openshift-qe.internal xiyuan10-2-hbvnf-worker-a-rznfc.c.openshift-qe.internal Succeeded example-fileintegrity-xiyuan10-2-hbvnf-worker-b-kv6fc.c.openshift-qe.internal xiyuan10-2-hbvnf-worker-b-kv6fc.c.openshift-qe.internal Succeeded example-fileintegrity-xiyuan10-2-hbvnf-worker-c-f47mt.c.openshift-qe.internal xiyuan10-2-hbvnf-worker-c-f47mt.c.openshift-qe.internal Succeeded [xiyuan@MiWiFi-RA69-srv file-integrity-operator]NAME PACKAGE SOURCE CHANNEL $ ALERT_MANAGER=$(oc get route alertmanager-main -n openshift-monitoring -o jsonpath='{@.spec.host}') $ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" https://$ALERT_MANAGER/api/v1/alerts | jq -r % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4484 0 4484 0 0 5962 0 --:--:-- --:--:-- --:--:-- 5962 { "status": "success", "data": [ { "labels": { "alertname": "NodeHasIntegrityFailure", "namespace": "fio", "node": "xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "annotations": { "description": "Node xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal has an integrity check status of Failed for more than 1 second.", "summary": "Node xiyuan10-2-hbvnf-master-0.c.openshift-qe.internal has a file integrity failure" }, "startsAt": "2022-08-10T03:50:27.748Z", "endsAt": "2022-08-10T06:56:27.748Z", "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=file_integrity_operator_node_failed%7Bnode%3D~%22.%2B%22%7D+%2A+on%28node%29+kube_node_info+%3E+0&g0.tab=1", "status": { "state": "active", "silencedBy": null, "inhibitedBy": null }, "receivers": [ "Default" ], "fingerprint": "5a5ba900c24cb04e" }, { "labels": { "alertname": "Watchdog", "namespace": "openshift-monitoring", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "none" }, "annotations": { "description": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n", "summary": "An alert that should always be firing to certify that Alertmanager is working properly." }, "startsAt": "2022-08-10T02:17:21.095Z", "endsAt": "2022-08-10T06:56:21.095Z", "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=vector%281%29&g0.tab=1", "status": { "state": "active", "silencedBy": null, "inhibitedBy": null }, "receivers": [ "Watchdog" ], "fingerprint": "6934731368443c07" }, { "labels": { "alertname": "AlertmanagerReceiversNotConfigured", "namespace": "openshift-monitoring", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "annotations": { "description": "Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. Check the OpenShift documentation to learn how to configure notifications with Alertmanager.", "summary": "Receivers (notification integrations) are not configured on Alertmanager" }, "startsAt": "2022-08-10T02:27:29.532Z", "endsAt": "2022-08-10T06:56:59.532Z", "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0&g0.tab=1", "status": { "state": "active", "silencedBy": null, "inhibitedBy": null }, "receivers": [ "Default" ], "fingerprint": "72bc0ebbd3167d00" }, { "labels": { "alertname": "CannotRetrieveUpdates", "endpoint": "metrics", "instance": "10.0.0.6:9099", "job": "cluster-version-operator", "namespace": "openshift-cluster-version", "openshift_io_alert_source": "platform", "pod": "cluster-version-operator-64f9cddbcd-xczmp", "prometheus": "openshift-monitoring/k8s", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "description": "Failure to retrieve updates means that cluster administrators will need to monitor for available updates on their own or risk falling behind on security or other bugfixes. If the failure is expected, you can clear spec.channel in the ClusterVersion object to tell the cluster-version operator to not retrieve updates. Failure reason VersionNotFound . For more information refer to https://console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Cluster version operator has not retrieved updates in 4h 37m 30s." }, "startsAt": "2022-08-10T03:15:32.736Z", "endsAt": "2022-08-10T06:57:02.736Z", "generatorURL": "https:///console-openshift-console.apps.xiyuan10-2.qe.gcp.devcluster.openshift.com/monitoring/graph?g0.expr=%28time%28%29+-+cluster_version_operator_update_retrieval_timestamp_seconds%29+%3E%3D+3600+and+ignoring%28condition%2C+name%2C+reason%29+cluster_operator_conditions%7Bcondition%3D%22RetrievedUpdates%22%2Cendpoint%3D%22metrics%22%2Cname%3D%22version%22%2Creason%21%3D%22NoChannel%22%7D&g0.tab=1", "status": { "state": "active", "silencedBy": null, "inhibitedBy": null }, "receivers": [ "Default" ], "fingerprint": "bf378e174b4c1929" } ] }
verification pass with file-integrity-operator.v0.1.31 + 4.12.0-0.nightly-2022-10-20-104328 Tried to verify with below two scenarios: 1. install File Integrity Operator v0.1.30 2. Create fileintegrity and make NodeHasIntegrityFailure fire from a node. Wait for about 10 minutes. Check TargetDown alerts: $ ALERT_MANAGER=$(oc get route alertmanager-main -n openshift-monitoring -o jsonpath='{@.spec.host}') $ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" https://$ALERT_MANAGER/api/v1/alerts | jq -r | grep -i "TargetDown" -A 10 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20370 0 20370 0 0 17899 0 --:--:-- 0:00:01 --:--:-- 17915 "alertname": "TargetDown", "job": "metrics", "namespace": "openshift-file-integrity", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "service": "metrics", "severity": "warning" }, "annotations": { "description": "33.33% of the metrics/metrics targets in openshift-file-integrity namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.", "summary": "Some targets were not reachable from the monitoring server for an extended period of time." 3. Upgrade to File-integrity-operator.v0.1.31, retrigger a NodeHasIntegrityFailure, and check TargetDown alerts. No TargetDown fired for openshift-file-integrity namespace. $ oc get csv NAME DISPLAY VERSION REPLACES PHASE file-integrity-operator.v0.1.31 File Integrity Operator 0.1.31 file-integrity-operator.v0.1.30 Succeeded $ curl -k -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" https://$ALERT_MANAGER/api/v1/alerts | jq -r | grep -i "TargetDown" -A 10 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 16276 0 16276 0 0 15976 0 --:--:-- 0:00:01 --:--:-- 15988 $
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift File Integrity Operator bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7095