Description of problem: According to style guide https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide an alert should set namespace label to identify which component is raising the alert however NodeHasIntegrityFailure is not setting namespace label: https://github.com/openshift/file-integrity-operator/blob/master/cmd/manager/operator.go#L266 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. install File Integrity Operator 2. make NodeHasIntegrityFailure fire from a node 3. inspect the alert lables Actual results: NodeHasIntegrityFailure doesn't set namespace label Expected results: NodeHasIntegrityFailure should be fired from the namespace the operator is is installed in (by default openshift-file-integrity) Additional info:
Verification pass with latest code and 4.11.0-0.nightly-2022-06-30-005428 Patch the PR, deploy the fileintegrity, trigger a fileintegrity failure on node. And then check from GUI for the alert, and screen the alert with label namespace=openshift-file-integrity, you could see the alert displayed in https://bugzilla.redhat.com/attachment.cgi?id=1893920 # oc get fileintegritynodestatus NAME NODE STATUS example-fileintegrity-ip-10-0-141-72.us-east-2.compute.internal ip-10-0-141-72.us-east-2.compute.internal Failed example-fileintegrity-ip-10-0-145-158.us-east-2.compute.internal ip-10-0-145-158.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-169-133.us-east-2.compute.internal ip-10-0-169-133.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-177-171.us-east-2.compute.internal ip-10-0-177-171.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-197-182.us-east-2.compute.internal ip-10-0-197-182.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-209-136.us-east-2.compute.internal ip-10-0-209-136.us-east-2.compute.internal Succeeded # oc extract cm/aide-example-fileintegrity-ip-10-0-141-72.us-east-2.compute.internal-failed --confirm integritylog # cat integritylog Start timestamp: 2022-07-01 13:35:25 +0000 (AIDE 0.16) AIDE found differences between database and filesystem!! Summary: Total number of entries: 35777 Added entries: 1 Removed entries: 0 Changed entries: 0 --------------------------------------------------- Added entries: --------------------------------------------------- d++++++++++++++++: /hostroot/root/test --------------------------------------------------- The attributes of the (uncompressed) database(s): --------------------------------------------------- /hostroot/etc/kubernetes/aide.db.gz MD5 : u4xAmCIexJayzdvPcSdujQ== SHA1 : kV+goC3QvMpxMFxhlyQ04Pb0M2w= RMD160 : HnckhVVNj3jsMZB24CyKGEm9LFY= TIGER : K23fBlZ+nH1MBp6avqDXuxGV32LVLrYv SHA256 : YgdronUu0anW0xmVuOl2bxto4r+VW6fx /n6WdbXqeQ8= SHA512 : tPFTZMJnJLWsd+C6x6akOYAzDDa0nYSw fhXjvpiG4EaJ61TyBrmX5K58ohVnsBuo 59rcZZfnYUYTH559AaaRIQ== End timestamp: 2022-07-01 13:35:55 +0000 (run time: 0m 30s)
verification failed with FIO v0.1.26-2. It is a little weird. As verification pass with pre-merge process but failed with FIO v0.1.26-2. 1. install FIO v0.1.26-2 $ oc get ip NAME CSV APPROVAL APPROVED install-84sxn file-integrity-operator.v0.1.26 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded file-integrity-operator.v0.1.26 File Integrity Operator 0.1.26 Succeeded $ oc get pod NAME READY STATUS RESTARTS AGE aide-example-fileintegrity-bjptx 1/1 Running 0 19m aide-example-fileintegrity-jpz4n 1/1 Running 0 19m aide-example-fileintegrity-qjqhs 1/1 Running 0 19m aide-example-fileintegrity-s2nhk 1/1 Running 0 19m aide-example-fileintegrity-sfqmd 1/1 Running 0 19m aide-example-fileintegrity-zdtj7 1/1 Running 0 19m file-integrity-operator-764d7bf547-c5lf7 1/1 Running 1 (30m ago) 30m $ oc describe pod/file-integrity-operator-764d7bf547-c5lf7 | grep Image Image: registry.redhat.io/compliance/openshift-file-integrity-rhel8-operator@sha256:0694bab8e31fd75558c3fcba2e17bf548e668948fbf6124844ce57063a1ad00e Image ID: registry.redhat.io/compliance/openshift-file-integrity-rhel8-operator@sha256:0694bab8e31fd75558c3fcba2e17bf548e668948fbf6124844ce57063a1ad00e Per https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2076094, it is from build FIO v0.1.26-2. 2. trigger failures: $ oc get fileintegritynodestatus NAME NODE STATUS example-fileintegrity-ip-10-0-136-242.us-east-2.compute.internal ip-10-0-136-242.us-east-2.compute.internal Failed example-fileintegrity-ip-10-0-159-42.us-east-2.compute.internal ip-10-0-159-42.us-east-2.compute.internal Failed example-fileintegrity-ip-10-0-175-108.us-east-2.compute.internal ip-10-0-175-108.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-190-195.us-east-2.compute.internal ip-10-0-190-195.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-193-242.us-east-2.compute.internal ip-10-0-193-242.us-east-2.compute.internal Succeeded example-fileintegrity-ip-10-0-202-200.us-east-2.compute.internal ip-10-0-202-200.us-east-2.compute.internal Succeeded 3. Check on GUI: the alert could show with label alertname=NodeHasIntegrityFailure, seen from https://bugzilla.redhat.com/attachment.cgi?id=1894946; but it won't show with label namespace=openshift-file-integrity, seen from https://bugzilla.redhat.com/attachment.cgi?id=1894947
I see what happened: The fix for the bug merged right after our 0.1.26 release commit. https://code.engineering.redhat.com/gerrit/c/openshift-file-integrity/+/418181 updated our source to e0f24ac1804d10c08da56acf606c1ba67ae5c709 https://github.com/openshift/file-integrity-operator/commits/master shows the fix merge b188b1b7afd958eca33364e28e1327f8d16e16f2 which comes after in the git history. We'll increment the release to 0.1.27 and rebuild.
verification pass with payload 4.12.0-0.nightly-2022-07-17-215842 and file-integrity-operator.v0.1.28 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-07-17-215842 True False 32m Cluster version is 4.12.0-0.nightly-2022-07-17-215842 $ oc get ip NAME CSV APPROVAL APPROVED install-trbq9 file-integrity-operator.v0.1.28 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded file-integrity-operator.v0.1.28 File Integrity Operator 0.1.28 Succeeded Verification steps: 1. install file-integrity-operator.v0.1.28 2. Create fileintegrity: $ oc apply -f -<<EOF apiVersion: fileintegrity.openshift.io/v1alpha1 kind: FileIntegrity metadata: name: example-fileintegrity spec: config: gracePeriod: 20 maxBackups: 5 debug: true EOF fileintegrity.fileintegrity.openshift.io/example-fileintegrity created $ oc get fileintegrity example-fileintegrity -o=jsonpath={.status} {"phase":"Active"} 3. Trigger fileintegrity check faillure: $ oc debug node/xiyuan19-2-wmvpn-worker-us-east-1a-p7w76 -- chroot /host mkdir /root/test Starting pod/xiyuan19-2-wmvpn-worker-us-east-1a-p7w76-debug ... To use host binaries, run `chroot /host` Removing debug pod ... $ oc get fileintegritynodestatus NAME NODE STATUS example-fileintegrity-xiyuan19-2-wmvpn-master-0 xiyuan19-2-wmvpn-master-0 Succeeded example-fileintegrity-xiyuan19-2-wmvpn-master-1 xiyuan19-2-wmvpn-master-1 Succeeded example-fileintegrity-xiyuan19-2-wmvpn-master-2 xiyuan19-2-wmvpn-master-2 Succeeded example-fileintegrity-xiyuan19-2-wmvpn-worker-us-east-1a-p7w76 xiyuan19-2-wmvpn-worker-us-east-1a-p7w76 Failed example-fileintegrity-xiyuan19-2-wmvpn-worker-us-east-1b-5x64v xiyuan19-2-wmvpn-worker-us-east-1b-5x64v Succeeded example-fileintegrity-xiyuan19-2-wmvpn-worker-us-east-1b-gz6v6 xiyuan19-2-wmvpn-worker-us-east-1b-gz6v6 Succeeded $ oc get cm NAME DATA AGE 962a0cf2.openshift.io 0 14m aide-example-fileintegrity-xiyuan19-2-wmvpn-worker-us-east-1a-p7w76-failed 1 88s aide-pause 1 6m10s aide-reinit 1 6m10s example-fileintegrity 1 6m10s kube-root-ca.crt 1 15m openshift-service-ca.crt 1 15m $ oc extract cm/aide-example-fileintegrity-xiyuan19-2-wmvpn-worker-us-east-1a-p7w76-failed --confirm integritylog $ cat integritylog Start timestamp: 2022-07-19 06:14:35 +0000 (AIDE 0.16) AIDE found differences between database and filesystem!! Summary: Total number of entries: 35766 Added entries: 1 Removed entries: 0 Changed entries: 0 --------------------------------------------------- Added entries: --------------------------------------------------- d++++++++++++++++: /hostroot/root/test --------------------------------------------------- The attributes of the (uncompressed) database(s): --------------------------------------------------- /hostroot/etc/kubernetes/aide.db.gz MD5 : H6vsxWQgCXCONNpNAsbK8A== SHA1 : 7izTbkVOjc1dKr26GYzBI0VsGvA= RMD160 : VkTQP+6k8oYdu3PddmF50vp3NCw= TIGER : L34A9vnYnQkrqUFgvToKaUn4kQ35DGw7 SHA256 : nxtcnEDrMBVgzwhysqTLpBQIdeVsOiPg n4W0aPRWC5U= SHA512 : OQun2sGxusc97s/DRcoZ21FffQGPyaAb w3SdWtpzQGp6UUxJegkRZReVHOD8fZxk WAYL0+Th3XJ7Hko/SxhV8w== End timestamp: 2022-07-19 06:15:05 +0000 (run time: 0m 30s) 4. check on GUI: Details seens from https://bugzilla.redhat.com/attachment.cgi?id=1898038
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift File Integrity Operator bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5538