Bug 2072058

Summary: Worker file integrity remains in initializing state.
Product: OpenShift Container Platform Reporter: German Parente <gparente>
Component: File Integrity OperatorAssignee: Matt Rogers <mrogers>
Status: CLOSED ERRATA QA Contact: xiyuan
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: achernet, alosingh, antaylor, ddelcian, dseals, eglottma, jhrozek, jmittapa, lbragsta, mrogers, suprs, wenshen
Target Milestone: ---Flags: xiyuan: needinfo-
xiyuan: needinfo-
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-23 09:57:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description German Parente 2022-04-05 13:57:55 UTC
Description of problem:

worker file integrity working fine and showing the results. However, it's showing 

 oc get fileintegrity worker-fileintegrity -o=jsonpath='{.status}'
{"phase":"Initializing"}

aide worker pods are running:

aide-worker-fileintegrity-l4nx6                                  1/1     Running   2          25h
aide-worker-fileintegrity-nmc6d                                  1/1     Running   2          25h
aide-worker-fileintegrity-pjpbg                                  1/1     Running   2          25h
aide-worker-fileintegrity-twt25                                  1/1     Running   2          25h
aide-worker-fileintegrity-zq86j                                  1/1     Running   2          25h
file-integrity-operator-6d877d8c59-vmvt6                         1/1     Running   0          22h

Following more details internally.

Comment 23 xiyuan 2022-05-11 08:21:24 UTC
Hi Matt,

The bug was reproduced with v0.1.21 > v0.1.22 FIO upgrade. So verify it with v0.1.21 > v0.1.24 FIO upgrade.
Generally it is fine, the cm aide-reinit was updated after FIO uprade; and the db reinit succeeded when user trigger manual reinit after FIO upgrade done.
The only problem is /hostroot/run/aide.reinit is missing on node. Is it expected? Thanks.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.39    True        False         72m     Cluster version is 4.8.39

1. install FIO v0.1.21, create FileIntegrity, and trigger failure:
$ oc get ip
NAME            CSV                               APPROVAL    APPROVED
install-pzl7d   file-integrity-operator.v0.1.21   Automatic   true
$ oc get csv -w
NAME                               DISPLAY                            VERSION     REPLACES   PHASE
elasticsearch-operator.5.2.10-20   OpenShift Elasticsearch Operator   5.2.10-20              
file-integrity-operator.v0.1.21    File Integrity Operator            0.1.21                 Installing
file-integrity-operator.v0.1.21    File Integrity Operator            0.1.21                 Succeeded
^C$ oc get pod
NAME                                       READY   STATUS    RESTARTS   AGE
file-integrity-operator-748cf55bbd-s4v59   1/1     Running   0          30s
$ oc create -f - << EOF
> apiVersion: fileintegrity.openshift.io/v1alpha1
> kind: FileIntegrity
> metadata:
>   name: example-fileintegrity
>   namespace: openshift-file-integrity
> spec:
>   # Change to debug: true to enable more verbose logging from the logcollector
>   # container in the aide pods
>   debug: false
>   config: 
>     gracePeriod: 90
> EOF

fileintegrity.fileintegrity.openshift.io/example-fileintegrity created

$ oc extract cm/aide-reinit --confirm 
aide.sh
$ cat aide.sh 
#!/bin/sh
    touch /hostroot/etc/kubernetes/aide.reinit
$ oc extract cm/aide-pause --confirm
pause.sh
$ cat pause.sh 
#!/bin/sh
	sleep infinity & PID=$!
	trap "kill $PID" INT TERM
	wait $PID || true
$ oc get fileintegritynodestatuses
NAME                                                     NODE                               STATUS
example-fileintegrity-xiyuan11-48-bpggz-master-0         xiyuan11-48-bpggz-master-0         Failed
example-fileintegrity-xiyuan11-48-bpggz-master-1         xiyuan11-48-bpggz-master-1         Succeeded
example-fileintegrity-xiyuan11-48-bpggz-master-2         xiyuan11-48-bpggz-master-2         Succeeded
example-fileintegrity-xiyuan11-48-bpggz-worker-0-6rc8j   xiyuan11-48-bpggz-worker-0-6rc8j   Failed
example-fileintegrity-xiyuan11-48-bpggz-worker-0-mnh67   xiyuan11-48-bpggz-worker-0-mnh67   Succeeded
example-fileintegrity-xiyuan11-48-bpggz-worker-0-t4spn   xiyuan11-48-bpggz-worker-0-t4spn   Succeeded

2. upgrade to v0.1.24:

$ oc get ip
NAME            CSV                               APPROVAL    APPROVED
install-7rvcj   file-integrity-operator.v0.1.24   Automatic   true
install-pzl7d   file-integrity-operator.v0.1.21   Automatic   true
[xiyuan@MiWiFi-RA69-srv func]$ oc get csv
NAME                               DISPLAY                            VERSION     REPLACES   PHASE
elasticsearch-operator.5.2.10-20   OpenShift Elasticsearch Operator   5.2.10-20              Succeeded
file-integrity-operator.v0.1.21    File Integrity Operator            0.1.21                 Succeeded
$ oc get csv
NAME                               DISPLAY                            VERSION     REPLACES                          PHASE
elasticsearch-operator.5.2.10-20   OpenShift Elasticsearch Operator   5.2.10-20                                     Succeeded
file-integrity-operator.v0.1.24    File Integrity Operator            0.1.24      file-integrity-operator.v0.1.21   Succeeded


$ oc extract cm/aide-reinit --confirm
aide.sh
$ cat aide.sh 
#!/bin/sh
    touch /hostroot/run/aide.reinit

3. trigger reinit manually:
$ oc debug node/xiyuan11-48-bpggz-master-0 -- chroot /host ls -ltr /etc/kubernetes
Starting pod/xiyuan11-48-bpggz-master-0-debug ...
To use host binaries, run `chroot /host`
total 3860
-rw-r--r--.  1 root root    9179 May 11 06:16 kubeconfig
drwxr-xr-x.  3 root root      19 May 11 06:17 cni
drwxr-xr-x.  3 root root      20 May 11 06:17 kubelet-plugins
drwxr-xr-x. 19 root root    4096 May 11 06:44 static-pod-resources
-rw-r--r--.  1 root root     101 May 11 06:50 apiserver-url.env
drwxr-xr-x.  2 root root     192 May 11 06:50 manifests
-rw-r--r--.  1 root root    5875 May 11 06:50 kubelet-ca.crt
-rw-r--r--.  1 root root    1123 May 11 06:50 ca.crt
-rw-r--r--.  1 root root      94 May 11 06:50 cloud.conf
-rw-r--r--.  1 root root    1076 May 11 06:50 kubelet.conf
-rw-------.  1 root root      67 May 11 07:23 aide.log.backup-20220511T07_23_30
-rw-------.  1 root root 1946990 May 11 07:24 aide.db.gz.new
-rw-------.  1 root root 1946990 May 11 07:24 aide.db.gz
-rw-------.  1 root root     877 May 11 07:45 aide.log.new
-rw-------.  1 root root     877 May 11 07:45 aide.log

Removing debug pod ...
$ oc annotate fileintegrities/example-fileintegrity  file-integrity.openshift.io/re-init=
fileintegrity.fileintegrity.openshift.io/example-fileintegrity annotated

$ oc get fileintegrity example-fileintegrity -o=jsonpath={.status}
{"phase":"Initializing"}[xiyuan@MiWiFi-RA69-srv func]$ 
$ oc get fileintegrity example-fileintegrity -o=jsonpath={.status}
{"phase":"Active"}

$ oc debug node/xiyuan11-48-bpggz-master-0 -- ls -ltr /hostroot/run/aide.reinit
Starting pod/xiyuan11-48-bpggz-master-0-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/hostroot/run/aide.reinit': No such file or directory

Removing debug pod ...
error: non-zero exit code from debug container


$ oc debug node/xiyuan11-48-bpggz-master-0 -- chroot /host ls -ltr /etc/kubernetes
Starting pod/xiyuan11-48-bpggz-master-0-debug ...
To use host binaries, run `chroot /host`
total 5764
-rw-r--r--.  1 root root    9179 May 11 06:16 kubeconfig
drwxr-xr-x.  3 root root      19 May 11 06:17 cni
drwxr-xr-x.  3 root root      20 May 11 06:17 kubelet-plugins
drwxr-xr-x. 19 root root    4096 May 11 06:44 static-pod-resources
-rw-r--r--.  1 root root     101 May 11 06:50 apiserver-url.env
drwxr-xr-x.  2 root root     192 May 11 06:50 manifests
-rw-r--r--.  1 root root    5875 May 11 06:50 kubelet-ca.crt
-rw-r--r--.  1 root root    1123 May 11 06:50 ca.crt
-rw-r--r--.  1 root root      94 May 11 06:50 cloud.conf
-rw-r--r--.  1 root root    1076 May 11 06:50 kubelet.conf
-rw-------.  1 root root      67 May 11 07:23 aide.log.backup-20220511T07_23_30
-rw-------.  1 root root 1946990 May 11 07:47 aide.db.gz.backup-20220511T07_47_50
-rw-------.  1 root root     877 May 11 07:47 aide.log.backup-20220511T07_47_50
-rw-------.  1 root root 1947002 May 11 07:48 aide.db.gz.new
-rw-------.  1 root root 1947002 May 11 07:48 aide.db.gz
-rw-------.  1 root root     651 May 11 07:52 aide.log
-rw-------.  1 root root       0 May 11 07:53 aide.log.new

Removing debug pod ...

Comment 24 xiyuan 2022-05-11 13:11:33 UTC
Correct the command to check /hostroot/run/aide.reinit in  https://bugzilla.redhat.com/show_bug.cgi?id=2072058#c23. The same result.
$ oc debug node/xiyuan11-48-bpggz-master-0 -- chroot /host  ls -ltr /run/aide.reinit
Starting pod/xiyuan11-48-bpggz-master-0-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/run/aide.reinit': No such file or directory

Removing debug pod ...
error: non-zero exit code from debug container

Comment 27 errata-xmlrpc 2022-05-23 09:57:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift File Integrity Operator bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1331