Bug 1867482

Summary: [OCP v46] The 'log-collector' container of api-checks pod shows 'Error' state with exit code 66 during Platform scan
Product: OpenShift Container Platform Reporter: Prashant Dhamdhere <pdhamdhe>
Component: Compliance OperatorAssignee: Juan Antonio Osorio <josorior>
Status: CLOSED ERRATA QA Contact: Prashant Dhamdhere <pdhamdhe>
Severity: low Docs Contact:
Priority: low    
Version: 4.6CC: jhrozek, josorior, mrogers, xiyuan
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:26:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
api-checks pod logs from all containers none

Description Prashant Dhamdhere 2020-08-10 07:08:48 UTC
Created attachment 1710924 [details]
api-checks pod logs from all containers

Description of problem: 

The 'log-collector' container of api-checks pod shows 'Error' state sometime with exit code 66 during Platform scan. 
However, the scan gets complete without throwing any error in pods logs.

$ oc get pods -w 
NAME                                   READY   STATUS              RESTARTS   AGE 
aggregator-pod-platform-scan           0/1     Completed           0          4m30s 
compliance-operator-869646dd4f-25bmr   1/1     Running             0          6m1s 
compliance-operator-869646dd4f-4d5hd   1/1     Running             0          6m1s 
compliance-operator-869646dd4f-qlqqm   1/1     Running             0          6m1s 
ocp4-pp-dcb8bc5b5-j74ks                1/1     Running             0          5m12s 
platform-new-scan-api-checks-pod       1/2     NotReady            0          14s 
platform-new-scan-rs-568894d96-qm2nw   0/1     ContainerCreating   0          14s 
platform-scan-api-checks-pod           0/2     Completed           0          5m10s 
rhcos4-pp-58466496cf-qpmxl             1/1     Running             0          5m12s 
platform-new-scan-rs-568894d96-qm2nw   0/1     ContainerCreating   0          15s 
platform-new-scan-rs-568894d96-qm2nw   1/1     Running             0          18s 
platform-new-scan-api-checks-pod       0/2     Error               0          21s  <<---- 
platform-new-scan-api-checks-pod       1/2     NotReady            1          23s 

$ oc get pods/platform-new-scan-api-checks-pod 
NAME                               READY   STATUS      RESTARTS   AGE 
platform-new-scan-api-checks-pod   0/2     Completed   1          42s 


$ oc describe pods/platform-new-scan-api-checks-pod |grep -A 30 "log-collector" 
  log-collector: 
    Container ID:  cri-o://aaaf86327a35f5cc12ff066c89f9f73718dda42e95b7bb815fd671fe141967f0 
    Image:         quay.io/compliance-operator/compliance-operator:latest 
    Image ID:      quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 
    Port:          <none> 
    Host Port:     <none> 
    Command: 
      compliance-operator 
      resultscollector 
      --arf-file=/reports/report-arf.xml 
      --results-file=/reports/report.xml 
      --exit-code-file=/reports/exit_code 
      --oscap-output-file=/reports/cmd_output 
      --config-map-name=platform-new-scan-api-checks-pod 
      --owner=platform-new-scan 
      --namespace=openshift-compliance 
      --resultserveruri=https://platform-new-scan-rs:8443/ 
      --tls-client-cert=/etc/pki/tls/tls.crt 
      --tls-client-key=/etc/pki/tls/tls.key 
      --tls-ca=/etc/pki/tls/ca.crt 
    State:          Terminated 
      Reason:       Completed 
      Exit Code:    0 
      Started:      Mon, 10 Aug 2020 09:56:03 +0530 
      Finished:     Mon, 10 Aug 2020 09:56:13 +0530 
    Last State:     Terminated 
      Reason:       Error    <<------- 
      Exit Code:    66           <<------- 
      Started:      Mon, 10 Aug 2020 09:55:51 +0530 
      Finished:     Mon, 10 Aug 2020 09:56:01 +0530 
    Ready:          False 


Version-Release number of selected component (if applicable): 

4.6.0-0.nightly-2020-08-07-202945 

How reproducible: 

Sometimes issue reproduce once, in initial 2 scans after operator deploy or once in 3 to 4 continuous scans

Steps to Reproduce: 

1 clone compliance-operator git repo  

$ git clone https://github.com/openshift/compliance-operator.git  

2 Create 'openshift-compliance' namespace  

$ oc create -f compliance-operator/deploy/ns.yaml    

3 Switch to 'openshift-compliance' namespace  

$ oc project openshift-compliance  

4 Deploy CustomResourceDefinition.  

$ for f in $(ls -1 compliance-operator/deploy/crds/*crd.yaml); do oc create -f $f; done  

5. Deploy compliance-operator.  

$ oc create -f compliance-operator/deploy/  

6. Deploy ComplianceSuite CR to perform Platform scan with schedule of every 3 or 5 mins  

$ oc create -f - << EOF 
apiVersion: compliance.openshift.io/v1alpha1 
kind: ComplianceSuite 
metadata: 
  name: example-compliancesuite 
spec: 
  autoApplyRemediations: false 
  schedule: "0/3 * * * *" 
  scans: 
    - name: platform-scan 
      scanType: Platform 
      profile: xccdf_org.ssgproject.content_profile_moderate 
      content: ssg-ocp4-ds.xml 
      contentImage: quay.io/complianceascode/ocp4:latest 
      debug: true 
EOF 

7. Monitor api-checks pod status continuously, it shows 'Error' state & restarts count 1 

$ oc get pods -w 

8. Once the issue reproduce edit the complianceSuite object and change the schedule parameter 
to "0 1 * * *" so that the api-checks pod does not terminate.

$ oc edit compliancesuite example-compliancesuite

9. Describe api-checks pod to see log-collector container state 

$ oc describe pods/platform-new-scan-api-checks-pod 


Actual results: 

The 'log-collector' container of api-checks pod shows 'Error' state with exit code 66 during Platform scan. 

Expected results: 

The 'log-collector' container of api-checks pod should not go in 'Error' state with exit code 66 during Platform scan. 

Additional info:

Attaching api-checks pod logs 'api-checks-pod.log'

Comment 1 Prashant Dhamdhere 2020-08-12 06:08:29 UTC
Noticed similar issue (scan pods terminate with exit code 66) during Node scan as well, here are some details:

$ oc get pods -w
NAME                                                        READY   STATUS     RESTARTS   AGE
compliance-operator-869646dd4f-47dm7                        1/1     Running    0          25m
compliance-operator-869646dd4f-d88hm                        1/1     Running    0          25m
compliance-operator-869646dd4f-zc27r                        1/1     Running    0          25m
ocp4-pp-dcb8bc5b5-9ql2c                                     1/1     Running    0          25m
rhcos4-pp-58466496cf-q2lhn                                  1/1     Running    0          25m
worker-scan-ip-10-0-48-121.us-east-2.compute.internal-pod   1/2     NotReady   0          75s
worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod   1/2     NotReady   1          74s  <<---- 
worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod   1/2     NotReady   0          75s
worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod    1/2     NotReady   0          75s
worker-scan-rs-66564cb46-lpdsw                              1/1     Running    0          75s
worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod   0/2     Error      0          78s  <<----
worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod   1/2     NotReady   1          82s  <<----
worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod    0/2     Completed   0          88s

$ oc describe pods/worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod |grep -A 30 "log-collector" 
  log-collector:
    Container ID:  cri-o://5f332ba3b257f97abf84bbe78fbb13ea806c72ba3ba73014a9f6651d5eb41cdb
    Image:         quay.io/compliance-operator/compliance-operator:latest
    Image ID:      quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92
    Port:          <none>
    Host Port:     <none>
    Command:
      compliance-operator
      resultscollector
    .....
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Aug 2020 11:25:43 +0530
      Finished:     Wed, 12 Aug 2020 11:26:26 +0530
    Last State:     Terminated
      Reason:       Error    <<------- 
      Exit Code:    66         <<------- 
      Started:      Wed, 12 Aug 2020 11:25:00 +0530
      Finished:     Wed, 12 Aug 2020 11:25:41 +0530


$ oc describe pods/worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod |grep -A 30 "log-collector" 
  log-collector:
    Container ID:  cri-o://46c0fc65ac67744106bf9261fea227b5629f6cdc079f7b597f756bc4c1056698
    Image:         quay.io/compliance-operator/compliance-operator:latest
    Image ID:      quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92
    Port:          <none>
    Host Port:     <none>
    Command:
      compliance-operator
      resultscollector
      ....
      --tls-ca=/etc/pki/tls/ca.crt
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Aug 2020 11:26:16 +0530
      Finished:     Wed, 12 Aug 2020 11:27:17 +0530
    Last State:     Terminated    
      Reason:       Error         <<------- 
      Exit Code:    66             <<------- 
      Started:      Wed, 12 Aug 2020 11:25:01 +0530
      Finished:     Wed, 12 Aug 2020 11:26:12 +0530

Comment 2 Juan Antonio Osorio 2020-08-12 11:35:55 UTC
Got any more logs about this? or could we check an environment where this is reproducible? I haven't been able to reproduce this myself.

Comment 3 Juan Antonio Osorio 2020-08-12 11:53:29 UTC
Was able to reproduce it after a while...

Seems we've got a race condition:

oc logs platform-scan-api-checks-pod -c log-collector -f                                                      
I0812 11:50:18.364531       1 request.go:621] Throttling request took 1.021754151s, request: GET:https://172.30.0.1:443/apis/operator.openshift.io/v1?timeout=32s
{"level":"info","ts":1597233019.677537,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"}
{"level":"info","ts":1597233019.677665,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"}
{"level":"info","ts":1597233019.6778421,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"}
{"level":"info","ts":1597233024.225037,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"}
{"level":"info","ts":1597233024.2251403,"logger":"cmd","msg":"Compressed results bytes size","bytes":92185}
{"level":"info","ts":1597233024.2253785,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"}
{"level":"info","ts":1597233024.226219,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"}
{"level":"info","ts":1597233024.2280633,"logger":"cmd","msg":"Trying to upload results ConfigMap"}
{"level":"info","ts":1597233024.2559106,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Wed, 12 Aug 2020 11:50:24 GMT\r\n\r\n"}
{"level":"info","ts":1597233024.2560139,"logger":"cmd","msg":"Uploaded to resultserver"}
==================
WARNING: DATA RACE
Write at 0x00c0005f5640 by goroutine 42:
  main.handleCompleteSCAPResults.func2()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c
Previous write at 0x00c0005f5640 by goroutine 12:
  main.handleCompleteSCAPResults.func1()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68
Goroutine 42 (running) created at:
  main.handleCompleteSCAPResults()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c
  main.resultCollectorMain()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c
  github.com/spf13/cobra.(*Command).execute()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0
  github.com/spf13/cobra.(*Command).ExecuteC()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499
  main.main()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e
Goroutine 12 (finished) created at:
  main.handleCompleteSCAPResults()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc
  main.resultCollectorMain()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c
  github.com/spf13/cobra.(*Command).execute()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0
  github.com/spf13/cobra.(*Command).ExecuteC()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499
  main.main()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e
==================
{"level":"info","ts":1597233024.2864218,"logger":"cmd","msg":"Uploaded ConfigMap"}
Found 1 data race(s)

Comment 4 Juan Antonio Osorio 2020-08-12 12:00:47 UTC
Fixing here https://github.com/openshift/compliance-operator/pull/399

Comment 5 Jakub Hrozek 2020-08-12 15:34:09 UTC
Prashant, do you know if the logs you attached to the opening comment captured the issue? Because I don't see the debug logs that warn about the race in that log file..
Alternatively, can you still reproduce the issue and check if you see the same log messages as in comment #3?

Comment 6 Prashant Dhamdhere 2020-08-13 07:49:27 UTC
Hi Jakub,

I captured those logs for reference purpose using the command $ oc logs platform-scan-api-checks-pod --all-containers which did not capture the issue.
But yes, I reproduced the issue and noticed similar debug logs that warn about the race condition.

$ oc logs platform-scan-api-checks-pod -c log-collector -f
I0813 07:15:13.771123       1 request.go:621] Throttling request took 1.037091702s, request: GET:https://172.30.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
{"level":"info","ts":1597302915.081068,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"}
{"level":"info","ts":1597302915.0811741,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"}
{"level":"info","ts":1597302915.0813122,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"}
{"level":"info","ts":1597302919.6576288,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"}
{"level":"info","ts":1597302919.6577134,"logger":"cmd","msg":"Compressed results bytes size","bytes":92227}
{"level":"info","ts":1597302919.6579232,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"}
{"level":"info","ts":1597302919.6586392,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"}
{"level":"info","ts":1597302919.6615288,"logger":"cmd","msg":"Trying to upload results ConfigMap"}
{"level":"info","ts":1597302919.6868777,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Thu, 13 Aug 2020 07:15:19 GMT\r\n\r\n"}
{"level":"info","ts":1597302919.6870172,"logger":"cmd","msg":"Uploaded to resultserver"}
==================
WARNING: DATA RACE
Write at 0x00c0001aab10 by goroutine 9:
  main.handleCompleteSCAPResults.func2()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c

Previous write at 0x00c0001aab10 by goroutine 164:
  main.handleCompleteSCAPResults.func1()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68

Goroutine 9 (running) created at:
  main.handleCompleteSCAPResults()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c
  main.resultCollectorMain()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c
  github.com/spf13/cobra.(*Command).execute()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0
  github.com/spf13/cobra.(*Command).ExecuteC()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499
  main.main()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e

Goroutine 164 (finished) created at:
  main.handleCompleteSCAPResults()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc
  main.resultCollectorMain()
      /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c
  github.com/spf13/cobra.(*Command).execute()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0
  github.com/spf13/cobra.(*Command).ExecuteC()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499
  main.main()
      /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e
==================
{"level":"info","ts":1597302919.710159,"logger":"cmd","msg":"Uploaded ConfigMap"}
Found 1 data race(s)

Comment 10 Prashant Dhamdhere 2020-08-27 12:03:42 UTC
It looks good. Now, the api-checks pod does not report Error state after performed multiple 
scans using the cronjob. Also, it does not restart.


Verified on: 
OCP 4.6.0-0.nightly-2020-08-27-005538

$ oc get cronjobs -w
NAME                               SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
example-compliancesuite-rerunner   0/3 * * * *   False     0        116s            19m


$ oc get jobs --watch
NAME                                          COMPLETIONS   DURATION   AGE
example-compliancesuite-rerunner-1598528700   1/1           8s         8m17s
example-compliancesuite-rerunner-1598528880   1/1           7s         5m16s
example-compliancesuite-rerunner-1598529060   1/1           8s         2m16s

$ oc get pods -w
NAME                                   READY   STATUS              RESTARTS   AGE
compliance-operator-869646dd4f-jvhkk   1/1     Running             0          140m
ocp4-pp-7f89f556cc-jg99r               1/1     Running             0          139m
platform-scan-api-checks-pod           0/2     Init:1/2            0          7s
platform-scan-rs-697c8bb994-t7nzk      0/1     ContainerCreating   0          7s
rhcos4-pp-7c44999587-pc8cz             1/1     Running             0          139m
platform-scan-api-checks-pod           0/2     PodInitializing     0          7s
platform-scan-api-checks-pod           2/2     Running             0          9s
platform-scan-api-checks-pod           1/2     NotReady            0          10s
platform-scan-rs-697c8bb994-t7nzk      0/1     ContainerCreating   0          24s
platform-scan-rs-697c8bb994-t7nzk      1/1     Running             0          27s
platform-scan-api-checks-pod           0/2     Completed           0          33s

$ oc get pods -w
NAME                                   READY   STATUS      RESTARTS   AGE
aggregator-pod-platform-scan           0/1     Completed   0          105s
compliance-operator-869646dd4f-jvhkk   1/1     Running     0          142m
ocp4-pp-7f89f556cc-jg99r               1/1     Running     0          141m
platform-scan-api-checks-pod           0/2     Completed   0          2m25s
rhcos4-pp-7c44999587-pc8cz             1/1     Running     0          141m

$ oc get pods -w
NAME                                                READY   STATUS      RESTARTS   AGE
aggregator-pod-platform-scan                        0/1     Completed   0          97s
compliance-operator-869646dd4f-jvhkk                1/1     Running     0          151m
example-compliancesuite-rerunner-1598528160-2m52r   0/1     Completed   0          8m7s
example-compliancesuite-rerunner-1598528340-9wh9m   0/1     Completed   0          5m6s
example-compliancesuite-rerunner-1598528520-vd7xl   0/1     Completed   0          2m6s
ocp4-pp-7f89f556cc-jg99r                            1/1     Running     0          150m
platform-scan-api-checks-pod                        0/2     Completed   0          118s
rhcos4-pp-7c44999587-pc8cz                          1/1     Running     0          150m


$ oc get pods/platform-scan-api-checks-pod -w
NAME                           READY   STATUS      RESTARTS   AGE
platform-scan-api-checks-pod   0/2     Completed   0          92s
platform-scan-api-checks-pod   0/2     Terminating   0          2m58s
platform-scan-api-checks-pod   0/2     Terminating   0          2m58s
platform-scan-api-checks-pod   0/2     Pending       0          0s
platform-scan-api-checks-pod   0/2     Pending       0          0s
platform-scan-api-checks-pod   0/2     Init:0/2      0          0s
platform-scan-api-checks-pod   0/2     Init:0/2      0          3s
platform-scan-api-checks-pod   0/2     Init:1/2      0          7s
platform-scan-api-checks-pod   0/2     Init:1/2      0          9s
platform-scan-api-checks-pod   0/2     PodInitializing   0          11s
platform-scan-api-checks-pod   2/2     Running           0          13s
platform-scan-api-checks-pod   1/2     NotReady          0          14s
platform-scan-api-checks-pod   0/2     Completed         0          21s


$ oc get pods/platform-scan-api-checks-pod -w
NAME                           READY   STATUS      RESTARTS   AGE
platform-scan-api-checks-pod   0/2     Completed   0          2m18s
platform-scan-api-checks-pod   0/2     Terminating   0          2m58s
platform-scan-api-checks-pod   0/2     Terminating   0          2m58s
platform-scan-api-checks-pod   0/2     Pending       0          0s
platform-scan-api-checks-pod   0/2     Pending       0          0s
platform-scan-api-checks-pod   0/2     Init:0/2      0          0s
platform-scan-api-checks-pod   0/2     Init:0/2      0          2s
platform-scan-api-checks-pod   0/2     Init:1/2      0          3s
platform-scan-api-checks-pod   0/2     Init:1/2      0          5s
platform-scan-api-checks-pod   0/2     PodInitializing   0          7s
platform-scan-api-checks-pod   2/2     Running           0          9s
platform-scan-api-checks-pod   1/2     NotReady          0          10s
platform-scan-api-checks-pod   0/2     Completed         0          17s

Comment 12 errata-xmlrpc 2020-10-27 16:26:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196