Created attachment 1710924 [details] api-checks pod logs from all containers Description of problem: The 'log-collector' container of api-checks pod shows 'Error' state sometime with exit code 66 during Platform scan. However, the scan gets complete without throwing any error in pods logs. $ oc get pods -w NAME READY STATUS RESTARTS AGE aggregator-pod-platform-scan 0/1 Completed 0 4m30s compliance-operator-869646dd4f-25bmr 1/1 Running 0 6m1s compliance-operator-869646dd4f-4d5hd 1/1 Running 0 6m1s compliance-operator-869646dd4f-qlqqm 1/1 Running 0 6m1s ocp4-pp-dcb8bc5b5-j74ks 1/1 Running 0 5m12s platform-new-scan-api-checks-pod 1/2 NotReady 0 14s platform-new-scan-rs-568894d96-qm2nw 0/1 ContainerCreating 0 14s platform-scan-api-checks-pod 0/2 Completed 0 5m10s rhcos4-pp-58466496cf-qpmxl 1/1 Running 0 5m12s platform-new-scan-rs-568894d96-qm2nw 0/1 ContainerCreating 0 15s platform-new-scan-rs-568894d96-qm2nw 1/1 Running 0 18s platform-new-scan-api-checks-pod 0/2 Error 0 21s <<---- platform-new-scan-api-checks-pod 1/2 NotReady 1 23s $ oc get pods/platform-new-scan-api-checks-pod NAME READY STATUS RESTARTS AGE platform-new-scan-api-checks-pod 0/2 Completed 1 42s $ oc describe pods/platform-new-scan-api-checks-pod |grep -A 30 "log-collector" log-collector: Container ID: cri-o://aaaf86327a35f5cc12ff066c89f9f73718dda42e95b7bb815fd671fe141967f0 Image: quay.io/compliance-operator/compliance-operator:latest Image ID: quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 Port: <none> Host Port: <none> Command: compliance-operator resultscollector --arf-file=/reports/report-arf.xml --results-file=/reports/report.xml --exit-code-file=/reports/exit_code --oscap-output-file=/reports/cmd_output --config-map-name=platform-new-scan-api-checks-pod --owner=platform-new-scan --namespace=openshift-compliance --resultserveruri=https://platform-new-scan-rs:8443/ --tls-client-cert=/etc/pki/tls/tls.crt --tls-client-key=/etc/pki/tls/tls.key --tls-ca=/etc/pki/tls/ca.crt State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 10 Aug 2020 09:56:03 +0530 Finished: Mon, 10 Aug 2020 09:56:13 +0530 Last State: Terminated Reason: Error <<------- Exit Code: 66 <<------- Started: Mon, 10 Aug 2020 09:55:51 +0530 Finished: Mon, 10 Aug 2020 09:56:01 +0530 Ready: False Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-07-202945 How reproducible: Sometimes issue reproduce once, in initial 2 scans after operator deploy or once in 3 to 4 continuous scans Steps to Reproduce: 1 clone compliance-operator git repo $ git clone https://github.com/openshift/compliance-operator.git 2 Create 'openshift-compliance' namespace $ oc create -f compliance-operator/deploy/ns.yaml 3 Switch to 'openshift-compliance' namespace $ oc project openshift-compliance 4 Deploy CustomResourceDefinition. $ for f in $(ls -1 compliance-operator/deploy/crds/*crd.yaml); do oc create -f $f; done 5. Deploy compliance-operator. $ oc create -f compliance-operator/deploy/ 6. Deploy ComplianceSuite CR to perform Platform scan with schedule of every 3 or 5 mins $ oc create -f - << EOF apiVersion: compliance.openshift.io/v1alpha1 kind: ComplianceSuite metadata: name: example-compliancesuite spec: autoApplyRemediations: false schedule: "0/3 * * * *" scans: - name: platform-scan scanType: Platform profile: xccdf_org.ssgproject.content_profile_moderate content: ssg-ocp4-ds.xml contentImage: quay.io/complianceascode/ocp4:latest debug: true EOF 7. Monitor api-checks pod status continuously, it shows 'Error' state & restarts count 1 $ oc get pods -w 8. Once the issue reproduce edit the complianceSuite object and change the schedule parameter to "0 1 * * *" so that the api-checks pod does not terminate. $ oc edit compliancesuite example-compliancesuite 9. Describe api-checks pod to see log-collector container state $ oc describe pods/platform-new-scan-api-checks-pod Actual results: The 'log-collector' container of api-checks pod shows 'Error' state with exit code 66 during Platform scan. Expected results: The 'log-collector' container of api-checks pod should not go in 'Error' state with exit code 66 during Platform scan. Additional info: Attaching api-checks pod logs 'api-checks-pod.log'
Noticed similar issue (scan pods terminate with exit code 66) during Node scan as well, here are some details: $ oc get pods -w NAME READY STATUS RESTARTS AGE compliance-operator-869646dd4f-47dm7 1/1 Running 0 25m compliance-operator-869646dd4f-d88hm 1/1 Running 0 25m compliance-operator-869646dd4f-zc27r 1/1 Running 0 25m ocp4-pp-dcb8bc5b5-9ql2c 1/1 Running 0 25m rhcos4-pp-58466496cf-q2lhn 1/1 Running 0 25m worker-scan-ip-10-0-48-121.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod 1/2 NotReady 1 74s <<---- worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-rs-66564cb46-lpdsw 1/1 Running 0 75s worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 0/2 Error 0 78s <<---- worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 1/2 NotReady 1 82s <<---- worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod 0/2 Completed 0 88s $ oc describe pods/worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod |grep -A 30 "log-collector" log-collector: Container ID: cri-o://5f332ba3b257f97abf84bbe78fbb13ea806c72ba3ba73014a9f6651d5eb41cdb Image: quay.io/compliance-operator/compliance-operator:latest Image ID: quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 Port: <none> Host Port: <none> Command: compliance-operator resultscollector ..... State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 12 Aug 2020 11:25:43 +0530 Finished: Wed, 12 Aug 2020 11:26:26 +0530 Last State: Terminated Reason: Error <<------- Exit Code: 66 <<------- Started: Wed, 12 Aug 2020 11:25:00 +0530 Finished: Wed, 12 Aug 2020 11:25:41 +0530 $ oc describe pods/worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod |grep -A 30 "log-collector" log-collector: Container ID: cri-o://46c0fc65ac67744106bf9261fea227b5629f6cdc079f7b597f756bc4c1056698 Image: quay.io/compliance-operator/compliance-operator:latest Image ID: quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 Port: <none> Host Port: <none> Command: compliance-operator resultscollector .... --tls-ca=/etc/pki/tls/ca.crt State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 12 Aug 2020 11:26:16 +0530 Finished: Wed, 12 Aug 2020 11:27:17 +0530 Last State: Terminated Reason: Error <<------- Exit Code: 66 <<------- Started: Wed, 12 Aug 2020 11:25:01 +0530 Finished: Wed, 12 Aug 2020 11:26:12 +0530
Got any more logs about this? or could we check an environment where this is reproducible? I haven't been able to reproduce this myself.
Was able to reproduce it after a while... Seems we've got a race condition: oc logs platform-scan-api-checks-pod -c log-collector -f I0812 11:50:18.364531 1 request.go:621] Throttling request took 1.021754151s, request: GET:https://172.30.0.1:443/apis/operator.openshift.io/v1?timeout=32s {"level":"info","ts":1597233019.677537,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"} {"level":"info","ts":1597233019.677665,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"} {"level":"info","ts":1597233019.6778421,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"} {"level":"info","ts":1597233024.225037,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"} {"level":"info","ts":1597233024.2251403,"logger":"cmd","msg":"Compressed results bytes size","bytes":92185} {"level":"info","ts":1597233024.2253785,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"} {"level":"info","ts":1597233024.226219,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"} {"level":"info","ts":1597233024.2280633,"logger":"cmd","msg":"Trying to upload results ConfigMap"} {"level":"info","ts":1597233024.2559106,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Wed, 12 Aug 2020 11:50:24 GMT\r\n\r\n"} {"level":"info","ts":1597233024.2560139,"logger":"cmd","msg":"Uploaded to resultserver"} ================== WARNING: DATA RACE Write at 0x00c0005f5640 by goroutine 42: main.handleCompleteSCAPResults.func2() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c Previous write at 0x00c0005f5640 by goroutine 12: main.handleCompleteSCAPResults.func1() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68 Goroutine 42 (running) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e Goroutine 12 (finished) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e ================== {"level":"info","ts":1597233024.2864218,"logger":"cmd","msg":"Uploaded ConfigMap"} Found 1 data race(s)
Fixing here https://github.com/openshift/compliance-operator/pull/399
Prashant, do you know if the logs you attached to the opening comment captured the issue? Because I don't see the debug logs that warn about the race in that log file.. Alternatively, can you still reproduce the issue and check if you see the same log messages as in comment #3?
Hi Jakub, I captured those logs for reference purpose using the command $ oc logs platform-scan-api-checks-pod --all-containers which did not capture the issue. But yes, I reproduced the issue and noticed similar debug logs that warn about the race condition. $ oc logs platform-scan-api-checks-pod -c log-collector -f I0813 07:15:13.771123 1 request.go:621] Throttling request took 1.037091702s, request: GET:https://172.30.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s {"level":"info","ts":1597302915.081068,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"} {"level":"info","ts":1597302915.0811741,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"} {"level":"info","ts":1597302915.0813122,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"} {"level":"info","ts":1597302919.6576288,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"} {"level":"info","ts":1597302919.6577134,"logger":"cmd","msg":"Compressed results bytes size","bytes":92227} {"level":"info","ts":1597302919.6579232,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"} {"level":"info","ts":1597302919.6586392,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"} {"level":"info","ts":1597302919.6615288,"logger":"cmd","msg":"Trying to upload results ConfigMap"} {"level":"info","ts":1597302919.6868777,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Thu, 13 Aug 2020 07:15:19 GMT\r\n\r\n"} {"level":"info","ts":1597302919.6870172,"logger":"cmd","msg":"Uploaded to resultserver"} ================== WARNING: DATA RACE Write at 0x00c0001aab10 by goroutine 9: main.handleCompleteSCAPResults.func2() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c Previous write at 0x00c0001aab10 by goroutine 164: main.handleCompleteSCAPResults.func1() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68 Goroutine 9 (running) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e Goroutine 164 (finished) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e ================== {"level":"info","ts":1597302919.710159,"logger":"cmd","msg":"Uploaded ConfigMap"} Found 1 data race(s)
Merged with https://github.com/openshift/compliance-operator/commit/cd341d5f7e7d0888bcca7acebb155674669303f8
It looks good. Now, the api-checks pod does not report Error state after performed multiple scans using the cronjob. Also, it does not restart. Verified on: OCP 4.6.0-0.nightly-2020-08-27-005538 $ oc get cronjobs -w NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE example-compliancesuite-rerunner 0/3 * * * * False 0 116s 19m $ oc get jobs --watch NAME COMPLETIONS DURATION AGE example-compliancesuite-rerunner-1598528700 1/1 8s 8m17s example-compliancesuite-rerunner-1598528880 1/1 7s 5m16s example-compliancesuite-rerunner-1598529060 1/1 8s 2m16s $ oc get pods -w NAME READY STATUS RESTARTS AGE compliance-operator-869646dd4f-jvhkk 1/1 Running 0 140m ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 139m platform-scan-api-checks-pod 0/2 Init:1/2 0 7s platform-scan-rs-697c8bb994-t7nzk 0/1 ContainerCreating 0 7s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 139m platform-scan-api-checks-pod 0/2 PodInitializing 0 7s platform-scan-api-checks-pod 2/2 Running 0 9s platform-scan-api-checks-pod 1/2 NotReady 0 10s platform-scan-rs-697c8bb994-t7nzk 0/1 ContainerCreating 0 24s platform-scan-rs-697c8bb994-t7nzk 1/1 Running 0 27s platform-scan-api-checks-pod 0/2 Completed 0 33s $ oc get pods -w NAME READY STATUS RESTARTS AGE aggregator-pod-platform-scan 0/1 Completed 0 105s compliance-operator-869646dd4f-jvhkk 1/1 Running 0 142m ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 141m platform-scan-api-checks-pod 0/2 Completed 0 2m25s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 141m $ oc get pods -w NAME READY STATUS RESTARTS AGE aggregator-pod-platform-scan 0/1 Completed 0 97s compliance-operator-869646dd4f-jvhkk 1/1 Running 0 151m example-compliancesuite-rerunner-1598528160-2m52r 0/1 Completed 0 8m7s example-compliancesuite-rerunner-1598528340-9wh9m 0/1 Completed 0 5m6s example-compliancesuite-rerunner-1598528520-vd7xl 0/1 Completed 0 2m6s ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 150m platform-scan-api-checks-pod 0/2 Completed 0 118s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 150m $ oc get pods/platform-scan-api-checks-pod -w NAME READY STATUS RESTARTS AGE platform-scan-api-checks-pod 0/2 Completed 0 92s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 3s platform-scan-api-checks-pod 0/2 Init:1/2 0 7s platform-scan-api-checks-pod 0/2 Init:1/2 0 9s platform-scan-api-checks-pod 0/2 PodInitializing 0 11s platform-scan-api-checks-pod 2/2 Running 0 13s platform-scan-api-checks-pod 1/2 NotReady 0 14s platform-scan-api-checks-pod 0/2 Completed 0 21s $ oc get pods/platform-scan-api-checks-pod -w NAME READY STATUS RESTARTS AGE platform-scan-api-checks-pod 0/2 Completed 0 2m18s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 2s platform-scan-api-checks-pod 0/2 Init:1/2 0 3s platform-scan-api-checks-pod 0/2 Init:1/2 0 5s platform-scan-api-checks-pod 0/2 PodInitializing 0 7s platform-scan-api-checks-pod 2/2 Running 0 9s platform-scan-api-checks-pod 1/2 NotReady 0 10s platform-scan-api-checks-pod 0/2 Completed 0 17s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196