Bug 1867482
Summary: | [OCP v46] The 'log-collector' container of api-checks pod shows 'Error' state with exit code 66 during Platform scan | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Prashant Dhamdhere <pdhamdhe> | ||||
Component: | Compliance Operator | Assignee: | Juan Antonio Osorio <josorior> | ||||
Status: | CLOSED ERRATA | QA Contact: | Prashant Dhamdhere <pdhamdhe> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 4.6 | CC: | jhrozek, josorior, mrogers, xiyuan | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-10-27 16:26:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Prashant Dhamdhere
2020-08-10 07:08:48 UTC
Noticed similar issue (scan pods terminate with exit code 66) during Node scan as well, here are some details: $ oc get pods -w NAME READY STATUS RESTARTS AGE compliance-operator-869646dd4f-47dm7 1/1 Running 0 25m compliance-operator-869646dd4f-d88hm 1/1 Running 0 25m compliance-operator-869646dd4f-zc27r 1/1 Running 0 25m ocp4-pp-dcb8bc5b5-9ql2c 1/1 Running 0 25m rhcos4-pp-58466496cf-q2lhn 1/1 Running 0 25m worker-scan-ip-10-0-48-121.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod 1/2 NotReady 1 74s <<---- worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod 1/2 NotReady 0 75s worker-scan-rs-66564cb46-lpdsw 1/1 Running 0 75s worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 0/2 Error 0 78s <<---- worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod 1/2 NotReady 1 82s <<---- worker-scan-ip-10-0-67-26.us-east-2.compute.internal-pod 0/2 Completed 0 88s $ oc describe pods/worker-scan-ip-10-0-50-160.us-east-2.compute.internal-pod |grep -A 30 "log-collector" log-collector: Container ID: cri-o://5f332ba3b257f97abf84bbe78fbb13ea806c72ba3ba73014a9f6651d5eb41cdb Image: quay.io/compliance-operator/compliance-operator:latest Image ID: quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 Port: <none> Host Port: <none> Command: compliance-operator resultscollector ..... State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 12 Aug 2020 11:25:43 +0530 Finished: Wed, 12 Aug 2020 11:26:26 +0530 Last State: Terminated Reason: Error <<------- Exit Code: 66 <<------- Started: Wed, 12 Aug 2020 11:25:00 +0530 Finished: Wed, 12 Aug 2020 11:25:41 +0530 $ oc describe pods/worker-scan-ip-10-0-65-217.us-east-2.compute.internal-pod |grep -A 30 "log-collector" log-collector: Container ID: cri-o://46c0fc65ac67744106bf9261fea227b5629f6cdc079f7b597f756bc4c1056698 Image: quay.io/compliance-operator/compliance-operator:latest Image ID: quay.io/compliance-operator/compliance-operator@sha256:1d03532fcc762d64d1dc3a05d54023105c44298a4337ce1f385050d978019c92 Port: <none> Host Port: <none> Command: compliance-operator resultscollector .... --tls-ca=/etc/pki/tls/ca.crt State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 12 Aug 2020 11:26:16 +0530 Finished: Wed, 12 Aug 2020 11:27:17 +0530 Last State: Terminated Reason: Error <<------- Exit Code: 66 <<------- Started: Wed, 12 Aug 2020 11:25:01 +0530 Finished: Wed, 12 Aug 2020 11:26:12 +0530 Got any more logs about this? or could we check an environment where this is reproducible? I haven't been able to reproduce this myself. Was able to reproduce it after a while... Seems we've got a race condition: oc logs platform-scan-api-checks-pod -c log-collector -f I0812 11:50:18.364531 1 request.go:621] Throttling request took 1.021754151s, request: GET:https://172.30.0.1:443/apis/operator.openshift.io/v1?timeout=32s {"level":"info","ts":1597233019.677537,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"} {"level":"info","ts":1597233019.677665,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"} {"level":"info","ts":1597233019.6778421,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"} {"level":"info","ts":1597233024.225037,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"} {"level":"info","ts":1597233024.2251403,"logger":"cmd","msg":"Compressed results bytes size","bytes":92185} {"level":"info","ts":1597233024.2253785,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"} {"level":"info","ts":1597233024.226219,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"} {"level":"info","ts":1597233024.2280633,"logger":"cmd","msg":"Trying to upload results ConfigMap"} {"level":"info","ts":1597233024.2559106,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Wed, 12 Aug 2020 11:50:24 GMT\r\n\r\n"} {"level":"info","ts":1597233024.2560139,"logger":"cmd","msg":"Uploaded to resultserver"} ================== WARNING: DATA RACE Write at 0x00c0005f5640 by goroutine 42: main.handleCompleteSCAPResults.func2() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c Previous write at 0x00c0005f5640 by goroutine 12: main.handleCompleteSCAPResults.func1() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68 Goroutine 42 (running) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e Goroutine 12 (finished) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e ================== {"level":"info","ts":1597233024.2864218,"logger":"cmd","msg":"Uploaded ConfigMap"} Found 1 data race(s) Prashant, do you know if the logs you attached to the opening comment captured the issue? Because I don't see the debug logs that warn about the race in that log file.. Alternatively, can you still reproduce the issue and check if you see the same log messages as in comment #3? Hi Jakub, I captured those logs for reference purpose using the command $ oc logs platform-scan-api-checks-pod --all-containers which did not capture the issue. But yes, I reproduced the issue and noticed similar debug logs that warn about the race condition. $ oc logs platform-scan-api-checks-pod -c log-collector -f I0813 07:15:13.771123 1 request.go:621] Throttling request took 1.037091702s, request: GET:https://172.30.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s {"level":"info","ts":1597302915.081068,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/exit_code"} {"level":"info","ts":1597302915.0811741,"logger":"cmd","msg":"Got exit-code from file","exit-code":"2"} {"level":"info","ts":1597302915.0813122,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report-arf.xml"} {"level":"info","ts":1597302919.6576288,"logger":"cmd","msg":"File needs compression","results-file":"/reports/report-arf.xml"} {"level":"info","ts":1597302919.6577134,"logger":"cmd","msg":"Compressed results bytes size","bytes":92227} {"level":"info","ts":1597302919.6579232,"logger":"cmd","msg":"Results file found, will upload it.","resuts-file":"/reports/report.xml"} {"level":"info","ts":1597302919.6586392,"logger":"cmd","msg":"Trying to upload to resultserver","url":"https://platform-scan-rs:8443/"} {"level":"info","ts":1597302919.6615288,"logger":"cmd","msg":"Trying to upload results ConfigMap"} {"level":"info","ts":1597302919.6868777,"logger":"cmd","msg":"HTTP/1.1 200 OK\r\nContent-Length: 0\r\nDate: Thu, 13 Aug 2020 07:15:19 GMT\r\n\r\n"} {"level":"info","ts":1597302919.6870172,"logger":"cmd","msg":"Uploaded to resultserver"} ================== WARNING: DATA RACE Write at 0x00c0001aab10 by goroutine 9: main.handleCompleteSCAPResults.func2() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:362 +0x8c Previous write at 0x00c0001aab10 by goroutine 164: main.handleCompleteSCAPResults.func1() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:352 +0x68 Goroutine 9 (running) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:361 +0x32c main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e Goroutine 164 (finished) created at: main.handleCompleteSCAPResults() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:351 +0x2bc main.resultCollectorMain() /go/src/github.com/openshift/compliance-operator/cmd/manager/resultcollector.go:455 +0x23c github.com/spf13/cobra.(*Command).execute() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:846 +0x8e0 github.com/spf13/cobra.(*Command).ExecuteC() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:950 +0x499 main.main() /go/src/github.com/openshift/compliance-operator/vendor/github.com/spf13/cobra/command.go:887 +0x4e ================== {"level":"info","ts":1597302919.710159,"logger":"cmd","msg":"Uploaded ConfigMap"} Found 1 data race(s) Merged with https://github.com/openshift/compliance-operator/commit/cd341d5f7e7d0888bcca7acebb155674669303f8 It looks good. Now, the api-checks pod does not report Error state after performed multiple scans using the cronjob. Also, it does not restart. Verified on: OCP 4.6.0-0.nightly-2020-08-27-005538 $ oc get cronjobs -w NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE example-compliancesuite-rerunner 0/3 * * * * False 0 116s 19m $ oc get jobs --watch NAME COMPLETIONS DURATION AGE example-compliancesuite-rerunner-1598528700 1/1 8s 8m17s example-compliancesuite-rerunner-1598528880 1/1 7s 5m16s example-compliancesuite-rerunner-1598529060 1/1 8s 2m16s $ oc get pods -w NAME READY STATUS RESTARTS AGE compliance-operator-869646dd4f-jvhkk 1/1 Running 0 140m ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 139m platform-scan-api-checks-pod 0/2 Init:1/2 0 7s platform-scan-rs-697c8bb994-t7nzk 0/1 ContainerCreating 0 7s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 139m platform-scan-api-checks-pod 0/2 PodInitializing 0 7s platform-scan-api-checks-pod 2/2 Running 0 9s platform-scan-api-checks-pod 1/2 NotReady 0 10s platform-scan-rs-697c8bb994-t7nzk 0/1 ContainerCreating 0 24s platform-scan-rs-697c8bb994-t7nzk 1/1 Running 0 27s platform-scan-api-checks-pod 0/2 Completed 0 33s $ oc get pods -w NAME READY STATUS RESTARTS AGE aggregator-pod-platform-scan 0/1 Completed 0 105s compliance-operator-869646dd4f-jvhkk 1/1 Running 0 142m ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 141m platform-scan-api-checks-pod 0/2 Completed 0 2m25s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 141m $ oc get pods -w NAME READY STATUS RESTARTS AGE aggregator-pod-platform-scan 0/1 Completed 0 97s compliance-operator-869646dd4f-jvhkk 1/1 Running 0 151m example-compliancesuite-rerunner-1598528160-2m52r 0/1 Completed 0 8m7s example-compliancesuite-rerunner-1598528340-9wh9m 0/1 Completed 0 5m6s example-compliancesuite-rerunner-1598528520-vd7xl 0/1 Completed 0 2m6s ocp4-pp-7f89f556cc-jg99r 1/1 Running 0 150m platform-scan-api-checks-pod 0/2 Completed 0 118s rhcos4-pp-7c44999587-pc8cz 1/1 Running 0 150m $ oc get pods/platform-scan-api-checks-pod -w NAME READY STATUS RESTARTS AGE platform-scan-api-checks-pod 0/2 Completed 0 92s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 3s platform-scan-api-checks-pod 0/2 Init:1/2 0 7s platform-scan-api-checks-pod 0/2 Init:1/2 0 9s platform-scan-api-checks-pod 0/2 PodInitializing 0 11s platform-scan-api-checks-pod 2/2 Running 0 13s platform-scan-api-checks-pod 1/2 NotReady 0 14s platform-scan-api-checks-pod 0/2 Completed 0 21s $ oc get pods/platform-scan-api-checks-pod -w NAME READY STATUS RESTARTS AGE platform-scan-api-checks-pod 0/2 Completed 0 2m18s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Terminating 0 2m58s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Pending 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 0s platform-scan-api-checks-pod 0/2 Init:0/2 0 2s platform-scan-api-checks-pod 0/2 Init:1/2 0 3s platform-scan-api-checks-pod 0/2 Init:1/2 0 5s platform-scan-api-checks-pod 0/2 PodInitializing 0 7s platform-scan-api-checks-pod 2/2 Running 0 9s platform-scan-api-checks-pod 1/2 NotReady 0 10s platform-scan-api-checks-pod 0/2 Completed 0 17s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |