Bug 1862939
| Summary: | The compliance-operator hangs when a single node taints | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | xiyuan |
| Component: | Compliance Operator | Assignee: | Juan Antonio Osorio <josorior> |
| Status: | CLOSED ERRATA | QA Contact: | Prashant Dhamdhere <pdhamdhe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | dahernan, jhrozek, josorior, mrogers, xiyuan |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:22:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
xiyuan
2020-08-03 09:26:09 UTC
I'm actually unsure about the expected result. If it wasn't possible to scan a node due to taints, seems to me like we should surface an error. Any opinions on this? (In reply to Juan Antonio Osorio from comment #1) > I'm actually unsure about the expected result. If it wasn't possible to scan > a node due to taints, seems to me like we should surface an error. Any > opinions on this? I guess so, the only q. would be how to detect that the pod was not scheduled due to a taint, I guess we'd have to look into events? (In reply to Jakub Hrozek from comment #2) > (In reply to Juan Antonio Osorio from comment #1) > > I'm actually unsure about the expected result. If it wasn't possible to scan > > a node due to taints, seems to me like we should surface an error. Any > > opinions on this? > > I guess so, the only q. would be how to detect that the pod was not > scheduled due to a taint, I guess we'd have to look into events? Gotta research that. Not sure if that info is available as part of the pod status, or if we gotta listen for events. Regardless, I think the resulting state should be an error and an indication that not all the nodes could be checked. Do you agree? (In reply to Juan Antonio Osorio from comment #3) > (In reply to Jakub Hrozek from comment #2) > > (In reply to Juan Antonio Osorio from comment #1) > > > I'm actually unsure about the expected result. If it wasn't possible to scan > > > a node due to taints, seems to me like we should surface an error. Any > > > opinions on this? > > > > I guess so, the only q. would be how to detect that the pod was not > > scheduled due to a taint, I guess we'd have to look into events? > > Gotta research that. Not sure if that info is available as part of the pod > status, or if we gotta listen for events. > > Regardless, I think the resulting state should be an error and an indication > that not all the nodes could be checked. Do you agree? Yes, if the scan cannot be reconciled, the end result should be an error. Got it, took that into account in https://github.com/openshift/compliance-operator/pull/394 Now, the compliance-operator does not hang on a single node taint and the scan result
report ERROR in compliancesuite & compliancescan.
Verified on: 4.6.0-0.nightly-2020-09-05-015624
$ oc get pods
NAME READY STATUS RESTARTS AGE
aggregator-pod-worker-scan 0/1 Completed 0 66s
compliance-operator-869646dd4f-7dzh7 1/1 Running 0 3h24m
ocp4-pp-6786c5f5b-tz6xr 1/1 Running 0 9m26s
rhcos4-pp-78c8cc9d44-sj2p2 1/1 Running 0 9m26s
worker-scan-pdhamdhe07-l8gp6-compute-0-pod 0/2 Pending 0 4m57s <<--- tained node pod
worker-scan-pdhamdhe07-l8gp6-compute-1-pod 0/2 Completed 0 4m57s
worker-scan-pdhamdhe07-l8gp6-compute-2-pod 0/2 Completed 0 4m57s
Monday 07 September 2020 10:26:45 PM IST
$ oc get compliancescan
NAME PHASE RESULT
worker-scan DONE ERROR
$ oc get compliancesuite
NAME PHASE RESULT
worker-compliancesuite DONE ERROR
$ oc get cm
NAME DATA AGE
compliance-operator-lock 0 3h24m
worker-scan-openscap-container-entrypoint 1 5m32s
worker-scan-openscap-env-map 5 5m32s
worker-scan-openscap-env-map-platform 4 5m32s
worker-scan-pdhamdhe07-l8gp6-compute-0-pod 2 5m31s <<-- tained
worker-scan-pdhamdhe07-l8gp6-compute-1-pod 2 3m28s
worker-scan-pdhamdhe07-l8gp6-compute-2-pod 2 111s
The scan reports error msg on the tainted node.
$ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-0-pod --confirm
error-msg
exit-code
$ cat error-msg
Couldn't schedule scan pod 'worker-scan-pdhamdhe07-l8gp6-compute-0-pod': 0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector.
$ cat exit-code
unschedulable
The scan has performed on non-tainted node.
$ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-1-pod --confirm
exit-code
results
$ cat exit-code
2
$ head results
<?xml version="1.0" encoding="UTF-8"?>
<TestResult xmlns="http://checklists.nist.gov/xccdf/1.2" id="xccdf_org.open-scap_testresult_xccdf_org.ssgproject.content_profile_moderate" start-time="2020-09-07T16:52:04+00:00" end-time="2020-09-07T16:53:13+00:00" version="0.1.52" test-system="cpe:/a:redhat:openscap:1.3.3">
<benchmark href="/content/ssg-rhcos4-ds.xml" id="xccdf_org.ssgproject.content_benchmark_RHCOS-4"/>
<title>OSCAP Scan Result</title>
<profile idref="xccdf_org.ssgproject.content_profile_moderate"/>
<target>pdhamdhe07-l8gp6-compute-1</target>
<target-facts>
<fact name="urn:xccdf:fact:identifier" type="string">chroot:///host</fact>
<fact name="urn:xccdf:fact:scanner:name" type="string">OpenSCAP</fact>
<fact name="urn:xccdf:fact:scanner:version" type="string">1.3.3</fact>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |