Description of problem The compliance-operator hangs when a single node taints Version-Release -Cluster version 4.6.0-0.nightly-2020-07-25-091217 Reproduce Always Reproduce step 1. install compliance operator: 1.1 clone compliance-operator git repo $ git clone https://github.com/openshift/compliance-operator.git 1.2 Create 'openshift-compliance' namespace $ oc create -f compliance-operator/deploy/ns.yaml 1.3 Switch to 'openshift-compliance' namespace $ oc project openshift-compliance 1.4 Deploy CustomResourceDefinition. $ for f in $(ls -1 compliance-operator/deploy/crds/*crd.yaml); do oc create -f $f; done 1.5 Deploy compliance-operator. $ oc create -f compliance-operator/deploy/ 2. add taint for one worker node $ kubectl taint nodes ip-10-0-135-150.us-east-2.compute.internal key1=value1:NoSchedule node/ip-10-0-135-150.us-east-2.compute.internal tainted 3. Deploy ComplianceSuite CR with schedule: $ oc create -f - <<EOF > apiVersion: compliance.openshift.io/v1alpha1 > kind: ComplianceSuite > metadata: > name: example-compliancesuite > spec: > autoApplyRemediations: true > schedule: "*/15 * * * *" > scans: > - name: rhcos-scan > profile: xccdf_org.ssgproject.content_profile_moderate > content: ssg-rhcos4-ds.xml > contentImage: quay.io/complianceascode/ocp4:latest > debug: true > nodeSelector: > node-role.kubernetes.io/worker: "" > EOF compliancesuite.compliance.openshift.io/example-compliancesuite created Actual result The compliance-operator hangs when a single node taints. The tained pod goes in Pending status and no remediations and cron jobs schedule. $ oc get all NAME READY STATUS RESTARTS AGE pod/compliance-operator-869646dd4f-58lnl 1/1 Running 0 19m pod/compliance-operator-869646dd4f-kpz7g 1/1 Running 0 19m pod/compliance-operator-869646dd4f-nvfpw 1/1 Running 0 19m pod/ocp4-pp-dcb8bc5b5-cm5nv 1/1 Running 0 19m pod/rhcos-scan-ip-10-0-135-150.us-east-2.compute.internal-pod 0/2 Pending 0 118s pod/rhcos-scan-ip-10-0-187-86.us-east-2.compute.internal-pod 1/2 NotReady 0 118s pod/rhcos-scan-ip-10-0-213-238.us-east-2.compute.internal-pod 2/2 Running 0 118s pod/rhcos-scan-rs-65bb84784c-x9bdb 1/1 Running 0 119s pod/rhcos4-pp-58466496cf-n77r4 1/1 Running 0 19m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/compliance-operator-metrics ClusterIP 172.30.134.140 <none> 8383/TCP,8686/TCP 19m service/rhcos-scan-rs ClusterIP 172.30.126.50 <none> 8443/TCP 2m1s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/compliance-operator 3/3 3 3 19m deployment.apps/ocp4-pp 1/1 1 1 19m deployment.apps/rhcos-scan-rs 1/1 1 1 2m2s deployment.apps/rhcos4-pp 1/1 1 1 19m NAME DESIRED CURRENT READY AGE replicaset.apps/compliance-operator-869646dd4f 3 3 3 20m replicaset.apps/ocp4-pp-dcb8bc5b5 1 1 1 19m replicaset.apps/rhcos-scan-rs-65bb84784c 1 1 1 2m3s replicaset.apps/rhcos4-pp-58466496cf 1 1 1 19m $ oc describe pod/rhcos-scan-ip-10-0-135-150.us-east-2.compute.internal-pod ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector. Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector. $ oc get compliancesuite NAME PHASE RESULT example-compliancesuite RUNNING NOT-AVAILABLE $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 00-worker 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 01-master-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 01-master-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 01-worker-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 01-worker-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 99-master-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 99-master-ssh 3.1.0 167m 99-worker-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m 99-worker-ssh 3.1.0 167m rendered-master-39aea8e699d444a3dae231b65fcb8d65 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 144m rendered-master-95b05172633be01453ce238cfd32c4e0 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m rendered-worker-56723acf0aa81071a5040a098638632b 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 144m rendered-worker-64411a4005e12b1abf365c37ec9665db 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 159m Expected result The compliance-operator should not hang when a single node taints. The remediations and cron jobs should get schedule.
I'm actually unsure about the expected result. If it wasn't possible to scan a node due to taints, seems to me like we should surface an error. Any opinions on this?
(In reply to Juan Antonio Osorio from comment #1) > I'm actually unsure about the expected result. If it wasn't possible to scan > a node due to taints, seems to me like we should surface an error. Any > opinions on this? I guess so, the only q. would be how to detect that the pod was not scheduled due to a taint, I guess we'd have to look into events?
(In reply to Jakub Hrozek from comment #2) > (In reply to Juan Antonio Osorio from comment #1) > > I'm actually unsure about the expected result. If it wasn't possible to scan > > a node due to taints, seems to me like we should surface an error. Any > > opinions on this? > > I guess so, the only q. would be how to detect that the pod was not > scheduled due to a taint, I guess we'd have to look into events? Gotta research that. Not sure if that info is available as part of the pod status, or if we gotta listen for events. Regardless, I think the resulting state should be an error and an indication that not all the nodes could be checked. Do you agree?
(In reply to Juan Antonio Osorio from comment #3) > (In reply to Jakub Hrozek from comment #2) > > (In reply to Juan Antonio Osorio from comment #1) > > > I'm actually unsure about the expected result. If it wasn't possible to scan > > > a node due to taints, seems to me like we should surface an error. Any > > > opinions on this? > > > > I guess so, the only q. would be how to detect that the pod was not > > scheduled due to a taint, I guess we'd have to look into events? > > Gotta research that. Not sure if that info is available as part of the pod > status, or if we gotta listen for events. > > Regardless, I think the resulting state should be an error and an indication > that not all the nodes could be checked. Do you agree? Yes, if the scan cannot be reconciled, the end result should be an error.
Got it, took that into account in https://github.com/openshift/compliance-operator/pull/394
Now, the compliance-operator does not hang on a single node taint and the scan result report ERROR in compliancesuite & compliancescan. Verified on: 4.6.0-0.nightly-2020-09-05-015624 $ oc get pods NAME READY STATUS RESTARTS AGE aggregator-pod-worker-scan 0/1 Completed 0 66s compliance-operator-869646dd4f-7dzh7 1/1 Running 0 3h24m ocp4-pp-6786c5f5b-tz6xr 1/1 Running 0 9m26s rhcos4-pp-78c8cc9d44-sj2p2 1/1 Running 0 9m26s worker-scan-pdhamdhe07-l8gp6-compute-0-pod 0/2 Pending 0 4m57s <<--- tained node pod worker-scan-pdhamdhe07-l8gp6-compute-1-pod 0/2 Completed 0 4m57s worker-scan-pdhamdhe07-l8gp6-compute-2-pod 0/2 Completed 0 4m57s Monday 07 September 2020 10:26:45 PM IST $ oc get compliancescan NAME PHASE RESULT worker-scan DONE ERROR $ oc get compliancesuite NAME PHASE RESULT worker-compliancesuite DONE ERROR $ oc get cm NAME DATA AGE compliance-operator-lock 0 3h24m worker-scan-openscap-container-entrypoint 1 5m32s worker-scan-openscap-env-map 5 5m32s worker-scan-openscap-env-map-platform 4 5m32s worker-scan-pdhamdhe07-l8gp6-compute-0-pod 2 5m31s <<-- tained worker-scan-pdhamdhe07-l8gp6-compute-1-pod 2 3m28s worker-scan-pdhamdhe07-l8gp6-compute-2-pod 2 111s The scan reports error msg on the tainted node. $ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-0-pod --confirm error-msg exit-code $ cat error-msg Couldn't schedule scan pod 'worker-scan-pdhamdhe07-l8gp6-compute-0-pod': 0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector. $ cat exit-code unschedulable The scan has performed on non-tainted node. $ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-1-pod --confirm exit-code results $ cat exit-code 2 $ head results <?xml version="1.0" encoding="UTF-8"?> <TestResult xmlns="http://checklists.nist.gov/xccdf/1.2" id="xccdf_org.open-scap_testresult_xccdf_org.ssgproject.content_profile_moderate" start-time="2020-09-07T16:52:04+00:00" end-time="2020-09-07T16:53:13+00:00" version="0.1.52" test-system="cpe:/a:redhat:openscap:1.3.3"> <benchmark href="/content/ssg-rhcos4-ds.xml" id="xccdf_org.ssgproject.content_benchmark_RHCOS-4"/> <title>OSCAP Scan Result</title> <profile idref="xccdf_org.ssgproject.content_profile_moderate"/> <target>pdhamdhe07-l8gp6-compute-1</target> <target-facts> <fact name="urn:xccdf:fact:identifier" type="string">chroot:///host</fact> <fact name="urn:xccdf:fact:scanner:name" type="string">OpenSCAP</fact> <fact name="urn:xccdf:fact:scanner:version" type="string">1.3.3</fact>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196