Bug 1862939

Summary: The compliance-operator hangs when a single node taints
Product: OpenShift Container Platform Reporter: xiyuan
Component: Compliance OperatorAssignee: Juan Antonio Osorio <josorior>
Status: CLOSED ERRATA QA Contact: Prashant Dhamdhere <pdhamdhe>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: dahernan, jhrozek, josorior, mrogers, xiyuan
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:22:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description xiyuan 2020-08-03 09:26:09 UTC
Description of problem 
The compliance-operator hangs when a single node taints

Version-Release -Cluster version 
4.6.0-0.nightly-2020-07-25-091217

Reproduce
Always

Reproduce step
1. install compliance operator:
 1.1 clone compliance-operator git repo
 $ git clone https://github.com/openshift/compliance-operator.git
 1.2 Create 'openshift-compliance' namespace
 $ oc create -f compliance-operator/deploy/ns.yaml  
 1.3 Switch to 'openshift-compliance' namespace
 $ oc project openshift-compliance
 1.4 Deploy CustomResourceDefinition.
 $ for f in $(ls -1 compliance-operator/deploy/crds/*crd.yaml); do oc create -f $f; done
 1.5 Deploy compliance-operator.
 $ oc create -f compliance-operator/deploy/

2. add taint for one worker node
$ kubectl taint nodes ip-10-0-135-150.us-east-2.compute.internal key1=value1:NoSchedule
node/ip-10-0-135-150.us-east-2.compute.internal tainted

3. Deploy ComplianceSuite CR with schedule:
$  oc create -f - <<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ComplianceSuite
> metadata:
>   name: example-compliancesuite
> spec:
>   autoApplyRemediations: true
>   schedule: "*/15 * * * *"
>   scans:
>     - name: rhcos-scan
>       profile: xccdf_org.ssgproject.content_profile_moderate
>       content: ssg-rhcos4-ds.xml
>       contentImage: quay.io/complianceascode/ocp4:latest
>       debug: true
>       nodeSelector:
>         node-role.kubernetes.io/worker: ""
> EOF
compliancesuite.compliance.openshift.io/example-compliancesuite created


Actual result

The compliance-operator hangs when a single node taints. The tained pod goes in Pending status and no remediations and cron jobs schedule.

$ oc get all
NAME                                                            READY   STATUS     RESTARTS   AGE
pod/compliance-operator-869646dd4f-58lnl                        1/1     Running    0          19m
pod/compliance-operator-869646dd4f-kpz7g                        1/1     Running    0          19m
pod/compliance-operator-869646dd4f-nvfpw                        1/1     Running    0          19m
pod/ocp4-pp-dcb8bc5b5-cm5nv                                     1/1     Running    0          19m
pod/rhcos-scan-ip-10-0-135-150.us-east-2.compute.internal-pod   0/2     Pending    0          118s
pod/rhcos-scan-ip-10-0-187-86.us-east-2.compute.internal-pod    1/2     NotReady   0          118s
pod/rhcos-scan-ip-10-0-213-238.us-east-2.compute.internal-pod   2/2     Running    0          118s
pod/rhcos-scan-rs-65bb84784c-x9bdb                              1/1     Running    0          119s
pod/rhcos4-pp-58466496cf-n77r4                                  1/1     Running    0          19m

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/compliance-operator-metrics   ClusterIP   172.30.134.140   <none>        8383/TCP,8686/TCP   19m
service/rhcos-scan-rs                 ClusterIP   172.30.126.50    <none>        8443/TCP            2m1s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/compliance-operator   3/3     3            3           19m
deployment.apps/ocp4-pp               1/1     1            1           19m
deployment.apps/rhcos-scan-rs         1/1     1            1           2m2s
deployment.apps/rhcos4-pp             1/1     1            1           19m

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/compliance-operator-869646dd4f   3         3         3       20m
replicaset.apps/ocp4-pp-dcb8bc5b5                1         1         1       19m
replicaset.apps/rhcos-scan-rs-65bb84784c         1         1         1       2m3s
replicaset.apps/rhcos4-pp-58466496cf             1         1         1       19m

$ oc describe pod/rhcos-scan-ip-10-0-135-150.us-east-2.compute.internal-pod
...
Events:
  Type     Reason            Age        From  Message
  ----     ------            ----       ----  -------
  Warning  FailedScheduling  <unknown>        0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>        0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector.

$ oc get compliancesuite
NAME                      PHASE     RESULT
example-compliancesuite   RUNNING   NOT-AVAILABLE

$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
00-worker                                          057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
01-master-container-runtime                        057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
01-master-kubelet                                  057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
01-worker-container-runtime                        057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
01-worker-kubelet                                  057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
99-master-generated-registries                     057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
99-master-ssh                                                                                 3.1.0             167m
99-worker-generated-registries                     057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
99-worker-ssh                                                                                 3.1.0             167m
rendered-master-39aea8e699d444a3dae231b65fcb8d65   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             144m
rendered-master-95b05172633be01453ce238cfd32c4e0   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m
rendered-worker-56723acf0aa81071a5040a098638632b   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             144m
rendered-worker-64411a4005e12b1abf365c37ec9665db   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             159m


Expected result
The compliance-operator should not hang when a single node taints. The remediations and cron jobs should get schedule.

Comment 1 Juan Antonio Osorio 2020-08-05 06:20:11 UTC
I'm actually unsure about the expected result. If it wasn't possible to scan a node due to taints, seems to me like we should surface an error. Any opinions on this?

Comment 2 Jakub Hrozek 2020-08-05 07:26:37 UTC
(In reply to Juan Antonio Osorio from comment #1)
> I'm actually unsure about the expected result. If it wasn't possible to scan
> a node due to taints, seems to me like we should surface an error. Any
> opinions on this?

I guess so, the only q. would be how to detect that the pod was not scheduled due to a taint, I guess we'd have to look into events?

Comment 3 Juan Antonio Osorio 2020-08-05 08:09:51 UTC
(In reply to Jakub Hrozek from comment #2)
> (In reply to Juan Antonio Osorio from comment #1)
> > I'm actually unsure about the expected result. If it wasn't possible to scan
> > a node due to taints, seems to me like we should surface an error. Any
> > opinions on this?
> 
> I guess so, the only q. would be how to detect that the pod was not
> scheduled due to a taint, I guess we'd have to look into events?

Gotta research that. Not sure if that info is available as part of the pod status, or if we gotta listen for events.

Regardless, I think the resulting state should be an error and an indication that not all the nodes could be checked. Do you agree?

Comment 4 Jakub Hrozek 2020-08-05 09:45:46 UTC
(In reply to Juan Antonio Osorio from comment #3)
> (In reply to Jakub Hrozek from comment #2)
> > (In reply to Juan Antonio Osorio from comment #1)
> > > I'm actually unsure about the expected result. If it wasn't possible to scan
> > > a node due to taints, seems to me like we should surface an error. Any
> > > opinions on this?
> > 
> > I guess so, the only q. would be how to detect that the pod was not
> > scheduled due to a taint, I guess we'd have to look into events?
> 
> Gotta research that. Not sure if that info is available as part of the pod
> status, or if we gotta listen for events.
> 
> Regardless, I think the resulting state should be an error and an indication
> that not all the nodes could be checked. Do you agree?

Yes, if the scan cannot be reconciled, the end result should be an error.

Comment 5 Juan Antonio Osorio 2020-08-12 10:02:18 UTC
Got it, took that into account in https://github.com/openshift/compliance-operator/pull/394

Comment 8 Prashant Dhamdhere 2020-09-07 17:23:09 UTC
Now, the compliance-operator does not hang on a single node taint and the scan result
report ERROR in compliancesuite & compliancescan.

Verified on: 4.6.0-0.nightly-2020-09-05-015624


$ oc get pods
NAME                                         READY   STATUS      RESTARTS   AGE
aggregator-pod-worker-scan                   0/1     Completed   0          66s
compliance-operator-869646dd4f-7dzh7         1/1     Running     0          3h24m
ocp4-pp-6786c5f5b-tz6xr                      1/1     Running     0          9m26s
rhcos4-pp-78c8cc9d44-sj2p2                   1/1     Running     0          9m26s
worker-scan-pdhamdhe07-l8gp6-compute-0-pod   0/2     Pending     0          4m57s  <<--- tained node pod
worker-scan-pdhamdhe07-l8gp6-compute-1-pod   0/2     Completed   0          4m57s
worker-scan-pdhamdhe07-l8gp6-compute-2-pod   0/2     Completed   0          4m57s
Monday 07 September 2020 10:26:45 PM IST

$ oc get compliancescan
NAME          PHASE   RESULT
worker-scan   DONE    ERROR

$ oc get compliancesuite
NAME                     PHASE   RESULT
worker-compliancesuite   DONE    ERROR

$ oc get cm
NAME                                         DATA   AGE
compliance-operator-lock                     0      3h24m
worker-scan-openscap-container-entrypoint    1      5m32s
worker-scan-openscap-env-map                 5      5m32s
worker-scan-openscap-env-map-platform        4      5m32s
worker-scan-pdhamdhe07-l8gp6-compute-0-pod   2      5m31s  <<-- tained 
worker-scan-pdhamdhe07-l8gp6-compute-1-pod   2      3m28s
worker-scan-pdhamdhe07-l8gp6-compute-2-pod   2      111s

The scan reports error msg on the tainted node.
 
$ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-0-pod --confirm
error-msg
exit-code

$ cat error-msg 
Couldn't schedule scan pod 'worker-scan-pdhamdhe07-l8gp6-compute-0-pod': 0/6 nodes are available: 1 node(s) had taint {key1: value1}, that the pod didn't tolerate, 5 node(s) didn't match node selector.

$ cat exit-code 
unschedulable

The scan has performed on non-tainted node.

$ oc extract cm/worker-scan-pdhamdhe07-l8gp6-compute-1-pod --confirm
exit-code
results

$ cat exit-code 
2

$ head results 
<?xml version="1.0" encoding="UTF-8"?>
<TestResult xmlns="http://checklists.nist.gov/xccdf/1.2" id="xccdf_org.open-scap_testresult_xccdf_org.ssgproject.content_profile_moderate" start-time="2020-09-07T16:52:04+00:00" end-time="2020-09-07T16:53:13+00:00" version="0.1.52" test-system="cpe:/a:redhat:openscap:1.3.3">
          <benchmark href="/content/ssg-rhcos4-ds.xml" id="xccdf_org.ssgproject.content_benchmark_RHCOS-4"/>
          <title>OSCAP Scan Result</title>
          <profile idref="xccdf_org.ssgproject.content_profile_moderate"/>
          <target>pdhamdhe07-l8gp6-compute-1</target>
          <target-facts>
            <fact name="urn:xccdf:fact:identifier" type="string">chroot:///host</fact>
            <fact name="urn:xccdf:fact:scanner:name" type="string">OpenSCAP</fact>
            <fact name="urn:xccdf:fact:scanner:version" type="string">1.3.3</fact>

Comment 10 errata-xmlrpc 2020-10-27 16:22:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196