Bug 1988259 - Rules missing from compliance operator after upgrade from 4.6.30 to 4.6.34
Summary: Rules missing from compliance operator after upgrade from 4.6.30 to 4.6.34
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Compliance Operator
Version: 4.6.z
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 4.9.0
Assignee: Jakub Hrozek
QA Contact: Prashant Dhamdhere
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-30 07:53 UTC by Sayali Bhavsar
Modified: 2021-11-15 09:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When parsing an updated profile, each object parsed from the profile is updated with an annotation reflecting the new profile version. Any object missing this annotation after the whole profile is parsed is considered removed from the profile and is deleted. Because the profileparser used to ignore errors, the rules might have not been annotated and would be considered removed. Consequence: On cluster upgrade, when a transient error occurs, an object such as a compliance rule that would be processed at the time would be considered removed from the profile and deleted. Fix: The profileparser propagates errors up instead of ignoring them, which restarts the whole profile parsing operation and eventually processes the content correctly. Result: No rules, variables or profiles are removed during CO upgrade even in case of transient errors e.g. when connecting to the API server fails.
Clone Of:
Environment:
Last Closed: 2021-11-10 07:37:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift compliance-operator pull 692 0 None None None 2021-09-08 15:07:28 UTC
Red Hat Product Errata RHBA-2021:4530 0 None None None 2021-11-10 07:37:28 UTC

Description Sayali Bhavsar 2021-07-30 07:53:36 UTC
Description of problem:
After RHOCP upgrade from v4.6.30 to v4.6.34, 2 compliance operator `rules` were deleted by `profile parser`


Version-Release number of selected component (if applicable):
RHOCP 4.6.34
complaince operator 0.1.35


How reproducible:
Always


Steps to Reproduce:
I. Install compliance operator v0.1.35 on RHOCP 4.6.30 and upgrade to v4.6.34 

$  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.30    True        False         13m     Cluster version is 4.6.30

$ oc get rules.compliance -n openshift-compliance | wc -l
750

$  oc get profilebundles.compliance
NAME     CONTENTIMAGE                                                                                                                               CONTENTFILE         STATUS
ocp4     registry.redhat.io/compliance/openshift-compliance-content-rhel8@sha256:5058d9130943ddf3d8a0e64dae5f81bf3fd612767d708e4f52e6ff19d6accf8f   ssg-ocp4-ds.xml     VALID
rhcos4   registry.redhat.io/compliance/openshift-compliance-content-rhel8@sha256:5058d9130943ddf3d8a0e64dae5f81bf3fd612767d708e4f52e6ff19d6accf8f   ssg-rhcos4-ds.xml   VALID

2. After upgrade check the rules

$  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.34    True        False         13m     Cluster version is 4.6.34

$ oc get rules.compliance -n openshift-compliance | wc -l
748

$ oc get profilebundles.compliance
NAME     CONTENTIMAGE                                                                                                                               CONTENTFILE         STATUS
ocp4     registry.redhat.io/compliance/openshift-compliance-content-rhel8@sha256:5058d9130943ddf3d8a0e64dae5f81bf3fd612767d708e4f52e6ff19d6accf8f   ssg-ocp4-ds.xml     INVALID
rhcos4   registry.redhat.io/compliance/openshift-compliance-content-rhel8@sha256:5058d9130943ddf3d8a0e64dae5f81bf3fd612767d708e4f52e6ff19d6accf8f   ssg-rhcos4-ds.xml   VALID

$ oc describe profilebundles.compliance ocp4
Name:         ocp4
Namespace:    openshift-compliance
Labels:       <none>
Annotations:  <none>
API Version:  compliance.openshift.io/v1alpha1
Kind:         ProfileBundle
Metadata:
  Creation Timestamp:  2021-07-30T04:36:26Z
  Finalizers:
    profilebundle.finalizers.compliance.openshift.io
  Generation:  1
  Managed Fields:
    API Version:  compliance.openshift.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"profilebundle.finalizers.compliance.openshift.io":
      f:spec:
        .:
        f:contentFile:
        f:contentImage:
      f:status:
        .:
        f:conditions:
        f:dataStreamStatus:
        f:errorMessage:
    Manager:         compliance-operator
    Operation:       Update
    Time:            2021-07-30T05:58:02Z
  Resource Version:  292580
  Self Link:         /apis/compliance.openshift.io/v1alpha1/namespaces/openshift-compliance/profilebundles/ocp4
  UID:               e7674866-a3f9-4f97-af70-f970cce326c6
Spec:
  Content File:   ssg-ocp4-ds.xml
  Content Image:  registry.redhat.io/compliance/openshift-compliance-content-rhel8@sha256:5058d9130943ddf3d8a0e64dae5f81bf3fd612767d708e4f52e6ff19d6accf8f
Status:
  Conditions:
    Last Transition Time:  2021-07-30T05:58:02Z
    Message:               Couldn't parse profile bundle
    Reason:                Invalid
    Status:                False
    Type:                  Ready
  Data Stream Status:      INVALID
  Error Message:           Operation cannot be fulfilled on profiles.compliance.openshift.io "ocp4-cis-node": the object has been modified; please apply your changes to the latest version and try again
Events:                    <none>

$ oc logs rhcos4-openshift-compliance-pp-xxxx -c profileparser
~~~
{"level":"info","ts":1627626752.1480162,"logger":"profileparser","msg":"Deleting object no longer used by the current profileBundle","kind":"Rule","name":"rhcos4-auditd-data-disk-full-action"}
{"level":"info","ts":1627626752.358363,"logger":"profileparser","msg":"Deleting object no longer used by the current profileBundle","kind":"Rule","name":"rhcos4-network-nmcli-permissions"}
~~~

[quicklab@upi-0 ~]$ oc get rules.compliance -n openshift-compliance | grep rhcos4-network-nmcli-permissions
[quicklab@upi-0 ~]$ oc get rules.compliance -n openshift-compliance | grep rhcos4-auditd-data-disk-full-action


Actual results:
- The number of rules belonging to a profile change after an upgrade.
- rules are deleted


Expected results:
- The number of rules belonging to a profile does not change after an upgrade 
- rules are not deleted 

Additional info:

Comment 1 Jakub Hrozek 2021-07-30 08:20:09 UTC
Thank you for the detailed bug report.

(In reply to Sayali Bhavsar from comment #0)
> [quicklab@upi-0 ~]$ oc get rules.compliance -n openshift-compliance | grep
> rhcos4-network-nmcli-permissions

I suspect that this rule got removed because at one point we removed the NCP profile, which was shipped by accident and the NCP profile was the only one using this rule.

> [quicklab@upi-0 ~]$ oc get rules.compliance -n openshift-compliance | grep
> rhcos4-auditd-data-disk-full-action

Here I'm not sure. I do see the rule being used in the rhcos4-moderate profile, so I need to check things locally.

However, as a general note, the rhcos4-moderate profile which is the only OCP-related profile using this rule is not production-ready yet. Also...

> 
> 
> Actual results:
> - The number of rules belonging to a profile change after an upgrade.
> - rules are deleted
> 
> 
> Expected results:
> - The number of rules belonging to a profile does not change after an
> upgrade 
> - rules are not deleted

...this expectation doesn't have to be necessarily true. Rules can become obsolete and get removed. I tend to agree that it should be documented, though, to avoid surprises.

Comment 5 Jakub Hrozek 2021-07-30 10:12:16 UTC
Here seems to be another issue:

Status:
  Conditions:
    Last Transition Time:  2021-07-30T05:58:02Z
    Message:               Couldn't parse profile bundle
    Reason:                Invalid
    Status:                False
    Type:                  Ready
  Data Stream Status:      INVALID
  Error Message:           Operation cannot be fulfilled on profiles.compliance.openshift.io "ocp4-cis-node": the object has been modified; please apply your changes to the latest version and try again
Events:                    <none>

We shouldn't probably mark the profile bundle as invalid, but just retry..

Comment 6 Sayali Bhavsar 2021-07-30 10:51:04 UTC
(In reply to Jakub Hrozek from comment #5)
> Here seems to be another issue:
> 
> Status:
>   Conditions:
>     Last Transition Time:  2021-07-30T05:58:02Z
>     Message:               Couldn't parse profile bundle
>     Reason:                Invalid
>     Status:                False
>     Type:                  Ready
>   Data Stream Status:      INVALID
>   Error Message:           Operation cannot be fulfilled on
> profiles.compliance.openshift.io "ocp4-cis-node": the object has been
> modified; please apply your changes to the latest version and try again
> Events:                    <none>
> 
> We shouldn't probably mark the profile bundle as invalid, but just retry..

No manual changes were made to the profile bundle at any given point of time. Neither before, during nor after upgrade operation. Post upgrade, it has become invalid, not sure why.

Comment 7 Jakub Hrozek 2021-07-30 11:06:57 UTC
btw just to set the severity right: the fact that the Rule object disappeared is very annoying but does not mean that the check itself is gone, especially if the whole datastream and not a subset (via tailoring) is used. Looking at the CO codebase, the only place where the Rule objects are actually used is when a tailoring profile is created. Existing tailored profiles wouldn't be affected because the configmap they produce already contains an XCCDF reference, "only" creating new tailored profiles.

So there seems to be a bug, but it's not like we're sudddendly not performing that check.

Comment 8 Jakub Hrozek 2021-07-30 11:26:11 UTC
Sayali was kind enough to have reproduced the bug. In his environment, I saw this in the profileparser logs:

{"level":"info","ts":1627626705.0728796,"logger":"profileparser","msg":"Creating object","kind":"Rule","key":{"namespace":"openshift-compliance","name":"rhcos4-auditd-data-disk-full-action"}}
{"level":"error","ts":1627626705.2833045,"logger":"profileparser","msg":"couldn't execute action for rule","error":"Get \"https://172.30.0.1:443/apis/compliance.openshift.io/v1alpha1/namespaces/openshift-compliance/rules/rhcos4-auditd-data-disk-full-action\": dial tcp 172.30.0.1:443: connect: connection refused"}

--> here

{"level":"info","ts":1627626705.2833915,"logger":"profileparser","msg":"Found rule","id":"xccdf_org.ssgproject.content_rule_network_nmcli_permissions"}
{"level":"info","ts":1627626705.3059528,"logger":"profileparser","msg":"Creating object","kind":"Rule","key":{"namespace":"openshift-compliance","name":"rhcos4-network-nmcli-permissions"}}
{"level":"error","ts":1627626705.318621,"logger":"profileparser","msg":"couldn't execute action for rule","error":"Get \"https://172.30.0.1:443/apis/compliance.openshift.io/v1alpha1/namespaces/openshift-compliance/rules/rhcos4-network-nmcli-permissions\": dial tcp 172.30.0.1:443: connect: connection refused"}

--> and here

{"level":"info","ts":1627626705.3187175,"logger":"profileparser","msg":"Found rule","id":"xccdf_org.ssgproject.content_rule_sysctl_net_ipv4_conf_default_secure_redirects"}

So it appears we're not retrying errors in the profileparser at all, which sort of makes sense, it's not a controller loop. But this also means that any errors cause the rules to not be flagged with the new digest annotation and are subsequently removed.

Comment 9 Jakub Hrozek 2021-07-30 11:31:49 UTC
Oh this one is even better:

{"level":"info","ts":1627624679.4095025,"logger":"profileparser","msg":"Creating object","kind":"Rule","key":{"namespace":"openshift-compliance","name":"ocp4-api-server-client-ca"}}
{"level":"error","ts":1627624682.5629635,"logger":"profileparser","msg":"couldn't execute action","error":"Operation cannot be fulfilled on profiles.compliance.openshift.io \"ocp4-cis-node\": the object has been modified; please apply you

--> here we get a conflict

{"level":"info","ts":1627624682.563104,"logger":"profileparser","msg":"Checking for unused object","kind":"Profile","owner":"ocp4","namespace":"openshift-compliance"}
{"level":"info","ts":1627624682.6403625,"logger":"profileparser","msg":"Deleting object no longer used by the current profileBundle","kind":"Profile","name":"ocp4-cis"}

--> and we just nuke the whole profile, oops

Comment 10 Jakub Hrozek 2021-07-30 13:58:07 UTC
WIP patch: https://github.com/openshift/compliance-operator/pull/675

It's totally untested, but I wanted to throw something out there before I leave for 2 weeks. Maybe someone else can prettify the PR in the meantime or create their own version.

Comment 11 Jakub Hrozek 2021-08-27 14:50:16 UTC
After a bit of discussion, here is a different approach: https://github.com/openshift/compliance-operator/pull/692

Comment 12 Jakub Hrozek 2021-09-08 15:05:16 UTC
Apparently this BZ was not linked properly with the GH PR, but the PR was merged and the fix will be released in 0.1.40 upstream.

Comment 23 errata-xmlrpc 2021-11-10 07:37:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Compliance Operator bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4530


Note You need to log in before you can comment on or make changes to this bug.