Bug 1970959

Summary: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig'
Product: OpenShift Container Platform Reporter: Neil Girard <ngirard>
Component: Compliance OperatorAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Prashant Dhamdhere <pdhamdhe>
Severity: high Docs Contact:
Priority: high    
Version: 4.6.zCC: aos-bugs, jkyros, mkrejci, mrogers, xiyuan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2020003 (view as bug list) Environment:
Last Closed: 2022-02-28 17:21:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neil Girard 2021-06-11 14:41:14 UTC
Description of problem:

If an upgrade from one cluster version to a new cluster version hits an issue such as a local file was modified and doesn't match MC, the MCD will break the link to /etc/machine-config-daemon/orig/etc/issue.mcdorig resulting in a stuck upgrade even after attempting "touch /run/machine-config-daemon-force"


Version-Release number of selected component (if applicable):
4.6.x


How reproducible:
Always


Steps to Reproduce:
1. Create MC that creates overrides for /etc/mtod and /etc/issues
2. Let MC roll out mtod and issues
3. Modify a controlled file.  In my case I added a line return to /etc/containers/registries.conf
4. Perform cluster upgrade (I upgraded from 4.6.31 to 4.6.32)
5. When MCD attempts to perform upgrade, you will hit issue about the registries.conf.  Perform "touch /run/machine-config-daemon-force" to remove manual changes  NOTE: Before even doing the touch, you can see that /etc/machine-config-daemon/orig/etc/issue.mcdorig is marked as broken
6. MCD will continue but then will then mark upgrade as failed due to the broken link. Also depending on node, removing the broken link may not allow node to progress.  You may need to uncordon node to retrigger the node to reapply the changes again. 

Actual results:
Node fails w/ broking link to /etc/machine-config-daemon/orig/etc/issue.mcdorig

Expected results:
Node should not have broken /etc/machine-config-daemon/orig/etc/issue.mcdorig link and should continue upgrade.

Additional info:
I was SSH'd into node to watch MCD crio logs.  I am not sure if this is required for this to happen.  I doubt it but wanted to call it out just in case.

The linked case also has a must gather that contains a MC for 60-motd-master that has issue and motd override values.  I used those.  The values work on initial application so I believe there is nothing wrong in these files.

Comment 11 Neil Girard 2021-10-26 11:55:06 UTC
Hello @jkyros , I have not seen any progress on the pull request in quite some time.  Is there any updates on this fix?

Comment 12 John Kyros 2021-11-02 16:51:51 UTC
Yes, sorry. We had paused this work as we were thinking we might be able to find a...more complete way to fix this, but time has not permitted, so we're going to try to get this in for 4.10.

Comment 13 John Kyros 2021-11-12 01:58:34 UTC
I'd appreciate it if someone from the compliance operator team could have a look -- the MCO is fixing the problem with "backing up a symlink" (https://bugzilla.redhat.com/show_bug.cgi?id=2020003) that causes the degradation, but the compliance operator is encouraging users to change /etc/issue on RHCOS via the 'banner-etc-issue' rule, which is maybe less than ideal. 

1.) On an RHCOS host /etc/issue is a symlink to /usr/lib/issue, just like on RHEL, but on RHCOS /usr/lib is read-only
sh-4.4# ls -l /etc/issue
lrwxrwxrwx. 1 root root 16 Nov 10 19:14 /etc/issue -> ../usr/lib/issue

2.) The MCO at current does not support the symlink portion of the ignition spec (so you can't write your own symlink) so right now if machine config is specified that modifies /etc/issue, the MCO overwrites the /etc/issue symlink with a file, which how this bug happens. 

3.) The preferred method to modify the issue message would be to write files to /etc/issue.d/ since that is user-writeable.

I'm not an expert on the compliance operator or where the rules come from, but can the rules be updated to encourage writing to /etc/issue.d/ instead or is that not a possibility? Thanks!

Comment 14 Jakub Hrozek 2021-11-26 10:08:38 UTC
First, I'm sorry that this bugzilla went unanswered for 2 weeks.

But to actually reply to it, I'm looking at the compliance rules definitions:
https://github.com/ComplianceAsCode/content/blob/master/linux_os/guide/system/accounts/accounts-banners/banner_etc_issue/rule.yml
and the rule text already tells (except for the first sentence which is confusing and I'll fix it) to don't edit /etc/issue directly on RHCOS, but instead use /etc/issue.d

At the same time, there is no automated remediation (because we can't presume the text), so adding the remediation is left as an exercise for the admin, as you can see the rule.yml even suggests a MachineConfig.

Is it possible that the customer applied a custom MachineConfig that touched /etc/issue? To do that, it would be nice to see the output of "oc get complianceremediations" and "oc get mc -lcompliance.openshift.io/scan-name" to look for remediations or machineConfigs created from remediations.

Or is it possible that they use a very old compliance operator version and/or compliance content version? What is it that they're actually running?

Comment 15 Jakub Hrozek 2021-11-26 10:15:08 UTC
btw rule amend PR: https://github.com/ComplianceAsCode/content/pull/7921

Comment 21 Jakub Hrozek 2022-02-28 17:21:01 UTC
Closing due to insuficcient data. Please see comments 14 and 16 in case you decide to reopen this bug.

Comment 22 Red Hat Bugzilla 2023-09-15 01:09:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days