1970959 – Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig'

Bug 1970959 - Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig'

Summary: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Compliance Operator
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jakub Hrozek
QA Contact:	Prashant Dhamdhere
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-11 14:41 UTC by Neil Girard
Modified:	2023-09-15 01:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2020003 (view as bug list)
Environment:
Last Closed:	2022-02-28 17:21:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2681	0	None	open	Bug 1970959: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.m...	2021-07-19 15:54:44 UTC
Red Hat Knowledge Base (Solution)	6128271	0	None	None	None	2021-06-21 18:43:53 UTC

Description Neil Girard 2021-06-11 14:41:14 UTC

Description of problem:

If an upgrade from one cluster version to a new cluster version hits an issue such as a local file was modified and doesn't match MC, the MCD will break the link to /etc/machine-config-daemon/orig/etc/issue.mcdorig resulting in a stuck upgrade even after attempting "touch /run/machine-config-daemon-force"

Version-Release number of selected component (if applicable):
4.6.x

How reproducible:
Always

Steps to Reproduce:
1. Create MC that creates overrides for /etc/mtod and /etc/issues
2. Let MC roll out mtod and issues
3. Modify a controlled file. In my case I added a line return to /etc/containers/registries.conf
4. Perform cluster upgrade (I upgraded from 4.6.31 to 4.6.32)
5. When MCD attempts to perform upgrade, you will hit issue about the registries.conf. Perform "touch /run/machine-config-daemon-force" to remove manual changes NOTE: Before even doing the touch, you can see that /etc/machine-config-daemon/orig/etc/issue.mcdorig is marked as broken
6. MCD will continue but then will then mark upgrade as failed due to the broken link. Also depending on node, removing the broken link may not allow node to progress. You may need to uncordon node to retrigger the node to reapply the changes again.

Actual results:
Node fails w/ broking link to /etc/machine-config-daemon/orig/etc/issue.mcdorig

Expected results:
Node should not have broken /etc/machine-config-daemon/orig/etc/issue.mcdorig link and should continue upgrade.

Additional info:
I was SSH'd into node to watch MCD crio logs. I am not sure if this is required for this to happen. I doubt it but wanted to call it out just in case.

The linked case also has a must gather that contains a MC for 60-motd-master that has issue and motd override values. I used those. The values work on initial application so I believe there is nothing wrong in these files.

Comment 11 Neil Girard 2021-10-26 11:55:06 UTC

Hello @jkyros , I have not seen any progress on the pull request in quite some time.  Is there any updates on this fix?

Comment 12 John Kyros 2021-11-02 16:51:51 UTC

Yes, sorry. We had paused this work as we were thinking we might be able to find a...more complete way to fix this, but time has not permitted, so we're going to try to get this in for 4.10.

Comment 13 John Kyros 2021-11-12 01:58:34 UTC

I'd appreciate it if someone from the compliance operator team could have a look -- the MCO is fixing the problem with "backing up a symlink" (https://bugzilla.redhat.com/show_bug.cgi?id=2020003) that causes the degradation, but the compliance operator is encouraging users to change /etc/issue on RHCOS via the 'banner-etc-issue' rule, which is maybe less than ideal. 

1.) On an RHCOS host /etc/issue is a symlink to /usr/lib/issue, just like on RHEL, but on RHCOS /usr/lib is read-only
sh-4.4# ls -l /etc/issue
lrwxrwxrwx. 1 root root 16 Nov 10 19:14 /etc/issue -> ../usr/lib/issue

2.) The MCO at current does not support the symlink portion of the ignition spec (so you can't write your own symlink) so right now if machine config is specified that modifies /etc/issue, the MCO overwrites the /etc/issue symlink with a file, which how this bug happens. 

3.) The preferred method to modify the issue message would be to write files to /etc/issue.d/ since that is user-writeable.

I'm not an expert on the compliance operator or where the rules come from, but can the rules be updated to encourage writing to /etc/issue.d/ instead or is that not a possibility? Thanks!

Comment 14 Jakub Hrozek 2021-11-26 10:08:38 UTC

First, I'm sorry that this bugzilla went unanswered for 2 weeks.

But to actually reply to it, I'm looking at the compliance rules definitions:
https://github.com/ComplianceAsCode/content/blob/master/linux_os/guide/system/accounts/accounts-banners/banner_etc_issue/rule.yml
and the rule text already tells (except for the first sentence which is confusing and I'll fix it) to don't edit /etc/issue directly on RHCOS, but instead use /etc/issue.d

At the same time, there is no automated remediation (because we can't presume the text), so adding the remediation is left as an exercise for the admin, as you can see the rule.yml even suggests a MachineConfig.

Is it possible that the customer applied a custom MachineConfig that touched /etc/issue? To do that, it would be nice to see the output of "oc get complianceremediations" and "oc get mc -lcompliance.openshift.io/scan-name" to look for remediations or machineConfigs created from remediations.

Or is it possible that they use a very old compliance operator version and/or compliance content version? What is it that they're actually running?

Comment 15 Jakub Hrozek 2021-11-26 10:15:08 UTC

btw rule amend PR: https://github.com/ComplianceAsCode/content/pull/7921

Comment 21 Jakub Hrozek 2022-02-28 17:21:01 UTC

Closing due to insuficcient data. Please see comments 14 and 16 in case you decide to reopen this bug.

Comment 22 Red Hat Bugzilla 2023-09-15 01:09:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.