Description of problem: If an upgrade from one cluster version to a new cluster version hits an issue such as a local file was modified and doesn't match MC, the MCD will break the link to /etc/machine-config-daemon/orig/etc/issue.mcdorig resulting in a stuck upgrade even after attempting "touch /run/machine-config-daemon-force" Version-Release number of selected component (if applicable): 4.6.x How reproducible: Always Steps to Reproduce: 1. Create MC that creates overrides for /etc/mtod and /etc/issues 2. Let MC roll out mtod and issues 3. Modify a controlled file. In my case I added a line return to /etc/containers/registries.conf 4. Perform cluster upgrade (I upgraded from 4.6.31 to 4.6.32) 5. When MCD attempts to perform upgrade, you will hit issue about the registries.conf. Perform "touch /run/machine-config-daemon-force" to remove manual changes NOTE: Before even doing the touch, you can see that /etc/machine-config-daemon/orig/etc/issue.mcdorig is marked as broken 6. MCD will continue but then will then mark upgrade as failed due to the broken link. Also depending on node, removing the broken link may not allow node to progress. You may need to uncordon node to retrigger the node to reapply the changes again. Actual results: Node fails w/ broking link to /etc/machine-config-daemon/orig/etc/issue.mcdorig Expected results: Node should not have broken /etc/machine-config-daemon/orig/etc/issue.mcdorig link and should continue upgrade. Additional info: I was SSH'd into node to watch MCD crio logs. I am not sure if this is required for this to happen. I doubt it but wanted to call it out just in case. The linked case also has a must gather that contains a MC for 60-motd-master that has issue and motd override values. I used those. The values work on initial application so I believe there is nothing wrong in these files.
Hello @jkyros , I have not seen any progress on the pull request in quite some time. Is there any updates on this fix?
Yes, sorry. We had paused this work as we were thinking we might be able to find a...more complete way to fix this, but time has not permitted, so we're going to try to get this in for 4.10.
I'd appreciate it if someone from the compliance operator team could have a look -- the MCO is fixing the problem with "backing up a symlink" (https://bugzilla.redhat.com/show_bug.cgi?id=2020003) that causes the degradation, but the compliance operator is encouraging users to change /etc/issue on RHCOS via the 'banner-etc-issue' rule, which is maybe less than ideal. 1.) On an RHCOS host /etc/issue is a symlink to /usr/lib/issue, just like on RHEL, but on RHCOS /usr/lib is read-only sh-4.4# ls -l /etc/issue lrwxrwxrwx. 1 root root 16 Nov 10 19:14 /etc/issue -> ../usr/lib/issue 2.) The MCO at current does not support the symlink portion of the ignition spec (so you can't write your own symlink) so right now if machine config is specified that modifies /etc/issue, the MCO overwrites the /etc/issue symlink with a file, which how this bug happens. 3.) The preferred method to modify the issue message would be to write files to /etc/issue.d/ since that is user-writeable. I'm not an expert on the compliance operator or where the rules come from, but can the rules be updated to encourage writing to /etc/issue.d/ instead or is that not a possibility? Thanks!
First, I'm sorry that this bugzilla went unanswered for 2 weeks. But to actually reply to it, I'm looking at the compliance rules definitions: https://github.com/ComplianceAsCode/content/blob/master/linux_os/guide/system/accounts/accounts-banners/banner_etc_issue/rule.yml and the rule text already tells (except for the first sentence which is confusing and I'll fix it) to don't edit /etc/issue directly on RHCOS, but instead use /etc/issue.d At the same time, there is no automated remediation (because we can't presume the text), so adding the remediation is left as an exercise for the admin, as you can see the rule.yml even suggests a MachineConfig. Is it possible that the customer applied a custom MachineConfig that touched /etc/issue? To do that, it would be nice to see the output of "oc get complianceremediations" and "oc get mc -lcompliance.openshift.io/scan-name" to look for remediations or machineConfigs created from remediations. Or is it possible that they use a very old compliance operator version and/or compliance content version? What is it that they're actually running?
btw rule amend PR: https://github.com/ComplianceAsCode/content/pull/7921
Closing due to insuficcient data. Please see comments 14 and 16 in case you decide to reopen this bug.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days