Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2109546

Summary: 4.8 RHCOS pipeline jobs failing
Product: OpenShift Container Platform Reporter: ximhan
Component: RHCOSAssignee: Renata Ravanelli <rravanel>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: dornelas, dustymabe, jligon, jschinta, kdsouza, mrussell, nstielau, roarora, smilner, travier
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-24 08:05:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2111107    
Bug Blocks:    

Description ximhan 2022-07-21 14:09:39 UTC
Thanks for reporting your issue!

In order for the CoreOS team to be able to quickly and successfully triage your issue, please fill out the following template as completely as possible.

Be ready for follow-up questions and please respond in a timely manner.

If we can't reproduce a bug, we might close your issue.

---

OCP Version at Install Time: 4.8
RHCOS Version at Install Time: 4.8
OCP Version after Upgrade (if applicable):
RHCOS Version after Upgrade (if applicable):
Platform: AWS, Azure, bare metal, GCP, vSphere, etc
Architecture: x86_64/ppc64le/s390x


What are you trying to do? What is your use case?
It cause 4.8 rhcos pipeline build job failing
https://jenkins-rhcos-art.cloud.privileged.psi.redhat.com/job/rhcos-art/job/rhcos-art-rhcos-4.8/


What happened? What went wrong or what did you expect?


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.


If you're having problems booting/installing RHCOS, please provide:
- the full contents of the serial console showing disk initialization, network configuration, and Ignition stage (see https://access.redhat.com/articles/7212 for information about configuring your serial console)
- Ignition JSON
- output of `journalctl -b`


If you're having problems post-upgrade, please provide:
- A complete must-gather (`oc adm must-gather`)


If you're having SELinux related issues, please provide:
- The full `/var/log/audit/audit.log` file
- Were any SELinux modules or booleans changed from the default configuration?
- The output of `ostree admin config-diff | grep selinux/targeted` on impacted nodes


Please add anything else that might be useful, for example:
- kernel command line (`cat /proc/cmdline`)
- contents of `/etc/NetworkManager/system-connections/`
- contents of `/etc/sysconfig/network-scripts/`

Comment 1 ximhan 2022-07-21 14:12:23 UTC
first found in https://coreos.slack.com/archives/C999USB0D/p1658410283899959

some logs:
https://coreos.slack.com/archives/C999USB0D/p1658411872844739?thread_ts=1658410283.899959&cid=C999USB0D
-- Subject: Unit failed
-- Defined-By: systemd                                                                                                                       r
-- Support: https://access.redhat.com/support                                                                                                g
-- Logs begin at Thu 2022-07-21 13:55:36 UTC, end at Thu 2022-07-21 13:56:23 UTC. --                                                         e
Jul 21 13:55:43 localhost systemd[1]: Started Monitor console-login-helper-messages runtime issue snippets directory for changes.            n
Jul 21 13:55:43 localhost systemd[1]: console-login-helper-messages-issuegen.path: Failed with result 'unit-condition-failed'. 

===============================================
https://coreos.slack.com/archives/C999USB0D/p1658411760671249?thread_ts=1658410283.899959&cid=C999USB0D
Red Hat Enterprise Linux CoreOS 48.84.202207210747-0 (Ootpa) 4.8
cosa-devsh login: core (automatic login)

Red Hat Enterprise Linux CoreOS 48.84.202207210747-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html

---
[systemd]
Failed Units: 1
  console-login-helper-messages-issuegen.path
[core@cosa-devsh ~]$

Comment 3 Michael Nguyen 2022-07-21 16:49:56 UTC
I rolled back systemd to systemd-239-45.el8_4.10.x86_64 and the error went away. I will continue to investigate.

Comment 4 Michael Nguyen 2022-07-21 22:44:24 UTC
The changelog for systemd is here:

* Thu Jun 23 2022 systemd maintenance team <systemd-maint> - 239-45.11
- unit: don't emit PropertiesChanged signal if adding a dependency to a unit is a no-op (#2091591)
- core: rename unit_{start_limit|condition|assert}_test() to unit_test_xyz() (#2095950)
- core: Check unit start rate limiting earlier (#2095950)
- sd-event: introduce callback invoked when event source ratelimit expires (#2095950)
- core: rename/generalize UNIT(u)->test_start_limit() hook (#2095950)
- mount: make mount units start jobs not runnable if /p/s/mountinfo ratelimit is in effect (#2095950)
- mount: retrigger run queue after ratelimit expired to run delayed mount start jobs (#2095950)
- pid1: add a manager_trigger_run_queue() helper (#2095950)
- unit: add jobs that were skipped because of ratelimit back to run_queue (#2095950)
- core: propagate triggered unit in more load states (#2095950)
- core: propagate unit start limit hit state to triggering path unit (#2095950)
- core: Move 'r' variable declaration to start of unit_start() (#2095950)
- core: Delay start rate limit check when starting a unit (#2095950)
- core: Propagate condition failed state to triggering units. (#2095950)
- unit: check for mount rate limiting before checking active state (#2097337)

I think ` Propagate condition failed state to triggering units (#2095950)` is the culprit.  

console-login-helper-messages-issuegen.service has these conditions to trigger
ConditionPathExistsGlob=|/run/console-login-helper-messages/issue.d/*.issue
ConditionPathExistsGlob=|/etc/console-login-helper-messages/issue.d/*.issue

If there are no issues, then the condition gets propagated up to console-login-helper-messages-issuegen.path as failed. On my test system /etc/console-login-helper-messages/issue.d/*.issue is empty and /run/console-login-helper-messages/issue.d/*.issue has issues. There may be a race condition where the issues in /run/console-login-helper-messages/issue.d/*.issue have not populated yet and thus failing console-login-helper-messages-issuegen.path.  console-login-helper-messages-issuegen.path doesn't restart automatically so it just remains failed.

Comment 5 Michael Nguyen 2022-07-21 23:44:51 UTC
The patch has this explanation: 

+Subject: [PATCH] core: Propagate condition failed state to triggering units.
+
+Alternative to https://github.com/systemd/systemd/pull/20531.
+
+Whenever a service triggered by another unit fails condition checks,
+stop the triggering unit to prevent systemd busy looping trying to
+start the triggered unit.
+
+(cherry picked from commit 12ab94a1e4961a39c32efb60b71866ab588d3ea2)

Comment 8 Dusty Mabe 2022-07-25 14:51:17 UTC
Interestingly enough The patch you called out was reverted:

- https://github.com/systemd/systemd/commit/40f41f3
- https://github.com/systemd/systemd/pull/21808

Maybe we need to work with the systemd maintainers in RHEL to figure out what the story is.

Comment 10 Renata Ravanelli 2022-07-25 18:18:46 UTC
Ignore PR in comment#9. The 4.10 builds are not being affected by this issue.

Even though the other RHCOS versions do use the same package version, sysmtemd is only causing issues in RHCOS4.8 for all arches. 

This PR pin systemd version for 4.8 until we can fix it https://github.com/openshift/os/pull/911

Comment 12 Michael Nguyen 2022-07-28 21:16:01 UTC
I reproduced it on a 4.10 build.  


[coreos-assembler]$ cosa run --qemu-image rhcos-410.84.202207271903-0-qemu.x86_64.qcow2 
[EVENT | QEMU guest is ready for SSH] [ [0;32m  OK   [0m] Started Login Service.
Red Hat Enterprise Linux CoreOS 410.84.202207271903-0
  Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html

---
Last login: Thu Jul 28 21:12:30 2022
[systemd]
Failed Units: 1
  console-login-helper-messages-issuegen.path
[core@cosa-devsh ~]$ rpm -qa systemd
systemd-239-45.el8_4.11.x86_64

Comment 16 Michael Nguyen 2022-08-08 12:27:31 UTC
4.8 pipeline jobs are passing as of August 3rd.

Comment 19 Timothée Ravier 2022-08-22 08:53:32 UTC
This is an issue that we had internally is unlikely to be an issue that a customer faced. In any case, we fixed in 410.84.202208030316-0 which as been shipped with https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.10.26.

Comment 20 Timothée Ravier 2022-08-22 09:01:22 UTC
Checking again, we shipped the systemd update (239-45.el8_4.10 → 239-45.el8_4.11) in 410.84.202207262020-0 which was included in https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.10.25.
So this might indeed be the issue they are facing. Updating should resolve it.

Comment 23 errata-xmlrpc 2022-08-24 08:05:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.48 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6099