Bug 2109546
| Summary: | 4.8 RHCOS pipeline jobs failing | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | ximhan |
| Component: | RHCOS | Assignee: | Renata Ravanelli <rravanel> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.8 | CC: | dornelas, dustymabe, jligon, jschinta, kdsouza, mrussell, nstielau, roarora, smilner, travier |
| Target Milestone: | --- | ||
| Target Release: | 4.8.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-24 08:05:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2111107 | ||
| Bug Blocks: | |||
|
Description
ximhan
2022-07-21 14:09:39 UTC
first found in https://coreos.slack.com/archives/C999USB0D/p1658410283899959 some logs: https://coreos.slack.com/archives/C999USB0D/p1658411872844739?thread_ts=1658410283.899959&cid=C999USB0D -- Subject: Unit failed -- Defined-By: systemd r -- Support: https://access.redhat.com/support g -- Logs begin at Thu 2022-07-21 13:55:36 UTC, end at Thu 2022-07-21 13:56:23 UTC. -- e Jul 21 13:55:43 localhost systemd[1]: Started Monitor console-login-helper-messages runtime issue snippets directory for changes. n Jul 21 13:55:43 localhost systemd[1]: console-login-helper-messages-issuegen.path: Failed with result 'unit-condition-failed'. =============================================== https://coreos.slack.com/archives/C999USB0D/p1658411760671249?thread_ts=1658410283.899959&cid=C999USB0D Red Hat Enterprise Linux CoreOS 48.84.202207210747-0 (Ootpa) 4.8 cosa-devsh login: core (automatic login) Red Hat Enterprise Linux CoreOS 48.84.202207210747-0 Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html --- [systemd] Failed Units: 1 console-login-helper-messages-issuegen.path [core@cosa-devsh ~]$ I rolled back systemd to systemd-239-45.el8_4.10.x86_64 and the error went away. I will continue to investigate. The changelog for systemd is here:
* Thu Jun 23 2022 systemd maintenance team <systemd-maint> - 239-45.11
- unit: don't emit PropertiesChanged signal if adding a dependency to a unit is a no-op (#2091591)
- core: rename unit_{start_limit|condition|assert}_test() to unit_test_xyz() (#2095950)
- core: Check unit start rate limiting earlier (#2095950)
- sd-event: introduce callback invoked when event source ratelimit expires (#2095950)
- core: rename/generalize UNIT(u)->test_start_limit() hook (#2095950)
- mount: make mount units start jobs not runnable if /p/s/mountinfo ratelimit is in effect (#2095950)
- mount: retrigger run queue after ratelimit expired to run delayed mount start jobs (#2095950)
- pid1: add a manager_trigger_run_queue() helper (#2095950)
- unit: add jobs that were skipped because of ratelimit back to run_queue (#2095950)
- core: propagate triggered unit in more load states (#2095950)
- core: propagate unit start limit hit state to triggering path unit (#2095950)
- core: Move 'r' variable declaration to start of unit_start() (#2095950)
- core: Delay start rate limit check when starting a unit (#2095950)
- core: Propagate condition failed state to triggering units. (#2095950)
- unit: check for mount rate limiting before checking active state (#2097337)
I think ` Propagate condition failed state to triggering units (#2095950)` is the culprit.
console-login-helper-messages-issuegen.service has these conditions to trigger
ConditionPathExistsGlob=|/run/console-login-helper-messages/issue.d/*.issue
ConditionPathExistsGlob=|/etc/console-login-helper-messages/issue.d/*.issue
If there are no issues, then the condition gets propagated up to console-login-helper-messages-issuegen.path as failed. On my test system /etc/console-login-helper-messages/issue.d/*.issue is empty and /run/console-login-helper-messages/issue.d/*.issue has issues. There may be a race condition where the issues in /run/console-login-helper-messages/issue.d/*.issue have not populated yet and thus failing console-login-helper-messages-issuegen.path. console-login-helper-messages-issuegen.path doesn't restart automatically so it just remains failed.
The patch has this explanation: +Subject: [PATCH] core: Propagate condition failed state to triggering units. + +Alternative to https://github.com/systemd/systemd/pull/20531. + +Whenever a service triggered by another unit fails condition checks, +stop the triggering unit to prevent systemd busy looping trying to +start the triggered unit. + +(cherry picked from commit 12ab94a1e4961a39c32efb60b71866ab588d3ea2) Interestingly enough The patch you called out was reverted: - https://github.com/systemd/systemd/commit/40f41f3 - https://github.com/systemd/systemd/pull/21808 Maybe we need to work with the systemd maintainers in RHEL to figure out what the story is. Ignore PR in comment#9. The 4.10 builds are not being affected by this issue. Even though the other RHCOS versions do use the same package version, sysmtemd is only causing issues in RHCOS4.8 for all arches. This PR pin systemd version for 4.8 until we can fix it https://github.com/openshift/os/pull/911 I reproduced it on a 4.10 build. [coreos-assembler]$ cosa run --qemu-image rhcos-410.84.202207271903-0-qemu.x86_64.qcow2 [EVENT | QEMU guest is ready for SSH] [ [0;32m OK [0m] Started Login Service. Red Hat Enterprise Linux CoreOS 410.84.202207271903-0 Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html --- Last login: Thu Jul 28 21:12:30 2022 [systemd] Failed Units: 1 console-login-helper-messages-issuegen.path [core@cosa-devsh ~]$ rpm -qa systemd systemd-239-45.el8_4.11.x86_64 4.8 pipeline jobs are passing as of August 3rd. This is an issue that we had internally is unlikely to be an issue that a customer faced. In any case, we fixed in 410.84.202208030316-0 which as been shipped with https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.10.26. Checking again, we shipped the systemd update (239-45.el8_4.10 → 239-45.el8_4.11) in 410.84.202207262020-0 which was included in https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.10.25. So this might indeed be the issue they are facing. Updating should resolve it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.48 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:6099 |