2116418 – mdcheck_continue.timer Failed with result 'unit-condition-failed'.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2116418 - mdcheck_continue.timer Failed with result 'unit-condition-failed'.

Summary: mdcheck_continue.timer Failed with result 'unit-condition-failed'.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	mdadm
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.9
Assignee:	XiaoNi
QA Contact:	Fine Fan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-08 13:42 UTC by James Hartsock
Modified:	2023-11-14 18:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	mdadm-4.2-8.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-14 15:50:02 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-130496	0	None	None	None	2022-08-08 13:49:16 UTC
Red Hat Product Errata	RHBA-2023:7128	0	None	None	None	2023-11-14 15:50:08 UTC

Description James Hartsock 2022-08-08 13:42:01 UTC

Description of problem:
mdcheck_continue.timer enters failed state as fails to start mdcheck_continue.service [Unit] Condition is not met

Version-Release number of selected component (if applicable):
mdadm-4.2-2.el8.x86_64

How reproducible:
Very

Steps to Reproduce:
1. Install mdadm and make sure timer is enabled
2. Make sure following is NOT met
   ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*

Actual results:
$ sudo systemctl list-units --failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION                      
● mdcheck_continue.timer loaded failed failed MD array scrubbing - continuation

$ sudo systemctl status mdcheck_continue.timer
● mdcheck_continue.timer - MD array scrubbing - continuation
   Loaded: loaded (/usr/lib/systemd/system/mdcheck_continue.timer; enabled; vendor preset: disabled)
   Active: failed (Result: unit-condition-failed) since Mon 2022-08-08 01:05:07 CDT; 6h ago
  Trigger: n/a

Aug 06 09:27:51 lager.jjhartsock.com systemd[1]: Started MD array scrubbing - continuation.
Aug 08 01:05:07 lager.jjhartsock.com systemd[1]: mdcheck_continue.timer: Failed with result 'unit-condition-failed'.



Expected results:
Timer should NOT be failed



Additional info:

I am adding same condition to timer, as believe that should be valid work-around.

# cat /etc/systemd/system/mdcheck_continue.timer.d/james.conf
[Unit]
ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*

Comment 3 XiaoNi 2022-10-23 02:30:14 UTC

Hi James

If "ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*" is not met, the timer
should not run. It's what ConditionPathExistsGlob wants, right?

So it's an expected result that the timer fails when /var/lib/mdcheck/MD_UUID_* doesn't exist.

By the way, the MD_UUID_* is left by the last mdcheck. When the raid is too big, it
can't finish the check at a time. So the file MD_UUID_* is left. The mdcheck_continue.service
will do the check from the last interruption place.

From my side, this is not a bug. What's your opinion?

Regards
Xiao

Comment 4 James Hartsock 2022-10-23 13:37:01 UTC

My view is a service or timer should only fail if it had an issue and needs administrative action (a skip should be clean exit). Adding the same condition to the timer allows the timer to be skipped (like the service) and not go into a failed state.

So only items needing administrative action should be reported in the output of:
systemctl list-units --failed

Comment 5 XiaoNi 2022-10-24 07:02:08 UTC

(In reply to James Hartsock from comment #4)
> My view is a service or timer should only fail if it had an issue and needs
> administrative action (a skip should be clean exit). Adding the same
> condition to the timer allows the timer to be skipped (like the service) and
> not go into a failed state.

Hi James

You mean add the ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*
in mdcheck_continuer.timer, right? If the condition doesn't met, so the
mdcheck_continuer.timer doesn't run, so mdcheck_continuer.service doesn't
run too. Right?

I'm not familiar with systemd. Do you know how to skip a service/timer and
the service/timer doesn't go into a failed state? Could you give a patch
for this? 

Regards
Xiao

Comment 6 XiaoNi 2022-10-25 06:44:52 UTC

From my side, this looks like a feature rather a bug. Move this to next release

Comment 8 James Hartsock 2022-10-25 12:03:13 UTC

Yes, adding the same Conditions to .service & .timer.  This way both are skipped and do not end up with the .timer in failed state since it triggers only to fail start the service.

And I suspect customers getting alerts on failed services across their enterprise for this issue would not consider this a feature.

Comment 9 James Hartsock 2022-10-25 12:15:05 UTC

BTW ... I am now using same work-around on RHEL 9

$ rpm -q mdadm
mdadm-4.2-2.el9.x86_64

$ sudo systemctl cat mdcheck_continue.service | grep Cond
ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*

$ sudo systemctl cat mdcheck_continue.timer | tail -n 3
# /etc/systemd/system/mdcheck_continue.timer.d/james.conf
[Unit]
ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*

Comment 10 XiaoNi 2022-10-25 15:14:23 UTC

(In reply to James Hartsock from comment #9)
> BTW ... I am now using same work-around on RHEL 9
> 
> $ rpm -q mdadm
> mdadm-4.2-2.el9.x86_64
> 
> $ sudo systemctl cat mdcheck_continue.service | grep Cond
> ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*
> 
> $ sudo systemctl cat mdcheck_continue.timer | tail -n 3
> # /etc/systemd/system/mdcheck_continue.timer.d/james.conf
> [Unit]
> ConditionPathExistsGlob = /var/lib/mdcheck/MD_UUID_*

There is a problem with this work-around. The mdcheck_continue.timer
is started by mdcheck_start.timer. If we add the check in mdcheck_continue.timer,
the continue timer can't be started. Can it be started again?

Comment 11 XiaoNi 2022-10-25 15:27:14 UTC

(In reply to XiaoNi from comment #10)
> 
> There is a problem with this work-around. The mdcheck_continue.timer
> is started by mdcheck_start.timer. If we add the check in
> mdcheck_continue.timer,
> the continue timer can't be started. Can it be started again?

If the mdcheck_continue.timer can't be started again. The mdcheck action
can't go on checking from the interrupted position.

Comment 12 James Hartsock 2022-10-27 12:03:11 UTC

Thank you for letting me know flaw in the work-around.  On system it is currently in use on has no mdarray so not a condition I faced.

Comment 13 XiaoNi 2022-10-27 13:00:48 UTC

We need to start the mdcheck_continue.timer when start mdcheck_start.timer.
So the only place to add the check is in mdcheck_continue.service.

Do you agree closing this bug?

Comment 14 James Hartsock 2022-10-27 13:55:09 UTC

No, but agree my work-around is flawed.  A service or timer should not go to failed when it should be skipped. It is standard for admin to run and believe action is needed on anything listed in systemctl list-units --failed

Comment 15 XiaoNi 2022-10-27 15:00:38 UTC

That makes sense. But as mentioned, we can't add the condition in mdcheck_continue.timer.
The only place adds the condition is at mdcheck_continue.service. I'll try to find a
way if we can skip mdcheck_continue.service when the condition is not met.

Comment 16 XiaoNi 2022-10-27 15:28:41 UTC

(In reply to James Hartsock from comment #12)
> Thank you for letting me know flaw in the work-around.  On system it is
> currently in use on has no mdarray so not a condition I faced.

Hi James

The mdcheck_start.timer is not enable by default. If your system hasn't
raid, why do you start the mdcheck_start.timer.

Comment 17 James Hartsock 2022-10-27 17:06:10 UTC

Sorry, I do have RAID ... I was on wrong host when I checked before saying that yesterday.

$ cat /proc/mdstat 
Personalities : [raid1] 
md126 : active raid1 sda[1] sdb[0]
      927881216 blocks super external:/md127/0 [2/2] [UU]
      
md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>

Comment 21 Fine Fan 2023-06-02 09:41:03 UTC

with mdadm-4.2-8.el8  
When triggered the  mdcheck_continue.timer didn't fail anymore.

[root@storageqe-26 ~]# systemctl status mdcheck_continue.timer
● mdcheck_continue.timer - MD array scrubbing - continuation
   Loaded: loaded (/usr/lib/systemd/system/mdcheck_continue.timer; disabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2023-06-02 05:39:00 EDT; 24s ago
  Trigger: n/a

Jun 02 05:38:41 storageqe-26.sqe.lab.eng.bos.redhat.com systemd[1]: Started MD array scrubbing - continuation.
Jun 02 05:39:00 storageqe-26.sqe.lab.eng.bos.redhat.com systemd[1]: mdcheck_continue.timer: Succeeded.
Jun 02 05:39:00 storageqe-26.sqe.lab.eng.bos.redhat.com systemd[1]: Stopped MD array scrubbing - continuation.
[root@storageqe-26 ~]# 

[root@storageqe-26 ~]# systemctl list-units --failed | grep mdcheck
[root@storageqe-26 ~]#

Comment 25 errata-xmlrpc 2023-11-14 15:50:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (mdadm bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7128

Note You need to log in before you can comment on or make changes to this bug.