Bug 2023092

Summary:	mdadm.service fails to start: mdmonitor.service: Can't open PID file /run/mdadm/mdadm.pid (yet?) after start: Operation not permitted
Product:	[Fedora] Fedora	Reporter:	Jan Vesely <jan.vesely>
Component:	mdadm	Assignee:	XiaoNi <xni>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	37	CC:	agk, dave, dledford, evansj3000, jes.sorensen, rhel, xni
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-12-05 21:02:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Vesely 2021-11-14 16:56:52 UTC

Description of problem:
mdmonitor service fails to start after upgrade to fedora 35

Version-Release number of selected component (if applicable):
mdadm-4.2-rc2.fc35.x86_64

How reproducible:
always

Steps to Reproduce:
1. boot/re-boot

Actual results:

from logs
Starting Software RAID monitoring and management...
mdmonitor.service: Can't open PID file /run/mdadm/mdadm.pid (yet?) after start: Operation not permitted
mdmonitor.service: Failed with result 'protocol'.
Failed to start Software RAID monitoring and management.

Expected results:
Starting Software RAID monitoring and management...
Started Software RAID monitoring and management.


Additional info:

the service starts ok when restarted manually after the machine boot completes

Comment 1 XiaoNi 2021-11-16 01:12:16 UTC

Hi Jan

I tried to reproduce this in my environment and I didn't reproduce it.

The steps I did are:
1. Install f35
2. Create some loop devices and create a raid1
3. mdadm -Es > /etc/mdadm.conf  (mdmonitor service needs the config file)
4. systemctl start mdmonitor
5. systemctl status mdmonitor
● mdmonitor.service - Software RAID monitoring and management
     Loaded: loaded (/usr/lib/systemd/system/mdmonitor.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-11-15 20:10:36 EST; 5s ago
    Process: 81616 ExecStart=/sbin/mdadm --monitor --scan --syslog -f --pid-file=/run/mdadm/mdadm.pid (code=exited, status=0/>
   Main PID: 81617 (mdadm)
      Tasks: 1 (limit: 14247)
     Memory: 456.0K
        CPU: 6ms
     CGroup: /system.slice/mdmonitor.service
             └─81617 /sbin/mdadm --monitor --scan --syslog -f --pid-file=/run/mdadm/mdadm.pid

The mdadm.pid file is created automatically after running `systemctl start mdmonitor`

How about restart your mdmonitor service? Can it success?

Thanks
Xiao

Comment 2 Jan Vesely 2021-11-17 20:19:25 UTC

From the opening post:

"the service starts ok when restarted manually after the machine boot completes"

so yes, manual restarting the mdmonitor service works.

Comment 3 XiaoNi 2021-11-19 02:28:31 UTC

(In reply to Jan Vesely from comment #2)
> From the opening post:
> 
> "the service starts ok when restarted manually after the machine boot
> completes"
> 
> so yes, manual restarting the mdmonitor service works.

How about lsblk after boot? So I can try to reproduce this problem in my machine.

Comment 4 Jan Vesely 2021-12-03 14:35:43 UTC

NAME                                    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                                       8:0    0 931.5G  0 disk  
├─md126                                   9:126  0   1.8T  0 raid0 
│ ├─md126p1                             259:4    0   260M  0 part  
│ ├─md126p2                             259:5    0    16M  0 part  
│ ├─md126p3                             259:6    0 325.8G  0 part  
│ ├─md126p4                             259:7    0   995M  0 part  
│ └─md126p5                             259:8    0   1.5T  0 part  
│   └─luks-f3e4f35e-5a93-4c2a-a1b2-56da7dc057b6
│                                       253:4    0   1.5T  0 crypt /mnt/big-data
└─md127                                   9:127  0     0B  0 md    
sdb                                       8:16   0 931.5G  0 disk  
├─md126                                   9:126  0   1.8T  0 raid0 
│ ├─md126p1                             259:4    0   260M  0 part  
│ ├─md126p2                             259:5    0    16M  0 part  
│ ├─md126p3                             259:6    0 325.8G  0 part  
│ ├─md126p4                             259:7    0   995M  0 part  
│ └─md126p5                             259:8    0   1.5T  0 part  
│   └─luks-f3e4f35e-5a93-4c2a-a1b2-56da7dc057b6
│                                       253:4    0   1.5T  0 crypt /mnt/big-data
└─md127                                   9:127  0     0B  0 md    
zram0                                   252:0    0  15.5G  0 disk  [SWAP]
nvme0n1                                 259:0    0 238.5G  0 disk  
├─nvme0n1p1                             259:1    0   200M  0 part  /boot/efi
├─nvme0n1p2                             259:2    0     1G  0 part  /boot
└─nvme0n1p3                             259:3    0 237.3G  0 part  
  └─luks-a99fc9de-32dd-43f7-9133-699710a861ef
                                        253:0    0 237.3G  0 crypt 
    ├─fedora-root                       253:1    0    50G  0 lvm   /
    ├─fedora-swap                       253:2    0  15.7G  0 lvm   [SWAP]
    └─fedora-home                       253:3    0 171.6G  0 lvm   /home


sorry for the delay

Comment 5 Dave King 2022-01-13 12:52:50 UTC

I have this same problem on a virtual private server (VPS) at OVH that I just upgraded from Fedora 34 to 35.

NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
sda       8:0    0  1.8T  0 disk  
├─sda1    8:1    0  511M  0 part  
├─sda2    8:2    0  1.8T  0 part  
│ └─md2   9:2    0  3.6T  0 raid0 /var/lib/containers/storage/overlay
│                                 /
├─sda3    8:3    0  512M  0 part  [SWAP]
└─sda4    8:4    0    2M  0 part  
sdb       8:16   0  1.8T  0 disk  
├─sdb1    8:17   0  511M  0 part  /boot/efi
├─sdb2    8:18   0  1.8T  0 part  
│ └─md2   9:2    0  3.6T  0 raid0 /var/lib/containers/storage/overlay
│                                 /
└─sdb3    8:19   0  512M  0 part  [SWAP]

Comment 6 Jan Vesely 2022-05-11 17:42:18 UTC

still fails the same way in f36

Comment 7 Jan Vesely 2022-09-08 16:10:34 UTC

I see similar failures in at least one more service:

dnsmasq[1675]: chown of PID file /run/nm-dnsmasq-wlo1.pid failed: Operation not permitted

but in the case of dnsmasq it doesn't result in service failure.



It looks like the "chown of PID file" is comming from systemd instead of the mdadm daemon.
I tried to follow the "yet?" part and changed mdmonitor service file to include:

ExecStart=sh -c "/sbin/mdadm --monitor --scan --syslog -f --pid-file=/run/mdadm/mdadm.pid && sleep 5"


This instead causes the issue to be 100% reproducible even when starting mdmonitor manually.
The experiment points to the issue being that there's a race between systemd accessing the file set up in "PIDFile=" and mdadm deleting the pidfile after exiting.


Removing the "PIDFile=" part of mdmonitor.service file works around the issue.

Comment 8 Ben Cotton 2022-11-29 17:19:23 UTC

This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 9 Ben Cotton 2022-12-13 15:52:41 UTC

Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 10 Jan Vesely 2022-12-14 21:22:30 UTC

reopening, the issue is still present in fedora 37

Comment 11 Aoife Moloney 2023-11-23 00:07:12 UTC

This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 12 Aoife Moloney 2023-12-05 21:02:39 UTC

Fedora Linux 37 entered end-of-life (EOL) status on None.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 13 evansj3000 2025-02-16 02:55:00 UTC

Hello all. I am writing in this ticket as I have experienced this issue today in Fedora 41.

I am running on a Mac Pro 5,1 and have 4x 1TB HDDs in each SATA2 port.

>>> Removing the "PIDFile=" part of mdmonitor.service file works around the issue.

Jan's advice holds sound to this day. It seems there is a race condition when the service is being started, which then causes the service to fail because it seemingly cannot both write/read the file.

systemd[1]: mdmonitor.service: Can't open PID file /run/mdadm/mdadm.pid (yet?) after start: No such file or directory


Here is the steps for reproduction:

# Create the RAID
sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1

# Create ext4 fs
sudo mkfs.ext4 /dev/md0

# Create /etc/mdadm.conf from assembled array
sudo mdadm --detail --scan --verbose | sudo tee -a /etc/mdadm.conf

# Add to fstab and set nofail otherwise feel pain
sudo blkid /dev/md0
UUID=36149ec5-d782-4ad7-9dc0-962537b9870b /mnt/data ext4 defaults,nofail 0 0

sudo mount -a

# Make it mine :)
sudo chown $(whoami) /mnt/data

# I am unsure if this helps really but I've seen that it should be done
sudo dracut --force --add mdraid

# Reboot
sudo systemctl reboot


Intended result: the 'ConditionPathExists=/etc/mdadm.conf' line in the service config should kick the service into running at next boot, therefore the RAID should mount.

Problem: The service starts now that the config is there. However, the service dies:

Feb 15 16:07:07 hostname systemd[1]: Starting mdmonitor.service - Software RAID monitoring and management...
Feb 15 16:07:07 hostname systemd[1]: mdmonitor.service: Can't open PID file /run/mdadm/mdadm.pid (yet?) after start: No such file or directory
Feb 15 16:07:07 hostname systemd[1]: mdmonitor.service: Failed with result 'protocol'.
Feb 15 16:07:07 hostname systemd[1]: Failed to start mdmonitor.service - Software RAID monitoring and management.


Workaround/Solution: 

sudo systemctl edit -full mdmonitor.service

Comment out this line
#PIDFile=/run/mdadm/mdadm.pid

The service will start on next boot and mount the RAID in the correct mountpoint. :D)

However, if something goes wrong I assume systemd not knowing the PID will cause trouble.