Bug 1901449

Summary: Disabling service via Ansible `service` module does not work, systemctl show aborts with "Failed to parse bus message: Invalid argument"
Product: Red Hat Enterprise Linux 8 Reporter: Miroslav Vadkerti <mvadkert>
Component: systemdAssignee: David Tardon <dtardon>
Status: CLOSED CURRENTRELEASE QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.4CC: abokovoy, ajb, arachman, bperkins, bstinson, bugzilla-redhat, carl, dkarpele, dtardon, farrotin, fsumsal, ftrivino, jpena, lmiksik, msekleta, ngompa13, nhosoi, pcahyna, rbeyel, rpm, sgoveas, skozina, stefw, systemd-maint-list, todoleza, toracat, twoerner, ykaliuta
Target Milestone: rcKeywords: Triaged
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-239-43.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-07 10:16:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1921845    

Description Miroslav Vadkerti 2020-11-25 09:34:47 UTC
Description of problem:

NOTE: This is a regression from RHEL-8.3 where this problem is not present. So I believe the problem is not ansible side.

We are able to reproduce this problem with both ansible-2.8 and ansible-2.9.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. have a playbook

```
 - hosts: all
   tasks:
     - name: Stop firewall (firewalld)
       service:
         name: firewalld
         state: stopped
```


2. Run it against an RHEL-8.4 machine

Actual results:
fatal: [root.139.202]: FAILED! => {"changed": false, "msg": "Service is in unknown state", "status": {}}


Expected results:
ok: [root.136.223]

Additional info:
Because this is a regression from RHEL-8.3, we do not see it with other services we interact with, I believe the problem is not in Ansible.

Comment 1 Miroslav Vadkerti 2020-11-25 09:36:18 UTC
Hmm downgrading `firewalld` does not help, moving to `systemd`

Comment 2 Pavel Cahyna 2020-11-25 10:10:58 UTC
Probably because systemctl show aborts.

8.4: 
# systemctl show firewalld
(...)
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SyslogLevel=6
SyslogFacility=3
LogLevelMax=-1
LogRateLimitIntervalUSec=0
LogRateLimitBurst=0
SecureBits=0
Failed to parse bus message: Invalid argument
# echo $?
1

8.3:
# systemctl show firewalld
AssertTimestampMonotonic=3105543614
Transient=no
Perpetual=no
StartLimitIntervalUSec=10s
StartLimitBurst=5
StartLimitAction=none
FailureAction=none
SuccessAction=none
InvocationID=88d31d5c400d4a219ca6387cc0474332
CollectMode=inactive
# echo $?
0

Comment 3 Pavel Cahyna 2020-11-25 10:28:24 UTC
8.4

./systemctl show firewalld | grep LoadState
Failed to parse bus message: Invalid argument

8.3

systemctl show firewalld  | grep LoadState
LoadState=loaded

The Ansible systemd module expects a LoadState key in systemctl output.

Minimal Ansible reproducer:

ansible -m systemd -a 'name=firewalld state=started' localhost

Comment 4 Pavel Cahyna 2020-11-25 11:36:48 UTC
In particular, the above means that ignoring the return code will not work, as the module will hit a problem in a different place and fail with "Could not find the requested service firewalld: host".

I don't see the problem on Rawhide - some backport gone wrong?

Comment 5 Pavel Cahyna 2020-11-25 11:57:21 UTC
Not just firewalld:

# systemctl show sshd | grep LoadState
Failed to parse bus message: Invalid argument

Comment 6 David Tardon 2020-11-25 12:14:45 UTC
*** Bug 1898591 has been marked as a duplicate of this bug. ***

Comment 7 Miroslav Vadkerti 2020-11-25 12:51:57 UTC
I forgot to mention that this issues is blocking currently RHEL CI. Please let us know how serious this issue is (in terms of breaking a lot of tests), so we can plan accordingly. The current nightly will expire in 2 days.

Comment 8 David Tardon 2020-11-25 15:17:09 UTC
This is caused by a kernel change. I get the failure with 4.18.0-252.el8 but not with 4.18.0-240.1.1.el8_3.

Comment 9 Pavel Cahyna 2020-11-25 17:27:30 UTC
Interestingly, strace does not show any syscall returning EINVAL (except the very first one in the process: arch_prctl, but this is the same for working kernels.)

Comment 10 Alexander Bokovoy 2020-11-26 11:09:10 UTC
This blocks any ability to run gating tests for RHEL IdM while attempting to integrate to RHEL 8.4.

Comment 13 Trivino 2020-11-26 12:50:57 UTC
(In reply to Pavel Cahyna from comment #9)
> Interestingly, strace does not show any syscall returning EINVAL (except the
> very first one in the process: arch_prctl, but this is the same for working
> kernels.)

I see the following error after comparing strace output from both rhel8.4 (4.18.0-252.el8.x86_64) and rhel8.3 (4.18.0-240.1.1.el8_3.x86_64)

recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999880000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {tv_sec=24, tv_nsec=992329163})


I'm not sure which sysfs attribute file is the one that is busy...

Comment 14 Frantisek Sumsal 2020-11-26 14:26:09 UTC
I played around a bit and I think I know where the problem is.

The issue happens when systemctl is trying to format the CapabilityBoundingSet= property:

# systemctl show tuned --no-pager -p CapabilityBoundingSet
Failed to parse bus message: Invalid argument

The ltrace shows that it dies somewhere around checking the last valid capability number from /proc/sys/kernel/cap_last_cap:
   bus_message_print_all_properties(0x563e0dbe6820, 0x563e0c1330d0, 0x563e0dbe2c60, 0 <unfinished ...>
        gettid@SYS(2, 0xffffffff, 0, 0x7ffd2adc5e30)                                                                                                = 5975
        SYS_318@SYS(0x7f0156fb68b0, 16, 1, 0)                                                                                                       = 16
        SYS_318@SYS(0x7f0156fb6890, 16, 1, 0x563e0dbe8260)                                                                                          = 16
        SYS_318@SYS(0x7f0156fb6890, 16, 1, 0x563e0dbe8260)                                                                                          = 16
        SYS_318@SYS(0x7f0156fb6890, 16, 1, 0x563e0dbe8260)                                                                                          = 16
        SYS_318@SYS(0x7f0156fb6890, 16, 1, 0x563e0dbe8490)                                                                                          = 16
        SYS_318@SYS(0x7f0156fb6890, 16, 1, 0x563e0dbe8490)                                                                                          = 16
        sd_bus_message_peek_type(0x563e0dbe6820, 0x7ffd2adc5d37, 0x7ffd2adc5d38, 0)                                                                 = 1
        openat@SYS(AT_FDCWD, "/proc/sys/kernel/cap_last_cap", 0x80000, 00)                                                                          = 4
        fstat@SYS(4, 0x7ffd2adc5a70)                                                                                                                = 0
        read@SYS(4, "39\n", 1024)                                                                                                                   = 3
        close@SYS(4)                                                                                                                                = 0
    <... bus_message_print_all_properties resumed> )                                                                                                = 0xffffffea
    bus_log_parse_error(0xffffffea, 2, 0xffffffea, 0 <unfinished ...>

Comparing the capability number with 8.3, there are clearly two new capabilities present in the new kernel:
## 8.3
# cat /proc/sys/kernel/cap_last_cap
37

## 8.4
# cat /proc/sys/kernel/cap_last_cap
39

After a brief poking around it's clear that the systemd internal capability structure is missing the newly introduced capabilities, since the list is built during compilation and the compilation was done with an older kernel:
Breakpoint 1, capability_to_name (id=id@entry=37) at ../src/basic/cap-list.c:20
20	const char *capability_to_name(int id) {
(gdb) 
Run till exit from #0  capability_to_name (id=id@entry=37) at ../src/basic/cap-list.c:20
0x00007ffff79f2d5b in capability_set_to_string_alloc (set=set@entry=18446744073709551615, s=s@entry=0x7fffffffd6d8) at ../src/basic/cap-list.c:70
70	                        p = capability_to_name(i);
Value returned is $38 = 0x7ffff7adcf0d "cap_audit_read"
(gdb) p capability_names
$39 = {0x7ffff7adccfe "cap_chown", 0x7ffff7adcd08 "cap_dac_override", 0x7ffff7adcd19 "cap_dac_read_search", 0x7ffff7adcd2d "cap_fowner", 0x7ffff7adcd38 "cap_fsetid", 0x7ffff7adcd43 "cap_kill", 0x7ffff7adcd4c "cap_setgid", 
  0x7ffff7adcd57 "cap_setuid", 0x7ffff7adcd62 "cap_setpcap", 0x7ffff7adcd6e "cap_linux_immutable", 0x7ffff7adcd82 "cap_net_bind_service", 0x7ffff7adcd97 "cap_net_broadcast", 0x7ffff7adcda9 "cap_net_admin", 0x7ffff7adcdb7 "cap_net_raw", 
  0x7ffff7adcdc3 "cap_ipc_lock", 0x7ffff7adcdd0 "cap_ipc_owner", 0x7ffff7adcdde "cap_sys_module", 0x7ffff7adcded "cap_sys_rawio", 0x7ffff7adcdfb "cap_sys_chroot", 0x7ffff7adce0a "cap_sys_ptrace", 0x7ffff7adce19 "cap_sys_pacct", 
  0x7ffff7adce27 "cap_sys_admin", 0x7ffff7adce35 "cap_sys_boot", 0x7ffff7adce42 "cap_sys_nice", 0x7ffff7adce4f "cap_sys_resource", 0x7ffff7adce60 "cap_sys_time", 0x7ffff7adce6d "cap_sys_tty_config", 0x7ffff7adce80 "cap_mknod", 
  0x7ffff7adce8a "cap_lease", 0x7ffff7adce94 "cap_audit_write", 0x7ffff7adcea4 "cap_audit_control", 0x7ffff7adceb6 "cap_setfcap", 0x7ffff7adcec2 "cap_mac_override", 0x7ffff7adced3 "cap_mac_admin", 0x7ffff7adcee1 "cap_syslog", 
  0x7ffff7adceec "cap_wake_alarm", 0x7ffff7adcefb "cap_block_suspend", 0x7ffff7adcf0d "cap_audit_read"}
(gdb) p sizeof(capability_names)/sizeof(*capability_names)
$41 = 38
(gdb) p capability_names[37]
$43 = 0x7ffff7adcf0d "cap_audit_read"
(gdb) p capability_names[38]
$44 = 0x0


So, the loop in capability_set_to_string_alloc() [src/basic/cap-list.c] is trying to loop over two capabilities, which are not present in the lookup table, causing capability_to_name() [src/basic/cap-list.c] to return NULL and subsequently capability_set_to_string_alloc() to return -EINVAL, leading up to the systemctl error:
(gdb) frame
#0  capability_to_name (id=id@entry=38) at ../src/basic/cap-list.c:20
20	const char *capability_to_name(int id) {
(gdb) s
25	        if (id >= (int) ELEMENTSOF(capability_names))
(gdb) s
capability_set_to_string_alloc (set=set@entry=18446744073709551615, s=s@entry=0x7fffffffd6d8) at ../src/basic/cap-list.c:71
71	                        if (!p)
(gdb) p p
$46 = 0x0
(gdb) s
Failed to parse bus message: Invalid argument


I suspect the issue should be easily resolvable by just rebuilding the systemd RPM with the new kernel.

Comment 15 David Tardon 2020-11-26 14:44:02 UTC
(In reply to Frantisek Sumsal from comment #14)
> I suspect the issue should be easily resolvable by just rebuilding the
> systemd RPM with the new kernel.

Right, it is.

Comment 16 Stanislav Kozina 2020-11-26 15:40:40 UTC
Yes, kernel-4.18.0-251.el8 adds CAP_BPF (at least). Thanks Frantisek!

Comment 17 Yauheni Kaliuta 2020-11-27 10:08:48 UTC
(In reply to David Tardon from comment #15)
> (In reply to Frantisek Sumsal from comment #14)
> > I suspect the issue should be easily resolvable by just rebuilding the
> > systemd RPM with the new kernel.
> 
> Right, it is.

Well, as a workaround. This compile-time dependency of build system kernel headers must be removed.
New capabilities can come, AFAIK one more is coming soon.

Comment 18 Alexander Bokovoy 2020-11-27 10:51:46 UTC
After change of the component keywords were removed. The bug still prevents proper gating testing.

Comment 19 Michal Sekletar 2020-11-27 11:38:17 UTC
(In reply to Alexander Bokovoy from comment #18)
> After change of the component keywords were removed. The bug still prevents
> proper gating testing.

systemd was rebuilt yesterday against kernel-headers-4.18.0-252.el8. That should mitigate the problem for now. Can you please retest and report back?

Comment 20 Alexander Bokovoy 2020-11-27 12:30:00 UTC
(In reply to Michal Sekletar from comment #19)
> (In reply to Alexander Bokovoy from comment #18)
> > After change of the component keywords were removed. The bug still prevents
> > proper gating testing.
> 
> systemd was rebuilt yesterday against kernel-headers-4.18.0-252.el8. That
> should mitigate the problem for now. Can you please retest and report back?

I asked RHEL IdM QE to verify it today.

Comment 21 Stanislav Kozina 2020-11-27 14:17:09 UTC
> Well, as a workaround. This compile-time dependency of build system kernel
> headers must be removed.
> New capabilities can come, AFAIK one more is coming soon.

Indeed -- I suppose it would also be required for eg. UBI container running on a non-RHEL kernel?
That's not a supported scenario, but we say it should work.
Michal, Frantisek, do we plan proper fix for this?

Comment 22 Pavel Cahyna 2020-11-27 14:32:44 UTC
(In reply to Stanislav Kozina from comment #21)
> > Well, as a workaround. This compile-time dependency of build system kernel
> > headers must be removed.
> > New capabilities can come, AFAIK one more is coming soon.
> 
> Indeed -- I suppose it would also be required for eg. UBI container running
> on a non-RHEL kernel?
> That's not a supported scenario, but we say it should work.

Why only non-RHEL? Shouldn't running container images of older RHEL releases (userland) be working and even supported on newer RHEL kernels? Wouldn't the issue affect this scenario too?

Comment 23 Stanislav Kozina 2020-11-27 14:47:23 UTC
(In reply to Pavel Cahyna from comment #22)
> Why only non-RHEL? Shouldn't running container images of older RHEL releases
> (userland) be working and even supported on newer RHEL kernels? Wouldn't the
> issue affect this scenario too?

Yes, I think you should be able to run RHEL-7 UBI on RHEL-8 host / kernel, and vice versa, and in such case I think it's even fully supported installation. Sorry, the non-RHEL example for misleading simplification, it really is any other kernel.

Comment 24 Javier Peña 2020-12-16 14:34:35 UTC
This was seen a few months ago in Fedora [1]. There is a systemd fix [2], and also a workaround from the Ansible point of view [3].

We are also hitting the same issue in CentOS 8 Stream, with systemd 239-43 and kernel 4.18.0-257.

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1853736
[2] - https://github.com/systemd/systemd/pull/16420/files
[3] - https://github.com/ansible/ansible/issues/71528

Comment 25 Stef Walter 2020-12-21 16:39:22 UTC
CentOS Stream bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1908275

Comment 27 David Tardon 2021-01-04 15:19:01 UTC
*** Bug 1910430 has been marked as a duplicate of this bug. ***

Comment 28 David Tardon 2021-01-05 09:47:34 UTC
*** Bug 1908275 has been marked as a duplicate of this bug. ***

Comment 31 Asaf Rachmani 2021-02-04 15:07:04 UTC
I've encountered the same issue with systemd-239-43.el8.x86_64 on CentOS Stream 8 but using systemd-239-44.el8.x86_64 fixed that.