Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2185151

Summary:	[BDI] sosreport collection hangs OS storage
Product:	Red Hat Enterprise Linux 8	Reporter:	Riccardo Bruzzone <rbruzzon>
Component:	sos	Assignee:	Pavel Moravec <pmoravec>
Status:	CLOSED NOTABUG	QA Contact:	Supportability QE <supportability-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.4	CC:	agk, jcastillo, jjansky, plambri, pmoravec, qguo, sbradley, theute
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-27 14:09:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Riccardo Bruzzone 2023-04-07 04:14:34 UTC

Description of problem:

- When running `sos report --all-logs` on an overcloud node it seems that the underlying OS storage hangs for more than 120 seconds. 
- In this situation the Customer doesn't have any capability to collect the sosreports from any Red Hat OpenStack 16.2.z4 (RHEL 8.4) node (this mainly in the production environment) as highlighted in the last comment added to the Customer case:

The real problem is that when this happens on Openstack controller nodes we see that I/O-bound processes managed by Pacemaker like galera and rabbitmq get marked as unresponsive and killed by the Pacemaker monitor. Sometimes even Pacemaker itself hangs and the other nodes invoke the STONITH agent and fence the affected system via IPMI.

How reproducible:

- The sos version is sos-4.0-19.el8_4.noarch (this pkg is aligned to the last version released for 8.4)
- This issue is happening with the last Kernel 4.18.0-305.72.1.el8_4.x86_64
- The problem isn't observed with the old kernel 4.18.0-305.65.1.el8_4.x86_64
- The problem was observed on these hardware configurations (Intel Xeon versus AMD EPYC CPUs, Dell PERC H730P versus Dell PERC H740P Mini, SAS spinning disks versus SAS SSDs)

Steps to Reproduce:
"sos report --all-logs" command to generate an sosreport

Customer request:
A fix to solve this issue and permit the log collection without impacting any Red Hat OpenStack resource.

Comment 3 Pavel Moravec 2023-04-13 07:18:17 UTC

1) Do we have a reproducer? Or how reliably the hang happens at customer?


2) "not seen on older kernel" - with respect to reliability of reproducer, is it "not seen (probably yet)" or "confirmed it never happened in many tests on older kernel while almost every time it hangs on newer kernel"?


3) would it be possible to test newest sos-4.5.1-1 ? Or is sos running inside a container that does not have this version yet?


4) "(ostnamed)" process - it is possible that sos calls "hostnamectl status" which (if not running) starts /usr/lib/systemd/systemd-hostnamed process that has name "(ostnamed)" at the beginning - cant the process be blocked by AVC denial per https://access.redhat.com/solutions/6978096 ? Since the "(ostnamed)" process backtrace does have

#3 [ffffab48656b3e90] do_mount at ffffffff8ab41e08

in backtrace.

(anyway is sos behaving well here (with the requirement "sos should not trigger a new daemon process / service to be running")? Since when systemd-hostnamed.service is (usually) inactive, running "hostnamectl status" does start it - for a while until it automatically stops. I think this is still OK)


5) Do I get the #c2 right that the semaphore lock is between grub2-probe and "(ostnamed)" process while sos and rpm processes just wait in queue? As sosreport can trigger grub2-probe processes when it calls:

GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig --no-grubenv-update

which triggers a lot of child processes *like*:

/usr/sbin/grub2-probe --device /dev/sda1 --target=abstraction
/usr/sbin/grub2-probe --device /dev/sda1 --target=fs_uuid
/usr/sbin/grub2-probe --device /dev/sda1 --target=compatibility_hint
/usr/sbin/grub2-probe --device /dev/sda1 --target=partmap
..
(and similar targets for all devices)


THEN we should be able to reproduce the hang by concurrently running these two commands:

GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig --no-grubenv-update
hostnamectl status

on an overcloud node .. ? grub2-mkconfig is called from grub2 plugin, hostnamectl status from host plugin, plugins are alphabetically close to each other so usually they are executed concurrently.


Again iterating question 1 : can we reproduce it somewhere (with / via sosreport)? If so, can we try running the 2 commands concurrently, repeatedly to hit varius interleavings of their execution, with delays between the execution attempts to let /usr/lib/systemd/systemd-hostnamed to stop first (to really have "(ostnamed)" process name)?


Possible workaround: run "sos report --all-logs --threads 1" to prevent sosreport to call plugins concurrently. Does this prevent the system hang? (if so, it is another argument for the two commands stepping on each other toes)

Comment 4 Riccardo Bruzzone 2023-04-13 19:09:03 UTC

Hello,
The list of kernel version used in the OLD RHOSP (16.2.3 in TEST and SILVER and 16.2.0 in GOLD) are below described [1] 

About the questions raised in the previous post:

- The issue can be reproduced on the Customer environment (overcloud nodes).
  From the feedback received this is a systematic issue.

- I can check with the Customer if we are in https://access.redhat.com/solutions/6978096

- It should be possible to test the sos-4.5.1-1 in the Test environment. I'm going to share this request with the Customer.

- About the last point [2], both tests should be feasible in the Customer environment. I'm going to share both requests with the Customer.

First test:  
Again iterating question 1 : can we reproduce it somewhere (with / via sosreport)? If so, can we try running the 2 commands concurrently, repeatedly to hit varius interleavings of their execution, with delays between the execution attempts to let /usr/lib/systemd/systemd-hostnamed to stop first (to really have "(ostnamed)" process name)?

Second testL
Possible workaround: run "sos report --all-logs --threads 1" to prevent sosreport to call plugins concurrently. Does this prevent the system hang? (if so, it is another argument for the two commands stepping on each other toes)

BR
Riccardo



[1]

# TEST and SILVER (pre-production) RHOSP environments was updated from RHOSP 16.2.3 (Z3) to RHOSP 16.2.4 (Z4)

- from Red Hat OpenStack Platform release 16.2.3 (Train)

- Kernel PKGs history up to 16.2.3 (this before the RHOSP upgrade to 16.2.4)

kernel-4.18.0-305.49.1.el8_4.x86_64                         Wed Jul 27 10:08:28 2022 <= this is the last installed
kernel-core-4.18.0-305.19.1.el8_4.x86_64                    Fri Oct  8 13:24:43 2021
kernel-core-4.18.0-305.30.1.el8_4.x86_64                    Fri Feb 11 00:32:52 2022
kernel-core-4.18.0-305.49.1.el8_4.x86_64                    Wed Jul 27 10:07:39 2022
kernel-modules-4.18.0-305.19.1.el8_4.x86_64                 Fri Oct  8 13:24:47 2021
kernel-modules-4.18.0-305.30.1.el8_4.x86_64                 Fri Feb 11 00:32:57 2022
kernel-modules-4.18.0-305.49.1.el8_4.x86_64                 Wed Jul 27 10:07:46 2022
kernel-tools-4.18.0-305.49.1.el8_4.x86_64                   Wed Jul 27 10:07:36 2022
kernel-tools-libs-4.18.0-305.49.1.el8_4.x86_64              Wed Jul 27 10:07:36 2022


# GOLD environment (production) RHOSP  was updated from RHOSP 16.2.0 to RHOSP 16.2.4 (Z4)

- from Red Hat OpenStack Platform release 16.2.0 GA (Train)

- Kernel PKGs history up to 16.2.0 (this before the RHOSP upgrade to 16.2.4)

kernel-4.18.0-305.12.1.el8_4.x86_64                         Thu Sep  2 14:35:07 2021
kernel-4.18.0-305.19.1.el8_4.x86_64                         Thu Oct 21 12:39:12 2021 <= this is the last installed
kernel-core-4.18.0-305.12.1.el8_4.x86_64                    Thu Sep  2 14:34:26 2021
kernel-core-4.18.0-305.19.1.el8_4.x86_64                    Thu Oct 21 12:38:08 2021
kernel-modules-4.18.0-305.12.1.el8_4.x86_64                 Thu Sep  2 14:34:33 2021
kernel-modules-4.18.0-305.19.1.el8_4.x86_64                 Thu Oct 21 12:38:12 2021
kernel-tools-4.18.0-305.19.1.el8_4.x86_64                   Thu Oct 21 12:38:43 2021
kernel-tools-libs-4.18.0-305.19.1.el8_4.x86_64              Thu Oct 21 12:38:41 2021


[2]

THEN we should be able to reproduce the hang by concurrently running these two commands:

GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig --no-grubenv-update
hostnamectl status

on an overcloud node .. ? grub2-mkconfig is called from grub2 plugin, hostnamectl status from host plugin, plugins are alphabetically close to each other so usually they are executed concurrently.


Again iterating question 1 : can we reproduce it somewhere (with / via sosreport)? If so, can we try running the 2 commands concurrently, repeatedly to hit varius interleavings of their execution, with delays between the execution attempts to let /usr/lib/systemd/systemd-hostnamed to stop first (to really have "(ostnamed)" process name)?


Possible workaround: run "sos report --all-logs --threads 1" to prevent sosreport to call plugins concurrently. Does this prevent the system hang? (if so, it is another argument for the two commands stepping on each other toes)

Comment 5 Riccardo Bruzzone 2023-04-13 19:38:59 UTC

Hello,

These are the AVC events observed in the sosreport (sosreport-20230314-095845 generated on Mar 14 2023) attached to the Customer Case 03461361
As highlighted below, no deny of ostnamed was collected.

BR
Riccardo


[rbruzzon@supportshell-1 audit]$ pwd
/home/remote/rbruzzon/03461361/sosreport-20230314-095845/host0010-cdm/var/log/audit


[rbruzzon@supportshell-1 audit]$ ll
total 29084
-rw-rw-rw-. 1 yank yank 4597752 Mar 14 10:04 audit.log
-rw-rw-rw-. 1 yank yank 6291540 Mar 13 20:57 audit.log.1
-rw-rw-rw-. 1 yank yank 6291498 Mar 13 02:59 audit.log.2
-rw-rw-rw-. 1 yank yank 6291504 Mar 12 08:55 audit.log.3
-rw-rw-rw-. 1 yank yank 6291501 Mar 11 14:48 audit.log.4

[rbruzzon@supportshell-1 audit]$ for i in $(ls -la | awk '{print $9}' | grep audit); do echo "#----------------- $i ---------------#"; ausearch -m AVC -if $i; done
#----------------- audit.log ---------------#
<no matches>
#----------------- audit.log.1 ---------------#
----
time->Mon Mar 13 11:01:54 2023
type=PROCTITLE msg=audit(1678705314.936:1045162): proctitle=2F7573722F6C6962657865632F71656D752D6B766D002D6E616D650067756573743D696E7374616E63652D30303031333062322C64656275672D746872656164733D6F6E002D53002D6F626A656374007365637265742C69643D6D61737465724B6579302C666F726D61743D7261772C66696C653D2F7661722F6C69622F6C69
type=PATH msg=audit(1678705314.936:1045162): item=1 name="/var/lib/libvirt/qemu/org.qemu.guest_agent.0.instance-000130b2.sock" inode=1413482438 dev=08:03 mode=0140775 ouid=107 ogid=107 rdev=00:00 obj=system_u:object_r:container_file_t:s0 nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1678705314.936:1045162): item=0 name="/var/lib/libvirt/qemu/" inode=1413481460 dev=08:03 mode=040755 ouid=107 ogid=107 rdev=00:00 obj=system_u:object_r:container_file_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1678705314.936:1045162): cwd="/"
type=SYSCALL msg=audit(1678705314.936:1045162): arch=c000003e syscall=87 success=no exit=-13 a0=7f58e80098c0 a1=4 a2=0 a3=a items=2 ppid=56355 pid=153384 auid=4294967295 uid=107 gid=107 euid=107 suid=107 fsuid=107 egid=107 sgid=107 fsgid=107 tty=(none) ses=4294967295 comm="qemu-kvm" exe="/usr/libexec/qemu-kvm" subj=system_u:system_r:svirt_t:s0:c667,c789 key=(null)
type=AVC msg=audit(1678705314.936:1045162): avc:  denied  { unlink } for  pid=153384 comm="qemu-kvm" name="org.qemu.guest_agent.0.instance-000130b2.sock" dev="sda3" ino=1413482438 scontext=system_u:system_r:svirt_t:s0:c667,c789 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=0
#----------------- audit.log.2 ---------------#
<no matches>
#----------------- audit.log.3 ---------------#
<no matches>
#----------------- audit.log.4 ---------------#
----
time->Fri Mar 10 21:39:16 2023
type=PROCTITLE msg=audit(1678484356.962:966412): proctitle=2F7573722F6C6962657865632F71656D752D6B766D002D6E616D650067756573743D696E7374616E63652D30303031336261332C64656275672D746872656164733D6F6E002D53002D6F626A656374007365637265742C69643D6D61737465724B6579302C666F726D61743D7261772C66696C653D2F7661722F6C69622F6C69
type=PATH msg=audit(1678484356.962:966412): item=1 name="/var/lib/libvirt/qemu/org.qemu.guest_agent.0.instance-00013ba3.sock" inode=1413482434 dev=08:03 mode=0140775 ouid=107 ogid=107 rdev=00:00 obj=system_u:object_r:container_file_t:s0 nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1678484356.962:966412): item=0 name="/var/lib/libvirt/qemu/" inode=1413481460 dev=08:03 mode=040755 ouid=107 ogid=107 rdev=00:00 obj=system_u:object_r:container_file_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1678484356.962:966412): cwd="/"
type=SYSCALL msg=audit(1678484356.962:966412): arch=c000003e syscall=87 success=no exit=-13 a0=7f732400c040 a1=4 a2=0 a3=c items=2 ppid=56355 pid=877319 auid=4294967295 uid=107 gid=107 euid=107 suid=107 fsuid=107 egid=107 sgid=107 fsgid=107 tty=(none) ses=4294967295 comm="qemu-kvm" exe="/usr/libexec/qemu-kvm" subj=system_u:system_r:svirt_t:s0:c4,c688 key=(null)
type=AVC msg=audit(1678484356.962:966412): avc:  denied  { unlink } for  pid=877319 comm="qemu-kvm" name="org.qemu.guest_agent.0.instance-00013ba3.sock" dev="sda3" ino=1413482434 scontext=system_u:system_r:svirt_t:s0:c4,c688 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=0

Comment 6 Pavel Moravec 2023-04-14 15:17:06 UTC

Maybe the AVC part is unrelated to this hung, then.

A small correction to suspected triggering commands:
- "hostnamectl status" was really triggered (most probably via sosreport), but succeeded well:

Mar 14 10:59:04 HOSTNAME dbus-daemon[3256]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.512127' (uid=0 pid=231589 comm="hostnamectl status " label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023")
Mar 14 10:59:04 HOSTNAME systemd[1]: Starting Hostname Service...
Mar 14 10:59:04 HOSTNAME dbus-daemon[3256]: [system] Successfully activated service 'org.freedesktop.hostname1'
Mar 14 10:59:04 HOSTNAME systemd[1]: Started Hostname Service.
Mar 14 10:59:34 HOSTNAME systemd[1]: systemd-hostnamed.service: Succeeded.

- BUT "systemd-analyze plot" (also one of commands collected by sosreport, so most probably triggered by it) failed:
Mar 14 11:01:49 HOSTNAME dbus-daemon[3256]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.512217' (uid=0 pid=238905 comm="systemd-analyze plot " label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023")
Mar 14 11:01:49 HOSTNAME systemd[1]: Starting Hostname Service...
Mar 14 11:02:14 HOSTNAME dbus-daemon[3256]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out (service_start_timeout=25000ms)
Mar 14 11:03:19 HOSTNAME systemd[1]: systemd-hostnamed.service: start operation timed out. Terminating.
Mar 14 11:04:27 HOSTNAME kernel: INFO: task (ostnamed):238906 blocked for more than 120 seconds.
Mar 14 11:04:27 HOSTNAME kernel: (ostnamed)      D    0 238906      1 0x00000084 
Mar 14 11:04:38 HOSTNAME systemd[1]: systemd-hostnamed.service: Failed with result 'timeout'.
Mar 14 11:04:38 HOSTNAME systemd[1]: Failed to start Hostname Service.

("systemd-analyze plot" also enables the service with "(ostamed)" process name at the beginning)

- grub2-probe was triggered at 11:00:24, since:
Mar 14 11:02:24 HOSTNAME kernel: INFO: task grub2-probe:231489 blocked for more than 120 seconds.

But what took it so much time to execute, if the "offending" hostnamed service was initiated at 11:01:49? Whole grub2-mkconfig takes 1-2 seconds on my (small) system. This smells to me.


Is it normal that grub2-probe takes so much time? How long does "time GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig --no-grubenv-update" take? If over minute, let try (after the minute) run concurrently "systemd-analyze plot".

Anyway the more I investigate, the more it seems the problem is outside sosreport - calling some system commands in some environment seems to be the trigger (maybe sos calls them inapproprietely? maintainers of grub2 and/or hostnamectl/systemd-analyze should tell us / should be in charge to explain BTs of their processes)

Comment 7 Riccardo Bruzzone 2023-04-26 11:27:10 UTC

Hello,
Below the Customer's feedbacks received in the case 03461361.

BR
Riccardo

#-------------- Customer feedbacks -----------------#

ausearch reports no matches.

As for the reproducers, we were unable to reproduce the issue running those two commands or using sos report in single thread mode, while running sos report --all-logs continues to trigger the issue in an unpredictable way.

In general, we're observing a general slowdown of I/O bound processes on our controllers since we upgraded to z4, to the point that we're getting timeouts from nova when cinder-volume needs to process a 6GB qcow2 image to build a volume or Galera ISTs taking so long after a node restart that Pacemaker marks the receiving Galera instance as FAILED and picks it back up as Master only after running pcs resource cleanup. We're also observing sporadic clusterwide deaths of Galera caused by Pacemaker monitor timeouts without any specific reason (no fencing, no network issues, just monitor timeouts followed by Galera unceremoniously getting killed by Pacemaker, at which point we have to set safe_to_bootstrap in grastate.dat for the correct node to bring the cluster back online).

We should probably open a separate case (for which collecting a sosreport dump would be difficult, to say the least), but I fear these I/O problems and the hangs caused by sosreport are just different manifestations of the same root cause.

Comment 8 Pavel Moravec 2023-05-02 12:04:35 UTC

It was revealed the (part of the) root cause was "none" block device scheduler. I expect a combination of:

- that scheduler
- concurrent run of commands:

hostnamectl status
GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig --no-grubenv-update

with the *right* timing

is the reproducr for this bug. Which is not sos specific issue.

Comment 9 Pavel Moravec 2023-06-27 14:09:22 UTC

(In reply to Pavel Moravec from comment #8)
> It was revealed the (part of the) root cause was "none" block device
> scheduler. I expect a combination of:
> 
> - that scheduler
> - concurrent run of commands:
> 
> hostnamectl status
> GRUB_DISABLE_OS_PROBER=true /bin/sh /usr/sbin/grub2-mkconfig
> --no-grubenv-update
> 
> with the *right* timing
> 
> is the reproducr for this bug. Which is not sos specific issue.

Further investigation revealed some lvolume was mounted with improper options that triggered some kernel/systemd mishap, independent on sos.

Therefore I am closing this as there is nothing wrong from sos perspective.