1845456 – systemctl try-restart command hangs indefinitely while being executed during a yum update

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1845456 - systemctl try-restart command hangs indefinitely while being executed during a yum update

Summary: systemctl try-restart command hangs indefinitely while being executed during ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1829754
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-09 09:50 UTC by Renaud Métrich
Modified:	2023-10-06 20:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-11 12:14:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Renaud Métrich 2020-06-09 09:50:26 UTC

Description of problem:

A customer is seeing multiple of his systems hanging during "yum update" transaction on RHEL 7.6 EUS while the following packages update:
- lvm2
- glibc
- systemd

The "lvm2-lvmpolld.service" unit is stopped initially

Coredump analysis shows that "systemctl try-restart lvm2-lvmpolld.service" waits forever from a reply from systemd (it waits for the start to happen).

On the systemd side, systemd is in its wait loop.
Still the manager shows there is 1 job pending, but I don't know how to read the hashmap to confirm that this is the job systemctl is waiting for.


Version-Release number of selected component (if applicable):

systemd-219-62.el7_6.12
glibc-2.17-260.el7_6.5 -> glibc-2.17-260.el7_6.6


How reproducible:

Often on customer site


Steps to Reproduce:
1. Update the system 


Additional info:

systemctl coredump:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007f47887843f9 in __GI_ppoll (fds=fds@entry=0x7ffc1594dae0, nfds=nfds@entry=1, 
    timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:51
#1  0x00007f47887a8afe in __ppoll_chk (fds=fds@entry=0x7ffc1594dae0, nfds=nfds@entry=1, 
    timeout=<optimized out>, ss=ss@entry=0x0, fdslen=fdslen@entry=16) at ppoll_chk.c:28
#2  0x00005569d014b8b4 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=1, __fds=0x7ffc1594dae0)
    at /usr/include/bits/poll2.h:71
#3  bus_poll (bus=bus@entry=0x5569d1c51030, need_more=need_more@entry=false, 
    timeout_usec=timeout_usec@entry=18446744073709551615) at src/libsystemd/sd-bus/sd-bus.c:2839
#4  0x00005569d014b9d3 in sd_bus_wait (bus=bus@entry=0x5569d1c51030, 
    timeout_usec=timeout_usec@entry=18446744073709551615) at src/libsystemd/sd-bus/sd-bus.c:2860
#5  0x00005569d016d201 in bus_process_wait (bus=0x5569d1c51030) at src/libsystemd/sd-bus/bus-util.c:1719
#6  bus_wait_for_jobs (d=0x5569d1c51920, quiet=false) at src/libsystemd/sd-bus/bus-util.c:1851
#7  0x00005569d01369c9 in start_unit (bus=0x5569d1c51030, args=<optimized out>)
    at src/systemctl/systemctl.c:2801
#8  0x00005569d012b246 in systemctl_main (bus_error=0, argv=0x7ffc1594ded8, argc=<optimized out>, 
    bus=0x5569d1c51030) at src/systemctl/systemctl.c:7436
#9  main (argc=<optimized out>, argv=0x7ffc1594ded8) at src/systemctl/systemctl.c:7694

(gdb) f 6
#6  bus_wait_for_jobs (d=0x5569d1c51920, quiet=false) at src/libsystemd/sd-bus/bus-util.c:1851
1851	                q = bus_process_wait(d->bus);

(gdb) p *d->jobs 
$4 = {b = {hash_ops = 0x5569d03c94e0 <string_hash_ops>, {indirect = {
        storage = 0x5569d1d6a480 "/org/freedesktop/systemd1/job/15207", 
        hash_key = '\000' <repeats 15 times>, n_entries = 0, n_buckets = 0, 
        idx_lowest_entry = 4294967040, _pad = "\000\000"}, direct = {
        storage = "\200\244\326\321iU", '\000' <repeats 27 times>, "\377\377\377\000\000"}}, 
    type = HASHMAP_TYPE_SET, has_indirect = false, n_direct_entries = 1, from_pool = true}}
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

The systemctl command is waiting for job 15207 (which is "try-restart ...") to complete. There is nothing more to check here.

--------------------------------------------------------------------------------

systemd coredump:

Initial loading of the coredump makes gdb complain and show nothing:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
warning: .dynamic section for "/lib64/libc.so.6" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib/systemd/systemd'.
#0  0x00007f1645df2483 in capget () at ../sysdeps/unix/syscall-template.S:83
83	T_PSEUDO_END (SYSCALL_SYMBOL)
(gdb) bt
#0  0x00007f1645df2483 in capget () at ../sysdeps/unix/syscall-template.S:83
#1  0x0000000000000000 in ?? ()
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This is due to having systemd use the older glibc "glibc-2.17-260.el7_6.5.x86_64" which was getting updated.
The confirmation is shown below (the "deleted" tags):
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# eu-unstrip -n --core ./systemd-1591655006.gcore.1 
0x7f164588e000+0x204000 357693c8f1f49d93010c4e31529c07cdd2bd3d08@0x7f164588e1d8 . - /usr/lib64/libdl-2.17.so;5ed564d6 (deleted)
0x7f1645cf4000+0x3cd000 753c29c26bff7e166f7564555bdfb1e26e3c2682@0x7f1645cf4280 . - /usr/lib64/libc-2.17.so;5ed564d6 (deleted)
0x7f16460c1000+0x21c000 96900cb0ff25b26f2bbdf247de1408242e4773d8@0x7f16460c1248 . - /usr/lib64/libpthread-2.17.so;5ed564d6 (deleted)
0x7f16462dd000+0x216000 179f202998e429aa1215907f6d4c5c1bb9c90136@0x7f16462dd1d8 . - /usr/lib64/libgcc_s-4.8.5-20150702.so.1;5ed564d6 (deleted)
0x7f16464f3000+0x208000 efde2029c9a4a20be5b8d8ae7e6551ff9b5755d2@0x7f16464f31d8 . - /usr/lib64/librt-2.17.so;5ed564d6 (deleted)
0x7f16473b8000+0x224000 a527fe72908703c5972ae384e78d1850d1881ee7@0x7f16473b81d8 /usr/lib64/ld-2.17.so /usr/lib/debug/usr/lib64/ld-2.17.so.debug ld-linux-x86-64.so.2
0x7ffd2819e000+0x1000 61b336ff6146a0f51f0950a0823407cf3804bfa6@0x7ffd2819e328 . - linux-vdso.so.1
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

I doubt this could be a reason of the issue; but clearly it's not that good anyway: we will have systemd use the old/deleted glibc until a "systemctl daemon-reexec" happens.
It's possible that later during the yum transaction this will happen, but I'm not sure at all since here we have the newer systemd already running, indicating that a "daemon-reexec" already happened.

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# yum downgrade glibc*-2.17-260.el7_6.5.x86_64 *gcc*-4.8.5-36.el7_6.2.x86_64
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

After downgrading the glibc and gcc packages, I can now analyze the coredump properly.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007f1645df2483 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000555ecd0addd9 in sd_event_wait (e=e@entry=0x555ecd53cf90, 
    timeout=timeout@entry=18446744073709551615) at src/libsystemd/sd-event/sd-event.c:2372
#2  0x0000555ecd0ae8ed in sd_event_run (e=0x555ecd53cf90, timeout=18446744073709551615)
    at src/libsystemd/sd-event/sd-event.c:2499
#3  0x0000555ecd00f1a3 in manager_loop (m=0x555ecd53caf0) at src/core/manager.c:2267
#4  0x0000555ecd00365b in main (argc=4, argv=0x7ffd280248f8) at src/core/main.c:1773

(gdb) frame 3
#3  0x0000555ecd00f1a3 in manager_loop (m=0x555ecd53caf0) at src/core/manager.c:2267
2267	                r = sd_event_run(m->event, wait_usec);

(gdb) p *m->jobs
$8 = {b = {hash_ops = 0x555ecd3616f0 <trivial_hash_ops>, {indirect = {storage = 0x555ecd5ab5d0 "", 
        hash_key = "\275\\\234=(-=D+\203\070\354(\212\360\062", n_entries = 1, n_buckets = 30, 
        idx_lowest_entry = 13, _pad = "\000\000"}, direct = {
        storage = "еZ\315^U\000\000\275\\\234=(-=D+\203\070\354(\212\360\062\001\000\000\000\036\000\000\000\r\000\000\000\000\000"}}, type = HASHMAP_TYPE_PLAIN, has_indirect = true, n_direct_entries = 0, 
    from_pool = true}}

(gdb) p m->jobs_in_progress_event_source 
$27 = (sd_event_source *) 0x0
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

In the backtrace above, we can see that systemd just waits for events.
In the "manager", we can see that there is a job (probably our job systemctl is waiting for), but reading the hashmap is extremely difficult to me to confirm this.

Nevertheless, doing source code analysis, I could find this:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
2954 void manager_check_finished(Manager *m) {
2955         Unit *u = NULL;
2956         Iterator i;
2957 
2958         assert(m);
2959 
2960         if (hashmap_size(m->jobs) > 0) {
2961 
2962                 if (m->jobs_in_progress_event_source)
2963                         sd_event_source_set_time(m->jobs_in_progress_event_source, now(CLOCK_MONOTONIC) + JOBS_IN_PROGRESS_WAIT_USEC);
2964 
2965                 return;
2966         }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

In a nutshell, we have some job here (line 2960) hence we enter the condition on lines 2961-2965 but due to not having an event source attached, it's a noop and we return immediately on line 2965.

What is unclear is why the job hasn't been deleted.

Comment 3 David Tardon 2020-06-10 14:50:33 UTC

(In reply to Renaud Métrich from comment #0)
> (gdb) p *m->jobs
> $8 = {b = {hash_ops = 0x555ecd3616f0 <trivial_hash_ops>, {indirect =
> {storage = 0x555ecd5ab5d0 "", 
>         hash_key = "\275\\\234=(-=D+\203\070\354(\212\360\062", n_entries =
> 1, n_buckets = 30, 
>         idx_lowest_entry = 13, _pad = "\000\000"}, direct = {
>         storage =
> "еZ\315^U\000\000\275\\\234=(-
> =D+\203\070\354(\212\360\062\001\000\000\000\036\000\000\000\r\000\000\000\00
> 0\000"}}, type = HASHMAP_TYPE_PLAIN, has_indirect = true, n_direct_entries =
> 0, 
>     from_pool = true}}
> 
> (gdb) p m->jobs_in_progress_event_source 
> $27 = (sd_event_source *) 0x0
> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8<
> --------
> 
> In the backtrace above, we can see that systemd just waits for events.
> In the "manager", we can see that there is a job (probably our job systemctl
> is waiting for), but reading the hashmap is extremely difficult to me to
> confirm this.

Try

(gdb) (Job*) hashmap_first(m->jobs)

> 
> Nevertheless, doing source code analysis, I could find this:
> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8<
> --------
> 2954 void manager_check_finished(Manager *m) {
> 2955         Unit *u = NULL;
> 2956         Iterator i;
> 2957 
> 2958         assert(m);
> 2959 
> 2960         if (hashmap_size(m->jobs) > 0) {
> 2961 
> 2962                 if (m->jobs_in_progress_event_source)
> 2963                        
> sd_event_source_set_time(m->jobs_in_progress_event_source,
> now(CLOCK_MONOTONIC) + JOBS_IN_PROGRESS_WAIT_USEC);
> 2964 
> 2965                 return;
> 2966         }
> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8<
> --------
> 
> In a nutshell, we have some job here (line 2960) hence we enter the
> condition on lines 2961-2965 but due to not having an event source attached,
> it's a noop and we return immediately on line 2965.

This is actually possible if the job is in state JOB_WAITING. But normally a job should only get into this state if it waits for another job, so it would be a highly suspicious state for a single job :-) What's the value of m->n_running_jobs and m->run_queue?

Comment 7 David Tardon 2020-06-11 12:14:15 UTC


*** This bug has been marked as a duplicate of bug 1829754 ***

Note You need to log in before you can comment on or make changes to this bug.