Bug 1466875 - [pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemaker session, reconnecting to new one [NEEDINFO]
[pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemake...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jan Pokorný
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 11:39 EDT by Jan Pokorný
Modified: 2017-08-10 10:42 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mnovacek: needinfo? (jpokorny)


Attachments (Terms of Use)

  None (edit)
Description Jan Pokorný 2017-06-30 11:39:11 EDT
# pcs cluster start --wait
# crm_mon -d --as-html /tmp/mon.html
# pcs cluster stop --wait

# pcs cluster start --wait

# strace -p $(pidof crm_mon)
> [...]
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> [...]

# gcore $(pidof crm_mon)
# gdb $(which crm_mon) core.$(pidof crm_mon)

(gdb) bt
> #0  0x00007fa5c5c1fa20 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007fa5c61497ac in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x1a9f5a0, timeout=500, context=0x1ab3080) at gmain.c:4226
> #2  g_main_context_iterate (context=0x1ab3080, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3922
> #3  0x00007fa5c6149aea in g_main_loop_run (loop=0x1a9f750) at gmain.c:4123
> #4  0x000000000040381f in main (argc=<optimized out>, argv=<optimized out>) at crm_mon.c:812

(gdb) info proc mappings
>          Start Addr           End Addr       Size     Offset objfile
[...]
>      0x7fa5c9c57000     0x7fa5c9c58000     0x1000        0x0 /dev/shm/qb-cib_ro-control-30389-30758-20 (deleted)
[...]

# ls -l /dev/shm/qb-cib_ro-control-30389-30758-20
> ls: cannot access /dev/shm/qb-cib_ro-control-30389-30758-20: No such file or directory

The crm_mon seems to suggest the connection is expected to be
re-established just fine in such a scenario.
Comment 1 Jan Pokorný 2017-06-30 11:44:15 EDT
Furthermore:

# lsof -p $(pidof crm_mon)
> COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
> [...]
> crm_mon 30758 root  DEL       REG               0,17             599553 /dev/shm/qb-cib_ro-control-30389-30758-20
> [...]
> crm_mon 30758 root    5u     unix 0xffff88008ae0bc00      0t0    600577 @qb-cib_ro-30389-30758-20-response
> crm_mon 30758 root    6u     unix 0xffff88008ae0e000      0t0    600578 @qb-cib_ro-30389-30758-20-event

# kill -0 30389
> -bash: kill: (30389) - No such process

Looks like the UNIX socket should be long gone at this point, and it is not.
Comment 2 Jan Pokorný 2017-07-04 13:31:17 EDT
Observation:

- it actually worked for me correctly several times, but most of the
  time, it won't

- in failing cases, mon_cib_connection_destroy never gets called

- the previous means that client->destroy_fn from the mainloop's
  perspective won't get called

- the previous means that mainloop_gio_destroy won't get called


Need to find a link between a gracefull cib daemon shutdown and an
attempt to run all the connection at least partially through the
above mentioned call stack.  Hopefully there will be some.

Note You need to log in before you can comment on or make changes to this bug.