Bug 1466875 - [pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemaker session, reconnecting to new one [NEEDINFO]
[pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemake...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jan Pokorný
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 11:39 EDT by Jan Pokorný
Modified: 2017-10-09 13:32 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mnovacek: needinfo? (jpokorny)


Attachments (Terms of Use)

  None (edit)
Description Jan Pokorný 2017-06-30 11:39:11 EDT
# pcs cluster start --wait
# crm_mon -d --as-html /tmp/mon.html
# pcs cluster stop --wait

# pcs cluster start --wait

# strace -p $(pidof crm_mon)
> [...]
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> [...]

# gcore $(pidof crm_mon)
# gdb $(which crm_mon) core.$(pidof crm_mon)

(gdb) bt
> #0  0x00007fa5c5c1fa20 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007fa5c61497ac in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x1a9f5a0, timeout=500, context=0x1ab3080) at gmain.c:4226
> #2  g_main_context_iterate (context=0x1ab3080, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3922
> #3  0x00007fa5c6149aea in g_main_loop_run (loop=0x1a9f750) at gmain.c:4123
> #4  0x000000000040381f in main (argc=<optimized out>, argv=<optimized out>) at crm_mon.c:812

(gdb) info proc mappings
>          Start Addr           End Addr       Size     Offset objfile
[...]
>      0x7fa5c9c57000     0x7fa5c9c58000     0x1000        0x0 /dev/shm/qb-cib_ro-control-30389-30758-20 (deleted)
[...]

# ls -l /dev/shm/qb-cib_ro-control-30389-30758-20
> ls: cannot access /dev/shm/qb-cib_ro-control-30389-30758-20: No such file or directory

The crm_mon seems to suggest the connection is expected to be
re-established just fine in such a scenario.
Comment 1 Jan Pokorný 2017-06-30 11:44:15 EDT
Furthermore:

# lsof -p $(pidof crm_mon)
> COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
> [...]
> crm_mon 30758 root  DEL       REG               0,17             599553 /dev/shm/qb-cib_ro-control-30389-30758-20
> [...]
> crm_mon 30758 root    5u     unix 0xffff88008ae0bc00      0t0    600577 @qb-cib_ro-30389-30758-20-response
> crm_mon 30758 root    6u     unix 0xffff88008ae0e000      0t0    600578 @qb-cib_ro-30389-30758-20-event

# kill -0 30389
> -bash: kill: (30389) - No such process

Looks like the UNIX socket should be long gone at this point, and it is not.
Comment 2 Jan Pokorný 2017-07-04 13:31:17 EDT
Observation:

- it actually worked for me correctly several times, but most of the
  time, it won't

- in failing cases, mon_cib_connection_destroy never gets called

- the previous means that client->destroy_fn from the mainloop's
  perspective won't get called

- the previous means that mainloop_gio_destroy won't get called


Need to find a link between a gracefull cib daemon shutdown and an
attempt to run all the connection at least partially through the
above mentioned call stack.  Hopefully there will be some.
Comment 4 Ken Gaillot 2017-10-09 13:32:24 EDT
Due to time constraints, this will not make 7.5

Note You need to log in before you can comment on or make changes to this bug.