Bug 1447916 - cleanup of a bundle resource may result in the crmd process on the DC node taking 100 % of a CPU
Summary: cleanup of a bundle resource may result in the crmd process on the DC node ta...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.4
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-04 08:47 UTC by Tomas Jelinek
Modified: 2019-05-01 23:53 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Tomas Jelinek 2017-05-04 08:47:01 UTC
Description of problem:
Running "crm_resource --resource <bundle resource> --cleanup" may
result in the crmd process on the DC node taking 100 % of a CPU.


Version-Release number of selected component (if applicable):
pacemaker-1.1.16-8.el7.x86_64


How reproducible:
easily, most of the time


Steps to Reproduce:
1. create a bundle
2. run "crm_resource --resource <bundle resource> --cleanup"
3. it may be neede to run the command several 


Actual results:
# crm_resource --resource http-bundle --cleanup
Cleaning up http-bundle-docker-0 on rh73-node1, removing fail-count-http-bundle-docker-0
Cleaning up http-bundle-docker-0 on rh73-node2, removing fail-count-http-bundle-docker-0
Cleaning up http-bundle-ip-192.168.122.250 on rh73-node1, removing fail-count-http-bundle-ip-192.168.122.250
Cleaning up http-bundle-ip-192.168.122.250 on rh73-node2, removing fail-count-http-bundle-ip-192.168.122.250
Cleaning up http-bundle-0 on rh73-node1, removing fail-count-http-bundle-0
Cleaning up http-bundle-0 on rh73-node2, removing fail-count-http-bundle-0
Cleaning up http-bundle-docker-1 on rh73-node1, removing fail-count-http-bundle-docker-1
Cleaning up http-bundle-docker-1 on rh73-node2, removing fail-count-http-bundle-docker-1
Cleaning up http-bundle-ip-192.168.122.251 on rh73-node1, removing fail-count-http-bundle-ip-192.168.122.251
Cleaning up http-bundle-ip-192.168.122.251 on rh73-node2, removing fail-count-http-bundle-ip-192.168.122.251
Cleaning up http-bundle-1 on rh73-node1, removing fail-count-http-bundle-1
Cleaning up http-bundle-1 on rh73-node2, removing fail-count-http-bundle-1
Cleaning up dummy1:0 on rh73-node1, removing fail-count-dummy1
Cleaning up dummy1:0 on rh73-node2, removing fail-count-dummy1
Waiting for 14 replies from the CRMd.............. OK

Then crmd takes 100 % of a CPU. Sometimes the "OK" at the last line is missing.


Expected results:
crmd does not take 100 % of a CPU.


Additional info:
Bundle configuration:
      <bundle id="http-bundle">                                                 
        <docker image="pcmktest:http" options="--log-driver=journald" replicas="2"/>
        <network host-netmask="24" host-interface="ens3" ip-range-start="192.168.122.250">
          <port-mapping port="80" id="http-bundle-port-map-80"/>                
        </network>                                                              
        <storage>                                                               
          <storage-mapping source-dir-root="/root/docker/www" target-dir="/var/www/html" options="rw" id="http-bundle-storage-map"/>
          <storage-mapping source-dir-root="/root/docker/logs" target-dir="/etc/httpd/logs" options="rw" id="http-bundle-storage-map-1"/>
        </storage>                                                              
        <primitive class="ocf" id="dummy1" provider="pacemaker" type="Stateful">
          <operations>                                                          
            <op id="dummy1-monitor-interval-10" interval="10" name="monitor" role="Master" timeout="20"/>
            <op id="dummy1-monitor-interval-11" interval="11" name="monitor" role="Slave" timeout="20"/>
            <op id="dummy1-start-interval-0s" interval="0s" name="start" timeout="20"/>
            <op id="dummy1-stop-interval-0s" interval="0s" name="stop" timeout="20"/>
          </operations>                                                         
        </primitive>                                                            
      </bundle>

Comment 2 Andrew Beekhof 2017-05-05 01:37:31 UTC
I was talking to klaus about this yesterday, its something to do with the remote connection resource.

If you 'killall -TRAP crmd' a few times you see thousands of copies of:

trace   May 04 03:58:04 mainloop_gio_callback(673):0: New message from
remote-lrmd-pcmk-2:1111[0x7f9def161c60] 1
trace   May 04 03:58:04 lrmd_tls_dispatch(366):0: tls dispatch
triggered after disconnect

And if you think I'm exaggerating:

[root@pcmk-2 /]# qb-blackbox /var/lib/pacemaker/blackbox/crmd-349.2 |
grep "trace   May 04 03:58:04 mainloop_gio_callback" | wc -l
4317


The message in particular suggests we're not correctly cleaning up dead connections.

Comment 3 Andrew Beekhof 2017-05-05 01:38:30 UTC
Oh and the other symptom is that all IPC to the crmd (eg. subsequent cleanup calls) fails.

Comment 4 Ken Gaillot 2017-11-01 22:38:08 UTC
This is unlikely to be solved in the 7.5 timeframe

Comment 6 Ken Gaillot 2017-11-01 22:57:27 UTC
Whoops, picked wrong drop-down


Note You need to log in before you can comment on or make changes to this bug.