Description of problem: After using pcs resource restart haproxy-bundle to restart haproxy containers, the containers are killed on controller-1 and controller-2 Version-Release number of selected component (if applicable): OSP 13 How reproducible: 100% Steps to Reproduce: 1. restart haproxy 2. use pcs status to view status 3. Actual results: haproxy containers killed on controller-1 and controller-2 Expected results: haproxy containers should be restarted on all controllers Additional info: [root@overcloud-controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-2 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum Last updated: Mon Apr 9 23:44:07 2018 Last change: Mon Apr 9 23:44:05 2018 by hacluster via crmd on overcloud-controller-2 12 nodes configured 37 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master overcloud-controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave overcloud-controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave overcloud-controller-2 ip-192.168.24.54 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.21.0.100 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.0.14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.18.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.19.0.13 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1 ============================================================================== In /var/log/messages Apr 9 19:49:09 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886021]: ERROR: Newly created docker container exited after start Apr 9 19:49:09 overcloud-controller-1 lrmd[20848]: notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:waiting on monitor_cmd to pass after start ] Apr 9 19:49:09 overcloud-controller-1 lrmd[20848]: notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:Newly created docker container exited after start ] Apr 9 19:49:09 overcloud-controller-1 crmd[20851]: notice: Result of start operation for haproxy-bundle-docker-2 on overcloud-controller-1: 1 (unknown error) Apr 9 19:49:09 overcloud-controller-1 crmd[20851]: notice: overcloud-controller-1-haproxy-bundle-docker-2_start_0:159 [ ocf-exit-reason:waiting on monitor_cmd to pass after start\nocf-exit-reason:Newly created docker container exited after start\n ] Apr 9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.004764059Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop?t=10 returned error: Container haproxy-bundle-docker-2 is already stopped" Apr 9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.005303162Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop returned error: Container haproxy-bundle-docker-2 is already stopped" Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2 Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: NOTICE: Cleaning up inactive container, haproxy-bundle-docker-2. Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2 Apr 9 19:49:10 overcloud-controller-1 crmd[20851]: notice: Result of stop operation for haproxy-bundle-docker-2 on overcloud-controller-1: 0 (ok)
EDIT I have noticed that the haproxy containers are stopped on controller-1 and controller-2 on fresh deployment as well.
On the machines we see that the problem related to the container is really specific, and you can see it by starting the container by hand: [ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [172.16.0.15:8185] [ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [192.168.24.59:8185] Which should mean that the ports that haproxy want to use are occupied by something, but in fact what we have on the controller is: [root@overcloud-controller-1 heat-admin]# netstat -nlp|grep 8185 tcp 0 0 172.16.0.20:8185 0.0.0.0:* LISTEN 496289/java So the local IP of the machine 172.16.0.20 correctly listens with the opendaylight service (driven by the container) and nothing else. One particular thing is that controller-1 do not have any VIP on it, and the problem does not happen on controller-0, where the VIP lives. Commenting the opendaylight_ws section in /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg on the machine makes haproxy start, but it remains to be understood why it cannot bind the port.
Just a quick update, looks like when the VIP moves to controller-2, haproxy container is started on controller-2 and stopped on the others. So looks like haproxy seems to be working only on the controller that has the VIP.
The reason it is not starting on non-VIP control nodes is because the binding for haproxy is not set to transparent. Therefore since the nodes do not have the VIP on their machines, haproxy will not start because it cannot bind to those IPs. Most other services use transparent mode which will allow haproxy to start even when it does not have the referenced bind address. Therefore the behavior is expected here. However, is the behavior correct? On both Zaqar Websocket and ODL Websocket services we are not using transparent binding with a note from Juan indicating it is done intentionally: if $zaqar_ws { ::tripleo::haproxy::endpoint { 'zaqar_ws': public_virtual_ip => $public_virtual_ip, internal_ip => hiera('zaqar_ws_vip', $controller_virtual_ip), service_port => $ports[zaqar_ws_port], ip_addresses => hiera('zaqar_ws_node_ips', $controller_hosts_real), server_names => hiera('zaqar_ws_node_names', $controller_hosts_names_real), mode => 'http', haproxy_listen_bind_param => [], # We don't use a transparent proxy here I'm guessing there is some issue with using transparent proxy with websocket, but we need Juan to tell us what the original issue here was.
Changed the HAProxy configuration for opendaylight_ws to include transparent on the controllers and restarted haproxy-bundle. Tried VM boot ping scenario, VMs go into ACTIVE as expected and are pingable. -------------------------------------------------------------------------------- +-----------------------------------------------------------------------------------------------------------------------------------+ | Response Times (sec) | +--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count | +--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | neutron.create_router | 1.62 | 1.899 | 3.281 | 3.89 | 4.327 | 2.259 | 100.0% | 50 | | neutron.create_network | 0.246 | 0.462 | 0.612 | 0.689 | 0.774 | 0.444 | 100.0% | 50 | | neutron.create_subnet | 0.582 | 0.849 | 1.014 | 1.038 | 1.421 | 0.854 | 100.0% | 50 | | neutron.add_interface_router | 2.029 | 2.42 | 2.859 | 2.966 | 3.152 | 2.453 | 100.0% | 50 | | nova.boot_server | 38.003 | 77.645 | 90.082 | 91.992 | 92.988 | 75.522 | 100.0% | 50 | | vm.attach_floating_ip | 3.779 | 5.075 | 5.671 | 5.786 | 6.557 | 5.051 | 100.0% | 50 | | -> neutron.create_floating_ip | 1.374 | 1.711 | 2.074 | 2.132 | 2.156 | 1.752 | 100.0% | 50 | | -> nova.associate_floating_ip | 2.029 | 3.24 | 3.978 | 4.175 | 4.655 | 3.298 | 100.0% | 50 | | vm.wait_for_ping | 0.019 | 0.023 | 0.028 | 0.029 | 121.23 | 4.851 | 96.0% | 50 | | total | 47.474 | 88.715 | 101.95 | 103.039 | 215.239 | 91.436 | 96.0% | 50 | | -> duration | 46.474 | 87.715 | 100.95 | 102.039 | 214.239 | 90.436 | 96.0% | 50 | | -> idle_duration | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 96.0% | 50 | +--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
FYI, this test was to launch and delete 50 VMs at a concurrency of 8.
haproxy-bundle is started on all 3 controllers after this change. [root@overcloud-controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum Last updated: Mon Apr 16 20:10:46 2018 Last change: Mon Apr 16 19:49:34 2018 by hacluster via crmd on overcloud-controller-1 12 nodes configured 37 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master overcloud-controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave overcloud-controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave overcloud-controller-2 ip-192.168.24.60 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.21.0.100 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.0.19 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.18.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.19.0.12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started overcloud-controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started overcloud-controller-2 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
The upstream change was merged on Apr 20th, moving this to POST
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086