Bug 1565376

Summary:	[Deployment] haproxy containers stopped on two controllers on fresh deployment
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	puppet-tripleo	Assignee:	Tim Rozet <trozet>
Status:	CLOSED ERRATA	QA Contact:	Tomas Jamrisko <tjamrisk>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	aadam, atelang, bperkins, jjoyce, josorior, jschluet, mkolesni, nyechiel, ojanas, rscarazz, slinaber, tvignaud
Target Milestone:	beta	Keywords:	Triaged
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	odl_deployment
Fixed In Version:	puppet-tripleo-8.3.2-2.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	N/A
Last Closed:	2018-06-27 13:50:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2018-04-09 23:55:54 UTC

Description of problem:

After using pcs resource restart haproxy-bundle to restart haproxy containers, the containers are killed on controller-1 and controller-2

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:
100%

Steps to Reproduce:
1. restart haproxy
2. use pcs status to view status
3.

Actual results:
haproxy containers killed on controller-1 and controller-2

Expected results:
haproxy containers should be restarted on all controllers

Additional info:

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr  9 23:44:07 2018
Last change: Mon Apr  9 23:44:05 2018 by hacluster via crmd on overcloud-controller-2

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2
 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2
 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master overcloud-controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave overcloud-controller-2
 ip-192.168.24.54       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.21.0.100        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.14 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.18.0.18 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.19.0.13 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started overcloud-controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1

==============================================================================

In /var/log/messages
Apr  9 19:49:09 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886021]: ERROR: Newly created docker container exited after start
Apr  9 19:49:09 overcloud-controller-1 lrmd[20848]:  notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:waiting on monitor_cmd to pass after start ]
Apr  9 19:49:09 overcloud-controller-1 lrmd[20848]:  notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:Newly created docker container exited after start ]
Apr  9 19:49:09 overcloud-controller-1 crmd[20851]:  notice: Result of start operation for haproxy-bundle-docker-2 on overcloud-controller-1: 1 (unknown error)
Apr  9 19:49:09 overcloud-controller-1 crmd[20851]:  notice: overcloud-controller-1-haproxy-bundle-docker-2_start_0:159 [ ocf-exit-reason:waiting on monitor_cmd to pass after start\nocf-exit-reason:Newly created docker container exited after start\n ]
Apr  9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.004764059Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop?t=10 returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr  9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.005303162Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: NOTICE: Cleaning up inactive container, haproxy-bundle-docker-2.
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr  9 19:49:10 overcloud-controller-1 crmd[20851]:  notice: Result of stop operation for haproxy-bundle-docker-2 on overcloud-controller-1: 0 (ok)

Comment 1 Sai Sindhur Malleni 2018-04-10 12:33:51 UTC

EDIT

I have noticed that the haproxy containers are stopped on controller-1 and controller-2 on fresh deployment as well.

Comment 2 Raoul Scarazzini 2018-04-10 14:10:56 UTC

On the machines we see that the problem related to the container is really specific, and you can see it by starting the container by hand:

[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [172.16.0.15:8185]
[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [192.168.24.59:8185]

Which should mean that the ports that haproxy want to use are occupied by something, but in fact what we have on the controller is:

[root@overcloud-controller-1 heat-admin]# netstat -nlp|grep 8185
tcp        0      0 172.16.0.20:8185        0.0.0.0:*               LISTEN      496289/java

So the local IP of the machine 172.16.0.20 correctly listens with the opendaylight service (driven by the container) and nothing else.
One particular thing is that controller-1 do not have any VIP on it, and the problem does not happen on controller-0, where the VIP lives.

Commenting the opendaylight_ws section in /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg on the machine makes haproxy start, but it remains to be understood why it cannot bind the port.

Comment 3 Sai Sindhur Malleni 2018-04-10 18:24:00 UTC

Just a quick update, looks like when the VIP moves to controller-2, haproxy container is started on controller-2 and stopped on the others. So looks like haproxy seems to be working only on the controller that has the VIP.

Comment 4 Tim Rozet 2018-04-13 19:51:07 UTC

The reason it is not starting on non-VIP control nodes is because the binding for haproxy is not set to transparent.  Therefore since the nodes do not have the VIP on their machines, haproxy will not start because it cannot bind to those IPs. Most other services use transparent mode which will allow haproxy to start even when it does not have the referenced bind address.  Therefore the behavior is expected here.  However, is the behavior correct?

On both Zaqar Websocket and ODL Websocket services we are not using transparent binding with a note from Juan indicating it is done intentionally:

  if $zaqar_ws {
    ::tripleo::haproxy::endpoint { 'zaqar_ws':
      public_virtual_ip         => $public_virtual_ip,
      internal_ip               => hiera('zaqar_ws_vip', $controller_virtual_ip),
      service_port              => $ports[zaqar_ws_port],
      ip_addresses              => hiera('zaqar_ws_node_ips', $controller_hosts_real),
      server_names              => hiera('zaqar_ws_node_names', $controller_hosts_names_real),
      mode                      => 'http',
      haproxy_listen_bind_param => [],  # We don't use a transparent proxy here

I'm guessing there is some issue with using transparent proxy with websocket, but we need Juan to tell us what the original issue here was.

Comment 7 Sai Sindhur Malleni 2018-04-16 20:08:28 UTC

Changed the HAProxy configuration for opendaylight_ws to include transparent on the controllers and restarted haproxy-bundle. Tried VM boot ping scenario, VMs go into ACTIVE as expected and are pingable.

--------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------+
|                                                       Response Times (sec)                                                        |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action                         | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_router          | 1.62      | 1.899        | 3.281        | 3.89         | 4.327     | 2.259     | 100.0%  | 50    |
| neutron.create_network         | 0.246     | 0.462        | 0.612        | 0.689        | 0.774     | 0.444     | 100.0%  | 50    |
| neutron.create_subnet          | 0.582     | 0.849        | 1.014        | 1.038        | 1.421     | 0.854     | 100.0%  | 50    |
| neutron.add_interface_router   | 2.029     | 2.42         | 2.859        | 2.966        | 3.152     | 2.453     | 100.0%  | 50    |
| nova.boot_server               | 38.003    | 77.645       | 90.082       | 91.992       | 92.988    | 75.522    | 100.0%  | 50    |
| vm.attach_floating_ip          | 3.779     | 5.075        | 5.671        | 5.786        | 6.557     | 5.051     | 100.0%  | 50    |
|  -> neutron.create_floating_ip | 1.374     | 1.711        | 2.074        | 2.132        | 2.156     | 1.752     | 100.0%  | 50    |
|  -> nova.associate_floating_ip | 2.029     | 3.24         | 3.978        | 4.175        | 4.655     | 3.298     | 100.0%  | 50    |
| vm.wait_for_ping               | 0.019     | 0.023        | 0.028        | 0.029        | 121.23    | 4.851     | 96.0%   | 50    |
| total                          | 47.474    | 88.715       | 101.95       | 103.039      | 215.239   | 91.436    | 96.0%   | 50    |
|  -> duration                   | 46.474    | 87.715       | 100.95       | 102.039      | 214.239   | 90.436    | 96.0%   | 50    |
|  -> idle_duration              | 1.0       | 1.0          | 1.0          | 1.0          | 1.0       | 1.0       | 96.0%   | 50    |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

Comment 8 Sai Sindhur Malleni 2018-04-16 20:10:12 UTC

FYI, this test was to launch and delete 50 VMs at a concurrency of 8.

Comment 9 Sai Sindhur Malleni 2018-04-16 20:11:14 UTC

haproxy-bundle is started on all 3 controllers after this change.


[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr 16 20:10:46 2018
Last change: Mon Apr 16 19:49:34 2018 by hacluster via crmd on overcloud-controller-1

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2
 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2
 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master overcloud-controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave overcloud-controller-2
 ip-192.168.24.60       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.21.0.100        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.19 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.18.0.18 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.19.0.12 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started overcloud-controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started overcloud-controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started overcloud-controller-2
 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started overcloud-controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 10 Mike Kolesnik 2018-04-22 11:45:04 UTC

The upstream change was merged on Apr 20th, moving this to POST

Comment 17 errata-xmlrpc 2018-06-27 13:50:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086