1565376 – [Deployment] haproxy containers stopped on two controllers on fresh deployment

Bug 1565376 - [Deployment] haproxy containers stopped on two controllers on fresh deployment

Summary: [Deployment] haproxy containers stopped on two controllers on fresh deployment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Tim Rozet
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	odl_deployment
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 23:55 UTC by Sai Sindhur Malleni
Modified:	2022-03-13 15:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:	puppet-tripleo-8.3.2-2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2018-06-27 13:50:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1764514	None	None	None	2018-04-16 20:17:13 UTC
OpenStack gerrit	562581	'None'	MERGED	Fixes binding type for OpenDaylight Websocket	2021-01-18 02:07:24 UTC
Red Hat Issue Tracker	ODL-255	None	None	None	2022-03-13 15:41:38 UTC
Red Hat Issue Tracker	OSP-13652	None	None	None	2022-03-13 15:41:41 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:51:34 UTC

Internal Links: 1656411

Description Sai Sindhur Malleni 2018-04-09 23:55:54 UTC

Description of problem:

After using pcs resource restart haproxy-bundle to restart haproxy containers, the containers are killed on controller-1 and controller-2

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:
100%

Steps to Reproduce:
1. restart haproxy
2. use pcs status to view status
3.

Actual results:
haproxy containers killed on controller-1 and controller-2

Expected results:
haproxy containers should be restarted on all controllers

Additional info:

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr  9 23:44:07 2018
Last change: Mon Apr  9 23:44:05 2018 by hacluster via crmd on overcloud-controller-2

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2
 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2
 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master overcloud-controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave overcloud-controller-2
 ip-192.168.24.54       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.21.0.100        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.14 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.18.0.18 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.19.0.13 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started overcloud-controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1

==============================================================================

In /var/log/messages
Apr  9 19:49:09 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886021]: ERROR: Newly created docker container exited after start
Apr  9 19:49:09 overcloud-controller-1 lrmd[20848]:  notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:waiting on monitor_cmd to pass after start ]
Apr  9 19:49:09 overcloud-controller-1 lrmd[20848]:  notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:Newly created docker container exited after start ]
Apr  9 19:49:09 overcloud-controller-1 crmd[20851]:  notice: Result of start operation for haproxy-bundle-docker-2 on overcloud-controller-1: 1 (unknown error)
Apr  9 19:49:09 overcloud-controller-1 crmd[20851]:  notice: overcloud-controller-1-haproxy-bundle-docker-2_start_0:159 [ ocf-exit-reason:waiting on monitor_cmd to pass after start\nocf-exit-reason:Newly created docker container exited after start\n ]
Apr  9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.004764059Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop?t=10 returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr  9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.005303162Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: NOTICE: Cleaning up inactive container, haproxy-bundle-docker-2.
Apr  9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr  9 19:49:10 overcloud-controller-1 crmd[20851]:  notice: Result of stop operation for haproxy-bundle-docker-2 on overcloud-controller-1: 0 (ok)

Comment 1 Sai Sindhur Malleni 2018-04-10 12:33:51 UTC

EDIT

I have noticed that the haproxy containers are stopped on controller-1 and controller-2 on fresh deployment as well.

Comment 2 Raoul Scarazzini 2018-04-10 14:10:56 UTC

On the machines we see that the problem related to the container is really specific, and you can see it by starting the container by hand:

[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [172.16.0.15:8185]
[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [192.168.24.59:8185]

Which should mean that the ports that haproxy want to use are occupied by something, but in fact what we have on the controller is:

[root@overcloud-controller-1 heat-admin]# netstat -nlp|grep 8185
tcp        0      0 172.16.0.20:8185        0.0.0.0:*               LISTEN      496289/java

So the local IP of the machine 172.16.0.20 correctly listens with the opendaylight service (driven by the container) and nothing else.
One particular thing is that controller-1 do not have any VIP on it, and the problem does not happen on controller-0, where the VIP lives.

Commenting the opendaylight_ws section in /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg on the machine makes haproxy start, but it remains to be understood why it cannot bind the port.

Comment 3 Sai Sindhur Malleni 2018-04-10 18:24:00 UTC

Just a quick update, looks like when the VIP moves to controller-2, haproxy container is started on controller-2 and stopped on the others. So looks like haproxy seems to be working only on the controller that has the VIP.

Comment 4 Tim Rozet 2018-04-13 19:51:07 UTC

The reason it is not starting on non-VIP control nodes is because the binding for haproxy is not set to transparent.  Therefore since the nodes do not have the VIP on their machines, haproxy will not start because it cannot bind to those IPs. Most other services use transparent mode which will allow haproxy to start even when it does not have the referenced bind address.  Therefore the behavior is expected here.  However, is the behavior correct?

On both Zaqar Websocket and ODL Websocket services we are not using transparent binding with a note from Juan indicating it is done intentionally:

  if $zaqar_ws {
    ::tripleo::haproxy::endpoint { 'zaqar_ws':
      public_virtual_ip         => $public_virtual_ip,
      internal_ip               => hiera('zaqar_ws_vip', $controller_virtual_ip),
      service_port              => $ports[zaqar_ws_port],
      ip_addresses              => hiera('zaqar_ws_node_ips', $controller_hosts_real),
      server_names              => hiera('zaqar_ws_node_names', $controller_hosts_names_real),
      mode                      => 'http',
      haproxy_listen_bind_param => [],  # We don't use a transparent proxy here

I'm guessing there is some issue with using transparent proxy with websocket, but we need Juan to tell us what the original issue here was.

Comment 7 Sai Sindhur Malleni 2018-04-16 20:08:28 UTC

Changed the HAProxy configuration for opendaylight_ws to include transparent on the controllers and restarted haproxy-bundle. Tried VM boot ping scenario, VMs go into ACTIVE as expected and are pingable.

--------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------+
|                                                       Response Times (sec)                                                        |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action                         | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_router          | 1.62      | 1.899        | 3.281        | 3.89         | 4.327     | 2.259     | 100.0%  | 50    |
| neutron.create_network         | 0.246     | 0.462        | 0.612        | 0.689        | 0.774     | 0.444     | 100.0%  | 50    |
| neutron.create_subnet          | 0.582     | 0.849        | 1.014        | 1.038        | 1.421     | 0.854     | 100.0%  | 50    |
| neutron.add_interface_router   | 2.029     | 2.42         | 2.859        | 2.966        | 3.152     | 2.453     | 100.0%  | 50    |
| nova.boot_server               | 38.003    | 77.645       | 90.082       | 91.992       | 92.988    | 75.522    | 100.0%  | 50    |
| vm.attach_floating_ip          | 3.779     | 5.075        | 5.671        | 5.786        | 6.557     | 5.051     | 100.0%  | 50    |
|  -> neutron.create_floating_ip | 1.374     | 1.711        | 2.074        | 2.132        | 2.156     | 1.752     | 100.0%  | 50    |
|  -> nova.associate_floating_ip | 2.029     | 3.24         | 3.978        | 4.175        | 4.655     | 3.298     | 100.0%  | 50    |
| vm.wait_for_ping               | 0.019     | 0.023        | 0.028        | 0.029        | 121.23    | 4.851     | 96.0%   | 50    |
| total                          | 47.474    | 88.715       | 101.95       | 103.039      | 215.239   | 91.436    | 96.0%   | 50    |
|  -> duration                   | 46.474    | 87.715       | 100.95       | 102.039      | 214.239   | 90.436    | 96.0%   | 50    |
|  -> idle_duration              | 1.0       | 1.0          | 1.0          | 1.0          | 1.0       | 1.0       | 96.0%   | 50    |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

Comment 8 Sai Sindhur Malleni 2018-04-16 20:10:12 UTC

FYI, this test was to launch and delete 50 VMs at a concurrency of 8.

Comment 9 Sai Sindhur Malleni 2018-04-16 20:11:14 UTC

haproxy-bundle is started on all 3 controllers after this change.


[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr 16 20:10:46 2018
Last change: Mon Apr 16 19:49:34 2018 by hacluster via crmd on overcloud-controller-1

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2
 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2
 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master overcloud-controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave overcloud-controller-2
 ip-192.168.24.60       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.21.0.100        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.19 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.18.0.18 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.19.0.12 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started overcloud-controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started overcloud-controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started overcloud-controller-2
 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started overcloud-controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 10 Mike Kolesnik 2018-04-22 11:45:04 UTC

The upstream change was merged on Apr 20th, moving this to POST

Comment 17 errata-xmlrpc 2018-06-27 13:50:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.