Description of problem: In a HA controller setup (deployed using director - OSP7/OSP8 tested), when the VIP used by haproxy galera proxy is being moved, the nova services on the controller where the IP previously was located go in down state until a SQL timeout is being reached. Version-Release number of selected component (if applicable): * Reported on OSP7, but also reproduced on OSP8 using python-nova-12.0.3-1.el7ost.noarch How reproducible: always Steps to Reproduce: 1. move VIP to new controller $ pcs resource move ip-172.16.2.5 2. check 'nova service-list', after configured timeout, nova-scheduler, nova-conductor and nova-consoleauth go into down state Actual results: * nova service in down state since 2016-09-21 09:26:54 * nova-conductor service up again at 2016-09-21 09:44:33 * From nova conductor log with sqlalchemy debug enabled: # egrep "MySQL|outlasted" /var/log/nova/nova-conductor.log 2016-09-21 09:24:38.258 15949 DEBUG oslo_db.sqlalchemy.engines [req-3f89cecb-38bf-4f32-bae4-8214756d5a89 - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py:256 2016-09-21 09:27:11.463 15948 INFO sqlalchemy.pool.QueuePool [req-dd51f26e-83de-4cc2-91af-ec33975e037d - - - - -] Invalidate connection <pymysql.connections.Connection object at 0x625b2d0> (reason: OperationalError:(2006, "MySQL server has gone away (error(110, 'Connection timed out'))")) 2016-09-21 09:44:33.737 15946 INFO sqlalchemy.pool.QueuePool [-] Invalidate connection <pymysql.connections.Connection object at 0x5c43d50> (reason: OperationalError:(2013, 'Lost connection to MySQL server during query ([Errno 110] Connection timed out)')) 2016-09-21 09:44:33.813 15946 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 1039.21 sec 2016-09-21 09:44:34.761 15949 INFO sqlalchemy.pool.QueuePool [-] Invalidate connection <pymysql.connections.Connection object at 0x5c43d50> (reason: OperationalError:(2013, 'Lost connection to MySQL server during query ([Errno 110] Connection timed out)')) 2016-09-21 09:44:34.839 15949 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 1040.28 sec 2016-09-21 09:44:38.857 15948 INFO sqlalchemy.pool.QueuePool [-] Invalidate connection <pymysql.connections.Connection object at 0x6240dd0> (reason: OperationalError:(2013, 'Lost connection to MySQL server during query ([Errno 110] Connection timed out)')) 2016-09-21 09:44:39.008 15948 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 1044.49 sec 2016-09-21 09:44:42.953 15947 INFO sqlalchemy.pool.QueuePool [-] Invalidate connection <pymysql.connections.Connection object at 0x5c43d50> (reason: OperationalError:(2013, 'Lost connection to MySQL server during query ([Errno 110] Connection timed out)')) 2016-09-21 09:44:43.049 15947 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 1048.53 sec Expected results: controller services do not go into down state when VIP used by galers switch Additional info: * setting pool_timeout in nova.conf seem to have a positive impact on nova-conductor. When this is set e.g. to 30s we do not see nova-conductor to be reported as down * pool_timeout seem to not have an impact on nova-scheduler
Forgot to mention that when we configure nova to not use the VIP for the haproxy galera proxy and instead go direct to one of the DB nodes we do not see this behavior when moving the VIP. For Reference this is the haproxy galera proxy config: global daemon group haproxy log /dev/log local0 maxconn 20480 pidfile /var/run/haproxy.pid ssl-default-bind-ciphers !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES ssl-default-bind-options no-sslv3 user haproxy defaults log global maxconn 4096 mode tcp retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout check 10s ... listen mysql bind 172.16.2.5:3306 transparent option tcpka option httpchk stick on dst stick-table type ip size 1000 timeout client 90m timeout server 90m server overcloud-controller-0 172.16.2.6:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-1 172.16.2.7:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-2 172.16.2.8:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
I don't have enough happy-path experience using haproxy/galera/pacemaker/tripleo to help debug this much, but I wonder if auditing the differences in the recommended config vs the actual config might help. - How can I verify my haproxy.cfg is correctly configured to load balance openstack services? https://access.redhat.com/solutions/1599813 Here's a diff between the recommended haproxy config and the config pasted above. Do any of those differences jump out as problematic to anyone else with more experience? http://paste.openstack.org/show/582652/ global maxconn 20480 ssl-default-bind-ciphers !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES ssl-default-bind-options no-sslv3 defaults maxconn 4096 listen mysql bind 172.16.2.5:3306 transparent option tcpka
(In reply to Diana Clarke from comment #3) > I don't have enough happy-path experience using > haproxy/galera/pacemaker/tripleo to help debug this much, but I wonder if > auditing the differences in the recommended config vs the actual config > might help. > > - How can I verify my haproxy.cfg is correctly configured to load balance > openstack services? > https://access.redhat.com/solutions/1599813 > > Here's a diff between the recommended haproxy config and the config pasted > above. Do any of those differences jump out as problematic to anyone else > with more experience? > > http://paste.openstack.org/show/582652/ > > > global > maxconn 20480 > ssl-default-bind-ciphers > !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:! > DES > ssl-default-bind-options no-sslv3 > > defaults > maxconn 4096 > > listen mysql > bind 172.16.2.5:3306 transparent > option tcpka Thanks for the review. It is possible to reproduce the issue on an empty idle overcloud, so connection limits should not matter here. I have verified the tcpka and transparent option makes no difference on the behavior.
(In reply to Martin Schuppert from comment #4) > (In reply to Diana Clarke from comment #3) > > I don't have enough happy-path experience using > > haproxy/galera/pacemaker/tripleo to help debug this much, but I wonder if > > auditing the differences in the recommended config vs the actual config > > might help. > > > > - How can I verify my haproxy.cfg is correctly configured to load balance > > openstack services? > > https://access.redhat.com/solutions/1599813 > > > > Here's a diff between the recommended haproxy config and the config pasted > > above. Do any of those differences jump out as problematic to anyone else > > with more experience? > > > > http://paste.openstack.org/show/582652/ > > > > > > global > > maxconn 20480 > > ssl-default-bind-ciphers > > !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:! > > DES > > ssl-default-bind-options no-sslv3 > > > > defaults > > maxconn 4096 > > > > listen mysql > > bind 172.16.2.5:3306 transparent > > option tcpka > > Thanks for the review. It is possible to reproduce the issue on an empty > idle overcloud, so connection limits should not matter here. > I have verified the tcpka and transparent option makes no difference on the > behavior. I agree that 'tcpka' probably has no impact, but 'transparent' allows haproxy to bind to an IP address (VIP) that does not exist on the system. Best to leave that intact.
(In reply to Ryan O'Hara from comment #10) > (In reply to Martin Schuppert from comment #4) > > (In reply to Diana Clarke from comment #3) > > > I don't have enough happy-path experience using > > > haproxy/galera/pacemaker/tripleo to help debug this much, but I wonder if > > > auditing the differences in the recommended config vs the actual config > > > might help. > > > > > > - How can I verify my haproxy.cfg is correctly configured to load balance > > > openstack services? > > > https://access.redhat.com/solutions/1599813 > > > > > > Here's a diff between the recommended haproxy config and the config pasted > > > above. Do any of those differences jump out as problematic to anyone else > > > with more experience? > > > > > > http://paste.openstack.org/show/582652/ > > > > > > > > > global > > > maxconn 20480 > > > ssl-default-bind-ciphers > > > !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:! > > > DES > > > ssl-default-bind-options no-sslv3 > > > > > > defaults > > > maxconn 4096 > > > > > > listen mysql > > > bind 172.16.2.5:3306 transparent > > > option tcpka > > > > Thanks for the review. It is possible to reproduce the issue on an empty > > idle overcloud, so connection limits should not matter here. > > I have verified the tcpka and transparent option makes no difference on the > > behavior. > > I agree that 'tcpka' probably has no impact, but 'transparent' allows > haproxy to bind to an IP address (VIP) that does not exist on the system. > Best to leave that intact. yes, makes sense, but no difference in this case.
I am able to reproduce the failure reported in "nova service-list" on an OSP8 by moving VIP away from a node. Before the move, all haproxy are up on all nodes, every haproxy listening to the mysql VIP. In my test, at 15:55 I'm moving the VIP away from ctrl1 to ctrl0: [root@overcloud-controller-1 heat-admin]# pcs resource move ip-172.16.2.5 Warning: Creating location constraint cli-ban-ip-172.16.2.5-on-overcloud-controller-1 with a score of -INFINITY for resource ip-172.16.2.5 on node overcloud-controller-1. This will prevent ip-172.16.2.5 from running on overcloud-controller-1 until the constraint is removed. This will be the case even if overcloud-controller-1 is the last node in the cluster. A few seconds later, various nova services are seen down with "nova service-list" [root@overcloud-controller-1 heat-admin]# nova service-list +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 3 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2016-09-30T16:00:34.000000 | - | | 6 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2016-09-30T16:00:32.000000 | - | | 9 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | down | 2016-09-30T15:55:06.000000 | - | | 12 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | down | 2016-09-30T15:55:03.000000 | - | | 15 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-09-30T16:00:33.000000 | - | | 18 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-09-30T16:00:37.000000 | - | | 21 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | down | 2016-09-30T15:55:09.000000 | - | | 33 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2016-09-30T16:00:32.000000 | - | | 36 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2016-09-30T16:00:40.000000 | - | | 39 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | up | 2016-09-30T16:00:35.000000 | - | | 42 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2016-09-30T16:00:40.000000 | - | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ Oddly, the instances of nova service which are reported as "down" are those located on the controller which originally hosted the VIP (ctrl1). Nova services located on other nodes don't seem to be impacted by the VIP moves, so at this stage it doesn't seem like HAProxy is causing the failure of nova services. We have captured tcpdump on both ctrl1 and ctrl0 to analyze traffic to the mysql VIP and see if failure could be related to TCP keepalive packets on nova service side, or if anything else is involved. Analysis is still ongoing.
Update to comment #21, So the failure of the nova services with long-running DB connection (consoleauth, scheduler, conductor) comes from the way the services initially connect to the VIP. In the example from comment #21 (where the VIP is located on ctrl1), when service is started on ctrl a socket is created and connects to the mysql VIP. The socket having a default source address of *, the kernel will select the source address with the most direct path to the destination address to connect to (i.e. the VIP). On ctrl1, this results in a connected socket with source address bound to VIP. When the failover occurs, the VIP disappears from ctrl1. Eventually when a keepalive probe detects that, the kernel cannot send a RST packet to close the client side of the connection because the VIP is already gone.
Guys, since we have some clarity on what's causing this peculiar behavior, do we know what's needed in order to get it fixed? If I'm getting this correctly, it looks like an underlying issue that's creeping in all versions of Openstack?
Irina, When the DB VIP fails over from one node to another,the OpenStack services with long-running connection to the DB recreate automatically a new connection next time they need to access the DB. The VIP failover failure which is impacting the OpenStack services colocated with the VIP (same host) is due to the inability on kernel-side to notify the clients of the socket disconnection because both the source and the destination IP of the socket are bound to the VIP. So in order to fix this failure we need either other means to detect socket connection failure or means to prevent the socket to bound source IP to the VIP when service and VIP are collocated. We're currently evaluating the best options to resolve that issue. As you said, this is more of a network concern, so it impacts all versions of OpenStack.
It appears the best way to solve this issue both for IPv4 and IPv6 is to configure OpenStack clients to connect to the DB from a specific IP address, for example that from the control place network. In order to achieve that, we need at least to: 1) add support in PyMySQL upstream for binding to a specific source IP before connect()-ing to the DB 2) update tripleo packages to generate the proper configuration flags for Openstack services. For 1), I just created a pull request upstream [1]. Once it will get discussed and approved, we will design 2) accordingly. [1] https://github.com/PyMySQL/PyMySQL/pull/529
So, one wrinkle is that oslo.db has an opinion about arguments that can pass through to create_engine(). I think we might have to change oslo.db to support this new argument (SQLAlchemy allows pass through options for the MySQL DBAPIs already).
oh wait I think I'm mistaken. We'd send this as a question mark variable in the URL and oslo will let that pass through: mysql+pymysql://user:pass@hostname/dbname?host=<hostname>
OK, fix merged in all upstream branches, I'll now prepare the backport for downstream.
As a heads up, unfortunately the fix merged upstream conflicts with Nova's cells_v2 feature in Ocata, so it has been reverted [1]. We will shortly propose another version of the patch, which still pursues the strategy from comment #36, but is deployed differently in tripleo. [1] https://review.openstack.org/430183
New fix has been merged upstream in Master [1] and Ocata [2]. It now needs to be adapted for Mitaka and Liberty. Once done, we'll focus on backporting those patches. [1] https://review.openstack.org/431424 [2] https://review.openstack.org/436144
So this bug is fixed starting OSP11. The fix is a fairly big change in that it introduces new mysql config file and changes the DBURI in all OpenStack services' configs. While we could backport it down to OSP10, we believe that this is a bit risky as it could impact FFU from OSP10 -> OSP13. Consequently I'm closing it now, feel free to reopen if you disagree with the reasoning.