Created attachment 1013778 [details] one of the pacemaker logs Description of problem: Running boot storm tests in the scalability lab. The testing was automated to run overnight. After a few successful runs, we saw vms in ERROR state. One of the vms failed due to: "message": "Remote error: DBDuplicateEntry (IntegrityError) (1062, \"Duplicate entry '1503535' for key 'PRIMARY'\") 'INSERT INTO instance_system_metadata (deleted_at, deleted, created_at, updated_at, `key`, value, instance_uuid) VALUES (%s, %s, %s, %s, %s, %s, %s)' (No", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 314, in decorated_function " Note this was the second to last A2 puddle and configured with staypuft. Cloud had been up and running for over a week. Version-Release number of selected component (if applicable): openstack-ceilometer-alarm.noarch 2014.2.2-2.el7ost openstack-ceilometer-api.noarch 2014.2.2-2.el7ost openstack-ceilometer-central.noarch 2014.2.2-2.el7ost openstack-ceilometer-collector.noarch 2014.2.2-2.el7ost openstack-ceilometer-common.noarch 2014.2.2-2.el7ost openstack-ceilometer-notification.noarch 2014.2.2-2.el7ost openstack-cinder.noarch 2014.2.2-2.el7ost openstack-dashboard.noarch 2014.2.2-2.el7ost openstack-dashboard-theme.noarch 2014.2.2-2.el7ost openstack-glance.noarch 2014.2.2-1.el7ost openstack-heat-api.noarch 2014.2.2-1.el7ost openstack-heat-api-cfn.noarch 2014.2.2-1.el7ost openstack-heat-api-cloudwatch.noarch 2014.2.2-1.el7ost openstack-heat-common.noarch 2014.2.2-1.el7ost openstack-heat-engine.noarch 2014.2.2-1.el7ost openstack-keystone.noarch 2014.2.2-1.el7ost openstack-neutron.noarch 2014.2.2-5.el7ost openstack-neutron-ml2.noarch 2014.2.2-5.el7ost openstack-neutron-openvswitch.noarch 2014.2.2-5.el7ost openstack-nova-api.noarch 2014.2.2-19.el7ost openstack-nova-cert.noarch 2014.2.2-19.el7ost openstack-nova-common.noarch 2014.2.2-19.el7ost openstack-nova-conductor.noarch 2014.2.2-19.el7ost openstack-nova-console.noarch 2014.2.2-19.el7ost openstack-nova-novncproxy.noarch 2014.2.2-19.el7ost openstack-nova-scheduler.noarch 2014.2.2-19.el7ost openstack-selinux.noarch 0.6.25-1.el7ost openstack-utils.noarch 2014.2-1.el7ost galera.x86_64 25.3.5-6.el7ost mariadb.x86_64 1:5.5.41-2.el7_0 mariadb-galera-common.x86_64 1:5.5.40-6.el7ost mariadb-galera-server.x86_64 1:5.5.40-6.el7ost mariadb-libs.x86_64 1:5.5.41-2.el7_0 corosync.x86_64 2.3.4-4.el7 corosynclib.x86_64 2.3.4-4.el7 How reproducible: First time I have seen it. Preserving state until Monday PM (EDT) in case someone wants to poke. Steps to Reproduce: 1. 2. 3. Actual results: Instance creation fails due to duplicate primary key. Expected results: It just works... Additional info:
Created attachment 1013779 [details] output of the nova show command This includes the creation time
This looks like a collision of MySQL autoincrement as the PK of that table is an autoincrement and a value is not specified in the INSERT statement. As is often the case, this is an unusual occurrence and we'd have to see if this is something that is known to occur with Galera clusters under certain circumstances; a quick google suggests we should at least look for nodes that are involved in SST or IST processes. It also would be good to know which galera masters were in play when the error occurred, though at the moment I don't know that our standard logging setup gives us this info (we'd need HAProxy logging set up). So let's start at least with what we should be able to know, can you please attach the following logs / directories from *all* nodes: /etc/my.cnf /etc/my.cnf.d/* /var/log/mariadb/* /var/log/mysqld.log /var/log/pacemaker.log /var/log/nova/* if there were any haproxy logging turned on, that would help also (any /var/log/haproxy* files), though I'm not sure those are going to be present (I should look into figuring out how our test environments can have these set up).
the issue itself may be normal when lots of partitions are occurring, therefore not preventable at the Galera level: https://groups.google.com/forum/#!topic/codership-team/lDRB9Z6swV8 > Well, with the current implementation it is possible to get such errors at ANY cluster reconfiguration (like momentary network partitions and such - any time auto_increment_increment *OR* auto_increment_offset changes). Making it bullet-proof would not be that easy: all ongoing transactions which already got their autoincrements are vulnerable.
Created attachment 1013982 [details] requested logs from controller1
Created attachment 1013983 [details] requested logs from controller - note missing pacemaker log
Created attachment 1013984 [details] requested logs from controller3
Created attachment 1014025 [details] collated logs of all mysqld's / nova instances this attachment contains a grep of all the startup/shutdown and synchronization state changes from the galera nodes merged with all the duplicate key errors within all the nova logs (there were many, not just the one), collated by date and tagged with node and app name. This file shows the series of events very simply and shows that the MariaDB-Galera nodes would disconnect and re-synchronize on up/down without any surprises, however during the point at which the node would be taken down (seems to usually be node1), the nova processes would undergo a flurry of dupe PK errors, and as the SQL in all of these illustrates INSERT statements relying on autoincrement, that means the autoincrement feature seems to fail during this small window; more or less how the codership thread I linked towards suggests. It seems like there's not too much surprise here, however I do want to try to reproduce this in isolation and see if the choice of fallback nodes is involved; e.g. when HAProxy fails over, does failing over to exactly one node, or two nodes at the same time (e.g. two masters) have any impact? How about the wsrep_auto_increment_control, which we have at its default of true for now? Learning more what that flag does and does not do might be of significance. If we then say that this is just galera's behavior, the fact that Nova had many failures of this nature but only one of them seems to have harmed the cluster might be where we go for this; I don't think we should add tons more retry decorators to every Nova API method, though when we move to enginefacade that would be something we could enable transparently. Or maybe we say that when a node is taken offline, Pacemaker and/or HAProxy should do something else that could possibly help; maybe point HAProxy to some kind of "waiting" target for 30 seconds so that all new transactions are paused.
Created attachment 1014065 [details] load testing script this is a Python script that is adapted from the one I wrote for https://bugzilla.redhat.com/show_bug.cgi?id=1195226. this shows which nodes HAProxy is directing connections towards, as well as changes in node and all errors.
Created attachment 1014066 [details] output run with current HAProxy settings
The autoincrement issue here is easily reproduced, so this is normal behavior of Galera. See the attached patch which shows it when stopping a node while running a load script. However, the way that HAProxy is configured seems to have am impact on how likely the dupe PK issue is to occur. This configuration is part of some other behaviors I've been suspicious of with our HAProxy setup so I will be posting that as a separate issue. The main factor is when a node is taken down, whether HAProxy sends remaining connections to multiple nodes or just one node - if it sends the connections to multiple nodes then there seems to be a greater chance that this occurs. However I can also observe this occurring when HAproxy bounces connections to just one other node. There's a lot of avenues we could take with this issue. We can try to get HAProxy to redirect bounces to just one node and greatly reduce the issue. Or we can try to address the issue on the Nova side or even oslo.db side. Avenues include: 1. adding more "retry" features to Nova, such that duplicate key errors on autoinc columns constitute a "retry" in the same was as a DBDeadlock does right now. The autoinc collision here is not that different. 2. trying to work with time delays; the window in which the dupe PK errors occur here is only immediately in the first half a second after the first node fails and connections are rerouted. We could either get HAProxy to delay when connections are failed over (though I didn't see any option for this), or perhaps work with time delays in oslo.db, though on the oslo.db side it means we'd have to detect a "failover" at the application level. 3. Reducing the failures here by improving HAProxy and then considering the remaining failures as normal behavior when nodes are being rebooted under load.
Created attachment 1014068 [details] load testing script this is a new version of the script where I've adapted the "delay" parameter to be a fixed delay per worker. Run like this: .venv/bin/python autoinc_hammer.py -u root -H 192.168.1.200 -p root --delay 1
Created attachment 1014070 [details] load testing script this version uses the delay in conjunction with a "proceed cautiously" flag. when a disconnect or other error occurs, the flag is set, which causes a one-second delay before new operations, including reconnects, begin. it seems to eliminate the PK issues even when HAProxy sends connections to multiple nodes.
Created attachment 1014071 [details] output run with one second "proceed cautiously" flag
anyway, this solution suggests that we can at least try #2, add a time delay option to oslo.db that will take place when we detect any kind of "disconnect", and before we reconnect. I will try reproducing using oslo.db for connectivity next.
I was able to confirm that for the issue to occur, the node that is actively receiving database traffic needs to be the one brought down. If a node not the active master is brought down while INSERT traffic proceeds on a different node, there is no corruption of the autoincrement synchronization. So dealing with this at the HAProxy or application level is an option.
I was hoping to address this with either HAProxy or HAAlchemy, however the most expedient solution is to finish up the oslo.db patch I have at https://review.openstack.org/#/c/173421/. The approach there needs to be tested to confirm it works against a cluster where a node is going down under INSERT load, and also we need to figure out how to get the "reconnect_interval" parameter to be set when we roll out Openstack applications.
it looks like this was fixed upstream and released in 5.5.42: https://mariadb.atlassian.net/browse/MDEV-7129 related rebase issue: https://bugzilla.redhat.com/show_bug.cgi?id=1240114
Setting to post since current proposed solution is to rebase to 5.5.42 OSP6 is currently using 5.5.41 which is ahead of current 5.5.40 which is was is in current OSP7. 5.5.42 has changes that are supposed to address this issue.
used the load testing script attached in the bug to reproduce the issue. was unable to reproduce. mariadb-galera-5.5.42-1.el7ost
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548