1211088 – db corruption, duplicate primary key

Bug 1211088 - db corruption, duplicate primary key

Summary: db corruption, duplicate primary key

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	mariadb-galera
Sub Component:
Version:	6.0 (Juno)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ga
Target Release:	7.0 (Kilo)
Assignee:	Michael Bayer
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1081488
TreeView+	depends on / blocked

Reported:	2015-04-12 23:13 UTC by Mark Wagner
Modified:	2023-02-22 23:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:	mariadb-galera-5.5.42-1.el7ost
Doc Type:	Rebase: Bug Fixes Only
Doc Text:	This rebase package includes a notable fix under version 5.5.42: * An issue was resolved whereby INSERT statements that use auto-incrementing primary keys could fail with a "DUPLICATE PRIMARY KEY" error on an otherwise working Galera node, if a different Galera node that was also handling INSERT statements on that same table was recently taken out of the cluster. The issue would cause OpenStack applications to temporarily fail to create new records while a Galera failover operation was in-progress.
Clone Of:
Environment:
Last Closed:	2015-08-05 13:22:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
one of the pacemaker logs (8.57 MB, application/x-gzip) 2015-04-12 23:13 UTC, Mark Wagner	no flags	Details
output of the nova show command (21.15 KB, text/plain) 2015-04-12 23:16 UTC, Mark Wagner	no flags	Details
requested logs from controller1 (14.76 MB, application/x-gzip) 2015-04-13 14:30 UTC, Mark Wagner	no flags	Details
requested logs from controller - note missing pacemaker log (6.21 MB, application/x-gzip) 2015-04-13 14:31 UTC, Mark Wagner	no flags	Details
requested logs from controller3 (14.75 MB, application/x-gzip) 2015-04-13 14:32 UTC, Mark Wagner	no flags	Details
collated logs of all mysqld's / nova instances (37.30 KB, text/plain) 2015-04-13 18:24 UTC, Michael Bayer	no flags	Details
load testing script (4.07 KB, text/plain) 2015-04-13 22:15 UTC, Michael Bayer	no flags	Details
output run with current HAProxy settings (18.98 KB, text/plain) 2015-04-13 22:16 UTC, Michael Bayer	no flags	Details
load testing script (3.90 KB, text/plain) 2015-04-13 22:34 UTC, Michael Bayer	no flags	Details
load testing script (4.13 KB, text/plain) 2015-04-13 22:47 UTC, Michael Bayer	no flags	Details
output run with one second "proceed cautiously" flag (44.63 KB, text/plain) 2015-04-13 22:47 UTC, Michael Bayer	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2015:1548	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory	2015-08-05 17:07:06 UTC

Description Mark Wagner 2015-04-12 23:13:57 UTC

Created attachment 1013778 [details]
one of the pacemaker logs

Description of problem: Running boot storm tests in the scalability lab. The testing was automated to run overnight. After a few successful runs, we saw vms in ERROR state. One of the vms failed due to:

"message": "Remote error: DBDuplicateEntry (IntegrityError) (1062, \"Duplicate entry '1503535' for key 'PRIMARY'\") 'INSERT INTO instance_system_metadata (deleted_at, deleted, created_at, updated_at, `key`, value, instance_uuid) VALUES (%s, %s, %s, %s, %s, %s, %s)' (No", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 314, in decorated_function "

Note this was the second to last A2 puddle and configured with staypuft. Cloud had been up and running for over a week.


Version-Release number of selected component (if applicable):
openstack-ceilometer-alarm.noarch   2014.2.2-2.el7ost         
openstack-ceilometer-api.noarch     2014.2.2-2.el7ost     
openstack-ceilometer-central.noarch  2014.2.2-2.el7ost         
openstack-ceilometer-collector.noarch 2014.2.2-2.el7ost       
openstack-ceilometer-common.noarch    2014.2.2-2.el7ost        
openstack-ceilometer-notification.noarch   2014.2.2-2.el7ost             
openstack-cinder.noarch                 2014.2.2-2.el7ost               
openstack-dashboard.noarch           2014.2.2-2.el7ost        
openstack-dashboard-theme.noarch     2014.2.2-2.el7ost       
openstack-glance.noarch              2014.2.2-1.el7ost  
openstack-heat-api.noarch            2014.2.2-1.el7ost  
openstack-heat-api-cfn.noarch        2014.2.2-1.el7ost  
openstack-heat-api-cloudwatch.noarch  2014.2.2-1.el7ost  
openstack-heat-common.noarch         2014.2.2-1.el7ost
openstack-heat-engine.noarch        2014.2.2-1.el7ost  
openstack-keystone.noarch           2014.2.2-1.el7ost   
openstack-neutron.noarch            2014.2.2-5.el7ost     
openstack-neutron-ml2.noarch         2014.2.2-5.el7ost   
openstack-neutron-openvswitch.noarch  2014.2.2-5.el7ost    
openstack-nova-api.noarch             2014.2.2-19.el7ost  
openstack-nova-cert.noarch          2014.2.2-19.el7ost   
openstack-nova-common.noarch        2014.2.2-19.el7ost   
openstack-nova-conductor.noarch      2014.2.2-19.el7ost  
openstack-nova-console.noarch      2014.2.2-19.el7ost
openstack-nova-novncproxy.noarch    2014.2.2-19.el7ost 
openstack-nova-scheduler.noarch     2014.2.2-19.el7ost     
openstack-selinux.noarch            0.6.25-1.el7ost 
openstack-utils.noarch              2014.2-1.el7ost
galera.x86_64                       25.3.5-6.el7ost
mariadb.x86_64                     1:5.5.41-2.el7_0 
mariadb-galera-common.x86_64       1:5.5.40-6.el7ost
mariadb-galera-server.x86_64       1:5.5.40-6.el7ost  
mariadb-libs.x86_64                1:5.5.41-2.el7_0
corosync.x86_64                    2.3.4-4.el7
corosynclib.x86_64                    2.3.4-4.el7

How reproducible:
First time I have seen it. Preserving state until Monday PM (EDT) in case someone wants to poke.

Steps to Reproduce:
1.
2.
3.

Actual results:
Instance creation fails due to duplicate primary key.

Expected results:
It just works...

Additional info:

Comment 3 Mark Wagner 2015-04-12 23:16:22 UTC

Created attachment 1013779 [details]
output of the nova show command

This includes the creation time

Comment 4 Michael Bayer 2015-04-13 00:33:21 UTC

This looks like a collision of MySQL autoincrement as the PK of that table is an autoincrement and a value is not specified in the INSERT statement.

As is often the case, this is an unusual occurrence and we'd have to see if this is something that is known to occur with Galera clusters under certain circumstances; a quick google suggests we should at least look for nodes that are involved in SST or IST processes.   It also would be good to know which galera masters were in play when the error occurred, though at the moment I don't know that our standard logging setup gives us this info (we'd need HAProxy logging set up).

So let's start at least with what we should be able to know, can you please attach the following logs / directories from *all* nodes:

/etc/my.cnf
/etc/my.cnf.d/*
/var/log/mariadb/*
/var/log/mysqld.log
/var/log/pacemaker.log
/var/log/nova/*

if there were any haproxy logging turned on, that would help also (any /var/log/haproxy* files), though I'm not sure those are going to be present (I should look into figuring out how our test environments can have these set up).

Comment 5 Michael Bayer 2015-04-13 01:24:15 UTC

the issue itself may be normal when lots of partitions are occurring, therefore not preventable at the Galera level:

https://groups.google.com/forum/#!topic/codership-team/lDRB9Z6swV8


> Well, with the current implementation it is possible to get such errors
at ANY cluster reconfiguration (like momentary network partitions and
such - any time auto_increment_increment *OR* auto_increment_offset
changes). Making it bullet-proof would not be that easy: all ongoing
transactions which already got their autoincrements are vulnerable.

Comment 6 Mark Wagner 2015-04-13 14:30:18 UTC

Created attachment 1013982 [details]
requested logs from controller1

Comment 7 Mark Wagner 2015-04-13 14:31:19 UTC

Created attachment 1013983 [details]
requested logs from controller - note missing pacemaker log

Comment 8 Mark Wagner 2015-04-13 14:32:10 UTC

Created attachment 1013984 [details]
requested logs from controller3

Comment 9 Michael Bayer 2015-04-13 18:24:38 UTC

Created attachment 1014025 [details]
collated logs of all mysqld's / nova instances

this attachment contains a grep of all the startup/shutdown and synchronization state changes from the galera nodes merged with all the duplicate key errors within all the nova logs (there were many, not just the one), collated by date and tagged with node and app name. 

This file shows the series of events very simply and shows that the MariaDB-Galera nodes would disconnect and re-synchronize on up/down without any surprises, however during the point at which the node would be taken down (seems to usually be node1), the nova processes would undergo a flurry of dupe PK errors, and as the SQL in all of these illustrates INSERT statements relying on autoincrement, that means the autoincrement feature seems to fail during this small window; more or less how the codership thread I linked towards suggests.

It seems like there's not too much surprise here, however I do want to try to reproduce this in isolation and see if the choice of fallback nodes is involved; e.g. when HAProxy fails over, does failing over to exactly one node, or two nodes at the same time (e.g. two masters) have any impact?  How about the wsrep_auto_increment_control, which we have at its default of true for now?  Learning more what that flag does and does not do might be of significance.

If we then say that this is just galera's behavior, the fact that Nova had many failures of this nature but only one of them seems to have harmed the cluster might be where we go for this;   I don't think we should add tons more retry decorators to every Nova API method, though when we move to enginefacade that would be something we could enable transparently.

Or maybe we say that when a node is taken offline, Pacemaker and/or HAProxy should do something else that could possibly help; maybe point HAProxy to some kind of "waiting" target for 30 seconds so that all new transactions are paused.

Comment 10 Michael Bayer 2015-04-13 22:15:35 UTC

Created attachment 1014065 [details]
load testing script

this is a Python script that is adapted from the one I wrote for https://bugzilla.redhat.com/show_bug.cgi?id=1195226.   this shows which nodes HAProxy is directing connections towards, as well as changes in node and all errors.

Comment 11 Michael Bayer 2015-04-13 22:16:12 UTC

Created attachment 1014066 [details]
output run with current HAProxy settings

Comment 12 Michael Bayer 2015-04-13 22:23:35 UTC

The autoincrement issue here is easily reproduced, so this is normal behavior of Galera. See the attached patch which shows it when stopping a node while running a load script.

However, the way that HAProxy is configured seems to have am impact on how likely the dupe PK issue is to occur. This configuration is part of some other behaviors I've been suspicious of with our HAProxy setup so I will be posting that as a separate issue. The main factor is when a node is taken down, whether HAProxy sends remaining connections to multiple nodes or just one node - if it sends the connections to multiple nodes then there seems to be a greater chance that this occurs. However I can also observe this occurring when HAproxy bounces connections to just one other node.

There's a lot of avenues we could take with this issue. We can try to get HAProxy to redirect bounces to just one node and greatly reduce the issue. Or we can try to address the issue on the Nova side or even oslo.db side. Avenues include:

1. adding more "retry" features to Nova, such that duplicate key errors on autoinc columns constitute a "retry" in the same was as a DBDeadlock does right now. The autoinc collision here is not that different.

2. trying to work with time delays; the window in which the dupe PK errors occur here is only immediately in the first half a second after the first node fails and connections are rerouted. We could either get HAProxy to delay when connections are failed over (though I didn't see any option for this), or perhaps work with time delays in oslo.db, though on the oslo.db side it means we'd have to detect a "failover" at the application level.

3. Reducing the failures here by improving HAProxy and then considering the remaining failures as normal behavior when nodes are being rebooted under load.

Comment 13 Michael Bayer 2015-04-13 22:34:05 UTC

Created attachment 1014068 [details]
load testing script

this is a new version of the script where I've adapted the "delay" parameter to be a fixed delay per worker.

Run like this:

.venv/bin/python autoinc_hammer.py -u root -H 192.168.1.200 -p root --delay 1

Comment 14 Michael Bayer 2015-04-13 22:47:14 UTC

Created attachment 1014070 [details]
load testing script

this version uses the delay in conjunction with a "proceed cautiously" flag.  when a disconnect or other error occurs, the flag is set, which causes a one-second delay before new operations, including reconnects, begin.  it seems to eliminate the PK issues even when HAProxy sends connections to multiple nodes.

Comment 15 Michael Bayer 2015-04-13 22:47:41 UTC

Created attachment 1014071 [details]
output run with one second "proceed cautiously" flag

Comment 16 Michael Bayer 2015-04-13 23:14:15 UTC

anyway, this solution suggests that we can at least try #2, add a time delay option to oslo.db that will take place when we detect any kind of "disconnect", and before we reconnect.    I will try reproducing using oslo.db for connectivity next.

Comment 17 Michael Bayer 2015-04-15 22:29:16 UTC

I was able to confirm that for the issue to occur, the node that is actively receiving database traffic needs to be the one brought down.  If a node not the active master is brought down while INSERT traffic proceeds on a different node, there is no corruption of the autoincrement synchronization.   So dealing with this at the HAProxy or application level is an option.

Comment 18 Michael Bayer 2015-06-24 17:17:35 UTC

I was hoping to address this with either HAProxy or HAAlchemy, however the most expedient solution is to finish up the oslo.db patch I have at https://review.openstack.org/#/c/173421/.  The approach there needs to be tested to confirm it works against a cluster where a node is going down under INSERT load, and also we need to figure out how to get the "reconnect_interval" parameter to be set when we roll out Openstack applications.

Comment 19 Michael Bayer 2015-07-06 13:53:43 UTC

it looks like this was fixed upstream and released in 5.5.42: https://mariadb.atlassian.net/browse/MDEV-7129

related rebase issue: https://bugzilla.redhat.com/show_bug.cgi?id=1240114

Comment 20 Jon Schlueter 2015-07-07 13:22:27 UTC

Setting to post since current proposed solution is to rebase to 5.5.42 OSP6 is currently using 5.5.41 which is ahead of current 5.5.40 which is was is in current OSP7. 5.5.42 has changes that are supposed to address this issue.

Comment 22 Leonid Natapov 2015-07-22 16:08:17 UTC

used the load testing script attached in the bug to reproduce the issue. was unable to reproduce. 

mariadb-galera-5.5.42-1.el7ost

Comment 24 errata-xmlrpc 2015-08-05 13:22:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Note You need to log in before you can comment on or make changes to this bug.