Bug 1281994

Summary:	Slow SQL replication under the 4.2 kernels
Product:	[Fedora] Fedora	Reporter:	LukasH <k-rh-bugzilla>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	22	CC:	gansalmon, hhorak, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-02-15 11:47:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description LukasH 2015-11-14 01:24:29 UTC

Description of problem:

After upgrade process from F21 to F22, I observed delayed replication between master and slave (both ends are on MariaDB 10.0.21). Both statuses Slave_IO_Running & Slave_SQL_Running are in Yes statement and everything looks clear, but replication starts to be delayed immediately and value Seconds_Behind_Master is continuously increasing.

It must be some regression or something like this in kernel 4.2 series (see below), as MariaDB is almost still the same in stable repositories for F21, F22 and F23.
When I reboot back into 4.1 kernel series, replication returns into normal mode and speed, Seconds_Behind_Master counter gradually decreasing till zero value.


Version-Release number of selected component (if applicable):

affected kernels : 4.2.5-201.fc22, 4.2.3-200.fc22.

non-affected kernels: 4.1.10-100.fc21 and also 4.1.10-200.fc22, so it's not "fc22 specific", but more likely kernel (4.2) specific.


How reproducible:

Always, just upgrade to F22 (or probably to F23 too) and 4.2.X kernel, and then starts the SQL replication. Increasing delay starts almost immediately.


Steps to Reproduce:
1. Upgrade system to F22.
2. Reboot into any newer kernel (4.2.X).
3. Start mariadb.service / SQL replication.

Actual results:

Delayed slave status behind master.


Expected results:

The same position of relay logs on master and slave, Seconds_Behind_Master value should be zero.


Additional info:

Comment 1 Josh Boyer 2015-11-18 13:39:24 UTC

Which filesystem is in use and are the processes hung or sleeping?

Comment 2 LukasH 2015-11-20 16:27:50 UTC

Filesystem is ext4, there is nothing extra with FS flags (or with Fedora install at all, it's normal "F22 Server Edition", without SELinux enabled).

But I was fallen with the 4.2 kernels, the problem is elsewhere - it looks like that replication delay starts at some point (regardless of kernel version), and this point is probably : "systemctl restart mariadb.service". Reboot (again, regardless to which kernel version) fix it. I have two F22 SQL slaves, I'll observed it on the first one, will try to verify the same behaviour on the second one this evening.

In any case, I'm pretty sure that the same action (`systemctl restart mariadb.service') works fine (without any delay affection) in F21 - the configuration of replication processes and whole MariaDB was exactly the same lot of months ago.

So, this bug is very likely (something under) "fc22 related". But I really don't know, why/where. Or what should I trace or observe for better specification of this bugreport.

Comment 3 LukasH 2016-02-11 13:02:51 UTC

Just another additional info - on 4.3 kernels, situation with replication is still the same or slightly worse. SQL/replication is probably definitely not the reason, just a "result effect" of occasional huge load and huge number of forks/processes on the server(s). So it's probably really kernel (4.2/4.3) and/or systemd related. I found the similar trouble tickets with systemd (and mistake with small number of limit of processes/tasks/threads per user) on Arch Linux forum, but I tried their recommended "stress test" and this is probably not the case in F23 with current/fresh version of systemd, kernel, etc.

It's definitely not the HW case/error, as I observe exactly the same problem (with huge CPU/fork load and lagging replications) on several servers with various architecture (nVidia, AMD, Intel based, one VM under VBox...). And finally - the exactly same configuration (of SQL replication and everything else) works like a charm on F21 / 4.1.x kernel instance...

Comment 4 LukasH 2016-02-15 11:47:36 UTC

Problem solved. Problem was under heavy iowait/ioload of jbd2 journal process, and - due to SQL - problem has been solved by this :

innodb_flush_log_at_trx_commit=2
innodb_flush_log_at_timeout=30

(instead of default innodb_flush_log_at_trx_commit=1, which means permanent sync after each commit).

In any case, something must be different in ext4/jbd2 code in kernel -- because the same config, SQL traffic and everything else works fine under 4.1.X kernels.