Red Hat Bugzilla – Bug 1281994
Slow SQL replication under the 4.2 kernels
Last modified: 2016-02-15 06:47:36 EST
Description of problem:
After upgrade process from F21 to F22, I observed delayed replication between master and slave (both ends are on MariaDB 10.0.21). Both statuses Slave_IO_Running & Slave_SQL_Running are in Yes statement and everything looks clear, but replication starts to be delayed immediately and value Seconds_Behind_Master is continuously increasing.
It must be some regression or something like this in kernel 4.2 series (see below), as MariaDB is almost still the same in stable repositories for F21, F22 and F23.
When I reboot back into 4.1 kernel series, replication returns into normal mode and speed, Seconds_Behind_Master counter gradually decreasing till zero value.
Version-Release number of selected component (if applicable):
affected kernels : 4.2.5-201.fc22, 4.2.3-200.fc22.
non-affected kernels: 4.1.10-100.fc21 and also 4.1.10-200.fc22, so it's not "fc22 specific", but more likely kernel (4.2) specific.
Always, just upgrade to F22 (or probably to F23 too) and 4.2.X kernel, and then starts the SQL replication. Increasing delay starts almost immediately.
Steps to Reproduce:
1. Upgrade system to F22.
2. Reboot into any newer kernel (4.2.X).
3. Start mariadb.service / SQL replication.
Delayed slave status behind master.
The same position of relay logs on master and slave, Seconds_Behind_Master value should be zero.
Which filesystem is in use and are the processes hung or sleeping?
Filesystem is ext4, there is nothing extra with FS flags (or with Fedora install at all, it's normal "F22 Server Edition", without SELinux enabled).
But I was fallen with the 4.2 kernels, the problem is elsewhere - it looks like that replication delay starts at some point (regardless of kernel version), and this point is probably : "systemctl restart mariadb.service". Reboot (again, regardless to which kernel version) fix it. I have two F22 SQL slaves, I'll observed it on the first one, will try to verify the same behaviour on the second one this evening.
In any case, I'm pretty sure that the same action (`systemctl restart mariadb.service') works fine (without any delay affection) in F21 - the configuration of replication processes and whole MariaDB was exactly the same lot of months ago.
So, this bug is very likely (something under) "fc22 related". But I really don't know, why/where. Or what should I trace or observe for better specification of this bugreport.
Just another additional info - on 4.3 kernels, situation with replication is still the same or slightly worse. SQL/replication is probably definitely not the reason, just a "result effect" of occasional huge load and huge number of forks/processes on the server(s). So it's probably really kernel (4.2/4.3) and/or systemd related. I found the similar trouble tickets with systemd (and mistake with small number of limit of processes/tasks/threads per user) on Arch Linux forum, but I tried their recommended "stress test" and this is probably not the case in F23 with current/fresh version of systemd, kernel, etc.
It's definitely not the HW case/error, as I observe exactly the same problem (with huge CPU/fork load and lagging replications) on several servers with various architecture (nVidia, AMD, Intel based, one VM under VBox...). And finally - the exactly same configuration (of SQL replication and everything else) works like a charm on F21 / 4.1.x kernel instance...
Problem solved. Problem was under heavy iowait/ioload of jbd2 journal process, and - due to SQL - problem has been solved by this :
(instead of default innodb_flush_log_at_trx_commit=1, which means permanent sync after each commit).
In any case, something must be different in ext4/jbd2 code in kernel -- because the same config, SQL traffic and everything else works fine under 4.1.X kernels.