Bug 1957058

Summary: out of tree module dattobd 0.10.15 blocks all IO when trying to take a snapshot of /boot
Product: Red Hat Enterprise Linux 8 Reporter: Neal Gompa <ngompa13>
Component: kernelAssignee: fs-maint
kernel sub component: XFS QA Contact: Zorro Lang <zlang>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bstinson, cmaiolin, daxelrod, drwilliams, esandeen, jwboyer, pdelcol, xzhou
Version: CentOS Stream   
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-07 03:49:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel log events from hang+crash of xfs when trying to take a backup none

Description Neal Gompa 2021-05-04 23:33:19 UTC
Created attachment 1779602 [details]
kernel log events from hang+crash of xfs when trying to take a backup

Created attachment 1779602 [details]
kernel log events from hang+crash of xfs when trying to take a backup

Description of problem:
When trying to use dattobd-0.10.15 to take a snapshot of /boot, XFS hangs and I/O stops working.

Version-Release number of selected component (if applicable):
4.18.0-301.1.el8

How reproducible:
Always

Steps to Reproduce:
1. Install CentOS-Stream-8-20210503.n.0 with default storage layout (/boot on XFS, / and /home on XFS+LVM)
2. Install repositories
2a. Install EPEL repo, per https://fedoraproject.org/wiki/EPEL#Quickstart
2b. Install datto repo: sudo yum install https://cpkg.datto.com/datto-rpm/repoconfig/datto-el-rpm-release-$(rpm -E %rhel)-latest.noarch.rpm
3. Install dattobd-utils (which also installs dkms-dattobd)
4. Load dattobd module: sudo modprobe dattobd
5. Start to take a backup (More details on the process: https://github.com/datto/dattobd/blob/master/README.md#performing-incremental-backups)
5a. Setup snapshot: sudo dbdctl setup-snapshot /dev/sda1 /boot/.datto 0
5b. Take the backup: sudo dd if=/dev/datto0 of=/root/boot-bkp.img bs=1M
5c. Destroy the datto device: sudo dbdctl destroy 0

Actual results:
It hangs when trying to run "dd" and XFS crashes. Log of the results attached.

Expected results:
Backup is successfully taken.

Additional info:
This worked with RHEL/CentOS 8.3 and does not work with RHEL 8.4 beta either.

Comment 1 Neal Gompa 2021-05-04 23:34:36 UTC
Also, note, dattobd is open source and available here: https://github.com/datto/dattobd

Comment 2 Neal Gompa 2021-05-04 23:35:19 UTC
I know that RHEL kernel bugs automatically are private by default, but I'm fine with this bug being public.

Comment 3 Carlos Maiolino 2021-05-05 09:32:17 UTC
Hi.

I am not familiar with datto, could you please let us know what this error message means?

datto: error finding original_mrf: -14

This is present on several places in the log you provided. A quick search on internet didn't return
anything useful, but a quick look at the datto says this is an error when datto fails to call the original
request queue function.

This is just a guessing here, but it appears datto intercepts bio submissions then pass it back to the original
driver once it does it snapshot thing, and at some point it fails to pass it back to the original device driver.
I didn't read through the whole source code to confirm it though.
But all the hung threads are waiting for IO (saving one which is waiting for a buf lock), which would explain them
waiting forever if the ditto driver fails to submit these IO requests to the original driver.

You did mention it hangs while running dd, but at the logs you provided `dd` itself does not appear there.
The logs you provided are the whole logs or a snippet of the whole kernel logs?

Also, all the hung tasks showed in the logs provided appears to be unrelated with the volume being snapshotted (i.e. /boot),
so, I wonder what influence the datto driver has on the other volumes beyond the one configured to be snapshotted.

Comment 4 Neal Gompa 2021-05-05 10:48:54 UTC
This is the function that makes that error: https://github.com/datto/dattobd/blob/master/src/dattobd.c#L3078-L3108

Your guess is pretty much on point on how the driver works.

So what makes this confusing is that I can take backups of / and /home this way, but the hang occurs when I try to do /boot. I don't know what's different about it being on LVM (DM) vs plain partition.

Comment 5 Neal Gompa 2021-05-05 10:51:19 UTC
I should also point out that I don't actually get far enough to run dd when trying to backup /boot. I/O hangs start pretty much right after setting up the snapshot device.

Comment 6 Eric Sandeen 2021-05-06 19:01:20 UTC
You have shown no "crash" here, you have only timeout messages AFAICT:

May 04 17:24:07 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663 blocked for more than 120 seconds.
May 04 17:24:07 localhost.localdomain kernel: INFO: task in:imjournal:1435 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task NetworkManager:1120 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task in:imjournal:1435 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task systemd-tmpfile:3145 blocked for more than 120 seconds.

You've also loaded an out of tree driver which seems to intercept and alter normal system IO behavior, and it is emitting error messages.

The system (XFS) is now waiting for IO to complete; it's far longer than expected, and so you get these messages.

I don't see any indication that this is an XFS or RHEL bug; somethin - likely your datto module - has stopped IO from completing as expected.

Comment 7 Neal Gompa 2021-05-07 03:06:02 UTC
(In reply to Eric Sandeen from comment #6)
> You have shown no "crash" here, you have only timeout messages AFAICT:
> 
> May 04 17:24:07 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663
> blocked for more than 120 seconds.
> May 04 17:24:07 localhost.localdomain kernel: INFO: task in:imjournal:1435
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task NetworkManager:1120
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task in:imjournal:1435
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task
> systemd-tmpfile:3145 blocked for more than 120 seconds.
> 
> You've also loaded an out of tree driver which seems to intercept and alter
> normal system IO behavior, and it is emitting error messages.
> 
> The system (XFS) is now waiting for IO to complete; it's far longer than
> expected, and so you get these messages.
> 
> I don't see any indication that this is an XFS or RHEL bug; somethin -
> likely your datto module - has stopped IO from completing as expected.

So I only indicated a crash based on the backtrace in the log. That may have been a miscalculation on my part.

That being said, I'm willing to accept that it might be dattobd's fault, but I don't understand how that's the case when it works fine with -270 kernel from RHEL 8.3 backing up /boot (regular XFS partition), / (XFS on DM), and /home (XFS on DM). With RHEL 8.4 beta with -293 kernel, / and /home work, but /boot causes this behavior. CentOS Stream 8 with -301 kernel also exhibits this behavior.

Comment 8 Eric Sandeen 2021-05-07 03:49:17 UTC
I don't know what to tell you about the change in behavior, or why it seems unique to /boot. I think you probably need to have datto debug this problem, we won't be able to dig into it.  If you find an actual RHEL bug in the process, feel free to re-open this. But closing NOTABUG from the RHEL perspective for now. Everything points to your hung task messages stemming from the non-RHEL module.