Bug 1957058

Summary:

out of tree module dattobd 0.10.15 blocks all IO when trying to take a snapshot of /boot

Product:

Red Hat Enterprise Linux 8

Reporter:

Neal Gompa <ngompa13>

Component:

kernel

Assignee:

fs-maint

kernel sub component:

XFS

QA Contact:

Zorro Lang <zlang>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

bstinson, cmaiolin, daxelrod, drwilliams, esandeen, jwboyer, pdelcol, xzhou

Version:

CentOS Stream

Target Milestone:

beta

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-07 03:49:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
kernel log events from hang+crash of xfs when trying to take a backup	none

Description Neal Gompa 2021-05-04 23:33:19 UTC

Created attachment 1779602 [details]
kernel log events from hang+crash of xfs when trying to take a backup

Created attachment 1779602 [details]
kernel log events from hang+crash of xfs when trying to take a backup

Description of problem:
When trying to use dattobd-0.10.15 to take a snapshot of /boot, XFS hangs and I/O stops working.

Version-Release number of selected component (if applicable):
4.18.0-301.1.el8

How reproducible:
Always

Steps to Reproduce:
1. Install CentOS-Stream-8-20210503.n.0 with default storage layout (/boot on XFS, / and /home on XFS+LVM)
2. Install repositories
2a. Install EPEL repo, per https://fedoraproject.org/wiki/EPEL#Quickstart
2b. Install datto repo: sudo yum install https://cpkg.datto.com/datto-rpm/repoconfig/datto-el-rpm-release-$(rpm -E %rhel)-latest.noarch.rpm
3. Install dattobd-utils (which also installs dkms-dattobd)
4. Load dattobd module: sudo modprobe dattobd
5. Start to take a backup (More details on the process: https://github.com/datto/dattobd/blob/master/README.md#performing-incremental-backups)
5a. Setup snapshot: sudo dbdctl setup-snapshot /dev/sda1 /boot/.datto 0
5b. Take the backup: sudo dd if=/dev/datto0 of=/root/boot-bkp.img bs=1M
5c. Destroy the datto device: sudo dbdctl destroy 0

Actual results:
It hangs when trying to run "dd" and XFS crashes. Log of the results attached.

Expected results:
Backup is successfully taken.

Additional info:
This worked with RHEL/CentOS 8.3 and does not work with RHEL 8.4 beta either.

Comment 1 Neal Gompa 2021-05-04 23:34:36 UTC

Also, note, dattobd is open source and available here: https://github.com/datto/dattobd

Comment 2 Neal Gompa 2021-05-04 23:35:19 UTC

I know that RHEL kernel bugs automatically are private by default, but I'm fine with this bug being public.

Comment 3 Carlos Maiolino 2021-05-05 09:32:17 UTC

Hi.

I am not familiar with datto, could you please let us know what this error message means?

datto: error finding original_mrf: -14

This is present on several places in the log you provided. A quick search on internet didn't return
anything useful, but a quick look at the datto says this is an error when datto fails to call the original
request queue function.

This is just a guessing here, but it appears datto intercepts bio submissions then pass it back to the original
driver once it does it snapshot thing, and at some point it fails to pass it back to the original device driver.
I didn't read through the whole source code to confirm it though.
But all the hung threads are waiting for IO (saving one which is waiting for a buf lock), which would explain them
waiting forever if the ditto driver fails to submit these IO requests to the original driver.

You did mention it hangs while running dd, but at the logs you provided `dd` itself does not appear there.
The logs you provided are the whole logs or a snippet of the whole kernel logs?

Also, all the hung tasks showed in the logs provided appears to be unrelated with the volume being snapshotted (i.e. /boot),
so, I wonder what influence the datto driver has on the other volumes beyond the one configured to be snapshotted.

Comment 4 Neal Gompa 2021-05-05 10:48:54 UTC

This is the function that makes that error: https://github.com/datto/dattobd/blob/master/src/dattobd.c#L3078-L3108

Your guess is pretty much on point on how the driver works.

So what makes this confusing is that I can take backups of / and /home this way, but the hang occurs when I try to do /boot. I don't know what's different about it being on LVM (DM) vs plain partition.

Comment 5 Neal Gompa 2021-05-05 10:51:19 UTC

I should also point out that I don't actually get far enough to run dd when trying to backup /boot. I/O hangs start pretty much right after setting up the snapshot device.

Comment 6 Eric Sandeen 2021-05-06 19:01:20 UTC

You have shown no "crash" here, you have only timeout messages AFAICT:

May 04 17:24:07 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663 blocked for more than 120 seconds.
May 04 17:24:07 localhost.localdomain kernel: INFO: task in:imjournal:1435 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task NetworkManager:1120 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task in:imjournal:1435 blocked for more than 120 seconds.
May 04 17:26:10 localhost.localdomain kernel: INFO: task systemd-tmpfile:3145 blocked for more than 120 seconds.

You've also loaded an out of tree driver which seems to intercept and alter normal system IO behavior, and it is emitting error messages.

The system (XFS) is now waiting for IO to complete; it's far longer than expected, and so you get these messages.

I don't see any indication that this is an XFS or RHEL bug; somethin - likely your datto module - has stopped IO from completing as expected.

Comment 7 Neal Gompa 2021-05-07 03:06:02 UTC

(In reply to Eric Sandeen from comment #6)
> You have shown no "crash" here, you have only timeout messages AFAICT:
> 
> May 04 17:24:07 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663
> blocked for more than 120 seconds.
> May 04 17:24:07 localhost.localdomain kernel: INFO: task in:imjournal:1435
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task xfsaild/dm-0:663
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task NetworkManager:1120
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task in:imjournal:1435
> blocked for more than 120 seconds.
> May 04 17:26:10 localhost.localdomain kernel: INFO: task
> systemd-tmpfile:3145 blocked for more than 120 seconds.
> 
> You've also loaded an out of tree driver which seems to intercept and alter
> normal system IO behavior, and it is emitting error messages.
> 
> The system (XFS) is now waiting for IO to complete; it's far longer than
> expected, and so you get these messages.
> 
> I don't see any indication that this is an XFS or RHEL bug; somethin -
> likely your datto module - has stopped IO from completing as expected.

So I only indicated a crash based on the backtrace in the log. That may have been a miscalculation on my part.

That being said, I'm willing to accept that it might be dattobd's fault, but I don't understand how that's the case when it works fine with -270 kernel from RHEL 8.3 backing up /boot (regular XFS partition), / (XFS on DM), and /home (XFS on DM). With RHEL 8.4 beta with -293 kernel, / and /home work, but /boot causes this behavior. CentOS Stream 8 with -301 kernel also exhibits this behavior.

Comment 8 Eric Sandeen 2021-05-07 03:49:17 UTC

I don't know what to tell you about the change in behavior, or why it seems unique to /boot. I think you probably need to have datto debug this problem, we won't be able to dig into it.  If you find an actual RHEL bug in the process, feel free to re-open this. But closing NOTABUG from the RHEL perspective for now. Everything points to your hung task messages stemming from the non-RHEL module.