Bug 1711787 - [abrt] btrfs-progs: btrfs_scan_kernel(): btrfs killed by SIGABRT
Summary: [abrt] btrfs-progs: btrfs_scan_kernel(): btrfs killed by SIGABRT
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: btrfs-progs
Version: 29
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Josef Bacik
QA Contact: Fedora Extras Quality Assurance
URL: https://retrace.fedoraproject.org/faf...
Whiteboard: abrt_hash:ce568361bf4c140a22f1ee24f7b...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-20 06:30 UTC by Peter Hjalmarsson
Modified: 2019-09-06 03:23 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-06 03:23:16 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
File: backtrace (9.12 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: cgroup (351 bytes, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: core_backtrace (1.99 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: cpuinfo (1.13 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: dso_list (732 bytes, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: environ (2.12 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: limits (1.29 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: maps (5.11 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: mountinfo (5.15 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: open_fds (194 bytes, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details
File: proc_pid_status (1.28 KB, text/plain)
2019-05-20 06:30 UTC, Peter Hjalmarsson
no flags Details

Description Peter Hjalmarsson 2019-05-20 06:30:39 UTC
Description of problem:


Version-Release number of selected component:
btrfs-progs-4.20.2-1.fc29

Additional info:
reporter:       libreport-2.10.0
backtrace_rating: 4
cmdline:        btrfs fi sh
crash_function: btrfs_scan_kernel
executable:     /usr/sbin/btrfs
journald_cursor: s=48a6382f23744f27b20631a08fe07737;i=6db05f;b=128dba0ccab24bfab3a1f1b0022be241;m=34068607f4;t=5894baf7b728d;x=f71d228bc0c0d5de
kernel:         5.0.16-200.fc29.x86_64
rootdir:        /
runlevel:       N 3
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (2 frames)
 #5 btrfs_scan_kernel at cmds-filesystem.c:391
 #6 cmd_filesystem_show at cmds-filesystem.c:763

Comment 1 Peter Hjalmarsson 2019-05-20 06:30:42 UTC
Created attachment 1571124 [details]
File: backtrace

Comment 2 Peter Hjalmarsson 2019-05-20 06:30:43 UTC
Created attachment 1571125 [details]
File: cgroup

Comment 3 Peter Hjalmarsson 2019-05-20 06:30:44 UTC
Created attachment 1571126 [details]
File: core_backtrace

Comment 4 Peter Hjalmarsson 2019-05-20 06:30:45 UTC
Created attachment 1571127 [details]
File: cpuinfo

Comment 5 Peter Hjalmarsson 2019-05-20 06:30:46 UTC
Created attachment 1571128 [details]
File: dso_list

Comment 6 Peter Hjalmarsson 2019-05-20 06:30:47 UTC
Created attachment 1571129 [details]
File: environ

Comment 7 Peter Hjalmarsson 2019-05-20 06:30:48 UTC
Created attachment 1571130 [details]
File: limits

Comment 8 Peter Hjalmarsson 2019-05-20 06:30:49 UTC
Created attachment 1571131 [details]
File: maps

Comment 9 Peter Hjalmarsson 2019-05-20 06:30:51 UTC
Created attachment 1571132 [details]
File: mountinfo

Comment 10 Peter Hjalmarsson 2019-05-20 06:30:52 UTC
Created attachment 1571133 [details]
File: open_fds

Comment 11 Peter Hjalmarsson 2019-05-20 06:30:53 UTC
Created attachment 1571134 [details]
File: proc_pid_status

Comment 12 Chris Murphy 2019-05-20 16:40:36 UTC
What was the btrfs command? And can you attach dmesg following the command crash? For whatever reason abrt didn't include kernel messages at all.

Comment 13 Peter Hjalmarsson 2019-05-20 19:16:05 UTC
The command was "btrfs fi sh" or "btrfs filesystem show" as it becomes if its printed out.
I have been running "watch btrfs fi sh" to monitor a "btrfs device delete" running.
The crash seems to happened while the "btrfs device delete" was finishing out.

journalctl lists the following in the logs at the time (note the timestamp of "device deleted"):
May 20 08:17:16 <hostname> kernel: BTRFS info (device bcache4): found 80 extents
May 20 08:17:22 <hostname> audit[32681]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=32681 comm="btrfs" exe="/usr/sbin/btrfs" sig=6 res=1
May 20 08:17:22 <hostname> systemd[1]: Created slice system-systemd\x2dcoredump.slice.
May 20 08:17:22 <hostname> audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@0-32682-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 08:17:22 <hostname> systemd[1]: Started Process Core Dump (PID 32682/UID 0).
May 20 08:17:23 <hostname> systemd-coredump[32683]: Process 32681 (btrfs) of user 0 dumped core.

                                                       Stack trace of thread 32681:
                                                       #0  0x00007f18ae71057f raise (libc.so.6)
                                                       #1  0x00007f18ae6fa895 abort (libc.so.6)
                                                       #2  0x00007f18ae7539c7 __libc_message (libc.so.6)
                                                       #3  0x00007f18ae75a2cc malloc_printerr (libc.so.6)
                                                       #4  0x00007f18ae75bcdc _int_free (libc.so.6)
                                                       #5  0x0000559b98ccb0d6 n/a (btrfs)
                                                       #6  0x0000559b98c92ad2 main (btrfs)
                                                       #7  0x00007f18ae6fc413 __libc_start_main (libc.so.6)
                                                       #8  0x0000559b98c92cce _start (btrfs)
May 20 08:17:23 <hostname> audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@0-32682-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 08:17:23 <hostname> kernel: BTRFS info (device bcache4): device deleted: /dev/bcache6
May 20 08:17:27 <hostname> dbus-daemon[944]: [system] Activating service name='org.freedesktop.problems' requested by ':1.5918' (uid=0 pid=32753 comm="/usr/bin/python3 /usr/bin/abrt-action-notify -d /v" label="system_u:system_r:abrt_t:s0-s0:c0.c1023") (using servicehelper)
May 20 08:17:27 <hostname> dbus-daemon[944]: [system] Successfully activated service 'org.freedesktop.problems'
May 20 08:17:29 <hostname> abrt-notification[303]: Process 32681 (btrfs) crashed in ??()

Comment 14 Eric Sandeen 2019-05-20 19:17:31 UTC
Looks like the command was:

# btrfs fi sh

And somehow this was freeing an invalid pointer?

        while ((mnt = getmntent(f)) != NULL) {
                free(dev_info_arg);


If you can reproduce this, it might be interesting to run it under valgrind.

Overall, you'll probably have more luck reporting it to the upstream btrfs list.

Comment 15 Chris Murphy 2019-05-20 19:26:23 UTC
Forwarded it to Btrfs list as: 'watch btrfs fi show' crash while 'btrfs device delete'
https://lore.kernel.org/linux-btrfs/CAJCQCtR4NsGc7uA+j2yuZkuQ4NYpvdZ5enkh_-zDd2FhV0pJ7A@mail.gmail.com/

Comment 16 Chris Murphy 2019-05-23 18:38:04 UTC
I tried this with btrfs-progs 5.1 and kernel 5.2.0-0.rc0.git6.1.fc31.x86_64, and could not reproduce the crash.

$ sudo ./mkfs.btrfs -dsingle -msingle /dev/vdb /dev/vdc

$ sudo watch ./btrfs fi show
$ sudo btrfs dev del /dev/vdb /mnt   ##in a separate shell

Every 2.0s: ./btrfs fi show                                                                                    localhost.localdomain: Thu May 23 12:33:30 2019

Label: none  uuid: d82d6a87-6fc0-4e3f-b0d0-942088f51b19
        Total devices 2 FS bytes used 11.98GiB
        devid    1 size 0.00B used 1.02GiB path /dev/vdb
        devid    2 size 200.00GiB used 12.00GiB path /dev/vdc


Every 2.0s: ./btrfs fi show                                                                                    localhost.localdomain: Thu May 23 12:33:47 2019

Label: none  uuid: d82d6a87-6fc0-4e3f-b0d0-942088f51b19
        Total devices 1 FS bytes used 10.98GiB
        devid    2 size 200.00GiB used 12.03GiB path /dev/vdc


I'm not sure if the bug causing the crash is fixed or if it just has transient manifestation.

Comment 17 Peter Hjalmarsson 2019-05-24 12:18:56 UTC
So after some trashing I think I figured this one out, and it seems like something "btrfs filesystem show" is not just done to run while a filesystem is resizing.
Triggering ths seems to be a race-condition.

I am not 100% sure this is the same, since I cannot easily trash around on the original system as easy (backed up data, but restoring is a job I rather skip).
The original system is a x86_64 based system, while I made a test setup of a raspberry pi, a dual-slotted USB-HDD-dock, and two 320G HDDs
According to abrt-cli the crash is a duplicate of this bug, and the stacktraces are rather similar.

The current setup I can hit this with is:
A Raspberry pi 3 running Fedora 30 aarch64 from SD-card
The two HDDs partitioned in two equal sized partitions for a total of four partitions

After this I ran:
# mkfs.btrfs -d raid1 -m raid1 /dev/sd[a-b][1-2]
<..>
# mkdir /mnt/test && mount /dev/sda1 /mnt/test
# btrfs fi df /mnt/test/
Data, RAID1: total=1.00GiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 128.00KiB
        devid    1 size 149.04GiB used 1.00GiB path /dev/sda1
        devid    2 size 149.04GiB used 1.00GiB path /dev/sda2
        devid    3 size 149.04GiB used 1.01GiB path /dev/sdb1
        devid    4 size 149.04GiB used 1.01GiB path /dev/sdb2
# btrfs dev del /dev/sda2 /mnt/test
# btrfs dev del /dev/sdb2 /mnt/test
# btrfs dev add /dev/sda2 /mnt/test
# btrfs dev add /dev/sdb2 /mnt/test
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 640.00KiB
        devid    1 size 149.04GiB used 2.03GiB path /dev/sda1
        devid    3 size 149.04GiB used 2.03GiB path /dev/sdb1
        devid    4 size 149.04GiB used 0.00B path /dev/sda2
        devid    5 size 149.04GiB used 0.00B path /dev/sdb2


This makes it possible to maximize the amount of device add/remove from a volume, as removeing any of sda2 or sdb2 does not require moving any big amount of data, and the add/remove seems to be what triggered the behaviour from "btrfs fi sh".
Then I ditch "watch" and run a sime "while true"-loop for "btrfs fi sh" to prevent that the device add/remove happends while watch does it 2 s sleep.

So after this I start the following in one shell:
---
i="0"
while true
do echo $((i++))
btrfs dev del /dev/sda2 /mnt/test/
btrfs dev add /dev/sda2 /mnt/test/
btrfs dev del /dev/sdb2 /mnt/test/
btrfs dev add /dev/sdb2 /mnt/test/
done
---

And in the other shell:
---
while btrfs fi sh ; do true ; done
---


Often the last commando does not need to go for more then 3 to 4 times before a message like the following:

corrupted size vs. prev_size
Aborted (core dumped)

This one leave the following in the journal:
May 24 13:57:14 localhost systemd-coredump[3198]: Process 3193 (btrfs) of user 0 dumped core.

                                                      Stack trace of thread 3193:
                                                      #0  0x0000ffffb669fca0 raise (libc.so.6)
                                                      #1  0x0000ffffb668daa8 abort (libc.so.6)
                                                      #2  0x0000ffffb66d9a0c __libc_message (libc.so.6)
                                                      #3  0x0000ffffb66dffd4 malloc_printerr (libc.so.6)
                                                      #4  0x0000ffffb66e0730 unlink_chunk.isra.0 (libc.so.6)
                                                      #5  0x0000ffffb66e193c _int_free (libc.so.6)
                                                      #6  0x0000ffffb6709c40 closedir (libc.so.6)
                                                      #7  0x0000aaaab1debf48 close_file_or_dir (btrfs)
                                                      #8  0x0000aaaab1dece00 get_fs_info (btrfs)
                                                      #9  0x0000aaaab1e027cc btrfs_scan_kernel (btrfs)
                                                      #10 0x0000aaaab1dcc8dc main (btrfs)
                                                      #11 0x0000ffffb668deec __libc_start_main (libc.so.6)
                                                      #12 0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #13 0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
May 24 13:57:14 localhost kernel: BTRFS info (device sda1): device deleted: /dev/sdb2


I also get this from time to time:

free(): invalid next size (normal)
Aborted (core dumped)

May 24 14:02:32 localhost systemd-coredump[5153]: Process 5148 (btrfs) of user 0 dumped core.

                                                      Stack trace of thread 5148:
                                                      #0  0x0000ffffa1491ca0 raise (libc.so.6)
                                                      #1  0x0000ffffa147faa8 abort (libc.so.6)
                                                      #2  0x0000ffffa14cba0c __libc_message (libc.so.6)
                                                      #3  0x0000ffffa14d1fd4 malloc_printerr (libc.so.6)
                                                      #4  0x0000ffffa14d3920 _int_free (libc.so.6)
                                                      #5  0x0000aaaac0ed18c8 btrfs_scan_kernel (btrfs)
                                                      #6  0x0000aaaac0e9b8dc main (btrfs)
                                                      #7  0x0000ffffa147feec __libc_start_main (libc.so.6)
                                                      #8  0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #9  0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
May 24 14:02:32 localhost systemd[1]: systemd-coredump: Succeeded.
May 24 14:02:32 localhost kernel: BTRFS info (device sda1): device deleted: /dev/sda2


Seems like the btrfs-code some place does a couple of size-checks, and if these tests fails - like for example while resizing your filesystem - you get a SIGABRT.


@chris murphy, are you able to reproduce doing this?

Comment 18 Chris Murphy 2019-05-24 20:09:06 UTC
I've only tested a couple times in a VM on SSD, so it might be the resize and device removal happen fast enough to avoid the race. I still think the conversation is best handled on the upstream Btrfs list because ultimately the fix must come from them. The Red Hat/Fedora bug reporter, at least as far as Btrfs stuff goes, is mainly about packaging problems rather than bugs.

Comment 19 Peter Hjalmarsson 2019-09-02 08:57:01 UTC
Fixed in btrfs-progs v5.1.1

So fixed in Fedora30 since there is now a newer version in the repo.


Note You need to log in before you can comment on or make changes to this bug.