Bug 1474503 - btrfs filesystem corruption after using service making heavy use of btrfs subvolumes
btrfs filesystem corruption after using service making heavy use of btrfs sub...
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
27
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-24 15:19 EDT by Bernhard Schuster
Modified: 2018-03-04 11:29 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bernhard Schuster 2017-07-24 15:19:02 EDT
Description of problem:
whenever I run concourse ci with a btrfs backend, which creates a lot of btrfs subvolums, snapshots and deletes them again, the btrfs disk filesystem gets corrupted.



Version-Release number of selected component (if applicable):

#uname -a
Linux compute 4.11.12-20170712.amdstg.fc25.x86_64 #1 SMP Sun Jul 23 10:54:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version   
btrfs-progs v4.6.1



How reproducible:
Happened a total of 6 times over a period of 2 months on two different machines and a total of 4 different disks.

Steps to Reproduce:
1. install concourse ci worker as docker container
2. docker pull concourse/concourse
3. use the following systemd unit file
#cat /etc/systemd/system/concourse-worker.service                                                                                                  
[Unit]
Description=concourse worker
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=60s
Environment=CONCOURSE_TSA_HOST=ci.spearow.io
Environment=CONCOURSE_TSA_PORT=....
Environment=CONCOURSE_TSA_PUBLIC_KEY=...
Environment=CONCOURSE_WORKER_PRIVATE_KEY=...
Environment=CONCOURSE_WORK_DIR=/var/cache/concourse
Environment=CONCOURSE_KEY_DIR=...
ExecStartPre=-/usr/bin/docker stop concourse-worker
ExecStartPre=-/usr/bin/docker rm  -f concourse-worker
ExecStart=/usr/bin/docker run --rm \
	-v ${CONCOURSE_WORK_DIR}:${CONCOURSE_WORK_DIR}:z \
	-v ${CONCOURSE_KEY_DIR}:${CONCOURSE_KEY_DIR}:z \
	--name=concourse-worker \
	--privileged \
        -e CONCOURSE_GARDEN_DNS_SERVER=8.8.8.8 \
	concourse/concourse:latest worker \
		--work-dir=${CONCOURSE_WORK_DIR} \
		--tsa-host=${CONCOURSE_TSA_HOST} \
		--tsa-port=${CONCOURSE_TSA_PORT} \
		--tsa-public-key=${CONCOURSE_KEY_DIR}/${CONCOURSE_TSA_PUBLIC_KEY} \
		--tsa-worker-private-key=${CONCOURSE_KEY_DIR}/${CONCOURSE_WORKER_PRIVATE_KEY}

[Install]
WantedBy=multi-user.target

3. use it for a while, launch jobs from the ui, do stuff

Actual results:

Fails to launch the worker after a reboot at some point, dmesg shows a lot of related errors:
[   23.539787] BTRFS: device label physalis devid 1 transid 136947 /dev/dm-2
[   23.547491] BTRFS info (device dm-2): disk space caching is enabled
[   23.547491] BTRFS info (device dm-2): has skinny extents
[   23.570775] BTRFS info (device dm-2): detected SSD devices, enabling SSD mode
[   23.579172] BTRFS info (device dm-2): checking UUID tree
[   23.581963] BTRFS error (device dm-2): parent transid verify failed on 834453504 wanted 123019 found 136313
[   23.581976] BTRFS warning (device dm-2): iterating uuid_tree failed -5
[  178.481637] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  179.408012] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  179.411793] BTRFS error (device dm-2): qgroup scan failed with -5
[  375.669368] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  421.194210] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  421.194272] BTRFS error (device dm-2): qgroup scan failed with -5
[  577.103169] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  577.159385] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  577.159409] BTRFS error (device dm-2): qgroup scan failed with -5

Expected results:
No btrfs corruption.

Additional info:
There is a ticket for concourse to migrate away from btrfs https://github.com/concourse/concourse/issues/1045
Comment 1 Laura Abbott 2018-02-27 22:53:07 EST
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs.
 
Fedora 26 has now been rebased to 4.15.4-200.fc26.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27.
 
If you experience different issues, please open a new bug report for those.

Note You need to log in before you can comment on or make changes to this bug.