Bug 1474503 - btrfs filesystem corruption after using service making heavy use of btrfs subvolumes
btrfs filesystem corruption after using service making heavy use of btrfs sub...
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
26
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-24 15:19 EDT by Bernhard Schuster
Modified: 2017-07-24 15:19 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bernhard Schuster 2017-07-24 15:19:02 EDT
Description of problem:
whenever I run concourse ci with a btrfs backend, which creates a lot of btrfs subvolums, snapshots and deletes them again, the btrfs disk filesystem gets corrupted.



Version-Release number of selected component (if applicable):

#uname -a
Linux compute 4.11.12-20170712.amdstg.fc25.x86_64 #1 SMP Sun Jul 23 10:54:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version   
btrfs-progs v4.6.1



How reproducible:
Happened a total of 6 times over a period of 2 months on two different machines and a total of 4 different disks.

Steps to Reproduce:
1. install concourse ci worker as docker container
2. docker pull concourse/concourse
3. use the following systemd unit file
#cat /etc/systemd/system/concourse-worker.service                                                                                                  
[Unit]
Description=concourse worker
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=60s
Environment=CONCOURSE_TSA_HOST=ci.spearow.io
Environment=CONCOURSE_TSA_PORT=....
Environment=CONCOURSE_TSA_PUBLIC_KEY=...
Environment=CONCOURSE_WORKER_PRIVATE_KEY=...
Environment=CONCOURSE_WORK_DIR=/var/cache/concourse
Environment=CONCOURSE_KEY_DIR=...
ExecStartPre=-/usr/bin/docker stop concourse-worker
ExecStartPre=-/usr/bin/docker rm  -f concourse-worker
ExecStart=/usr/bin/docker run --rm \
	-v ${CONCOURSE_WORK_DIR}:${CONCOURSE_WORK_DIR}:z \
	-v ${CONCOURSE_KEY_DIR}:${CONCOURSE_KEY_DIR}:z \
	--name=concourse-worker \
	--privileged \
        -e CONCOURSE_GARDEN_DNS_SERVER=8.8.8.8 \
	concourse/concourse:latest worker \
		--work-dir=${CONCOURSE_WORK_DIR} \
		--tsa-host=${CONCOURSE_TSA_HOST} \
		--tsa-port=${CONCOURSE_TSA_PORT} \
		--tsa-public-key=${CONCOURSE_KEY_DIR}/${CONCOURSE_TSA_PUBLIC_KEY} \
		--tsa-worker-private-key=${CONCOURSE_KEY_DIR}/${CONCOURSE_WORKER_PRIVATE_KEY}

[Install]
WantedBy=multi-user.target

3. use it for a while, launch jobs from the ui, do stuff

Actual results:

Fails to launch the worker after a reboot at some point, dmesg shows a lot of related errors:
[   23.539787] BTRFS: device label physalis devid 1 transid 136947 /dev/dm-2
[   23.547491] BTRFS info (device dm-2): disk space caching is enabled
[   23.547491] BTRFS info (device dm-2): has skinny extents
[   23.570775] BTRFS info (device dm-2): detected SSD devices, enabling SSD mode
[   23.579172] BTRFS info (device dm-2): checking UUID tree
[   23.581963] BTRFS error (device dm-2): parent transid verify failed on 834453504 wanted 123019 found 136313
[   23.581976] BTRFS warning (device dm-2): iterating uuid_tree failed -5
[  178.481637] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  179.408012] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  179.411793] BTRFS error (device dm-2): qgroup scan failed with -5
[  375.669368] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  421.194210] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  421.194272] BTRFS error (device dm-2): qgroup scan failed with -5
[  577.103169] BTRFS info (device dm-2): qgroup_rescan_init failed with -22
[  577.159385] BTRFS error (device dm-2): parent transid verify failed on 31337414656 wanted 123062 found 136980
[  577.159409] BTRFS error (device dm-2): qgroup scan failed with -5

Expected results:
No btrfs corruption.

Additional info:
There is a ticket for concourse to migrate away from btrfs https://github.com/concourse/concourse/issues/1045

Note You need to log in before you can comment on or make changes to this bug.