1286500 – Tool thin_dump failing to show 'mappings'

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1286500 - Tool thin_dump failing to show 'mappings'

Summary: Tool thin_dump failing to show 'mappings'

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Snitzer
QA Contact:	Bruno Goncalves
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1290824 (view as bug list)
Depends On:
Blocks:	1295577 1313485
TreeView+	depends on / blocked

Reported:	2015-11-30 03:56 UTC by Gudge
Modified:	2021-09-06 14:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-3.10.0-366.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1290912 (view as bug list)
Environment:
Last Closed:	2016-11-03 14:25:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch on Kernel 3.10.0-327.4.5 (6.50 KB, patch) 2016-02-16 16:37 UTC, Gudge	shankhabanerjee: review? (thornber)	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2574	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2016-11-03 12:06:10 UTC

Description Gudge 2015-11-30 03:56:03 UTC

Description of problem:
Tool thin_dump failing to show 'mappings'

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux 7.2

How reproducible:
Everytime.

Steps to Reproduce: (Reproducible on a Virtual Machine)
1. Create a new Volume Group and a thin pool and thin volume within the thin pool.

   pvcreate /dev/sdc       # Size of the physical disk is 75GB
   vgcreate VG /dev/sdc
   lvcreate -y --extents 100%free --thin VG/POOL
   lvcreate -y --name LV1 --virtualsize 30GB --thinpool VG/POOL 

2. Create xfs file system on the logical volume and mount it.
    mkfs.xfs -L STP /dev/mapper/VG-LV1
    mkdir -p /LV1
    mount -L STP /LV1

3. Two Shell Scripts:

a. Script for writing data to the logical volume: dd.sh
 
#! /bin/bash

i="0"
while true; do
    i=$[$i+1]
    if [ $i -eq 5 ]; then
        exit 0
    fi
    t='t'_$i.txt
    dd if=/dev/zero of=$t bs=2GB count=2 seek=1
    sync
done

b. Script for dumping thin dump data to a file  thin_dump.sh
#! /bin/bash

thin_dump_data='/lvm_scripts/thin_dump_data'

POOL='/dev/mapper/VG-POOL'
POOL_TMETA=$POOL'_tmeta'
POOL_TPOOL=$POOL'-tpool'
i="0"
while true; do
    i=$[$i+1]

    if [ $i -eq 100 ]; then
        exit 0
    fi

    t='thinDump'_$i.xml

    echo "Iteration : " $i
    t='thinDump'_$i.xml

    dmsetup message $POOL_TPOOL 0 reserve_metadata_snap
    echo "Reserve message snapshot message status : " $?

    block_no=`dmsetup status $POOL_TPOOL | cut -f 7 -d " "`
    thin_dump -r -f xml $POOL_TMETA -m $block_no > $thin_dump_data/$t
    echo "Thin dump message status " : $?

    dmsetup message $POOL_TPOOL 0 release_metadata_snap
    echo "Release snapshot message status " : $?

    sleep 3
done


4. Create directory structure:

mkdir -p /lvm_scripts/thin_dump_data

5. Please place script 3(a) and 3(b) in the directory
/lvm_scripts

6. On one shell inside the mounted volume (/LV1)
Please run dd.sh

This will start dumping data on the logical volume.

7. On the other shell please run thin_dump.sh

This will start dumping thin dump output in xml format in /lvm_scripts/thin_dump_data


8. Once the run for dd.sh finishes you can kill thin_dump.sh

9) If you look at the xml files generated by thin dump
you will notice than some of them are empty.

No single mappings or range mappings specified
 

10) I have written another script which will just parse the xml file and
calculate the mappings and dump the size based on the number of mappings and chunk size. 


#!/usr/bin/python

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
import os

class Device:
    def __init__(self, device_id, mapped_blocks, chunk_size):
        self.device_id = int(device_id)
        self.mapped_blocks = int(mapped_blocks)
        self.chunk_size = chunk_size
        self.block_tuple_list = []

    def add_block_info(self, data, length='1'):
        self.block_tuple_list.append((int(data), int(length)))


def get_size(device):
    total_blocks = sum(x[1] for x in device.block_tuple_list)
    total_size = total_blocks * device.chunk_size * 1024
    total_size_kb = total_size / 1024
    total_size_mb = total_size / (1024 * 1024)
    total_size_gb = total_size / (1024 * 1024 * 1024)
    print 'device_id = {} chunk_size = {} KB total_blocks = {} mapped_blocks = {} size = {} {} KB {} MB {} GB'.format(device.device_id, device.chunk_size, total_blocks, device.mapped_blocks, total_size, total_size_kb, total_size_mb, total_size_gb)


def lvm_size_wrapper(xfile):
    if not os.path.exists(xfile):
        return
    print 'File = {}'.format(xfile)
    try:
        tree = ET.parse(xfile)
        root = tree.getroot()
    except ParseError:
        print 'Could not parse file'
        return
    nr_data_blocks = root.attrib['nr_data_blocks']
    data_block_size = root.attrib['data_block_size']
    lvm_chunk_size = (int(data_block_size) * 512)/1024
    print 'nr_data_blocks = {} data_block_size = {} chunk_size = {} KB'.format(nr_data_blocks, data_block_size, lvm_chunk_size)
    device_id_list = []
    device_id_map = {}
    for child in root:
        device = Device(child.attrib['dev_id'], child.attrib['mapped_blocks'], lvm_chunk_size)
        device_id_list.append(device)
        device_id_map[device.device_id] = device

        for e in child:
            if e.tag == 'single_mapping':
                device.add_block_info(e.attrib['data_block'])
            elif e.tag == 'range_mapping':
                device.add_block_info(e.attrib['data_begin'], e.attrib['length'])

    for d in device_id_list:
        get_size(d)
def lvm_helper():
    path='/lvm_scripts/thin_dump_data'
    list_of_files = []
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            if filename.endswith('.xml'):
                list_of_files.append(os.sep.join([dirpath, filename]))

    for f in list_of_files:
        lvm_size_wrapper(f)

lvm_helper()



11. This will show you the size gradually increases and then suddenly drops to zero and then picks up again
from where it dropped.


Actual results:
Thin dump does not dump mappings correctly even when run on the snapshot of the metadata.


Expected results:
This dump should report all the block mappings correctly every time it is run on the snapshot of the metadata.


Additional info:
It takes hardly 2 minutes to reproduce the issue and it is 100% reproducible everytime. This is not just on RedHat7 but on CentOS7 as well.

Comment 1 Gudge 2015-12-01 14:57:04 UTC

Please do let me know if you are able to reproduce the issue. If need be I can share my VM. That may helpful in reproducing the issue.

Thanks for all your help.

Comment 2 Zdenek Kabelac 2015-12-01 16:02:15 UTC

I've tried simplier reproducer myself locally:

shell1:

lvcreate -T -L20 vg/pool
while :
do
  lvcreate -V 20 vg/pool -n th
  dd if=/dev/zero of=/dev/vg/th bs=1M count=1 conv=fdatasync
  lvremove -ff vg/th
done

----

shell2

while :
do
        dmsetup message vg-pool-tpool 0 reserve_metadata_snap
        thin_dump -r -f xml /dev/mapper/vg-pool_tmeta -m $(dmsetup status vg-pool-tpool | cut -f 7 -d " ")
        dmsetup message vg-pool-tpool 0 release_metadata_snap
done

----


end result was machine deadlock on bare metal T61 in a few seconds of parallel run:


device-mapper: block manager: recursive lock detected in metadata
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map common: unable to decrement a reference count below 0
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map common: unable to decrement a reference count below 0
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map common: unable to decrement a reference count below 0
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: thin: dm_thin_get_highest_mapped_block returned -22
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 1
device-mapper: thin: dm_thin_get_highest_mapped_block returned -22
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: bufio: leaked buffer 6, hold count 1, list 0
------------[ cut here ]------------
kernel BUG at drivers/md/dm-bufio.c:1484!
invalid opcode: 0000 [#1] SMP 
Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_analog snd_hda_codec_generic iTCO_wdt iTCO_vendor_support arc4 ppdev coretemp kvm_intel kvm mt7601u iwl3945 iwlegacy joydev mac80211 snd_hda_intel snd_hda_codec i2c_i801 cfg80211 snd_hda_core lpc_ich snd_hwdep r592 snd_seq memstick snd_seq_device snd_pcm e1000e snd_timer shpchp thinkpad_acpi ptp snd pps_core wmi parport_pc soundcore tpm_tis fjes parport rfkill tpm acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc loop sdhci_pci i915 sdhci i2c_algo_bit drm_kms_helper drm mmc_core serio_raw ata_generic yenta_socket pata_acpi video
CPU: 1 PID: 103 Comm: kworker/u4:7 Not tainted 4.3.0-0.rc7.git2.2.fc24.x86_64 #1
Hardware name: LENOVO 6464CTO/6464CTO, BIOS 7LETC9WW (2.29 ) 03/18/2011
Workqueue: dm-thin do_worker [dm_thin_pool]
task: ffff8800b9b80000 ti: ffff880135e30000 task.ti: ffff880135e30000
RIP: 0010:[<ffffffff8160b2da>]  [<ffffffff8160b2da>] dm_bufio_client_destroy+0x14a/0x1d0
RSP: 0018:ffff880135e33be8  EFLAGS: 00010287
RAX: ffff880135def238 RBX: ffff880135def200 RCX: 0000000000000000
RDX: ffff880135def220 RSI: ffff88013bb0dff8 RDI: ffff88013bb0dff8
RBP: ffff880135e33c10 R08: 0000000000000001 R09: 0000000000000523
R10: ffff8800ac637000 R11: 0000000000000523 R12: ffff880135def248
R13: 0000000000000002 R14: ffff880135def228 R15: ffff880135def210
FS:  0000000000000000(0000) GS:ffff88013bb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fd6f4874c30 CR3: 0000000001c0b000 CR4: 00000000000006e0
Stack:
 ffff88013204d0f0 ffff88013217b000 00000000ffffffea ffff8800ac636800
 ffff8800a9e37e20 ffff880135e33c28 ffffffffa07561f5 ffff88013217b000
 ffff880135e33c40 ffffffffa077124a ffff88013217b158 ffff880135e33c68
Call Trace:
 [<ffffffffa07561f5>] dm_block_manager_destroy+0x15/0x20 [dm_persistent_data]
 [<ffffffffa077124a>] __destroy_persistent_data_objects+0x3a/0x40 [dm_thin_pool]
 [<ffffffffa0773593>] dm_pool_abort_metadata+0x63/0xa0 [dm_thin_pool]
 [<ffffffffa076cfbe>] metadata_operation_failed+0x5e/0x100 [dm_thin_pool]
 [<ffffffffa076e0bb>] alloc_data_block.isra.48+0x8b/0x190 [dm_thin_pool]
 [<ffffffffa07708c0>] process_cell+0x2b0/0x510 [dm_thin_pool]
 [<ffffffff810d4008>] ? dequeue_entity+0x3b8/0xa80
 [<ffffffff81655fd7>] ? skb_release_data+0xa7/0xd0
 [<ffffffffa076b075>] ? process_prepared+0x75/0xc0 [dm_thin_pool]
 [<ffffffffa076fd3e>] do_worker+0x26e/0x830 [dm_thin_pool]
 [<ffffffff810b759e>] process_one_work+0x19e/0x3f0
 [<ffffffff810b783e>] worker_thread+0x4e/0x450
 [<ffffffff8177a801>] ? __schedule+0x371/0x980
 [<ffffffff810b77f0>] ? process_one_work+0x3f0/0x3f0
 [<ffffffff810b77f0>] ? process_one_work+0x3f0/0x3f0
 [<ffffffff810bd6a8>] kthread+0xd8/0xf0
 [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
 [<ffffffff8177f25f>] ret_from_fork+0x3f/0x70
 [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
Code: d2 75 7e 48 8b 53 50 48 85 d2 75 54 48 8b bb 80 00 00 00 e8 49 af ff ff 48 89 df e8 51 73 bf ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 49 89 d7 41 8b 57 40 49 8b 77 28 44 89 
RIP  [<ffffffff8160b2da>] dm_bufio_client_destroy+0x14a/0x1d0
 RSP <ffff880135e33be8>
---[ end trace 8ef1e20cefdaef36 ]---
BUG: unable to handle kernel paging request at ffffffffffffffd8
IP: [<ffffffff810bdd00>] kthread_data+0x10/0x20
PGD 1c0e067 PUD 1c10067 PMD 0 
Oops: 0000 [#2] SMP 
Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_analog snd_hda_codec_generic iTCO_wdt iTCO_vendor_support arc4 ppdev coretemp kvm_intel kvm mt7601u iwl3945 iwlegacy joydev mac80211 snd_hda_intel snd_hda_codec i2c_i801 cfg80211 snd_hda_core lpc_ich snd_hwdep r592 snd_seq memstick snd_seq_device snd_pcm e1000e snd_timer shpchp thinkpad_acpi ptp snd pps_core wmi parport_pc soundcore tpm_tis fjes parport rfkill tpm acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc loop sdhci_pci i915 sdhci i2c_algo_bit drm_kms_helper drm mmc_core serio_raw ata_generic yenta_socket pata_acpi video
CPU: 0 PID: 103 Comm: kworker/u4:7 Tainted: G      D         4.3.0-0.rc7.git2.2.fc24.x86_64 #1
Hardware name: LENOVO 6464CTO/6464CTO, BIOS 7LETC9WW (2.29 ) 03/18/2011
task: ffff8800b9b80000 ti: ffff880135e30000 task.ti: ffff880135e30000
RIP: 0010:[<ffffffff810bdd00>]  [<ffffffff810bdd00>] kthread_data+0x10/0x20
RSP: 0018:ffff880135e338b8  EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81f27e40
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800b9b80000
RBP: ffff880135e338b8 R08: ffff8800b9b80088 R09: 00000020e0e3ef77
R10: ffff8800b9b80060 R11: 0000000000000003 R12: 0000000000016c80
R13: ffff8800b9b80000 R14: ffff88013ba16c80 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000001c0b000 CR4: 00000000000006f0
Stack:
 ffff880135e338d0 ffffffff810b8401 ffff88013ba16c80 ffff880135e33918
 ffffffff8177aab0 ffff880100000000 ffff8800b9b80000 ffff880135e34000
 ffff880135e33970 ffff880135e33970 ffff880135e33468 ffff880135e33468
Call Trace:
 [<ffffffff810b8401>] wq_worker_sleeping+0x11/0x90
 [<ffffffff8177aab0>] __schedule+0x620/0x980
 [<ffffffff8177ae43>] schedule+0x33/0x80
 [<ffffffff810a229a>] do_exit+0x80a/0xae0
 [<ffffffff8101888a>] oops_end+0x9a/0xd0
 [<ffffffff81018d4b>] die+0x4b/0x70
 [<ffffffff81015d11>] do_trap+0xb1/0x140
 [<ffffffff810160b9>] do_error_trap+0x89/0x110
 [<ffffffff8160b2da>] ? dm_bufio_client_destroy+0x14a/0x1d0
 [<ffffffff8177e96e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
 [<ffffffff8118901e>] ? irq_work_queue+0x8e/0xa0
 [<ffffffff810f47f1>] ? console_unlock+0x201/0x520
 [<ffffffff81016670>] do_invalid_op+0x20/0x30
 [<ffffffff81780a9e>] invalid_op+0x1e/0x30
 [<ffffffff8160b2da>] ? dm_bufio_client_destroy+0x14a/0x1d0
 [<ffffffff8160b2fc>] ? dm_bufio_client_destroy+0x16c/0x1d0
 [<ffffffffa07561f5>] dm_block_manager_destroy+0x15/0x20 [dm_persistent_data]
 [<ffffffffa077124a>] __destroy_persistent_data_objects+0x3a/0x40 [dm_thin_pool]
 [<ffffffffa0773593>] dm_pool_abort_metadata+0x63/0xa0 [dm_thin_pool]
 [<ffffffffa076cfbe>] metadata_operation_failed+0x5e/0x100 [dm_thin_pool]
 [<ffffffffa076e0bb>] alloc_data_block.isra.48+0x8b/0x190 [dm_thin_pool]
 [<ffffffffa07708c0>] process_cell+0x2b0/0x510 [dm_thin_pool]
 [<ffffffff810d4008>] ? dequeue_entity+0x3b8/0xa80
 [<ffffffff81655fd7>] ? skb_release_data+0xa7/0xd0
 [<ffffffffa076b075>] ? process_prepared+0x75/0xc0 [dm_thin_pool]
 [<ffffffffa076fd3e>] do_worker+0x26e/0x830 [dm_thin_pool]
 [<ffffffff810b759e>] process_one_work+0x19e/0x3f0
 [<ffffffff810b783e>] worker_thread+0x4e/0x450
 [<ffffffff8177a801>] ? __schedule+0x371/0x980
 [<ffffffff810b77f0>] ? process_one_work+0x3f0/0x3f0
 [<ffffffff810b77f0>] ? process_one_work+0x3f0/0x3f0
 [<ffffffff810bd6a8>] kthread+0xd8/0xf0
 [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
 [<ffffffff8177f25f>] ret_from_fork+0x3f/0x70
 [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
Code: cf 48 89 e7 e8 92 e4 6b 00 e9 55 ff ff ff e8 18 12 fe ff 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 80 05 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
RIP  [<ffffffff810bdd00>] kthread_data+0x10/0x20
 RSP <ffff880135e338b8>
CR2: ffffffffffffffd8
---[ end trace 8ef1e20cefdaef37 ]---
Fixing recursive fault but reboot is needed!

Comment 3 Mike Snitzer 2015-12-01 16:58:22 UTC

(In reply to Zdenek Kabelac from comment #2)
> I've tried simplier reproducer myself locally:
> 
> shell1:
> 
> lvcreate -T -L20 vg/pool
> while :
> do
>   lvcreate -V 20 vg/pool -n th
>   dd if=/dev/zero of=/dev/vg/th bs=1M count=1 conv=fdatasync
>   lvremove -ff vg/th
> done
> 
> ----
> 
> shell2
> 
> while :
> do
>         dmsetup message vg-pool-tpool 0 reserve_metadata_snap
>         thin_dump -r -f xml /dev/mapper/vg-pool_tmeta -m $(dmsetup status
> vg-pool-tpool | cut -f 7 -d " ")
>         dmsetup message vg-pool-tpool 0 release_metadata_snap
> done
> 
> ----
> 
> 
> end result was machine deadlock on bare metal T61 in a few seconds of
> parallel run:

We obviously need to add much more safety to the kernel interfaces.
But IIRC the thin-pool should be suspended _before_ the 'reserve_metadata_snap'.
(So it seems like the kernel code should fail the 'reserve_metadata_snap' if the pool isn't suspended).

That said, I could be mistaken.  Joe is really the one who needs to weigh-in here.

Comment 4 Joe Thornber 2015-12-01 17:05:41 UTC

I'll take the bug.  Not able to look at it for a couple of days however.

Comment 5 Gudge 2015-12-01 17:17:06 UTC

> We obviously need to add much more safety to the kernel interfaces.
> But IIRC the thin-pool should be suspended _before_ the
> 'reserve_metadata_snap'.
> (So it seems like the kernel code should fail the 'reserve_metadata_snap' if
> the pool isn't suspended).
> 
> That said, I could be mistaken.  Joe is really the one who needs to weigh-in
> here.

I did try suspending the metadata volume. On my setup If you suspend the metadata volume I am not able to take a snapshot. The command hangs.

dmsetup suspend /dev/mapper/vg-pool_tmeta
dmsetup message vg-pool-tpool 0 reserve_metadata_snap  ---> Hangs

Comment 6 Mike Snitzer 2015-12-02 00:06:44 UTC

(In reply to Gudge from comment #5)
> > We obviously need to add much more safety to the kernel interfaces.
> > But IIRC the thin-pool should be suspended _before_ the
> > 'reserve_metadata_snap'.
> > (So it seems like the kernel code should fail the 'reserve_metadata_snap' if
> > the pool isn't suspended).
> > 
> > That said, I could be mistaken.  Joe is really the one who needs to weigh-in
> > here.
> 
> I did try suspending the metadata volume. On my setup If you suspend the
> metadata volume I am not able to take a snapshot. The command hangs.
> 
> dmsetup suspend /dev/mapper/vg-pool_tmeta
> dmsetup message vg-pool-tpool 0 reserve_metadata_snap  ---> Hangs

No, I was saying suspend the thin-pool (not its underlying metadata device).

But I checked with Joe and he said that the suspend is only needed to get consistent usage info (and all mappings, associated with outstanding IO, on disk).  The thin-pool suspend isn't required to avoid crashes that were reported via comment#2.

So there are 2 different things related to this BZ that need to be explored.

Comment 7 Gudge 2015-12-02 01:06:21 UTC

(In reply to Mike Snitzer from comment #6)
> No, I was saying suspend the thin-pool (not its underlying metadata device).
> 

Side question:

If one suspends the thin pool then would not it effect the I/O's on the logical volumes which are part of the thin pool. 

Will they wait for the thin pool to be resume or the I/O's will go through.

Comment 8 Joe Thornber 2015-12-04 14:10:47 UTC

Addressing the original issue here rather than Kabi's crash (which I haven't reproduced yet):

If the pool is active the kernel can be updating the metadata at any time.  This means userland tools cannot expect a consistent view of the metadata from an active pool (think of btree nodes being updated by the kernel at the same time as thin_dump is reading them).

There is a facility to get around this called metadata snapshots.  Basically the pool is told, via a dmsetup message to take a snapshot of the metadata.  You then pass the -m switch to thin_dump (I recommend not giving the snap location, thin_dump will pick up the snap location itself).  Make sure you drop the metadata snap once you've finished examining the metadata since it causes a small performance penalty for thin IO.

I've written an example test for you that takes you through this process:

https://github.com/jthornber/device-mapper-test-suite/commit/d46300f61bae34f242fe63a7a77ccb343d86c1d5

Note the results of thin_dump reflect the metadata as it was when the 'reserve_metadata_snap' message was sent to the pool.

The thin devices and pools do not need to be suspended before taking the metadata snap.

Comment 9 Joe Thornber 2015-12-04 14:45:36 UTC

Kabi's bug reproduced with this test:

https://github.com/jthornber/device-mapper-test-suite/commit/3e2c2df1a79eda6cb640e2fd5466b708df3d56c3

Comment 10 Joe Thornber 2015-12-09 17:06:53 UTC

If you're using metadata snapshots you should use the following patches which are making their way upstream:

https://github.com/jthornber/linux-2.6/commit/aea679436aa1140744323801839336a8388cd387

https://github.com/jthornber/linux-2.6/commit/8944fb30473f964168ae90971e322ab1f8cbc893


https://github.com/jthornber/linux-2.6/commit/9b971d8295b3a8fe69f4db9caaa7d34045f16102


https://github.com/jthornber/linux-2.6/commit/f5bfb39b946ef136dbfbe88d2b64bca5fa0adec9

https://github.com/jthornber/linux-2.6/commit/604416e72af1efc1b4a3ca422b3395e49dce7c56

Rassigning to Mike to ensure they get into RHEL.

Comment 12 Bruno Goncalves 2015-12-11 14:46:03 UTC

*** Bug 1290824 has been marked as a duplicate of this bug. ***

Comment 14 Gudge 2016-02-16 16:37:23 UTC

Created attachment 1127630 [details]
Patch on Kernel 3.10.0-327.4.5

Hi,
Could someone please review the patch and let me know if it is correct. 
I have ported the patches mentioned on Comment 10 to 3.10.0-327.4.5

Thanks

Comment 15 Gudge 2016-02-16 17:18:40 UTC

(In reply to Gudge from comment #14)
> Created attachment 1127630 [details]
> Patch on Kernel 3.10.0-327.4.5
> 
> Hi,
> Could someone please review the patch and let me know if it is correct. 
> I have ported the patches mentioned on Comment 10 to 3.10.0-327.4.5
> 
> Thanks

The patch does fix the issue reported in Comment 2.

I do still see metadata corruption while taking a snap and while writing to thin volume in parallel (on patched 3.10.0-327.4.5).

I do not have a small reproducible test case.

Thanks

Comment 16 Rafael Aquini 2016-03-21 11:20:15 UTC

Patch(es) available on kernel-3.10.0-366.el7

Comment 19 Bruno Goncalves 2016-05-19 14:54:18 UTC

The patch worked well.
Tested on RHEL-7.2 with kernel 3.10.0-366.el7


# dmtest run --profile mytest --suite thin-provisioning -t /ToolsTests/
ToolsTests
  metadata_snap_stress1...PASS
  metadata_snap_stress2...iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
PASS
<snip>

Comment 22 errata-xmlrpc 2016-11-03 14:25:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html

Note You need to log in before you can comment on or make changes to this bug.