Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 894334

Summary:

mirror leg not restored after failure: lvm[5786]: Secondary mirror device 253:4 sync failed.

Product:

Red Hat Enterprise Linux 6

Reporter:

Marian Csontos <mcsontos>

Component:

lvm2

Assignee:

Jonathan Earl Brassow <jbrassow>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

medium

Version:

6.4

CC:

agk, cmarthal, dwysocha, heinzm, jbrassow, mcsontos, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-10-14 00:08:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1075263

Attachments:

Description	Flags
logs etc	none

Description Marian Csontos 2013-01-11 13:17:11 UTC

Created attachment 676851 [details]
logs etc

Description of problem:
After removing leg[1] the mirror was not repaired by dmeventd. All I got from dmeventd in syslog related to the leg failure are these two lines:

    Jan 10 13:15:18 zaphodc1-node01 lvm[5786]: Monitoring mirror device helter_skelter-nonsyncd_secondary_2legs_1 for events.
    Jan 10 13:15:35 zaphodc1-node01 lvm[5786]: Secondary mirror device 253:4 sync failed.

dmeventd is stuck waiting for ioctl - see backtrace below.

[1] Used recommended `echo 1 > .../delete` and udev event was generated (and pvscan --cache executed though lvmetad not configured.)

Version-Release number of selected component (if applicable):
lvm2-2.02.98-6.18

How reproducible:
???

Steps to Reproduce:
1. ???

Actual results:
mirror was not repaired.

Expected results:
Mirror should have been repaired.

Additional info:

    Core was generated by `/sbin/dmeventd'.
    #0  0x00007f085a829a47 in ioctl () at ../sysdeps/unix/syscall-template.S:82
    82      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
    Missing separate debuginfos, use: debuginfo-install libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libsepol-2.0.41-4.el6.x86_64

    (gdb) info threads
      2 Thread 0x7f085b76f7a0 (LWP 5786)  0x00007f085a82a4f3 in select () at ../sysdeps/unix/syscall-template.S:82
    * 1 Thread 0x7f0859e90700 (LWP 13968)  0x00007f085a829a47 in ioctl () at ../sysdeps/unix/syscall-template.S:82

    (gdb) thread 2
    [Switching to thread 2 (Thread 0x7f085b76f7a0 (LWP 5786))]#0  0x00007f085a82a4f3 in select () at ../sysdeps/unix/syscall-template.S:82
    82      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
    (gdb) bt full
    #0  0x00007f085a82a4f3 in select () at ../sysdeps/unix/syscall-template.S:82
    No locals.
    #1  0x000000000040514b in _client_read (fifos=0x7fff8389b170, msg=0x7fff8389b120) at dmeventd.c:1335
            t = {tv_sec = 0, tv_usec = 155468}
            bytes = 0
            ret = 0
            fds = {fds_bits = {32, 0 <repeats 15 times>}}
            size = 8
            header = 0x7fff8389b000
            buf = 0x7fff8389b000 "v0LTlDB6mZTMC4h205wLR 65280"
    #2  0x0000000000405777 in _process_request (fifos=0x7fff8389b170) at dmeventd.c:1478
            die = 0
            msg = {cmd = 0, size = 0, data = 0x0}
    #3  0x000000000040695d in main (argc=1, argv=0x7fff8389b278) at dmeventd.c:2007
            opt = -1 '\377'
            fifos = {client = 5, server = 4, client_path = 0x406c2b "/var/run/dmeventd-client",
              server_path = 0x406c44 "/var/run/dmeventd-server"}

    (gdb) thread 1
    [Switching to thread 1 (Thread 0x7f0859e90700 (LWP 13968))]#0  0x00007f085a829a47 in ioctl () at ../sysdeps/unix/syscall-template.S:82
    82      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
    (gdb) bt full
    #0  0x00007f085a829a47 in ioctl () at ../sysdeps/unix/syscall-template.S:82
    No locals.
    #1  0x00007f085ad28049 in _do_dm_ioctl (dmt=0x7f0844000b20, command=3241737480, buffer_repeat_count=0, retry_repeat_count=1,
        retryable=0x7f0859e8fc18) at ioctl/libdm-iface.c:1726
            dmi = 0x7f0844003100
            ioctl_with_uevent = 0
    #2  0x00007f085ad288ad in dm_task_run (dmt=0x7f0844000b20) at ioctl/libdm-iface.c:1844
            dmi = 0x0
            command = 3241737480
            check_udev = 0
            rely_on_udev = 0
            suspended_counter = 0
            ioctl_retry = 1
            retryable = 0
            dev_name = 0x0
            dev_uuid = 0x7f0844000bf0 "LVM-EUowLYZxEkWGJWJr6pY77z70j03YTf2VIwo2fi8k6Yav0LTlDB6mZTMC4h205wLR"
    #3  0x0000000000403afd in _event_wait (thread=0x1671180, task=0x7f0859e8fde8) at dmeventd.c:651
            set = {__val = {18446744067266821880, 69, 140735400226736, 139673854582850, 139673844910528, 1, 139673477319664, 139673860032013,
                0, 139673844906944, 23218096, 139673477319456, 3, 4294967296, 8606191168529454668, 5140695445782813004}}
            ret = 0
            dmt = 0x7f0844000b20
            info = {exists = 1508441520, suspended = 32520, live_table = -2088128592, inactive_table = 32767, open_count = 1508444608,
              event_nr = 32520, major = 0, minor = 0, read_only = 3, target_count = 0}
    #4  0x0000000000403faf in _monitor_thread (arg=0x1671180) at dmeventd.c:787
            __clframe = {__cancel_routine = 0x403d71 <_monitor_unregister>, __cancel_arg = 0x1671180, __do_it = 1, __cancel_type = 0}
            thread = 0x1671180
            wait_error = 1
            task = 0x0
    #5  0x00007f085af3f851 in start_thread (arg=0x7f0859e90700) at pthread_create.c:301
            __res = <value optimized out>
            pd = 0x7f0859e90700
            now = <value optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139673844909824, -8766475169563209079, 140735400226736, 139673844910528, 0, 3,
                    8667048386050331273, 8667050627825078921}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
                  cleanup = 0x0, canceltype = 0}}}
            not_first_call = <value optimized out>
            pagesize_m1 = <value optimized out>
            sp = <value optimized out>
            freesize = <value optimized out>
    #6  0x00007f085a83190d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
    No locals.

Comment 2 Marian Csontos 2013-01-12 07:23:43 UTC

So far the reproducibility is 2/3 and happens only when non-synchronized mirror loses leg. Will run more tests to get
a) more accurate reproducibility numbers and
b) more debugging information.

Comment 6 Jonathan Earl Brassow 2014-04-10 15:57:21 UTC

We are talking about killing a leg before the mirror is fully initialized, right?

Which device are you killing?  You say "leg[1]".  Does that imply *_mimage_1?

I've performed this test, and I've got it to hang for me for a while, but only as long as error messages are streaming to the console.  After that, it seems to complete.

Any more advice for reproducing?  Do you have any other I/O happening at the same time as the background sync?

Comment 7 Marian Csontos 2014-04-14 17:12:11 UTC

(In reply to Jonathan Earl Brassow from comment #6)
> We are talking about killing a leg before the mirror is fully initialized,
> right?

Yes. The secondary leg is killed before the mirror is full synced.

> 
> Which device are you killing?  You say "leg[1]".  Does that imply *_mimage_1?
> 

Nope, '[1]' is related to:

> [1] Used recommended `echo 1 > .../delete` and udev event was generated (and pvscan --cache executed though lvmetad not configured.)

> I've performed this test, and I've got it to hang for me for a while, but
> only as long as error messages are streaming to the console.  After that, it
> seems to complete.
> 
> Any more advice for reproducing?  Do you have any other I/O happening at the
> same time as the background sync?

Yes, there were 2 processes writing to the FS on the LV.

Comment 9 Jonathan Earl Brassow 2014-06-10 03:08:47 UTC

I've tried this a couple different ways with rhel6.6 RPMs and upstream code.  I've not been able to reproduce it.  Here are the steps I use:
1) create fresh VG
2) create 10G mirror
3) kill secondary image while mirror recovery is happening

I also start I/O (I've used buffered and direct) before and after killing the device.  If I start I/O after killing the device, it ensures that dmeventd first responds to a 'sync' error.  If I start I/O before killing the device, it gives a chance of hitting a 'i/o' error instead of a sync error.

I'll need some pointers to reproduce this issue.  Have a script?

Comment 12 Red Hat Bugzilla 2023-09-14 01:40:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days