Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 682767

Summary:

GFS2 mount hangs after some time

Product:

Red Hat Enterprise Linux 6

Reporter:

Jakub Hrozek <jhrozek>

Component:

kernel

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.0

CC:

abourne, adas, ajb2, bmarzins, rpeterso, rwheeler, swhiteho

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-07-15 15:24:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/etc/multipath.conf	none

Description Jakub Hrozek 2011-03-07 14:36:58 UTC

Description of problem:
We use the GFS2 filesystem on an IBM DS3000/1726. We encounter kernel call trace in the logs from time to time and the mountpoint becomes stalled afterwards (ls hangs, mountpoint cannot be umounted etc.)

We use the RHEL6 packaged ql2400-firmware-5.03.02-1 firmware for the SAN.

Version-Release number of selected component (if applicable):
kernel-2.6.32-71.el6.x86_64
ql2400-firmware-5.03.02-1.el6.noarch

How reproducible:
not great, sometimes works fine for a week, sometimes crashes sooner

Steps to Reproduce:
1. multipath the disks provided by the SAN
2. format the multipathed device with GFS2
3. mount the GFS2 FS
4. write to the FS
  
Actual results:
Mar  3 05:30:11 sam kernel: INFO: task scsi_wq_1:317 blocked for more than 120 seconds.
Mar  3 05:30:11 sam kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  3 05:30:11 sam kernel: scsi_wq_1     D ffff880551777ac0     0   317      2 0x00000000
Mar  3 05:30:11 sam kernel: ffff8805517779e0 0000000000000046 0000000000000000 ffff880551590000
Mar  3 05:30:11 sam kernel: ffff88055151f240 0000000000000246 ffff880551777a00 ffff880557191880
Mar  3 05:30:11 sam kernel: ffff880552dc1ad8 ffff880551777fd8 0000000000010518 ffff880552dc1ad8
Mar  3 05:30:11 sam kernel: Call Trace:
Mar  3 05:30:11 sam kernel: [<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0
Mar  3 05:30:11 sam kernel: [<ffffffff81255ecf>] ? cfq_set_request+0x18f/0x520
Mar  3 05:30:11 sam kernel: [<ffffffff8110e635>] ? mempool_alloc_slab+0x15/0x20
Mar  3 05:30:11 sam kernel: [<ffffffff814c8c33>] wait_for_common+0x123/0x180
Mar  3 05:30:11 sam kernel: [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
Mar  3 05:30:11 sam kernel: [<ffffffff814c8d4d>] wait_for_completion+0x1d/0x20
Mar  3 05:30:12 sam kernel: [<ffffffff812464ec>] blk_execute_rq+0x8c/0xf0
Mar  3 05:30:12 sam kernel: [<ffffffff81241240>] ? blk_rq_bio_prep+0x30/0xc0
Mar  3 05:30:12 sam kernel: [<ffffffff81246066>] ? blk_rq_map_kern+0xd6/0x150
Mar  3 05:30:12 sam kernel: [<ffffffff8134ab5c>] scsi_execute+0xfc/0x160
Mar  3 05:30:12 sam kernel: [<ffffffff8134adb6>] scsi_execute_req+0xb6/0x190
Mar  3 05:30:12 sam kernel: [<ffffffff8134d380>] __scsi_scan_target+0x2c0/0x750
Mar  3 05:30:12 sam kernel: [<ffffffff81056630>] ? __dequeue_entity+0x30/0x50
Mar  3 05:30:12 sam kernel: [<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0
Mar  3 05:30:12 sam kernel: [<ffffffff8134df40>] scsi_scan_target+0xd0/0xe0
Mar  3 05:30:12 sam kernel: [<ffffffffa016b8fd>] fc_scsi_scan_rport+0xbd/0xc0 [scsi_transport_fc]
Mar  3 05:30:12 sam kernel: [<ffffffffa016b840>] ? fc_scsi_scan_rport+0x0/0xc0 [scsi_transport_fc]
Mar  3 05:30:12 sam kernel: [<ffffffff8108c610>] worker_thread+0x170/0x2a0
Mar  3 05:30:12 sam kernel: [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
Mar  3 05:30:12 sam kernel: [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0
Mar  3 05:30:12 sam kernel: [<ffffffff81091936>] kthread+0x96/0xa0
Mar  3 05:30:12 sam kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Mar  3 05:30:12 sam kernel: [<ffffffff810918a0>] ? kthread+0x0/0xa0
Mar  3 05:30:12 sam kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Mar  3 05:30:12 sam kernel: INFO: task gfs2_logd:3735 blocked for more than 120 seconds.
Mar  3 05:30:12 sam kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  3 05:30:12 sam kernel: gfs2_logd     D 0000000000000002     0  3735      2 0x00000000
Mar  3 05:30:12 sam kernel: ffff88032ba2dc80 0000000000000046 ffff88032ba2dc40 ffffffffa000471c
Mar  3 05:30:12 sam kernel: ffff8805511a2380 ffff8802e7568200 ffff88032ba2dd20 0000000000000002
Mar  3 05:30:12 sam kernel: ffff8803a78ca678 ffff88032ba2dfd8 0000000000010518 ffff8803a78ca678
Mar  3 05:30:12 sam kernel: Call Trace:
Mar  3 05:30:12 sam kernel: [<ffffffffa000471c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Mar  3 05:30:12 sam kernel: [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
Mar  3 05:30:12 sam kernel: [<ffffffff8119dcf0>] ? sync_buffer+0x0/0x50
Mar  3 05:30:12 sam kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Mar  3 05:30:12 sam kernel: [<ffffffff8119dd30>] sync_buffer+0x40/0x50
Mar  3 05:30:12 sam kernel: [<ffffffff814c929f>] __wait_on_bit+0x5f/0x90
Mar  3 05:30:12 sam kernel: [<ffffffff8119dcf0>] ? sync_buffer+0x0/0x50
Mar  3 05:30:12 sam kernel: [<ffffffff814c9348>] out_of_line_wait_on_bit+0x78/0x90
Mar  3 05:30:12 sam kernel: [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
Mar  3 05:30:12 sam kernel: [<ffffffff8119dce6>] __wait_on_buffer+0x26/0x30
Mar  3 05:30:12 sam kernel: [<ffffffffa0625c1f>] log_write_header+0x1cf/0x520 [gfs2]
Mar  3 05:30:13 sam kernel: [<ffffffffa062653a>] gfs2_log_flush+0x2ea/0x6b0 [gfs2]
Mar  3 05:30:13 sam kernel: [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
Mar  3 05:30:13 sam kernel: [<ffffffffa06269e1>] gfs2_logd+0xe1/0x150 [gfs2]
Mar  3 05:30:13 sam kernel: [<ffffffffa0626900>] ? gfs2_logd+0x0/0x150 [gfs2]
Mar  3 05:30:13 sam kernel: [<ffffffff81091936>] kthread+0x96/0xa0
Mar  3 05:30:13 sam kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Mar  3 05:30:13 sam kernel: [<ffffffff810918a0>] ? kthread+0x0/0xa0
Mar  3 05:30:13 sam kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20


Expected results:
mountpoint keeps working

Additional info:
The GFS2 filesystem is mounted with default options:
$ grep gfs2 /etc/fstab 
/dev/mapper/banana_tree             /mnt/storage            gfs2    defaults        0 0

The multipath device is configured as follows:
multipaths {
        multipath {
                wwid                    3600a0b80005a5a69000003614a1d19ee
                alias                   banana_tree
                path_grouping_policy    multibus
                path_checker            readsector0
                path_selector           "round-robin 0"
                failback                manual
                rr_weight               priorities
                no_path_retry           5
        }
}

Comment 2 Steve Whitehouse 2011-03-07 15:14:10 UTC

At first glance this looks like it is a block device issue. I would suggest adding noatime to the mount flags, although that is not going to solve this particular issue.

Could we have a brief description of the set up here? How many nodes are there, what are the specs and what is the application?

Comment 3 Jakub Hrozek 2011-03-07 15:56:01 UTC

(In reply to comment #2)
> At first glance this looks like it is a block device issue. I would suggest
> adding noatime to the mount flags, although that is not going to solve this
> particular issue.
> 

Good suggestion, thanks.

> Could we have a brief description of the set up here? How many nodes are there,
> what are the specs and what is the application?

I could see the behaviour with 5 nodes which is my desired full setup and I can reproduce the issue with 2 nodes -- I downscaled the cluster b/c every time this bug happened I had to reboot the machines. The GFS2 filesystem has 8 journals - one per machine plus 3 more just in case.

Not sure what spec in particular you would like to hear - all machines are IBM eServer BladeCenter HS21 running RHEL 6.0. The cluster (and right now the machines individually) are used to host VM images. So the storage contains a smallish number of big files.

Here's what multipath -ll has got to say about the SAN:

# multipath -ll
banana_tree (3600a0b80005a5a69000003614a1d19ee) dm-2 IBM,1726-4xx  FAStT
size=1.1T features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 2:0:1:0 sde 8:64 active ghost  running
  |- 1:0:1:0 sdc 8:32 active ready  running
  |- 2:0:0:0 sdd 8:48 active ready  running
  `- 1:0:0:0 sdb 8:16 active ghost  running

Comment 4 Steve Whitehouse 2011-03-09 12:05:22 UTC

Ben, does this look familiar to you? I think it is a multipath/scsi/storage/block layer issue, but I'd like some conformation on that if possible.

Comment 5 Ben Marzinski 2011-07-05 15:41:09 UTC

This multipath setup looks wrong.  The ghost paths should not be in the same pathgroup as the active paths. ghost paths are passive paths that are working but need a special command sent to the hardware to activate them.  These command get sent when mutipath switches pathgroups. However, in this setup, they are in the same pathgroup, which will keep multipath from activating them before they are used.

This device should autoconfigure to correctly setup the pathgroups, unless you have overridden the settings for it in /etc/multipath.conf.

If you are still having this problem, could you please post your /etc/multipath.conf file.

Comment 6 Jakub Hrozek 2011-07-11 08:06:24 UTC

Created attachment 512142 [details]
/etc/multipath.conf

Comment 7 Jakub Hrozek 2011-07-11 08:06:59 UTC

(In reply to comment #5)
> This multipath setup looks wrong.  The ghost paths should not be in the same
> pathgroup as the active paths. ghost paths are passive paths that are working
> but need a special command sent to the hardware to activate them.  These
> command get sent when mutipath switches pathgroups. However, in this setup,
> they are in the same pathgroup, which will keep multipath from activating them
> before they are used.
> 
> This device should autoconfigure to correctly setup the pathgroups, unless you
> have overridden the settings for it in /etc/multipath.conf.
> 
> If you are still having this problem, could you please post your
> /etc/multipath.conf file.

Sorry for the vacation induced delay. The /etc/multipath.conf file is attached.

Comment 8 Ben Marzinski 2011-07-12 14:55:53 UTC

This device is autoconfigured by device-mapper-multipath, based on guidance from IBM, so you shouldn't need to change anything to have it work correctly.

I would suggest changing your multipaths section to look like

multipaths {
        multipath {
                wwid                    3600a0b80005a5a69000003614a1d19ee
                alias                   banana_tree
        }
}

Which just sets the alias the way you want it.

If you really need it configured the way it was, the following configuration just removes the parts of your configuration that were causing you problems. But I would definitely try the default configuration first.

defaults {
	user_friendly_names yes
}

devices {
        device {
                vendor "IBM"
                product "1724"
                path_checker readsector0
                failback manual
                no_path_retry 5
        }
}
                
multipaths {
        multipath {
                wwid                    3600a0b80005a5a69000003614a1d19ee
                alias                   banana_tree
        }
}

Comment 9 Steve Whitehouse 2011-07-15 15:00:50 UTC

Is this issue resolved now? If so can we close the bug assuming that this really is just a config issue and not a real bug?

Jakub, let us know if you need anything more from us.

Comment 10 Jakub Hrozek 2011-07-15 15:18:54 UTC

(In reply to comment #9)
> Is this issue resolved now? If so can we close the bug assuming that this
> really is just a config issue and not a real bug?
> 
> Jakub, let us know if you need anything more from us.

So far so good -- thanks a lot for the pointer!

I wasn't able to deploy the change to full cluster yet, only 2 nodes but I  haven't seen any hiccups so far.

Feel free to close this issue, I can reopen later if the issue persists.

Comment 11 Steve Whitehouse 2011-07-15 15:24:05 UTC

Ok, sounds good, will close for now.