Bug 435417 - Device-mapper-multipath not working correctly with GNBD devices
Summary: Device-mapper-multipath not working correctly with GNBD devices
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath   
(Show other bugs)
Version: 5.1
Hardware: i386
OS: Linux
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Corey Marthaler
Keywords: Reopened
Depends On:
TreeView+ depends on / blocked
Reported: 2008-02-29 07:31 UTC by Sveto
Modified: 2010-01-12 02:40 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-03-17 16:08:49 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

Description Sveto 2008-02-29 07:31:27 UTC
I am trying to configure a failover multipath between 2 GNBD devices.

I have a 4 nodes Redhat Cluster Suite (RCS) cluster. 3 of them are used for
running services, 1 of them for central storage. In the future I am going to
introduce another machine for central storage. The 2 storage machine are going
to share/export the same disk. The idea is not to have a single point of failure
on the machine exporting the storage.

For concept testing testing I am using one machine on which I have configured 2
GNBD exports, which are exporting exactly the same disk. These are configured with:

# /sbin/gnbd_export -d /dev/sdb1 -e gnbd0 -u gnbd
# /sbin/gnbd_export -d /dev/sdb1 -e gnbd1 -u gnbd

They are exporting with the same id, so the multipath driver will automatically
configure them as alternative paths to the same storage.

Now on one of the cluster nodes used for running services I am importing these
GNBD devices with:

# /sbin/gnbd_import -i gnbd1

where gnbd1 is the hostname of the machine exporting the GNBD devices.

And I have these imported ok:

# gnbd_import -l
Device name : gnbd1
    Minor # : 0
 sysfs name : /block/gnbd0
     Server : gnbd11
       Port : 14567
      State : Open Connected Clear
   Readonly : No
    Sectors : 41941688

Device name : gnbd0
    Minor # : 1
 sysfs name : /block/gnbd1
     Server : gnbd1
       Port : 14567
      State : Open Connected Clear
   Readonly : No
    Sectors : 41941688


After, I have configured the device-mapper multipath by commenting the
"blacklist" section in /etc/multipath.conf and adding this "defaults" section:

defaults {
        user_friendly_names yes
        polling_interval 5
        #path_grouping_policy failover
        path_grouping_policy multibus
        rr_min_io 1
        failback immediate
        #failback manual
        no_path_retry fail
        #no_path_retry queue

Now I have the mpath device configured correctly (IMHO):

# multipath -ll
mpath0 (gnbd) dm-2 GNBD,GNBD
\_ round-robin 0 [prio=2][enabled]
 \_ #:#:#:# gnbd0 252:0 [active][ready]
 \_ #:#:#:# gnbd1 252:1 [active][ready]

# dmsetup ls
mpath0 (253, 2)
VolGroup00-LogVol01 (253, 1)
VolGroup00-LogVol00 (253, 0)

Now I mkfs.ext3 over the mpath0 device to create a filesystem, then mount.
After I start to copy a file (with scp to have a progress bar) and during the
copy process I shutdown one of the exported GNBD device on the disk exporting
machine with:

# gnbd_export -r gnbd1 -O

After a while in the maillog:

gnbd_recvd[3357]: client lost connection with gnbd11 : Broken pipe
gnbd_recvd[3357]: reconnecting
kernel: gnbd1: Receive control failed (result -32)
kernel: gnbd1: shutting down socket
kernel: exiting GNBD_DO_IT ioctl
kernel: gnbd1: Attempted send on closed socket
gnbd_recvd[3357]: ERROR [gnbd_recvd.c:292] login refused by the server : No such
gnbd_recvd[3357]: reconnecting
kernel: device-mapper: multipath: Failing path 252:1.
multipathd: gnbd1: directio checker reports path is down
multipathd: checker failed path 252:1 in map mpath0
multipathd: mpath0: remaining active paths: 1
gnbd_recvd[3357]: ERROR [gnbd_recvd.c:292] login refused by the server : No such
gnbd_recvd[3357]: reconnecting

Now the copy process is freezed. It stays that way until the GNBD device is
exported again. I try some commands on the multipath machine:

# multipath -ll
gnbd1: checker msg is "directio checker reports path is down"
mpath0 (gnbd) dm-2 GNBD,GNBD
\_ round-robin 0 [prio=1][active]
 \_ #:#:#:# gnbd0 252:0 [active][ready]
 \_ #:#:#:# gnbd1 252:1 [failed][faulty]
<freezed, the prompt is not returning back>

This prompt get back after the GNBD device is exported again.

My expectations were that in such a scenario the multipath driver is going to
switch the requests to the other path and everything should continue to work. Am
I wrong?

I have upgraded to the last version of all the RPMs.

I have tried different multipath settings (which are commented out in the
multipath.conf "defaults" section I pasted previously), but nothing happens.

This may be useful. When starting the machine in the log:

multipathd: gnbd0: add path (uevent)
kernel: device-mapper: multipath round-robin: version 1.0.0 loaded
multipathd: mpath0: load table [0 41941688 multipath 0 0 1 1 round-robin 0 1 1
252:0 1000]
multipathd: mpath0: event checker started
multipathd: dm-2: add map (uevent)
multipathd: dm-2: devmap already registered
gnbd_recvd[3357]: gnbd_recvd started
kernel: resending requests
multipathd: gnbd1: add path (uevent)
multipathd: mpath0: load table [0 41941688 multipath 0 0 1 1 round-robin 0 2 1
252:0 1000 252:1 1000]
multipathd: dm-2: add map (uevent)
multipathd: dm-2: devmap already registered

Maybe this is a bug of GNBD not the multipath? If it is so, please excuse the
wrong category I assigned the issue.

Comment 1 Sveto 2008-03-10 07:13:40 UTC
Anyone ?

Comment 2 Ben Marzinski 2008-03-14 15:35:14 UTC
Sorry about the delay

What's happening here is that GNBD is not as smart as it should be (or as smart
as I thought it was).  In the normal case, where you have gnbd exports from
various servers, when you lose connection to a gnbd export on one node for
longer than the allowed timeout, gnbd will fail the IOs back, so that if you are
using multipath, you can retry them on a different node.  However, before gnbd
can fail the IOs back, in needs to make absolutely sure that the IO requests
that it sent to the failed node will never reach the disk after gnbd fails over.
 Otherwise, you could have the case where you:

1. lose connection to a gnbd_server node with outstanding IO requests
2. fail the IO requests over to a new gnbd_server node
3. write new IO requests to the same blocks on the new gnbd_server node
4. have the old gnbd_server node overwrite those blocks stale data from the
   original IO requests.

which results in data corruption.  Here's were GNBD isn't smart enough.  It
always makes sure that the IOs won't make it to disk by fencing the old
gnbd_server node.  The reason why it's not fencing the node in your case is
because it realizes that there aren't any other nodes to fail over too, so
fencing your last node is pointless.

If gnbd was smarter, it could just query the gnbd_server, and find out that the
device has been unexported cleanly.  Then it would know that it can safely fail
the IOs. But GNBD doesn't do this.

So while this means that your tests won't work correctly, once you have the gnbd
servers on different nodes, you should failover correctly. And as long as you
don't forcibly unexport gnbd devices with the override option, you shouldn't be
unnecessarily fencing your gnbd servers.

Comment 3 Sveto 2008-03-18 07:13:55 UTC
I've been testing this too. 

The configuration was 2 hosts exporting the same partition over GNBD. The
partitions were synchronized with DRBD 0.8 (both primaries mode).

On the multipath machine I have 2 GNBD devices imported from 2 different hosts.
All the hosts are part of the cluster with manual fencing configured. 

When I bring down the network interface of one of the GNBD exporting machines,
the same hang happens (I am using virtual machines for all these, if this matters).

Do you think that the cause of all this is the manual fencing method - so GNBD
is waiting for the failing device to be fenced, and after it will continue and
trigger the path change?

Comment 4 Sveto 2008-03-19 08:58:24 UTC
The manual fencing turns out to be the issue. It just waits to manually fencing
the failing node. After this the path is switched.

Note You need to log in before you can comment on or make changes to this bug.