Bug 1704468 - [ISCSI] vmware snapshots getting corrupted following hardware move and network change
Summary: [ISCSI] vmware snapshots getting corrupted following hardware move and networ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: iSCSI
Version: 3.1
Hardware: All
OS: All
unspecified
urgent
Target Milestone: rc
: 4.0
Assignee: Mike Christie
QA Contact: Madhavi Kasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-29 20:32 UTC by jquinn
Modified: 2019-12-02 13:54 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-03 16:18:21 UTC
Embargoed:


Attachments (Terms of Use)

Comment 1 Mike Christie 2019-04-29 23:23:27 UTC
How are you alerted that a vmware snapshot is corrupted? Is it when you try to open it? Could you attach the vmkernel.log when this happens?

Does the issue happen with snapshots you have already taken or taken while the lock error messages are being reported?


Could you give me the output of:

esxcli storage core device vaai status get -d your_device
esxcli storage core device list -d your_device

from one of the hosts.

For the lock bouncing message, you should also run this command:

esxcli storage nmp path list -d your_device

on all the ESX hosts.


The lock bouncing messages are likely due to a ESX hosts seeing igw-ssd-01.vmw-ssd-05 through different paths. The hosts will then try to access the disk through different iscsi gws and the lock will bounce between the gateways. This commonly happens when the hosts have network issues on the iscsi paths. Running the nmp path list command should show some of the hosts trying to use different paths.

Note that this should not cause corruption, but could cause IO failures due to the lock bouncing causing a command to be retried too many times and ESX marking the command as failed.


Note You need to log in before you can comment on or make changes to this bug.