Bug 1565311

Summary: vdo status command shows index is online after index fails to start
Product: Red Hat Enterprise Linux 7 Reporter: Bryan Gurney <bgurney>
Component: kmod-kvdoAssignee: Thomas Jaskiewicz <tjaskiew>
Status: CLOSED ERRATA QA Contact: Jakub Krysl <jkrysl>
Severity: unspecified Docs Contact:
Priority: high    
Version: 7.5CC: awalsh, bgurney, jkrysl, limershe, rhandlin, vdo-qe
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 6.1.1.60 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1578421 (view as bug list) Environment:
Last Closed: 2018-10-30 09:39:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578421    

Description Bryan Gurney 2018-04-09 20:03:36 UTC
Description of problem:

After powering up a test system this morning, I found that a VDO volume's index had stopped.  This was instigated by a hard disk drive that was returning a "Medium Error - Unrecovered read error - auto reallocate failed" sense on reading a couple of sectors in the index area:

Apr  9 10:25:59 charitable-view kernel: uds: kvdo0:dedupeQ: loading or rebuilding index: dev=/dev/sdc offset=4096 size=2781704192
Apr  9 10:25:59 charitable-view kernel: kvdo0:dmsetup: device 'vdo1' started
Apr  9 10:25:59 charitable-view kernel: kvdo0:dmsetup: resuming device 'vdo1'
Apr  9 10:25:59 charitable-view kernel: kvdo0:dmsetup: device 'vdo1' resumed
Apr  9 10:26:00 charitable-view kernel: uds: kvdo0:dedupeQ: Using 16 indexing zones for concurrency.
...
Apr  9 10:26:24 charitable-view kernel: sd 3:0:0:0: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  9 10:26:24 charitable-view kernel: sd 3:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor] 
Apr  9 10:26:24 charitable-view kernel: sd 3:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed
Apr  9 10:26:24 charitable-view kernel: sd 3:0:0:0: [sdc] CDB: Read(10) 28 00 00 47 45 f0 00 00 08 00
Apr  9 10:26:24 charitable-view kernel: blk_update_request: I/O error, dev sdc, sector 4670960
Apr  9 10:26:24 charitable-view kernel: ata3: EH complete
Apr  9 10:26:24 charitable-view kernel: uds: kvdo0:dedupeQ: readBufferedData got readFromRegion error: Input/output error (5)
Apr  9 10:26:24 charitable-view kernel: uds: kvdo0:dedupeQ: failed to read delta list data: Input/output error (5)
Apr  9 10:26:24 charitable-view kernel: uds: kvdo0:dedupeQ: index component master index: Input/output error (5)
Apr  9 10:26:24 charitable-view kernel: uds: kvdo0:dedupeQ: index_0: index could not be loaded: Input/output error (5)
Apr  9 10:26:25 charitable-view kernel: uds: kvdo0:dedupeQ: index_0: Replaying volume from chapter 0 through chapter 262
...
Apr  9 10:26:58 charitable-view kernel: sd 3:0:0:0: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  9 10:26:58 charitable-view kernel: sd 3:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor] 
Apr  9 10:26:58 charitable-view kernel: sd 3:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed
Apr  9 10:26:58 charitable-view kernel: sd 3:0:0:0: [sdc] CDB: Read(10) 28 00 00 0c 2a 18 00 00 40 00
Apr  9 10:26:58 charitable-view kernel: blk_update_request: I/O error, dev sdc, sector 797216
Apr  9 10:26:58 charitable-view kernel: ata3: EH complete
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: error reading physical page 12456: Input/output error (5)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: Error reading page 12456 from volume
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: index_0: could not get page 65: Input/output error (5)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: index_0: index could not be rebuilt: Unrecoverable error : Input/output error (131077)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: index_0: fatal error in makeIndex: Unrecoverable error: Input/output error (131077)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: failed to create index: Unrecoverable error: Input/outpu t error (131077)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: Failed to make router: Unrecoverable error: Input/output error (131077)
Apr  9 10:26:58 charitable-view kernel: uds: kvdo0:dedupeQ: Failed loading or rebuilding index: Unrecoverable error: Input/output error (131077)
Apr  9 10:26:58 charitable-view kernel: kvdo0:dedupeQ: Error opening index dev=/dev/sdc offset=4096 size=2781704192: System error 5 (5)
Apr  9 10:26:58 charitable-view kernel: kvdo0:dedupeQ: Setting UDS index target state to error

After this, the VDO volume was online, but the index appeared to be no longer running (the space savings was reduced, and the "posts found", "posts not found", and other index-specific statistics were showing zero values); however, "vdo stats" reported "Index status: online".  Also, the "dmsetup status" output for this VDO volume was "vdo1: 0 1942442600 vdo /dev/sdc albserver online cpu=2,bio=4,ack=1,bioRotationInterval=64".


Apr  9 13:17:44 charitable-view kernel: uds: vdostats: getBaseContext() failed.: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: kvdo1:vdostats: Error reading index stats: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: uds: vdostats: getBaseContext() failed: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: kvdo1:vdostats: Error reading context stats: UDS Error: Could not find the requested library context (1028)


Version-Release number of selected component (if applicable):
kmod-kvdo-6.1.0.149-13.el7.x86_64 (running on kernel 3.10.0-855.el7.x86_64)

How reproducible:
Very difficult to reproduce without defective media.


Steps to Reproduce:
1. Create a VDO volume on a device.
2. Execute mkfs.xfs on the VDO volume, then mount the VDO volume to a test directory
3. Write test data to the mounted filesystem on the VDO volume.
4. At some point after the creation of the VDO volume, a sector on disk in the range of the index starts returning unrecoverable read errors, causing the index to stop.
5. When running "vdo status", the "Error reading index stats" error messages appear.

Actual results:
Running "vdo status" for a VDO volume with a UDS index unexpectedly knocked offline outputs "Index status: online".

Expected results:
Running "vdo status" for a VDO volume with a UDS index unexpectedly knocked offline outputs "Index status: offline".


Additional info:
I did see the following messages when running "vdo status" or "vdostats --verbose", which indicate that either program was trying to access the information of the UDS index, but was unable to do so:
Apr  9 13:17:44 charitable-view kernel: uds: vdostats: getBaseContext() failed.: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: kvdo1:vdostats: Error reading index stats: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: uds: vdostats: getBaseContext() failed: UDS Error: Could not find the requested library context (1028)
Apr  9 13:17:44 charitable-view kernel: kvdo1:vdostats: Error reading context stats: UDS Error: Could not find the requested library context (1028)

This could potentially be used to detect a situation where the UDS index is offline, even though "vdo status" is reporting that the index is "Online".

Comment 2 Bryan Gurney 2018-04-10 21:37:59 UTC
I did try to make an 8 GB scsi_debug device with the "opts=0x2" option, which configures the device to return a medium error on reads of sector 0x1234.  My test system has 32 GB of RAM, so I ran the following command:

# modprobe scsi_debug dev_size_mb=8192 opts=0x2

...which created an 8 GB "scsi_debug" device appearing as /dev/sdd.

- Create a VDO volume on this device. (There will be about 4 GB of physical space usable, since the scsi_debug device is so small; make the logical size 40 GB.)

# vdo create --name=scsidebugvdo --device=/dev/sdd --vdoLogicalSize=40G --verbose

- Create an XFS filesystem on this device.

# mkfs.xfs -K /dev/mapper/scsidebugvdo
# mount -o discard /dev/mapper/scsidebugvdo /mnt/scsidebugvdo/

- Start copying and deleting data.  I chose two datasets: a high-dedupe 1 GB file, and a 2 GB directory of documents.  I kept copying the files in, and then removing the files and/or directories, which triggered XFS to send discards for the blocks vacated after the removal.


Apr 10 16:47:29 charitable-view systemd-logind: New session 244 of user root.
Apr 10 16:47:29 charitable-view systemd: Started Session 244 of user root.
Apr 10 16:47:29 charitable-view systemd: Starting Session 244 of user root.
Apr 10 16:47:41 charitable-view kernel: sd 10:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 10 16:47:41 charitable-view kernel: sd 10:0:0:0: [sdd] Sense Key : Medium Error [current]
Apr 10 16:47:41 charitable-view kernel: sd 10:0:0:0: [sdd] Add. Sense: Unrecovered read error
Apr 10 16:47:41 charitable-view kernel: sd 10:0:0:0: [sdd] CDB: Read(10) 28 00 00 00 12 18 00 00 40 00
Apr 10 16:47:41 charitable-view kernel: blk_update_request: critical medium error, dev sdd, sector 4632
Apr 10 16:47:41 charitable-view kernel: uds: kvdo12:reader: error reading physical page 72: Unknown error 61 (61)
Apr 10 16:47:41 charitable-view kernel: uds: kvdo12:reader: Error reading page 72 from volume
Apr 10 16:47:49 charitable-view systemd-logind: Removed session 244.
Apr 10 16:47:50 charitable-view kernel: uds: kvdo12:writer: Discarding saved open chapter
Apr 10 16:47:53 charitable-view kernel: uds: vdostats: getBaseContext() failed.: UDS Error: UDS library context is disabled (1029)
Apr 10 16:47:53 charitable-view kernel: kvdo12:vdostats: Error reading index stats: UDS Error: UDS library context is disabled (1029)
Apr 10 16:47:53 charitable-view kernel: uds: vdostats: getBaseContext() failed: UDS Error: UDS library context is disabled (1029)
Apr 10 16:47:53 charitable-view kernel: kvdo12:vdostats: Error reading context stats: UDS Error: UDS library context is disabled (1029)

I saw the "getBaseContext()" errors when I ran "vdostats --verbose".  However, I did still see the sysfs directory for the index.  This isn't enti
rely unexpected, as the original scenario for this bug was during VDO volume startup, and this scenario is well after startup has completed, but d
uring normal operation.

Comment 6 Thomas Jaskiewicz 2018-05-14 22:08:29 UTC
Does QA need any more info?

Comment 7 Jakub Krysl 2018-05-15 07:39:46 UTC
No thanks, I will try some reproducers and if they fail to reproduce, will do just SanityOnly.

Comment 10 Jakub Krysl 2018-07-03 13:18:57 UTC
Tested on:
RHEL-7.6-20180626.0
kernel-3.10.0-915.el7
kmod-vdo-6.1.1.99-1.el7
vdo-6.1.1.99-2.el7

I could not get the index to failed state as vdo managed to rebuild the index each time. Regression testing passed.

Comment 12 errata-xmlrpc 2018-10-30 09:39:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3094