Bug 1903346 - PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption.
Summary: PV backed by FC lun is not being unmounted properly and this leads to IO erro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks: 1903524
TreeView+ depends on / blocked
 
Reported: 2020-12-01 20:49 UTC by emahoney
Modified: 2021-04-07 06:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Kubernetes FibreChannel (FC) volume plugin did not properly flush multipath device before deleting it. Consequence: In rare cases, a filesystem on multipath FC device was corrupted during pod destruction. Fix: Kubernetes flushes data before deleting FC multipath device. Result: No filesystem corruption.
Clone Of:
: 1903524 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:37:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 489 0 None closed Bug 1903346: UPSTREAM: 97013: Fix FibreChannel volume plugin corrupting filesystem on detach 2021-02-01 08:28:42 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:37:55 UTC

Description emahoney 2020-12-01 20:49:27 UTC
Description of problem: PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption. We can see from the logs and the k8 1.12 code that there is an unclear removal of the paths before removing the map. This results in the IO errors and are expected when this state occurs. Will provide notes/logs. 

Version-Release number of selected component (if applicable):
3.11.306-1

How reproducible:
Happens any time the node is drained. 

Steps to Reproduce:
1. https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#evacuating-pods-on-nodes
2.
3.

Actual results:
After a node drain, the FC backed PVs start having consistency errors like below:
~~~
[ 6489.218162] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.218456] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222223] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.222589] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222890] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.223154] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.505911] XFS (dm-152): Internal error XFS_WANT_CORRUPTED_GOTO at line 1637 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
[ 6489.506006] [<ffffffffc03a13db>] xfs_error_report+0x3b/0x40 [xfs]
~~~

Expected results:
Pods are drained and storage is consistent.


Additional info:

Comment 4 Jan Safranek 2020-12-02 09:18:00 UTC
Thanks for a great explanation and links to code! The fix looks straightforward now.

Comment 5 Jan Safranek 2020-12-02 14:08:34 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/97013

Comment 6 Jan Safranek 2020-12-04 15:01:49 UTC
Waiting for upstream to un-freeze.

Comment 9 Qin Ping 2021-02-03 08:02:59 UTC
Verified with: 4.7.0-0.nightly-2021-02-02-223803

Comment 12 errata-xmlrpc 2021-02-24 15:37:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.