Bug 1903346

Summary: PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption.
Product: OpenShift Container Platform Reporter: emahoney
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Kubernetes QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: andcosta, aos-bugs, apanagio, bjarolim, chaoyang, openshift-bugs-escalate, piqin, plambri, rheinzma
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Kubernetes FibreChannel (FC) volume plugin did not properly flush multipath device before deleting it. Consequence: In rare cases, a filesystem on multipath FC device was corrupted during pod destruction. Fix: Kubernetes flushes data before deleting FC multipath device. Result: No filesystem corruption.
Story Points: ---
Clone Of:
: 1903524 (view as bug list) Environment:
Last Closed: 2021-02-24 15:37:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903524    

Description emahoney 2020-12-01 20:49:27 UTC
Description of problem: PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption. We can see from the logs and the k8 1.12 code that there is an unclear removal of the paths before removing the map. This results in the IO errors and are expected when this state occurs. Will provide notes/logs. 

Version-Release number of selected component (if applicable):
3.11.306-1

How reproducible:
Happens any time the node is drained. 

Steps to Reproduce:
1. https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#evacuating-pods-on-nodes
2.
3.

Actual results:
After a node drain, the FC backed PVs start having consistency errors like below:
~~~
[ 6489.218162] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.218456] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222223] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.222589] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222890] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.223154] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.505911] XFS (dm-152): Internal error XFS_WANT_CORRUPTED_GOTO at line 1637 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
[ 6489.506006] [<ffffffffc03a13db>] xfs_error_report+0x3b/0x40 [xfs]
~~~

Expected results:
Pods are drained and storage is consistent.


Additional info:

Comment 4 Jan Safranek 2020-12-02 09:18:00 UTC
Thanks for a great explanation and links to code! The fix looks straightforward now.

Comment 5 Jan Safranek 2020-12-02 14:08:34 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/97013

Comment 6 Jan Safranek 2020-12-04 15:01:49 UTC
Waiting for upstream to un-freeze.

Comment 9 Qin Ping 2021-02-03 08:02:59 UTC
Verified with: 4.7.0-0.nightly-2021-02-02-223803

Comment 12 errata-xmlrpc 2021-02-24 15:37:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633