Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1903346

Summary:	PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption.
Product:	OpenShift Container Platform	Reporter:	emahoney
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Kubernetes	QA Contact:	Qin Ping <piqin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	andcosta, aos-bugs, apanagio, bjarolim, chaoyang, openshift-bugs-escalate, piqin, plambri, rheinzma
Version:	3.11.0
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Kubernetes FibreChannel (FC) volume plugin did not properly flush multipath device before deleting it. Consequence: In rare cases, a filesystem on multipath FC device was corrupted during pod destruction. Fix: Kubernetes flushes data before deleting FC multipath device. Result: No filesystem corruption.	Story Points:	---
Clone Of:
Clones:	1903524 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:37:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1903524

Description emahoney 2020-12-01 20:49:27 UTC

Description of problem: PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption. We can see from the logs and the k8 1.12 code that there is an unclear removal of the paths before removing the map. This results in the IO errors and are expected when this state occurs. Will provide notes/logs. 

Version-Release number of selected component (if applicable):
3.11.306-1

How reproducible:
Happens any time the node is drained. 

Steps to Reproduce:
1. https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#evacuating-pods-on-nodes
2.
3.

Actual results:
After a node drain, the FC backed PVs start having consistency errors like below:
~~~
[ 6489.218162] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.218456] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222223] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.222589] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222890] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.223154] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.505911] XFS (dm-152): Internal error XFS_WANT_CORRUPTED_GOTO at line 1637 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
[ 6489.506006] [<ffffffffc03a13db>] xfs_error_report+0x3b/0x40 [xfs]
~~~

Expected results:
Pods are drained and storage is consistent.


Additional info:

Comment 4 Jan Safranek 2020-12-02 09:18:00 UTC

Thanks for a great explanation and links to code! The fix looks straightforward now.

Comment 5 Jan Safranek 2020-12-02 14:08:34 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/97013

Comment 6 Jan Safranek 2020-12-04 15:01:49 UTC

Waiting for upstream to un-freeze.

Comment 9 Qin Ping 2021-02-03 08:02:59 UTC

Verified with: 4.7.0-0.nightly-2021-02-02-223803

Comment 12 errata-xmlrpc 2021-02-24 15:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633