1903346 – PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption.

Bug 1903346 - PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption.

Summary: PV backed by FC lun is not being unmounted properly and this leads to IO erro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1903524
TreeView+	depends on / blocked

Reported:	2020-12-01 20:49 UTC by emahoney
Modified:	2024-03-25 17:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Kubernetes FibreChannel (FC) volume plugin did not properly flush multipath device before deleting it. Consequence: In rare cases, a filesystem on multipath FC device was corrupted during pod destruction. Fix: Kubernetes flushes data before deleting FC multipath device. Result: No filesystem corruption.
Clone Of:
Clones:	1903524 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:37:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 489	0	None	closed	Bug 1903346: UPSTREAM: 97013: Fix FibreChannel volume plugin corrupting filesystem on detach	2021-02-01 08:28:42 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:37:55 UTC

Description emahoney 2020-12-01 20:49:27 UTC

Description of problem: PV backed by FC lun is not being unmounted properly and this leads to IO errors / xfs corruption. We can see from the logs and the k8 1.12 code that there is an unclear removal of the paths before removing the map. This results in the IO errors and are expected when this state occurs. Will provide notes/logs. 

Version-Release number of selected component (if applicable):
3.11.306-1

How reproducible:
Happens any time the node is drained. 

Steps to Reproduce:
1. https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#evacuating-pods-on-nodes
2.
3.

Actual results:
After a node drain, the FC backed PVs start having consistency errors like below:
~~~
[ 6489.218162] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.218456] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222223] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.222589] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.222890] XFS (dm-152): Metadata CRC error detected at xfs_inobt_read_verify+0x79/0xb0 [xfs], xfs_inobt block 0x18
[ 6489.223154] XFS (dm-152): metadata I/O error: block 0x18 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 6489.505911] XFS (dm-152): Internal error XFS_WANT_CORRUPTED_GOTO at line 1637 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
[ 6489.506006] [<ffffffffc03a13db>] xfs_error_report+0x3b/0x40 [xfs]
~~~

Expected results:
Pods are drained and storage is consistent.


Additional info:

Comment 4 Jan Safranek 2020-12-02 09:18:00 UTC

Thanks for a great explanation and links to code! The fix looks straightforward now.

Comment 5 Jan Safranek 2020-12-02 14:08:34 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/97013

Comment 6 Jan Safranek 2020-12-04 15:01:49 UTC

Waiting for upstream to un-freeze.

Comment 9 Qin Ping 2021-02-03 08:02:59 UTC

Verified with: 4.7.0-0.nightly-2021-02-02-223803

Comment 12 errata-xmlrpc 2021-02-24 15:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.