Bug 1825915

Summary:	LUN and device mapper not removed and reused on new PVC
Product:	OpenShift Container Platform	Reporter:	Matthew Robson <mrobson>
Component:	Storage	Assignee:	aos-storage-staff <aos-storage-staff>
Storage sub component:	Storage	QA Contact:	Qin Ping <piqin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, chaoyang, jsafrane
Version:	3.11.0
Target Milestone:	---
Target Release:	4.5.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: OpenShift iSCSI volume plugin did scan an iSCSI session and discovered and mapped all its LUNs, including LUNs that were not required nor used. These unused LUNs could be added to local multipath. Consequence: When such such an unrelated LUN is deleted on the storage backend and a new volume is created with the same LUN number, the multipath running on a node may get confused and report that filesystem on the volume is corrupted. Fix: OpenShift iSCSI volume plugin uses manual iSCSI scanning and discovers and map only volumes really needed to be attached to a node. Result: Unrelated volumes are not added to multipath.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:29:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Robson 2020-04-20 13:34:06 UTC

Description of problem:

This is a NetApp trident setup, using a virtual netapp ontap device.

We have been seeing random fsck issues with netapp block causing corruption issue:

Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: I0416 12:15:04.470996   17920 mount_linux.go:488] `fsck` error fsck from util-linux 2.23.2
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: fsck.ext2: Bad magic number in super-block while trying to open /dev/mapper/3600a09805a506576375d4f4e754d5434
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: /dev/mapper/3600a09805a506576375d4f4e754d5434:
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: The superblock could not be read or does not describe a correct ext2
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: filesystem.  If the device is valid and it really contains an ext2
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: filesystem (and not swap or ufs or something else), then the superblock
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: is corrupt, and you might try running e2fsck with an alternate superblock:
Apr 16 12:15:04 server.dmz atomic-openshift-node[17920]: e2fsck -b 8193 <device>

Looking at the device, we could see it already had data and a filesystem on it which is why fsck in mount_linux.go was failing.

We tried to reproduce the issue and collect data by running the following in a loop until it failed:

1) Collect pre logs
2) Delete PVC / PV
3) Collect deleted logs
4) Create PVC / PV
5) Collect created logs
6) Scale up pod
7) Wait for success / error
8) Scale down pod
9) Collect logs

What we found was the following:

1) Before the failure mount, the device already existed:
dmsetup ls (pre creation):
3600a09805a506576375d4f4e754d5434       (253:50)

2) Before the failure mount, the dm-50 already existed in multipath:
multipath (pre  creation):
3600a09805a506576375d4f4e754d5434 dm-50 NETAPP  ,LUN C-Mode
size=5.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 462:0:0:49  sdhe 133:64  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 461:0:0:49  sdbh 67:176  active ready running

3) Before the failure mount, sdhe and sdbh existed
lsscsi (pre creation)
[462:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdhe
[461:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdbh


4) After deleting the old PVC / PV, the device was not removed:
dmsetup ls (pvc deleted):
3600a09805a506576375d4f4e754d5434       (253:50)

5) After deleting the old PVC / PV, dm-50 still existed and went into an active/faulty/running state
multipath (pvc deleted):
3600a09805a506576375d4f4e754d5434 dm-50 NETAPP  ,LUN C-Mode
size=5.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 462:0:0:49  sdhe 133:64  active faulty running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 461:0:0:49  sdbh 67:176  active faulty running

6) After deleting the old PVC / PV, sdhe and sdbh still existed
lsscsi (pvc deleted)
[462:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdhe
[461:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdbh

7) On the netapp, LUN 49 was removed

8) After recreating the PVC / PV, it was still there
dmsetup ls (pvc create):
3600a09805a506576375d4f4e754d5434       (253:50)

9) After recreating the PVC / PV, it reconnected to dm-50 / 3600a09805a506576375d4f4e754d5434 and back to active/ready/running
multipath (pvc create):
3600a09805a506576375d4f4e754d5434 dm-50 NETAPP  ,LUN C-Mode
size=5.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 462:0:0:49  sdhe 133:64  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 461:0:0:49  sdbh 67:176  active ready running

10) After recreating the PVC / PV, lsscsi is the same
lsscsi (pvc created)
[462:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdhe
[461:0:0:49] disk    NETAPP   LUN C-Mode       9600  /dev/sdbh

11) On the netapp, it created a new LUN49

Other interesting thing, the DM device is 5Gi, but the LUN / PVC are only 1Gi in size.


Version-Release number of selected component (if applicable):

3.11.146

How reproducible:

Random

Steps to Reproduce:
1) Collect pre logs
2) Delete PVC / PV
3) Collect deleted logs
4) Create PVC / PV
5) Collect created logs
6) Scale up pod
7) Wait for success / error
8) Scale down pod
9) Collect logs
10) Repeat

Actual results:
Mostly successful, but randomly you will hit this issue.

Expected results:
Devices should not be getting reused.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rh-test
  annotations:
    volume.beta.kubernetes.io/storage-class: netapp-block-standard
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

StorageClass Dump (if StorageClass used by PV/PVC):

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: 2019-11-25T18:43:15Z
  name: netapp-block-standard
  resourceVersion: "1171291242"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/netapp-block-standard
  uid: 58473621-0fb3-11ea-abeb-1948765234cc
parameters:
  backendType: ontap-san-economy
provisioner: netapp.io/trident
reclaimPolicy: Delete
volumeBindingMode: Immediate

Additional info:

Some related issues / fixes that have happened:

https://github.com/NetApp/trident/issues/101
https://github.com/NetApp/trident/issues/133

https://github.com/kubernetes/kubernetes/issues/59946
https://github.com/kubernetes/kubernetes/issues/60894

Comment 20 Qin Ping 2020-05-25 14:34:52 UTC

Verified with: 4.5.0-0.nightly-2020-05-25-052746

Comment 21 errata-xmlrpc 2020-07-13 17:29:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409