1746111 – RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem header corruption

Bug 1746111 - RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem header corruption

Summary: RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem heade...

Keywords:
Status:	CLOSED DUPLICATE of bug 1745776
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Fabio Bertinatto
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:	ocs-monkey
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-27 17:02 UTC by Shyamsundar
Modified:	2019-09-05 13:28 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-05 08:25:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dmesg logs from 2 nodes that were used to map RBD devices (285.18 KB, application/gzip) 2019-08-27 17:02 UTC, Shyamsundar	no flags	Details
View All

Description Shyamsundar 2019-08-27 17:02:35 UTC

Created attachment 1608688 [details]
dmesg logs from 2 nodes that were used to map RBD devices

Description of problem:

Two instances of rbd images mapped to a set of kubernetes nodes was stuck during umount. Further during the time, around when the umount was issued, there were kernel hung task stacks found in dmesg (attached dmesg from the nodes).

Also, one of the images which was stuck umounting on a node, was mounted on another node a while later, as there were no watchers on the image.
NOTE: For the above statement, initially a watcher was found (for 6-7 iterations with a second gap between checks for watchers on the said image), and post this there were no watchers causing the second node to mount the device and report XFS superblock corruptions.

Version-Release number of selected component (if applicable):

The version under test is actually upstream 14.2 as that is what we are using in Rook+Ceph-CSI tests with OCP 4.2.

How reproducible:

Happened twice during the same set of tests, which involves 3 nodes where rbd devices are mapped.

Steps to Reproduce:
1. Setup a OCP+OCS cluster, with Rook+Ceph and a storage class created for RBD (named csi-rbd)
2. Run the ocs-monkey tool (https://github.com/JohnStrunk/ocs-monkey) against this setup with the following CLI options:
  - ./workload_runner.py -l ./ -m ReadWriteOnce -n testws --oc /path/to/oc --ocs-namespace <rook-ceph namespace> -s csi-rbd
3. In about 2-10 minutes the workload runner will report failure and start gathering ocs and ocp must-gather logs

Post this inspecting pods running based on the tests may show symptoms of waiting for mounts, which are stuck in umount on other nodes where the pods were running prior to being scheduled to the current node.

Additional info:

From an rbd client perspective, data collected was dmesg logs from the nodes where the devices were mounted, (attached)

1) In dmesg logs dmesg-10-0-129-7.log around the following TS note the hung task timeouts,
----------------------
[ 2212.346516] INFO: task hyperkube:1202 blocked for more than 120 seconds.
[ 2212.350337]       Not tainted 4.18.0-80.7.2.el8_0.x86_64 #1
[ 2212.353349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
----------------------
- The rbd device that hung was /dev/rbd8
- Post leaving the machine overnight the image was umounted at some later point in time (this was noted when attempting to gather more information the next day)
  - This message and its TS may be of relevance: [23907.729712] XFS (rbd8): Unmounting Filesystem

2) In dmesg logs dmesg-10-0-162-206.log, events of interest are as follows,
----------------------
Aug 26 15:57:54.208979 ip-10-0-162-206 kernel: INFO: task hyperkube:1179 blocked for more than 120 seconds.
Aug 26 15:57:54.209129 ip-10-0-162-206 kernel:       Not tainted 4.18.0-80.7.2.el8_0.x86_64 #1
Aug 26 15:57:54.209173 ip-10-0-162-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
----------------------
- The rbd device that hung was /dev/rbd0
- Further below the above messages, please note some more XFS errors noting a corruption,
----------------------
Aug 26 22:34:57.001029 ip-10-0-162-206 kernel: XFS (rbd0): Metadata corruption detected at xfs_dir3_data_reada_verify+0x25/0x70 [xfs], xfs_dir3_data_reada block 0x53c100 
Aug 26 22:34:57.001213 ip-10-0-162-206 kernel: XFS (rbd0): Unmount and run xfs_repair
Aug 26 22:34:57.001342 ip-10-0-162-206 kernel: XFS (rbd0): First 128 bytes of corrupted metadata buffer:
----------------------

3) In dmesg logs dmesg-10-0-159-243.log, events of interest are as follows,
- Image that was still stuck in (2) above was mounted succesfully as /dev/rbd2 on this node
- The above mount was done post checking for watchers on the image, where the first 6 calls to check for watchers returned a watcher on the image, and then at a point there were no more watchers, and the kubernetes ceph-CSI plugins mounted the device on the node
- Post the mount in dmesg the logs of interest are,
---------------------
[23992.664392] XFS (rbd2): Metadata corruption detected at xfs_dir3_data_reada_verify+0x25/0x70 [xfs], xfs_dir3_data_reada block 0x3c2080 
[23992.676746] XFS (rbd2): Unmount and run xfs_repair
[23992.681815] XFS (rbd2): First 128 bytes of corrupted metadata buffer:
---------------------

Comment 1 RHEL Program Management 2019-08-27 17:02:42 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 3 Jason Dillaman 2019-08-27 17:12:05 UTC

I am seeing a lot of osdX down/up messages right before the failure. What's going on w/ the Ceph cluster?

Comment 4 Shyamsundar 2019-08-27 17:21:07 UTC

(In reply to Jason Dillaman from comment #3)
> I am seeing a lot of osdX down/up messages right before the failure. What's
> going on w/ the Ceph cluster?

I did read those lines and wondered what is happening as well. What I can tell is that the Rook based Ceph cluster is running on an AWS OCP setup, I have had various problems with this 3 worker node footprint setup, from the kubernetes SDN (network) to other parts. I suspect the network is not "sane" enough for Ceph, but how can I tell this?

What other information may help you in this regard? I have rook-ceph pod logs from this event (collected using ocs-must-gather) and can be found here: http://file.rdu.redhat.com/srangana/bz1745773.tar.gz if they help. The tarball also contains journalctl from the 2 worker nodes in addition to the other logs.

NOTE: None of the Ceph pods (OSDs) were restarted during the tests or reported as down

Comment 5 Jason Dillaman 2019-08-27 17:29:11 UTC

If you can get a dump of "/sys/kernel/debug/ceph/<fsid>.<clientid>/*" during this event, it would be helpful to see if some IOs are stuck waiting on an OSD.

Comment 6 Ilya Dryomov 2019-08-28 10:24:58 UTC

What kind of workload does workload_runner.py generate?  Looks like there is about a dozen rbd devices on each worker node -- how many workload containers does ocs-monkey spawn, what each of them does?

The OSD can be marked down by the monitor for various reasons (heartbeats, etc), the OSD pod will not be restarted or reported as down.

At the time ocs-must-gather ran, the cluster was still unhappy:

HEALTH_WARN Degraded data redundancy: 1796/18015 objects degraded (9.969%), 100 pgs degraded, 4 pgs undersized; 3 pools have too many placement groups
[WRN] PG_DEGRADED: Degraded data redundancy: 1796/18015 objects degraded (9.969%), 100 pgs degraded, 4 pgs undersized
    pg 1.15 is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.2e is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [2,0]
    pg 1.2f is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.30 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.31 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.32 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.33 is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [2,0]
    pg 1.34 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.35 is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.36 is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [0,2]
    pg 1.37 is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.38 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.39 is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.3a is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.3b is active+recovering+degraded, acting [0,2,1]
    pg 1.3c is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.3d is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.3e is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.3f is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.40 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.41 is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.42 is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.43 is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.44 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.45 is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.46 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.47 is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.48 is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.49 is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.4a is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.4b is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.4c is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.4d is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.4e is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.4f is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.50 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.51 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.52 is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.53 is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.54 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.55 is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.56 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.57 is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.58 is active+recovery_wait+degraded, acting [2,1,0]
    pg 1.59 is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.5a is active+recovery_wait+degraded, acting [0,2,1]
    pg 1.5b is active+recovery_wait+degraded, acting [0,1,2]
    pg 1.5c is active+recovery_wait+degraded, acting [2,0,1]
    pg 1.5d is active+recovery_wait+degraded, acting [1,0,2]
    pg 1.5e is active+recovery_wait+degraded, acting [1,2,0]
    pg 1.5f is active+recovery_wait+degraded, acting [1,0,2]
[WRN] POOL_TOO_MANY_PGS: 3 pools have too many placement groups
    Pool mycephfs-data0 has 100 placement groups, should have 4
    Pool replicapool has 100 placement groups, should have 4
    Pool mycephfs-metadata has 100 placement groups, should have 16

All of them are replicapool (i.e. rbd) PGs.

Each OSD reported hundreds of slow requests, pretty much every minute:

2019-08-26T17:31:21.397381293Z 2019-08-26 17:31:21.397316 I | 2019-08-26T17:31:21.395+0000 7ff6ddea6700 -1 osd.0 75 get_health_metrics reporting 216 slow ops, oldest is osd_op(client.24818.0:154908 1.8 1:10f7bdfe:::rbd_data.739a98c4e1d2.00000000000001c2:head [set-alloc-hint object_size 4194304 write_size 4194304,write 110592~69632] snapc 0=[] ondisk+write+known_if_redirected e75)

2019-08-26T17:30:55.050433341Z 2019-08-26 17:30:55.050372 I | 2019-08-26T17:30:55.049+0000 7fdb303f5700 -1 osd.1 75 get_health_metrics reporting 215 slow ops, oldest is osd_op(client.24818.0:155056 1.2d 1:b53576ea:::rbd_data.739a98c4e1d2.00000000000002de:head [set-alloc-hint object_size 4194304 write_size 4194304,write 69632~24576] snapc 0=[] ondisk+write+known_if_redirected e75)

2019-08-26T16:20:09.104164701Z 2019-08-26 16:20:09.104105 I | 2019-08-26T16:20:09.103+0000 7f0cb1c5d700 -1 osd.2 69 get_health_metrics reporting 181 slow ops, oldest is osd_op(client.24818.0:141383 1.30 1:0ff6c40f:::rbd_data.6bf94c9ef58.000000000000008f:head [set-alloc-hint object_size 4194304 write_size 4194304,write 4083712~4096] snapc 0=[] ondisk+write+known_if_redirected e69)

Hung workload tasks on the client side is a natural consequence...

Comment 7 Shyamsundar 2019-08-28 17:03:39 UTC

(In reply to Ilya Dryomov from comment #6)
> What kind of workload does workload_runner.py generate?  Looks like there is
> about a dozen rbd devices on each worker node -- how many workload
> containers does ocs-monkey spawn, what each of them does?


The workload that is executed is this: https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload.py

The above does a kernel tar ball extraction per 2 minutes and a "rm -rf" of an extracted dir every 2 minutes. Each of these extractions go into their own directories. This is done till the application pod is  deleted.

At a given time there maybe at max 30 app pods (across 3 nodes) that are executing the same workload as above.

> Hung workload tasks on the client side is a natural consequence...

Agreed. An under provisioned setup can have such hangs, and we possibly need to see how best to notify that problem (something outside the scope of this bug for the moment IMO). Hence the problem (1) in #description, where the image was not unmapped from the node, is a valid state to be in.

My concern is more (2) and (3), where a hung device was mounted on another node because the watchers came back with no watchers on the image. Is there anything that is missing here that we need to address?

These are the relevant watcher logs from the CSI nodeplugin that attempts to check for watchers and then mount the image,

Log file: ./ocs-monkey-82240754/ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log

2019-08-26T15:54:37.38006135Z I0826 15:54:37.379483   20545 utils.go:119] ID: 415 GRPC call: /csi.v1.Node/NodeStageVolume
2019-08-26T15:54:37.38740967Z I0826 15:54:37.379503   20545 utils.go:120] ID: 415 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:03:50.794222028Z I0826 16:03:50.793915   20545 rbd_util.go:488] ID: 415 setting disableInUseChecks on rbd volume to: false
2019-08-26T16:03:50.847343724Z I0826 16:03:50.847269   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789,
 pool replicapool
2019-08-26T16:03:50.949929449Z I0826 16:03:50.949871   20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers:
2019-08-26T16:03:50.949929449Z  watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971

2019-08-26T16:03:51.950168097Z I0826 16:03:51.950106   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789,
 pool replicapool
2019-08-26T16:03:52.042062014Z I0826 16:03:52.042009   20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers:
2019-08-26T16:03:52.042062014Z  watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971
 pool replicapool
2019-08-26T16:03:53.442307389Z I0826 16:03:53.442253   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789,
 pool replicapool
2019-08-26T16:03:53.575071417Z I0826 16:03:53.575017   20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers:
2019-08-26T16:03:53.575071417Z  watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971

2019-08-26T16:03:55.535308632Z I0826 16:03:55.535249   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789,
 pool replicapool
2019-08-26T16:03:55.634365555Z I0826 16:03:55.634296   20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers:
2019-08-26T16:03:55.634365555Z  watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971

2019-08-26T16:03:58.38780497Z I0826 16:03:58.387504   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool
2019-08-26T16:03:58.722279981Z I0826 16:03:58.722201   20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers:
2019-08-26T16:03:58.722279981Z  watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971

2019-08-26T16:04:02.608762443Z I0826 16:04:02.608465   20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool
2019-08-26T16:04:03.260864611Z W0826 16:04:03.260556   20545 rbd_util.go:173] ID: 415 rbd: no watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010
2019-08-26T16:04:03.260864611Z I0826 16:04:03.260590   20545 rbd_attach.go:202] ID: 415 rbd: map mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789

2019-08-26T16:20:37.107013625Z I0826 16:20:37.106503   20545 nodeserver.go:141] ID: 415 rbd image: 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010/replicapool was successfully mapped at /dev/rbd2
2019-08-26T16:20:38.639736721Z I0826 16:20:38.639407   20545 mount_linux.go:460] Attempting to mount disk: xfs /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010
2019-08-26T16:20:38.639736721Z I0826 16:20:38.639440   20545 nsenter_mount.go:80] nsenter mount /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 xfs [defaults]
2019-08-26T16:20:38.639736721Z I0826 16:20:38.639481   20545 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/systemd-run --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 --scope -- /bin/mount -t xfs -o defaults /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010]
---> Command hung!
/bin/mount -t xfs -o defaults /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010

Comment 8 Jason Dillaman 2019-08-28 17:20:31 UTC

(In reply to Shyamsundar from comment #7)
> My concern is more (2) and (3), where a hung device was mounted on another
> node because the watchers came back with no watchers on the image. Is there
> anything that is missing here that we need to address?

How did the PVC get re-scheduled to another node w/ the first node was unresponsive? That seems like a k8s bug to me. If the unmount / unmap operation times out, k8s shouldn't just assume that the PV is safe to map / mount elsewhere. This gets back to that fencing issue Jira ticket.

Comment 9 Ilya Dryomov 2019-08-28 17:57:33 UTC

(In reply to Shyamsundar from comment #7)
> (In reply to Ilya Dryomov from comment #6)
> > What kind of workload does workload_runner.py generate?  Looks like there is
> > about a dozen rbd devices on each worker node -- how many workload
> > containers does ocs-monkey spawn, what each of them does?
> 
> 
> The workload that is executed is this:
> https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload.
> py
> 
> The above does a kernel tar ball extraction per 2 minutes and a "rm -rf" of
> an extracted dir every 2 minutes. Each of these extractions go into their
> own directories. This is done till the application pod is  deleted.
> 
> At a given time there maybe at max 30 app pods (across 3 nodes) that are
> executing the same workload as above.
> 
> > Hung workload tasks on the client side is a natural consequence...
> 
> Agreed. An under provisioned setup can have such hangs, and we possibly need
> to see how best to notify that problem (something outside the scope of this
> bug for the moment IMO). Hence the problem (1) in #description, where the
> image was not unmapped from the node, is a valid state to be in.
> 
> My concern is more (2) and (3), where a hung device was mounted on another
> node because the watchers came back with no watchers on the image. Is there
> anything that is missing here that we need to address?

Well, as you can see a watch is not a reliable indicator of whether the image is mapped (be it on the same node or on a different node).  The watch is lost if the OSD responsible for the header object goes down (not necessarily crashes, as I said earlier it can be (wrongly) marked down by the monitor, etc).  The watch is lost if the client doesn't check in in time, which can happen even if the client is alive and happily sending I/O, not just to other OSDs but also to the OSD responsible for the header object.  In the end the client will re-establish the watch, but it can take a while.  Fundamentally, a watch is just a couple of messages exchanged over a regular OSD session, with a little bit of additional state on the OSD side.

If this is not prevented at a higher level, you might want to look at exclusive-lock feature and "rbd map -o exclusive".  It adds proper blacklisting, but it's not without its issues.  It is essentially hard fencing at the lowest level possible...

Comment 10 Shyamsundar 2019-08-28 18:33:47 UTC

Adding more context to the problem, as I did not pay correct attention to the timestamps for events between (2) and (3) and in an effort to answer observations in comment #8 and comment #9, now have more data or sequence of events to share.

The image was mounted to a different node (10-0-159-243) [call this node-2 in the discussion], when it was active on node 10-0-159-243 [call this node-1 in the discussion]. This is based on the CSI node plugin logs and time stamps in there.

Comment #7 details logs from node 10-0-159-243, which was the second node that was attempting to mount the image.
  - The time at which this event occurred was, 15:54:37 (until) 16:20:38
  - The event was a request to stage (map the device and mount it) the image on the node, which eventually got stuck at mount with dmesg logs that the XFS header was corrupt

The logs that follow are from the node 10-0-159-243 where the unmap of the image was stuck.
  - The time when the image was requested to be unmapped is 16:21:24.072361, which is about a minute after the call to mount it on node-2

So the image was mounted and active on node-1, when kube decided to mount it on node-2, which checked for watchers, and as @illya points out, found a watcher missing at some point in time, mounted the device, and promptly corrupted the XFS header as there are 2 nodes using the same device. This seems to be the root cause of the problem.

@Jason and @Illya I assume you agree on this.

node-1 logs (of interest to the image in question):
Log file: ocs-monkey-82240754/ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log
2019-08-26T15:43:09.241384822Z I0826 15:43:09.238373   18583 utils.go:120] ID: 120 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T15:43:14.366749924Z I0826 15:43:14.357786   18583 rbd_util.go:151] ID: 120 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool
2019-08-26T15:43:14.586560692Z W0826 15:43:14.586510   18583 rbd_util.go:173] ID: 120 rbd: no watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010
2019-08-26T15:43:14.783711884Z I0826 15:43:14.781427   18583 nodeserver.go:141] ID: 120 rbd image: 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010/replicapool was successfully mapped at /dev/rbd0
2019-08-26T15:43:15.115859874Z I0826 15:43:15.110500   18583 nodeserver.go:161] ID: 120 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010

2019-08-26T15:43:15.145657364Z I0826 15:43:15.141985   18583 utils.go:120] ID: 122 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T15:43:15.190647894Z I0826 15:43:15.190492   18583 nodeserver.go:261] ID: 122 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount

2019-08-26T15:46:51.917292763Z I0826 15:46:51.914865   18583 utils.go:120] ID: 251 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T15:46:51.932852045Z I0826 15:46:51.932516   18583 nodeserver.go:110] ID: 251 rbd: volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 is already mounted to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010, skipping

2019-08-26T15:46:51.945942582Z I0826 15:46:51.943788   18583 utils.go:120] ID: 253 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T15:46:51.988957899Z I0826 15:46:51.988160   18583 nodeserver.go:261] ID: 253 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount

2019-08-26T16:21:23.966851502Z I0826 16:21:23.966027   18583 utils.go:120] ID: 490 GRPC request: {"target_path":"/var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:21:23.968661157Z I0826 16:21:23.967243   18583 utils.go:120] ID: 491 GRPC request: {"target_path":"/var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:21:23.99221676Z I0826 16:21:23.991284   18583 nodeserver.go:404] ID: 490 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount
2019-08-26T16:21:23.992349012Z I0826 16:21:23.992312   18583 nodeserver.go:404] ID: 491 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount

2019-08-26T16:21:24.073095064Z I0826 16:21:24.072361   18583 utils.go:120] ID: 493 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:21:24.07920296Z I0826 16:21:24.079178   18583 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/umount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010]

2019-08-26T16:21:39.587992773Z I0826 16:21:39.586899   18583 utils.go:120] ID: 495 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:21:39.956828169Z E0826 16:21:39.956609   18583 nodeserver.go:468] ID: 495 error unmapping volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010): (rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed
2019-08-26T16:21:39.956828169Z E0826 16:21:39.956682   18583 utils.go:123] ID: 495 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed

2019-08-26T16:21:40.995074421Z I0826 16:21:40.994448   18583 utils.go:120] ID: 497 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}
2019-08-26T16:21:41.300554321Z E0826 16:21:41.300488   18583 nodeserver.go:468] ID: 497 error unmapping volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010): (rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed
2019-08-26T16:21:41.300554321Z E0826 16:21:41.300534   18583 utils.go:123] ID: 497 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed

As an aside:
Kuberbetes also has asked for staging (map the image and mount it to a base path) the same image twice on the same node, which also is suspect overall, as there is no reason to repeat the same.

Comment 11 Jason Dillaman 2019-08-28 19:03:58 UTC

(In reply to Shyamsundar from comment #10)
> The logs that follow are from the node 10-0-159-243 where the unmap of the
> image was stuck.
>   - The time when the image was requested to be unmapped is 16:21:24.072361,
> which is about a minute after the call to mount it on node-2

I see 15:42:03 as the first reference to stage that PVC on the original node ...

ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log:2019-08-26T15:42:03.828041114Z I0826 15:42:03.818792   18583 utils.go:120] ID: 78 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}

... followed shortly thereafter by the original mount request. So perhaps that is just implying that k8s timed out on the original mount request and then assumed that it was safe to stage and mount on another node (w/o calling unstage first)?

Aug 26 15:47:34 ip-10-0-162-206 hyperkube[1156]: I0826 15:47:34.888364    1156 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"testws", Name:"osio-worker-925889458-79df64c77f-gc7tn", UID:"a3050904-c818-11e9-9022-02f7b213f410", APIVersion:"v1", ResourceVersion:"27340", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.MountDevice failed for volume "pvc-81c6766a-c818-11e9-9022-02f7b213f410" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

> @Jason and @Illya I assume you agree on this.

Yes, agreed.

Comment 12 Shyamsundar 2019-08-28 19:23:49 UTC

(In reply to Jason Dillaman from comment #11)
> (In reply to Shyamsundar from comment #10)
> > The logs that follow are from the node 10-0-159-243 where the unmap of the
> > image was stuck.
> >   - The time when the image was requested to be unmapped is 16:21:24.072361,
> > which is about a minute after the call to mount it on node-2
> 
> I see 15:42:03 as the first reference to stage that PVC on the original node
> ...
> 
> ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-
> rbdplugin/csi-rbdplugin/logs/current.log:2019-08-26T15:42:03.828041114Z
> I0826 15:42:03.818792   18583 utils.go:120] ID: 78 GRPC request:
> {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/
> kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount",
> "volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":
> {"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool",
> "storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.
> rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-
> fcc0c5c1-c817-11e9-be8f-0a580a830010"}

- The above request was ack'd as a success here:
2019-08-26T15:42:07.685590966Z I0826 15:42:07.684802   18583 nodeserver.go:161] ID: 78 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010

- Later a publish happened as well:
2019-08-26T15:42:10.543570328Z I0826 15:42:10.428065   18583 utils.go:120] ID: 84 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}

2019-08-26T15:42:10.573657693Z I0826 15:42:10.573354   18583 nodeserver.go:261] ID: 84 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount

- Then an unpublish:
2019-08-26T15:42:10.932805093Z I0826 15:42:10.868051   18583 utils.go:120] ID: 85 GRPC request: {"target_path":"/var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}

2019-08-26T15:42:11.049305483Z I0826 15:42:11.030925   18583 nodeserver.go:404] ID: 85 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount

- and an unstage:
2019-08-26T15:42:11.176133639Z I0826 15:42:11.174985   18583 utils.go:120] ID: 87 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"}

2019-08-26T15:42:11.756380886Z I0826 15:42:11.756328   18583 nodeserver.go:472] ID: 87 successfully unmounted volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010)

Then the next stage occurred, for which I posted logs in comment #11

Also, all the above (un)stage/(un)publish responded not later than 5 seconds, so timeout toleration's were met.

> 
> ... followed shortly thereafter by the original mount request. So perhaps
> that is just implying that k8s timed out on the original mount request and
> then assumed that it was safe to stage and mount on another node (w/o
> calling unstage first)?

You may be seeing this bug occurring here: https://bugzilla.redhat.com/show_bug.cgi?id=1744958

> 
> Aug 26 15:47:34 ip-10-0-162-206 hyperkube[1156]: I0826 15:47:34.888364   
> 1156 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"testws",
> Name:"osio-worker-925889458-79df64c77f-gc7tn",
> UID:"a3050904-c818-11e9-9022-02f7b213f410", APIVersion:"v1",
> ResourceVersion:"27340", FieldPath:""}): type: 'Warning' reason:
> 'FailedMount' MountVolume.MountDevice failed for volume
> "pvc-81c6766a-c818-11e9-9022-02f7b213f410" : rpc error: code =
> DeadlineExceeded desc = context deadline exceeded

From a TS perspective the message from kubelet is quite later than the first and second stage success responses: "15:47:34".

For the bz #1744958 I never saw a stage request on another node, it kept failing on the same node, but that is possibly because the test case failed by then and stopped moving (deleting and recreating) the pods around.

>> As an aside:
>> Kuberbetes also has asked for staging (map the image and mount it to a base path) the same image twice on the same node, which also is suspect overall, as there is no reason to repeat the same.

It also possibly explains my seeing 2 stage requests on the same node.

> 
> > @Jason and @Illya I assume you agree on this.
> 
> Yes, agreed.

Ok, time to transfer this to OCP-Storage then, unless we want to iron out any other gaps.

Comment 13 Jason Dillaman 2019-08-28 19:30:47 UTC

(In reply to Shyamsundar from comment #12)
> Ok, time to transfer this to OCP-Storage then, unless we want to iron out
> any other gaps.

Certainly good to get the core experts to look at this even if it needs to bounce back to OCS.

Comment 14 Shyamsundar 2019-08-29 01:46:46 UTC

Correcting incorrect node names in the following comment

(In reply to Shyamsundar from comment #10)
> The image was mounted to a different node (10-0-159-243) [call this node-2
> in the discussion], when it was active on node 10-0-159-243 [call this
> node-1 in the discussion]. This is based on the CSI node plugin logs and
> time stamps in there.

The above should read:
The image was mounted to a different node (10-0-159-243) [call this node-2
in the discussion], when it was active on node 10-0-162-206 [call this
node-1 in the discussion]. This is based on the CSI node plugin logs and
time stamps in there.

Corrected node-1 address/name, which was duplicate of node-2 in the earlier comment.

> The logs that follow are from the node 10-0-159-243 where the unmap of the
> image was stuck.

The above should read:
The logs that follow are from the node 10-0-162-206 where the unmap of the
image was stuck.

Corrected node address/name from 10-0-159-243 to 10-0-162-206

Comment 15 Shyamsundar 2019-08-29 02:10:50 UTC

Moving this bug to OCP-Storage for problem analysis and inputs.

Disclaimers:
- At present I do not have PVC or Pod describe data for the problematic pods/pvcs
- What we traced in this bug was the PV and the rbd image, as to why we faced corruptions in the image (or XFS reported corruptions in the image)
- The setup has been reclaimed, hence currently no further data is available other than already collected must-gather and ocs-must gather data present at: http://file.rdu.redhat.com/srangana/bz1745773.tar.gz 
- The logs are the same for the reported bugs:
  - https://bugzilla.redhat.com/show_bug.cgi?id=1745776
  - https://bugzilla.redhat.com/show_bug.cgi?id=1745773
(added Hemant and Fabio to the CC of this bug, as they are assigned the above bugs)

Problem description:

An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity.

Reason for ceph-rbd/csi allowing the PV to be staged on the second node is due to a current weaker scheme used for exclusive access (i.e the watcher based scheme in rbd). This was under the assumption that Kubernetes (and hence OCP) provides enough resilience for RWO PVs to not be staged on multiple nodes.

Problematic PV: pvc-fc9b1743-c817-11e9-9022-02f7b213f410

PV activities found on nodes 10.0.159.243 and 10.0.162.206 as from the logs below,
----------------
$ grep -l pvc-fc9b1743-c817-11e9-9022-02f7b213f410 ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-?????/csi-rbdplugin/csi-rbdplugin/logs/current.log
ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log
ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log
----------------

ceph-csi rbd node plugin name and pod details
----------------
$ oc -n rook-ceph get pods -o wide | grep rbd
csi-rbdplugin-hdtdc                           3/3     Running     0          166m   10.0.129.7     ip-10-0-129-7.ec2.internal     <none>           <none>
csi-rbdplugin-n7ddn                           3/3     Running     0          166m   10.0.159.243   ip-10-0-159-243.ec2.internal   <none>           <none>
csi-rbdplugin-provisioner-0                   5/5     Running     0          166m   10.131.0.16    ip-10-0-129-7.ec2.internal     <none>           <none>
csi-rbdplugin-qgwql                           3/3     Running     0          166m   10.0.162.206   ip-10-0-162-206.ec2.internal   <none>           <none>

$ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log
2019-08-26T15:31:54.601747941Z I0826 15:31:54.600915   20545 utils.go:120] ID: 2 GRPC request: {}
2019-08-26T15:31:54.602450009Z I0826 15:31:54.601893   20545 nodeserver-default.go:58] Using default NodeGetInfo
2019-08-26T15:31:54.60307398Z I0826 15:31:54.601913   20545 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-159-243.ec2.internal"}

$ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-hdtdc/csi-rbdplugin/csi-rbdplugin/logs/current.log
2019-08-26T15:31:42.293055977Z I0826 15:31:42.292787   30300 utils.go:119] ID: 2 GRPC call: /csi.v1.Node/NodeGetInfo
2019-08-26T15:31:42.293460446Z I0826 15:31:42.292808   30300 utils.go:120] ID: 2 GRPC request: {}
2019-08-26T15:31:42.293460446Z I0826 15:31:42.293452   30300 nodeserver-default.go:58] Using default NodeGetInfo
2019-08-26T15:31:42.29405322Z I0826 15:31:42.293466   30300 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-129-7.ec2.internal"}

$ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log
2019-08-26T15:31:19.346043474Z I0826 15:31:19.346010   18583 utils.go:119] ID: 2 GRPC call: /csi.v1.Node/NodeGetInfo
2019-08-26T15:31:19.346411115Z I0826 15:31:19.346031   18583 utils.go:120] ID: 2 GRPC request: {}
2019-08-26T15:31:19.346411115Z I0826 15:31:19.346402   18583 nodeserver-default.go:58] Using default NodeGetInfo
2019-08-26T15:31:19.346903848Z I0826 15:31:19.346416   18583 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-162-206.ec2.internal"}
----------------

The PV was initially staged and published on node 10.0.162.206 (please see comment #10 for ceph-csi logs that discuss the activities, also see comment #14 for some errata information in the original comment)

Subsequent to this, the PV was staged on node 10-0-159-243 (ceph-csi logs provided in comment #7), even before the first unstage request was made to node 10.0.162.206.

The above behavior, and other slow osd and such responses causing a watcher detection failure, caused the image to become corrupt.

Questions:
- This behavior of attempting to state an RWO volume on another node, even before the first node has unstaged it seem in error, is this correct?
- Should the csi provider take additional steps to safeguard against multi node access for RWO volumes? Or, is there enough state management in kubernetes to ensure that this will not happen? (IOW, should we actively ensure exclusive ownership and apply fencing in ceph-rbd as well)

Comment 16 Humble Chirammal 2019-08-29 03:35:19 UTC

>An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity.

Shyam, I am yet to look into the logs, however I am trying to understand, how "staging" on 2 different nodes cause a filesystem integrity issue? staging brings the share to a global mount  path, but afaict, it **should not** cause any access to FS, as this is not mounted or not made available to an application ( which is a pod in ocp/kube context). IOW, just a staged volume should not cause FS integrity issues in my opinion. 

Multiattach volume error happens or validated at time of publish/attach, which is actually next course of action after staging. I think thats why it is staged on second node. I would let OCP team to confirm this though.

Comment 17 Shyamsundar 2019-08-29 11:21:58 UTC

(In reply to Humble Chirammal from comment #16)
> >An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity.
> 
> Shyam, I am yet to look into the logs, however I am trying to understand,
> how "staging" on 2 different nodes cause a filesystem integrity issue?
> staging brings the share to a global mount  path, but afaict, it **should
> not** cause any access to FS, as this is not mounted or not made available
> to an application ( which is a pod in ocp/kube context). IOW, just a staged
> volume should not cause FS integrity issues in my opinion.

Staging involves mapping the rbd image and then mounting it to the staging path, so there is filesystem mounting that occurs, at which point the FS would access the data in the image. Does XFS update its super block or other such activity at this point in time I am not aware.

> 
> Multiattach volume error happens or validated at time of publish/attach,
> which is actually next course of action after staging. I think thats why it
> is staged on second node. I would let OCP team to confirm this though.

Comment 18 Shyamsundar 2019-08-29 11:28:02 UTC

(In reply to Humble Chirammal from comment #16)
> Multiattach volume error happens or validated at time of publish/attach,
> which is actually next course of action after staging. I think thats why it
> is staged on second node. I would let OCP team to confirm this though.

I have seen multi attach checks prior to staging (from logs), and irrespective of that I would not depend on staging not touching the mount in any fashion, to be assumed safe to be called on multiple nodes for an RWO case. As you state, let us hear from the OCP team.

Comment 19 Humble Chirammal 2019-08-29 12:36:44 UTC

(In reply to Shyamsundar from comment #17)
> (In reply to Humble Chirammal from comment #16)
> > >An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity.
> > 
> > Shyam, I am yet to look into the logs, however I am trying to understand,
> > how "staging" on 2 different nodes cause a filesystem integrity issue?
> > staging brings the share to a global mount  path, but afaict, it **should
> > not** cause any access to FS, as this is not mounted or not made available
> > to an application ( which is a pod in ocp/kube context). IOW, just a staged
> > volume should not cause FS integrity issues in my opinion.
> 
> Staging involves mapping the rbd image and then mounting it to the staging
> path, so there is filesystem mounting that occurs, at which point the FS
> would access the data in the image. Does XFS update its super block or other
> such activity at this point in time I am not aware.
> 
> > 

By 'staging brings the share to a global mount  path,' in my comment I was referring this 2 stage operation ( map + global path mount). 
Again my impression was that, just mounting a volume like this should not cause any FS integrity issue, but I am not 100% sure. 
From the bz description I got a feeling that, this stage operation cause the FS integrity issue, so I was confused and was trying to confirm. 
It looks like we are not sure that 'this staging' process cause the FS integrity issue or not. 
May be XFS/RBD can team can clarify this.

Comment 20 Yaniv Kaul 2019-08-29 13:02:49 UTC

(In reply to Jason Dillaman from comment #8)
> (In reply to Shyamsundar from comment #7)
> > My concern is more (2) and (3), where a hung device was mounted on another
> > node because the watchers came back with no watchers on the image. Is there
> > anything that is missing here that we need to address?
> 
> How did the PVC get re-scheduled to another node w/ the first node was
> unresponsive? That seems like a k8s bug to me. If the unmount / unmap
> operation times out, k8s shouldn't just assume that the PV is safe to map /
> mount elsewhere. This gets back to that fencing issue Jira ticket.

Even if it's a K8S bug, we should make the extra effort to protect against corruptions.
Not that locking is easy and straightforward, but it's better than corruption.

Comment 21 Jason Dillaman 2019-08-29 13:04:44 UTC

(In reply to Humble Chirammal from comment #19)
> By 'staging brings the share to a global mount  path,' in my comment I was
> referring this 2 stage operation ( map + global path mount). 
> Again my impression was that, just mounting a volume like this should not
> cause any FS integrity issue, but I am not 100% sure. 
> From the bz description I got a feeling that, this stage operation cause the
> FS integrity issue, so I was confused and was trying to confirm. 
> It looks like we are not sure that 'this staging' process cause the FS
> integrity issue or not. 
> May be XFS/RBD can team can clarify this.

Mounting an XFS/extX/other non-cluster file system concurrently on two different nodes should be expected to cause corruption and/or issues. Even assuming that XFS doesn't write to the underlying block device at initial mount time (which is not true, especially if it thinks that it needs to replay journal entries), it still will read and cache data structures from on-disk. Let's pick the most manacle scenario where the XFS file system is being "mkfs.xfs"-ed on node 1 while node 2 attempts to mount it. You shouldn't expect that node 2 will be in a happy state.

Comment 22 Humble Chirammal 2019-08-29 13:11:22 UTC

(In reply to Jason Dillaman from comment #8)
> (In reply to Shyamsundar from comment #7)
> > My concern is more (2) and (3), where a hung device was mounted on another
> > node because the watchers came back with no watchers on the image. Is there
> > anything that is missing here that we need to address?
> 
> How did the PVC get re-scheduled to another node w/ the first node was
> unresponsive? That seems like a k8s bug to me. If the unmount / unmap
> operation times out, k8s shouldn't just assume that the PV is safe to map /
> mount elsewhere. This gets back to that fencing issue Jira ticket.

Jason, as tested by QE, in this scenario the volume is not shared/mounted to a POD.
POD creation is unsuccessful saying "multi attach" error. But, it looks like 
the mapping to the NODE has happened, eventhough its not shared to the POD/application.

Comment 23 Jason Dillaman 2019-08-29 13:18:01 UTC

(In reply to Humble Chirammal from comment #22)
> Jason, as tested by QE, in this scenario the volume is not shared/mounted to
> a POD.
> POD creation is unsuccessful saying "multi attach" error. But, it looks like 
> the mapping to the NODE has happened, eventhough its not shared to the
> POD/application.

Well that's unfortunate if QE didn't test that scenario since this ticket is the exact case that I mentioned and was worried about in the Jira ticket (not that manually creating two pods bound to two different nodes would result in k8s blocking the double-mount, but the path where k8s reschedules workloads to another node).

Comment 24 Humble Chirammal 2019-08-29 13:38:14 UTC

Few  things to sort out here: 
-------------------------------

Is it always the scenario that, staging/global path mount happens on 2 nodes  in case of a RWO volume at scenarios like this ? or this is a race condition between "unstage" request to first node and "stage" on second node?

In any case, afaict,  making a RWO volume on 2 nodes at the same time is not expected from Kube side or its a bug in Kube.

Is there any difference in FS integrity when we use 'ext4' in place of 'xfs' ?
 
Additional info:

'fsck -a' is executed subsequently, but I think it just help on detecting the integrity and not really to fix it. 

https://github.com/kubernetes/kubernetes/blob/master/pkg/util/mount/mount_linux.go#L276

Comment 25 Humble Chirammal 2019-08-29 13:43:23 UTC

(In reply to Jason Dillaman from comment #23)
> (In reply to Humble Chirammal from comment #22)
> > Jason, as tested by QE, in this scenario the volume is not shared/mounted to
> > a POD.
> > POD creation is unsuccessful saying "multi attach" error. But, it looks like 
> > the mapping to the NODE has happened, eventhough its not shared to the
> > POD/application.
> 
> Well that's unfortunate if QE didn't test that scenario since this ticket is
> the exact case that I mentioned and was worried about in the Jira ticket
> (not that manually creating two pods bound to two different nodes would
> result in k8s blocking the double-mount, but the path where k8s reschedules
> workloads to another node).

I think more detailed testing is yet to be performed from QE side with  different scenarios like this.
But one interesting fact is that, this is **not** just about an 'unresponsive node', rather something ( here test suite actions ) cause
a POD which use "RWO" pod to be (re)scheduled. Not sure what exactly caused the "reschedule" here though.  

This can occur frequently/very frequently in a setup where POD with RWO is getting respawned in different nodes for various reasons and it should not cause data integrity issues.

Comment 27 Shyamsundar 2019-08-29 14:01:36 UTC

Based on the various comments on the bug since comment #15 I would like to understand if OCP Storage team would help us understand why we ended up staging a RWO PV on 2 nodes. This seems to be the bug that needs some attention at the OCP layer as well, and that was the intent of moving the component to OCP-Storage.

Marking this as NEEDINFO against Brad, as I do not want the conversation to derail possible root causing or help, from the OCP side of affairs. @Brad can you help us get required attention on this? Thanks.

Comment 28 Humble Chirammal 2019-08-30 09:56:42 UTC

(In reply to Shyamsundar from comment #27)
> Based on the various comments on the bug since comment #15 I would like to
> understand if OCP Storage team would help us understand why we ended up
> staging a RWO PV on 2 nodes. This seems to be the bug that needs some
> attention at the OCP layer as well, and that was the intent of moving the
> component to OCP-Storage.
> 
> Marking this as NEEDINFO against Brad, as I do not want the conversation to
> derail possible root causing or help, from the OCP side of affairs. @Brad
> can you help us get required attention on this? Thanks.

While we wait, I would like to point out one of the recent analysis here
https://bugzilla.redhat.com/show_bug.cgi?id=1745776#c4 which could be 
the reason for RWO pod to be 'staged' on more than a NODE at a time and caused
FS data integrity issue.

Note You need to log in before you can comment on or make changes to this bug.