Created attachment 1608688 [details] dmesg logs from 2 nodes that were used to map RBD devices Description of problem: Two instances of rbd images mapped to a set of kubernetes nodes was stuck during umount. Further during the time, around when the umount was issued, there were kernel hung task stacks found in dmesg (attached dmesg from the nodes). Also, one of the images which was stuck umounting on a node, was mounted on another node a while later, as there were no watchers on the image. NOTE: For the above statement, initially a watcher was found (for 6-7 iterations with a second gap between checks for watchers on the said image), and post this there were no watchers causing the second node to mount the device and report XFS superblock corruptions. Version-Release number of selected component (if applicable): The version under test is actually upstream 14.2 as that is what we are using in Rook+Ceph-CSI tests with OCP 4.2. How reproducible: Happened twice during the same set of tests, which involves 3 nodes where rbd devices are mapped. Steps to Reproduce: 1. Setup a OCP+OCS cluster, with Rook+Ceph and a storage class created for RBD (named csi-rbd) 2. Run the ocs-monkey tool (https://github.com/JohnStrunk/ocs-monkey) against this setup with the following CLI options: - ./workload_runner.py -l ./ -m ReadWriteOnce -n testws --oc /path/to/oc --ocs-namespace <rook-ceph namespace> -s csi-rbd 3. In about 2-10 minutes the workload runner will report failure and start gathering ocs and ocp must-gather logs Post this inspecting pods running based on the tests may show symptoms of waiting for mounts, which are stuck in umount on other nodes where the pods were running prior to being scheduled to the current node. Additional info: From an rbd client perspective, data collected was dmesg logs from the nodes where the devices were mounted, (attached) 1) In dmesg logs dmesg-10-0-129-7.log around the following TS note the hung task timeouts, ---------------------- [ 2212.346516] INFO: task hyperkube:1202 blocked for more than 120 seconds. [ 2212.350337] Not tainted 4.18.0-80.7.2.el8_0.x86_64 #1 [ 2212.353349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ---------------------- - The rbd device that hung was /dev/rbd8 - Post leaving the machine overnight the image was umounted at some later point in time (this was noted when attempting to gather more information the next day) - This message and its TS may be of relevance: [23907.729712] XFS (rbd8): Unmounting Filesystem 2) In dmesg logs dmesg-10-0-162-206.log, events of interest are as follows, ---------------------- Aug 26 15:57:54.208979 ip-10-0-162-206 kernel: INFO: task hyperkube:1179 blocked for more than 120 seconds. Aug 26 15:57:54.209129 ip-10-0-162-206 kernel: Not tainted 4.18.0-80.7.2.el8_0.x86_64 #1 Aug 26 15:57:54.209173 ip-10-0-162-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ---------------------- - The rbd device that hung was /dev/rbd0 - Further below the above messages, please note some more XFS errors noting a corruption, ---------------------- Aug 26 22:34:57.001029 ip-10-0-162-206 kernel: XFS (rbd0): Metadata corruption detected at xfs_dir3_data_reada_verify+0x25/0x70 [xfs], xfs_dir3_data_reada block 0x53c100 Aug 26 22:34:57.001213 ip-10-0-162-206 kernel: XFS (rbd0): Unmount and run xfs_repair Aug 26 22:34:57.001342 ip-10-0-162-206 kernel: XFS (rbd0): First 128 bytes of corrupted metadata buffer: ---------------------- 3) In dmesg logs dmesg-10-0-159-243.log, events of interest are as follows, - Image that was still stuck in (2) above was mounted succesfully as /dev/rbd2 on this node - The above mount was done post checking for watchers on the image, where the first 6 calls to check for watchers returned a watcher on the image, and then at a point there were no more watchers, and the kubernetes ceph-CSI plugins mounted the device on the node - Post the mount in dmesg the logs of interest are, --------------------- [23992.664392] XFS (rbd2): Metadata corruption detected at xfs_dir3_data_reada_verify+0x25/0x70 [xfs], xfs_dir3_data_reada block 0x3c2080 [23992.676746] XFS (rbd2): Unmount and run xfs_repair [23992.681815] XFS (rbd2): First 128 bytes of corrupted metadata buffer: ---------------------
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
I am seeing a lot of osdX down/up messages right before the failure. What's going on w/ the Ceph cluster?
(In reply to Jason Dillaman from comment #3) > I am seeing a lot of osdX down/up messages right before the failure. What's > going on w/ the Ceph cluster? I did read those lines and wondered what is happening as well. What I can tell is that the Rook based Ceph cluster is running on an AWS OCP setup, I have had various problems with this 3 worker node footprint setup, from the kubernetes SDN (network) to other parts. I suspect the network is not "sane" enough for Ceph, but how can I tell this? What other information may help you in this regard? I have rook-ceph pod logs from this event (collected using ocs-must-gather) and can be found here: http://file.rdu.redhat.com/srangana/bz1745773.tar.gz if they help. The tarball also contains journalctl from the 2 worker nodes in addition to the other logs. NOTE: None of the Ceph pods (OSDs) were restarted during the tests or reported as down
If you can get a dump of "/sys/kernel/debug/ceph/<fsid>.<clientid>/*" during this event, it would be helpful to see if some IOs are stuck waiting on an OSD.
What kind of workload does workload_runner.py generate? Looks like there is about a dozen rbd devices on each worker node -- how many workload containers does ocs-monkey spawn, what each of them does? The OSD can be marked down by the monitor for various reasons (heartbeats, etc), the OSD pod will not be restarted or reported as down. At the time ocs-must-gather ran, the cluster was still unhappy: HEALTH_WARN Degraded data redundancy: 1796/18015 objects degraded (9.969%), 100 pgs degraded, 4 pgs undersized; 3 pools have too many placement groups [WRN] PG_DEGRADED: Degraded data redundancy: 1796/18015 objects degraded (9.969%), 100 pgs degraded, 4 pgs undersized pg 1.15 is active+recovery_wait+degraded, acting [1,2,0] pg 1.2e is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [2,0] pg 1.2f is active+recovery_wait+degraded, acting [1,2,0] pg 1.30 is active+recovery_wait+degraded, acting [2,1,0] pg 1.31 is active+recovery_wait+degraded, acting [1,0,2] pg 1.32 is active+recovery_wait+degraded, acting [2,1,0] pg 1.33 is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [2,0] pg 1.34 is active+recovery_wait+degraded, acting [1,0,2] pg 1.35 is active+recovery_wait+degraded, acting [0,1,2] pg 1.36 is stuck undersized for 70m, current state active+recovery_wait+undersized+degraded+remapped, last acting [0,2] pg 1.37 is active+recovery_wait+degraded, acting [2,0,1] pg 1.38 is active+recovery_wait+degraded, acting [2,1,0] pg 1.39 is active+recovery_wait+degraded, acting [1,2,0] pg 1.3a is active+recovery_wait+degraded, acting [0,2,1] pg 1.3b is active+recovering+degraded, acting [0,2,1] pg 1.3c is active+recovery_wait+degraded, acting [1,0,2] pg 1.3d is active+recovery_wait+degraded, acting [2,1,0] pg 1.3e is active+recovery_wait+degraded, acting [2,0,1] pg 1.3f is active+recovery_wait+degraded, acting [1,0,2] pg 1.40 is active+recovery_wait+degraded, acting [1,0,2] pg 1.41 is active+recovery_wait+degraded, acting [0,2,1] pg 1.42 is active+recovery_wait+degraded, acting [0,1,2] pg 1.43 is active+recovery_wait+degraded, acting [1,2,0] pg 1.44 is active+recovery_wait+degraded, acting [1,0,2] pg 1.45 is active+recovery_wait+degraded, acting [0,2,1] pg 1.46 is active+recovery_wait+degraded, acting [2,1,0] pg 1.47 is active+recovery_wait+degraded, acting [1,2,0] pg 1.48 is active+recovery_wait+degraded, acting [0,1,2] pg 1.49 is active+recovery_wait+degraded, acting [2,0,1] pg 1.4a is active+recovery_wait+degraded, acting [0,2,1] pg 1.4b is active+recovery_wait+degraded, acting [1,2,0] pg 1.4c is active+recovery_wait+degraded, acting [1,2,0] pg 1.4d is active+recovery_wait+degraded, acting [0,1,2] pg 1.4e is active+recovery_wait+degraded, acting [2,0,1] pg 1.4f is active+recovery_wait+degraded, acting [2,1,0] pg 1.50 is active+recovery_wait+degraded, acting [1,0,2] pg 1.51 is active+recovery_wait+degraded, acting [2,1,0] pg 1.52 is active+recovery_wait+degraded, acting [2,0,1] pg 1.53 is active+recovery_wait+degraded, acting [0,1,2] pg 1.54 is active+recovery_wait+degraded, acting [2,1,0] pg 1.55 is active+recovery_wait+degraded, acting [0,2,1] pg 1.56 is active+recovery_wait+degraded, acting [1,0,2] pg 1.57 is active+recovery_wait+degraded, acting [1,0,2] pg 1.58 is active+recovery_wait+degraded, acting [2,1,0] pg 1.59 is active+recovery_wait+degraded, acting [0,1,2] pg 1.5a is active+recovery_wait+degraded, acting [0,2,1] pg 1.5b is active+recovery_wait+degraded, acting [0,1,2] pg 1.5c is active+recovery_wait+degraded, acting [2,0,1] pg 1.5d is active+recovery_wait+degraded, acting [1,0,2] pg 1.5e is active+recovery_wait+degraded, acting [1,2,0] pg 1.5f is active+recovery_wait+degraded, acting [1,0,2] [WRN] POOL_TOO_MANY_PGS: 3 pools have too many placement groups Pool mycephfs-data0 has 100 placement groups, should have 4 Pool replicapool has 100 placement groups, should have 4 Pool mycephfs-metadata has 100 placement groups, should have 16 All of them are replicapool (i.e. rbd) PGs. Each OSD reported hundreds of slow requests, pretty much every minute: 2019-08-26T17:31:21.397381293Z 2019-08-26 17:31:21.397316 I | 2019-08-26T17:31:21.395+0000 7ff6ddea6700 -1 osd.0 75 get_health_metrics reporting 216 slow ops, oldest is osd_op(client.24818.0:154908 1.8 1:10f7bdfe:::rbd_data.739a98c4e1d2.00000000000001c2:head [set-alloc-hint object_size 4194304 write_size 4194304,write 110592~69632] snapc 0=[] ondisk+write+known_if_redirected e75) 2019-08-26T17:30:55.050433341Z 2019-08-26 17:30:55.050372 I | 2019-08-26T17:30:55.049+0000 7fdb303f5700 -1 osd.1 75 get_health_metrics reporting 215 slow ops, oldest is osd_op(client.24818.0:155056 1.2d 1:b53576ea:::rbd_data.739a98c4e1d2.00000000000002de:head [set-alloc-hint object_size 4194304 write_size 4194304,write 69632~24576] snapc 0=[] ondisk+write+known_if_redirected e75) 2019-08-26T16:20:09.104164701Z 2019-08-26 16:20:09.104105 I | 2019-08-26T16:20:09.103+0000 7f0cb1c5d700 -1 osd.2 69 get_health_metrics reporting 181 slow ops, oldest is osd_op(client.24818.0:141383 1.30 1:0ff6c40f:::rbd_data.6bf94c9ef58.000000000000008f:head [set-alloc-hint object_size 4194304 write_size 4194304,write 4083712~4096] snapc 0=[] ondisk+write+known_if_redirected e69) Hung workload tasks on the client side is a natural consequence...
(In reply to Ilya Dryomov from comment #6) > What kind of workload does workload_runner.py generate? Looks like there is > about a dozen rbd devices on each worker node -- how many workload > containers does ocs-monkey spawn, what each of them does? The workload that is executed is this: https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload.py The above does a kernel tar ball extraction per 2 minutes and a "rm -rf" of an extracted dir every 2 minutes. Each of these extractions go into their own directories. This is done till the application pod is deleted. At a given time there maybe at max 30 app pods (across 3 nodes) that are executing the same workload as above. > Hung workload tasks on the client side is a natural consequence... Agreed. An under provisioned setup can have such hangs, and we possibly need to see how best to notify that problem (something outside the scope of this bug for the moment IMO). Hence the problem (1) in #description, where the image was not unmapped from the node, is a valid state to be in. My concern is more (2) and (3), where a hung device was mounted on another node because the watchers came back with no watchers on the image. Is there anything that is missing here that we need to address? These are the relevant watcher logs from the CSI nodeplugin that attempts to check for watchers and then mount the image, Log file: ./ocs-monkey-82240754/ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log 2019-08-26T15:54:37.38006135Z I0826 15:54:37.379483 20545 utils.go:119] ID: 415 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:54:37.38740967Z I0826 15:54:37.379503 20545 utils.go:120] ID: 415 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:03:50.794222028Z I0826 16:03:50.793915 20545 rbd_util.go:488] ID: 415 setting disableInUseChecks on rbd volume to: false 2019-08-26T16:03:50.847343724Z I0826 16:03:50.847269 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:03:50.949929449Z I0826 16:03:50.949871 20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers: 2019-08-26T16:03:50.949929449Z watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971 2019-08-26T16:03:51.950168097Z I0826 16:03:51.950106 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:03:52.042062014Z I0826 16:03:52.042009 20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers: 2019-08-26T16:03:52.042062014Z watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971 pool replicapool 2019-08-26T16:03:53.442307389Z I0826 16:03:53.442253 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:03:53.575071417Z I0826 16:03:53.575017 20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers: 2019-08-26T16:03:53.575071417Z watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971 2019-08-26T16:03:55.535308632Z I0826 16:03:55.535249 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:03:55.634365555Z I0826 16:03:55.634296 20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers: 2019-08-26T16:03:55.634365555Z watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971 2019-08-26T16:03:58.38780497Z I0826 16:03:58.387504 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:03:58.722279981Z I0826 16:03:58.722201 20545 rbd_util.go:170] ID: 415 rbd: watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010: Watchers: 2019-08-26T16:03:58.722279981Z watcher=10.129.2.1:0/2877862217 client.14398 cookie=18446462598732840971 2019-08-26T16:04:02.608762443Z I0826 16:04:02.608465 20545 rbd_util.go:151] ID: 415 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T16:04:03.260864611Z W0826 16:04:03.260556 20545 rbd_util.go:173] ID: 415 rbd: no watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 2019-08-26T16:04:03.260864611Z I0826 16:04:03.260590 20545 rbd_attach.go:202] ID: 415 rbd: map mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789 2019-08-26T16:20:37.107013625Z I0826 16:20:37.106503 20545 nodeserver.go:141] ID: 415 rbd image: 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010/replicapool was successfully mapped at /dev/rbd2 2019-08-26T16:20:38.639736721Z I0826 16:20:38.639407 20545 mount_linux.go:460] Attempting to mount disk: xfs /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 2019-08-26T16:20:38.639736721Z I0826 16:20:38.639440 20545 nsenter_mount.go:80] nsenter mount /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 xfs [defaults] 2019-08-26T16:20:38.639736721Z I0826 16:20:38.639481 20545 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/systemd-run --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 --scope -- /bin/mount -t xfs -o defaults /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010] ---> Command hung! /bin/mount -t xfs -o defaults /dev/rbd2 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010
(In reply to Shyamsundar from comment #7) > My concern is more (2) and (3), where a hung device was mounted on another > node because the watchers came back with no watchers on the image. Is there > anything that is missing here that we need to address? How did the PVC get re-scheduled to another node w/ the first node was unresponsive? That seems like a k8s bug to me. If the unmount / unmap operation times out, k8s shouldn't just assume that the PV is safe to map / mount elsewhere. This gets back to that fencing issue Jira ticket.
(In reply to Shyamsundar from comment #7) > (In reply to Ilya Dryomov from comment #6) > > What kind of workload does workload_runner.py generate? Looks like there is > > about a dozen rbd devices on each worker node -- how many workload > > containers does ocs-monkey spawn, what each of them does? > > > The workload that is executed is this: > https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload. > py > > The above does a kernel tar ball extraction per 2 minutes and a "rm -rf" of > an extracted dir every 2 minutes. Each of these extractions go into their > own directories. This is done till the application pod is deleted. > > At a given time there maybe at max 30 app pods (across 3 nodes) that are > executing the same workload as above. > > > Hung workload tasks on the client side is a natural consequence... > > Agreed. An under provisioned setup can have such hangs, and we possibly need > to see how best to notify that problem (something outside the scope of this > bug for the moment IMO). Hence the problem (1) in #description, where the > image was not unmapped from the node, is a valid state to be in. > > My concern is more (2) and (3), where a hung device was mounted on another > node because the watchers came back with no watchers on the image. Is there > anything that is missing here that we need to address? Well, as you can see a watch is not a reliable indicator of whether the image is mapped (be it on the same node or on a different node). The watch is lost if the OSD responsible for the header object goes down (not necessarily crashes, as I said earlier it can be (wrongly) marked down by the monitor, etc). The watch is lost if the client doesn't check in in time, which can happen even if the client is alive and happily sending I/O, not just to other OSDs but also to the OSD responsible for the header object. In the end the client will re-establish the watch, but it can take a while. Fundamentally, a watch is just a couple of messages exchanged over a regular OSD session, with a little bit of additional state on the OSD side. If this is not prevented at a higher level, you might want to look at exclusive-lock feature and "rbd map -o exclusive". It adds proper blacklisting, but it's not without its issues. It is essentially hard fencing at the lowest level possible...
Adding more context to the problem, as I did not pay correct attention to the timestamps for events between (2) and (3) and in an effort to answer observations in comment #8 and comment #9, now have more data or sequence of events to share. The image was mounted to a different node (10-0-159-243) [call this node-2 in the discussion], when it was active on node 10-0-159-243 [call this node-1 in the discussion]. This is based on the CSI node plugin logs and time stamps in there. Comment #7 details logs from node 10-0-159-243, which was the second node that was attempting to mount the image. - The time at which this event occurred was, 15:54:37 (until) 16:20:38 - The event was a request to stage (map the device and mount it) the image on the node, which eventually got stuck at mount with dmesg logs that the XFS header was corrupt The logs that follow are from the node 10-0-159-243 where the unmap of the image was stuck. - The time when the image was requested to be unmapped is 16:21:24.072361, which is about a minute after the call to mount it on node-2 So the image was mounted and active on node-1, when kube decided to mount it on node-2, which checked for watchers, and as @illya points out, found a watcher missing at some point in time, mounted the device, and promptly corrupted the XFS header as there are 2 nodes using the same device. This seems to be the root cause of the problem. @Jason and @Illya I assume you agree on this. node-1 logs (of interest to the image in question): Log file: ocs-monkey-82240754/ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log 2019-08-26T15:43:09.241384822Z I0826 15:43:09.238373 18583 utils.go:120] ID: 120 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:43:14.366749924Z I0826 15:43:14.357786 18583 rbd_util.go:151] ID: 120 rbd: status csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 using mon 172.30.39.206:6789,172.30.247.177:6789,172.30.252.25:6789, pool replicapool 2019-08-26T15:43:14.586560692Z W0826 15:43:14.586510 18583 rbd_util.go:173] ID: 120 rbd: no watchers on csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010 2019-08-26T15:43:14.783711884Z I0826 15:43:14.781427 18583 nodeserver.go:141] ID: 120 rbd image: 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010/replicapool was successfully mapped at /dev/rbd0 2019-08-26T15:43:15.115859874Z I0826 15:43:15.110500 18583 nodeserver.go:161] ID: 120 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 2019-08-26T15:43:15.145657364Z I0826 15:43:15.141985 18583 utils.go:120] ID: 122 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:43:15.190647894Z I0826 15:43:15.190492 18583 nodeserver.go:261] ID: 122 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount 2019-08-26T15:46:51.917292763Z I0826 15:46:51.914865 18583 utils.go:120] ID: 251 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:46:51.932852045Z I0826 15:46:51.932516 18583 nodeserver.go:110] ID: 251 rbd: volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 is already mounted to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010, skipping 2019-08-26T15:46:51.945942582Z I0826 15:46:51.943788 18583 utils.go:120] ID: 253 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:46:51.988957899Z I0826 15:46:51.988160 18583 nodeserver.go:261] ID: 253 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount 2019-08-26T16:21:23.966851502Z I0826 16:21:23.966027 18583 utils.go:120] ID: 490 GRPC request: {"target_path":"/var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:21:23.968661157Z I0826 16:21:23.967243 18583 utils.go:120] ID: 491 GRPC request: {"target_path":"/var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:21:23.99221676Z I0826 16:21:23.991284 18583 nodeserver.go:404] ID: 490 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/b8ccd6ac-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount 2019-08-26T16:21:23.992349012Z I0826 16:21:23.992312 18583 nodeserver.go:404] ID: 491 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/2872d7a2-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount 2019-08-26T16:21:24.073095064Z I0826 16:21:24.072361 18583 utils.go:120] ID: 493 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:21:24.07920296Z I0826 16:21:24.079178 18583 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/umount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010] 2019-08-26T16:21:39.587992773Z I0826 16:21:39.586899 18583 utils.go:120] ID: 495 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:21:39.956828169Z E0826 16:21:39.956609 18583 nodeserver.go:468] ID: 495 error unmapping volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010): (rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed 2019-08-26T16:21:39.956828169Z E0826 16:21:39.956682 18583 utils.go:123] ID: 495 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed 2019-08-26T16:21:40.995074421Z I0826 16:21:40.994448 18583 utils.go:120] ID: 497 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T16:21:41.300554321Z E0826 16:21:41.300488 18583 nodeserver.go:468] ID: 497 error unmapping volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010): (rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed 2019-08-26T16:21:41.300554321Z E0826 16:21:41.300534 18583 utils.go:123] ID: 497 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (replicapool/csi-vol-fcc0c5c1-c817-11e9-be8f-0a580a830010) failed (exit status 16): (rbd: sysfs write failed As an aside: Kuberbetes also has asked for staging (map the image and mount it to a base path) the same image twice on the same node, which also is suspect overall, as there is no reason to repeat the same.
(In reply to Shyamsundar from comment #10) > The logs that follow are from the node 10-0-159-243 where the unmap of the > image was stuck. > - The time when the image was requested to be unmapped is 16:21:24.072361, > which is about a minute after the call to mount it on node-2 I see 15:42:03 as the first reference to stage that PVC on the original node ... ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log:2019-08-26T15:42:03.828041114Z I0826 15:42:03.818792 18583 utils.go:120] ID: 78 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} ... followed shortly thereafter by the original mount request. So perhaps that is just implying that k8s timed out on the original mount request and then assumed that it was safe to stage and mount on another node (w/o calling unstage first)? Aug 26 15:47:34 ip-10-0-162-206 hyperkube[1156]: I0826 15:47:34.888364 1156 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"testws", Name:"osio-worker-925889458-79df64c77f-gc7tn", UID:"a3050904-c818-11e9-9022-02f7b213f410", APIVersion:"v1", ResourceVersion:"27340", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.MountDevice failed for volume "pvc-81c6766a-c818-11e9-9022-02f7b213f410" : rpc error: code = DeadlineExceeded desc = context deadline exceeded > @Jason and @Illya I assume you agree on this. Yes, agreed.
(In reply to Jason Dillaman from comment #11) > (In reply to Shyamsundar from comment #10) > > The logs that follow are from the node 10-0-159-243 where the unmap of the > > image was stuck. > > - The time when the image was requested to be unmapped is 16:21:24.072361, > > which is about a minute after the call to mount it on node-2 > > I see 15:42:03 as the first reference to stage that PVC on the original node > ... > > ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi- > rbdplugin/csi-rbdplugin/logs/current.log:2019-08-26T15:42:03.828041114Z > I0826 15:42:03.818792 18583 utils.go:120] ID: 78 GRPC request: > {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/ > kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount", > "volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode": > {"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool", > "storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph. > rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001- > fcc0c5c1-c817-11e9-be8f-0a580a830010"} - The above request was ack'd as a success here: 2019-08-26T15:42:07.685590966Z I0826 15:42:07.684802 18583 nodeserver.go:161] ID: 78 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 - Later a publish happened as well: 2019-08-26T15:42:10.543570328Z I0826 15:42:10.428065 18583 utils.go:120] ID: 84 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:42:10.573657693Z I0826 15:42:10.573354 18583 nodeserver.go:261] ID: 84 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount - Then an unpublish: 2019-08-26T15:42:10.932805093Z I0826 15:42:10.868051 18583 utils.go:120] ID: 85 GRPC request: {"target_path":"/var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:42:11.049305483Z I0826 15:42:11.030925 18583 nodeserver.go:404] ID: 85 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/fcca66e2-c817-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/mount - and an unstage: 2019-08-26T15:42:11.176133639Z I0826 15:42:11.174985 18583 utils.go:120] ID: 87 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010"} 2019-08-26T15:42:11.756380886Z I0826 15:42:11.756328 18583 nodeserver.go:472] ID: 87 successfully unmounted volume (0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fc9b1743-c817-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-fcc0c5c1-c817-11e9-be8f-0a580a830010) Then the next stage occurred, for which I posted logs in comment #11 Also, all the above (un)stage/(un)publish responded not later than 5 seconds, so timeout toleration's were met. > > ... followed shortly thereafter by the original mount request. So perhaps > that is just implying that k8s timed out on the original mount request and > then assumed that it was safe to stage and mount on another node (w/o > calling unstage first)? You may be seeing this bug occurring here: https://bugzilla.redhat.com/show_bug.cgi?id=1744958 > > Aug 26 15:47:34 ip-10-0-162-206 hyperkube[1156]: I0826 15:47:34.888364 > 1156 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"testws", > Name:"osio-worker-925889458-79df64c77f-gc7tn", > UID:"a3050904-c818-11e9-9022-02f7b213f410", APIVersion:"v1", > ResourceVersion:"27340", FieldPath:""}): type: 'Warning' reason: > 'FailedMount' MountVolume.MountDevice failed for volume > "pvc-81c6766a-c818-11e9-9022-02f7b213f410" : rpc error: code = > DeadlineExceeded desc = context deadline exceeded From a TS perspective the message from kubelet is quite later than the first and second stage success responses: "15:47:34". For the bz #1744958 I never saw a stage request on another node, it kept failing on the same node, but that is possibly because the test case failed by then and stopped moving (deleting and recreating) the pods around. >> As an aside: >> Kuberbetes also has asked for staging (map the image and mount it to a base path) the same image twice on the same node, which also is suspect overall, as there is no reason to repeat the same. It also possibly explains my seeing 2 stage requests on the same node. > > > @Jason and @Illya I assume you agree on this. > > Yes, agreed. Ok, time to transfer this to OCP-Storage then, unless we want to iron out any other gaps.
(In reply to Shyamsundar from comment #12) > Ok, time to transfer this to OCP-Storage then, unless we want to iron out > any other gaps. Certainly good to get the core experts to look at this even if it needs to bounce back to OCS.
Correcting incorrect node names in the following comment (In reply to Shyamsundar from comment #10) > The image was mounted to a different node (10-0-159-243) [call this node-2 > in the discussion], when it was active on node 10-0-159-243 [call this > node-1 in the discussion]. This is based on the CSI node plugin logs and > time stamps in there. The above should read: The image was mounted to a different node (10-0-159-243) [call this node-2 in the discussion], when it was active on node 10-0-162-206 [call this node-1 in the discussion]. This is based on the CSI node plugin logs and time stamps in there. Corrected node-1 address/name, which was duplicate of node-2 in the earlier comment. > The logs that follow are from the node 10-0-159-243 where the unmap of the > image was stuck. The above should read: The logs that follow are from the node 10-0-162-206 where the unmap of the image was stuck. Corrected node address/name from 10-0-159-243 to 10-0-162-206
Moving this bug to OCP-Storage for problem analysis and inputs. Disclaimers: - At present I do not have PVC or Pod describe data for the problematic pods/pvcs - What we traced in this bug was the PV and the rbd image, as to why we faced corruptions in the image (or XFS reported corruptions in the image) - The setup has been reclaimed, hence currently no further data is available other than already collected must-gather and ocs-must gather data present at: http://file.rdu.redhat.com/srangana/bz1745773.tar.gz - The logs are the same for the reported bugs: - https://bugzilla.redhat.com/show_bug.cgi?id=1745776 - https://bugzilla.redhat.com/show_bug.cgi?id=1745773 (added Hemant and Fabio to the CC of this bug, as they are assigned the above bugs) Problem description: An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity. Reason for ceph-rbd/csi allowing the PV to be staged on the second node is due to a current weaker scheme used for exclusive access (i.e the watcher based scheme in rbd). This was under the assumption that Kubernetes (and hence OCP) provides enough resilience for RWO PVs to not be staged on multiple nodes. Problematic PV: pvc-fc9b1743-c817-11e9-9022-02f7b213f410 PV activities found on nodes 10.0.159.243 and 10.0.162.206 as from the logs below, ---------------- $ grep -l pvc-fc9b1743-c817-11e9-9022-02f7b213f410 ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-?????/csi-rbdplugin/csi-rbdplugin/logs/current.log ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log ---------------- ceph-csi rbd node plugin name and pod details ---------------- $ oc -n rook-ceph get pods -o wide | grep rbd csi-rbdplugin-hdtdc 3/3 Running 0 166m 10.0.129.7 ip-10-0-129-7.ec2.internal <none> <none> csi-rbdplugin-n7ddn 3/3 Running 0 166m 10.0.159.243 ip-10-0-159-243.ec2.internal <none> <none> csi-rbdplugin-provisioner-0 5/5 Running 0 166m 10.131.0.16 ip-10-0-129-7.ec2.internal <none> <none> csi-rbdplugin-qgwql 3/3 Running 0 166m 10.0.162.206 ip-10-0-162-206.ec2.internal <none> <none> $ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log 2019-08-26T15:31:54.601747941Z I0826 15:31:54.600915 20545 utils.go:120] ID: 2 GRPC request: {} 2019-08-26T15:31:54.602450009Z I0826 15:31:54.601893 20545 nodeserver-default.go:58] Using default NodeGetInfo 2019-08-26T15:31:54.60307398Z I0826 15:31:54.601913 20545 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-159-243.ec2.internal"} $ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-hdtdc/csi-rbdplugin/csi-rbdplugin/logs/current.log 2019-08-26T15:31:42.293055977Z I0826 15:31:42.292787 30300 utils.go:119] ID: 2 GRPC call: /csi.v1.Node/NodeGetInfo 2019-08-26T15:31:42.293460446Z I0826 15:31:42.292808 30300 utils.go:120] ID: 2 GRPC request: {} 2019-08-26T15:31:42.293460446Z I0826 15:31:42.293452 30300 nodeserver-default.go:58] Using default NodeGetInfo 2019-08-26T15:31:42.29405322Z I0826 15:31:42.293466 30300 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-129-7.ec2.internal"} $ less ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log 2019-08-26T15:31:19.346043474Z I0826 15:31:19.346010 18583 utils.go:119] ID: 2 GRPC call: /csi.v1.Node/NodeGetInfo 2019-08-26T15:31:19.346411115Z I0826 15:31:19.346031 18583 utils.go:120] ID: 2 GRPC request: {} 2019-08-26T15:31:19.346411115Z I0826 15:31:19.346402 18583 nodeserver-default.go:58] Using default NodeGetInfo 2019-08-26T15:31:19.346903848Z I0826 15:31:19.346416 18583 utils.go:125] ID: 2 GRPC response: {"node_id":"ip-10-0-162-206.ec2.internal"} ---------------- The PV was initially staged and published on node 10.0.162.206 (please see comment #10 for ceph-csi logs that discuss the activities, also see comment #14 for some errata information in the original comment) Subsequent to this, the PV was staged on node 10-0-159-243 (ceph-csi logs provided in comment #7), even before the first unstage request was made to node 10.0.162.206. The above behavior, and other slow osd and such responses causing a watcher detection failure, caused the image to become corrupt. Questions: - This behavior of attempting to state an RWO volume on another node, even before the first node has unstaged it seem in error, is this correct? - Should the csi provider take additional steps to safeguard against multi node access for RWO volumes? Or, is there enough state management in kubernetes to ensure that this will not happen? (IOW, should we actively ensure exclusive ownership and apply fencing in ceph-rbd as well)
>An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity. Shyam, I am yet to look into the logs, however I am trying to understand, how "staging" on 2 different nodes cause a filesystem integrity issue? staging brings the share to a global mount path, but afaict, it **should not** cause any access to FS, as this is not mounted or not made available to an application ( which is a pod in ocp/kube context). IOW, just a staged volume should not cause FS integrity issues in my opinion. Multiattach volume error happens or validated at time of publish/attach, which is actually next course of action after staging. I think thats why it is staged on second node. I would let OCP team to confirm this though.
(In reply to Humble Chirammal from comment #16) > >An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity. > > Shyam, I am yet to look into the logs, however I am trying to understand, > how "staging" on 2 different nodes cause a filesystem integrity issue? > staging brings the share to a global mount path, but afaict, it **should > not** cause any access to FS, as this is not mounted or not made available > to an application ( which is a pod in ocp/kube context). IOW, just a staged > volume should not cause FS integrity issues in my opinion. Staging involves mapping the rbd image and then mounting it to the staging path, so there is filesystem mounting that occurs, at which point the FS would access the data in the image. Does XFS update its super block or other such activity at this point in time I am not aware. > > Multiattach volume error happens or validated at time of publish/attach, > which is actually next course of action after staging. I think thats why it > is staged on second node. I would let OCP team to confirm this though.
(In reply to Humble Chirammal from comment #16) > Multiattach volume error happens or validated at time of publish/attach, > which is actually next course of action after staging. I think thats why it > is staged on second node. I would let OCP team to confirm this though. I have seen multi attach checks prior to staging (from logs), and irrespective of that I would not depend on staging not touching the mount in any fashion, to be assumed safe to be called on multiple nodes for an RWO case. As you state, let us hear from the OCP team.
(In reply to Shyamsundar from comment #17) > (In reply to Humble Chirammal from comment #16) > > >An RWO PV is staged on 2 nodes, causing the RWO image to lose filesystem integrity. > > > > Shyam, I am yet to look into the logs, however I am trying to understand, > > how "staging" on 2 different nodes cause a filesystem integrity issue? > > staging brings the share to a global mount path, but afaict, it **should > > not** cause any access to FS, as this is not mounted or not made available > > to an application ( which is a pod in ocp/kube context). IOW, just a staged > > volume should not cause FS integrity issues in my opinion. > > Staging involves mapping the rbd image and then mounting it to the staging > path, so there is filesystem mounting that occurs, at which point the FS > would access the data in the image. Does XFS update its super block or other > such activity at this point in time I am not aware. > > > By 'staging brings the share to a global mount path,' in my comment I was referring this 2 stage operation ( map + global path mount). Again my impression was that, just mounting a volume like this should not cause any FS integrity issue, but I am not 100% sure. From the bz description I got a feeling that, this stage operation cause the FS integrity issue, so I was confused and was trying to confirm. It looks like we are not sure that 'this staging' process cause the FS integrity issue or not. May be XFS/RBD can team can clarify this.
(In reply to Jason Dillaman from comment #8) > (In reply to Shyamsundar from comment #7) > > My concern is more (2) and (3), where a hung device was mounted on another > > node because the watchers came back with no watchers on the image. Is there > > anything that is missing here that we need to address? > > How did the PVC get re-scheduled to another node w/ the first node was > unresponsive? That seems like a k8s bug to me. If the unmount / unmap > operation times out, k8s shouldn't just assume that the PV is safe to map / > mount elsewhere. This gets back to that fencing issue Jira ticket. Even if it's a K8S bug, we should make the extra effort to protect against corruptions. Not that locking is easy and straightforward, but it's better than corruption.
(In reply to Humble Chirammal from comment #19) > By 'staging brings the share to a global mount path,' in my comment I was > referring this 2 stage operation ( map + global path mount). > Again my impression was that, just mounting a volume like this should not > cause any FS integrity issue, but I am not 100% sure. > From the bz description I got a feeling that, this stage operation cause the > FS integrity issue, so I was confused and was trying to confirm. > It looks like we are not sure that 'this staging' process cause the FS > integrity issue or not. > May be XFS/RBD can team can clarify this. Mounting an XFS/extX/other non-cluster file system concurrently on two different nodes should be expected to cause corruption and/or issues. Even assuming that XFS doesn't write to the underlying block device at initial mount time (which is not true, especially if it thinks that it needs to replay journal entries), it still will read and cache data structures from on-disk. Let's pick the most manacle scenario where the XFS file system is being "mkfs.xfs"-ed on node 1 while node 2 attempts to mount it. You shouldn't expect that node 2 will be in a happy state.
(In reply to Jason Dillaman from comment #8) > (In reply to Shyamsundar from comment #7) > > My concern is more (2) and (3), where a hung device was mounted on another > > node because the watchers came back with no watchers on the image. Is there > > anything that is missing here that we need to address? > > How did the PVC get re-scheduled to another node w/ the first node was > unresponsive? That seems like a k8s bug to me. If the unmount / unmap > operation times out, k8s shouldn't just assume that the PV is safe to map / > mount elsewhere. This gets back to that fencing issue Jira ticket. Jason, as tested by QE, in this scenario the volume is not shared/mounted to a POD. POD creation is unsuccessful saying "multi attach" error. But, it looks like the mapping to the NODE has happened, eventhough its not shared to the POD/application.
(In reply to Humble Chirammal from comment #22) > Jason, as tested by QE, in this scenario the volume is not shared/mounted to > a POD. > POD creation is unsuccessful saying "multi attach" error. But, it looks like > the mapping to the NODE has happened, eventhough its not shared to the > POD/application. Well that's unfortunate if QE didn't test that scenario since this ticket is the exact case that I mentioned and was worried about in the Jira ticket (not that manually creating two pods bound to two different nodes would result in k8s blocking the double-mount, but the path where k8s reschedules workloads to another node).
Few things to sort out here: ------------------------------- Is it always the scenario that, staging/global path mount happens on 2 nodes in case of a RWO volume at scenarios like this ? or this is a race condition between "unstage" request to first node and "stage" on second node? In any case, afaict, making a RWO volume on 2 nodes at the same time is not expected from Kube side or its a bug in Kube. Is there any difference in FS integrity when we use 'ext4' in place of 'xfs' ? Additional info: 'fsck -a' is executed subsequently, but I think it just help on detecting the integrity and not really to fix it. https://github.com/kubernetes/kubernetes/blob/master/pkg/util/mount/mount_linux.go#L276
(In reply to Jason Dillaman from comment #23) > (In reply to Humble Chirammal from comment #22) > > Jason, as tested by QE, in this scenario the volume is not shared/mounted to > > a POD. > > POD creation is unsuccessful saying "multi attach" error. But, it looks like > > the mapping to the NODE has happened, eventhough its not shared to the > > POD/application. > > Well that's unfortunate if QE didn't test that scenario since this ticket is > the exact case that I mentioned and was worried about in the Jira ticket > (not that manually creating two pods bound to two different nodes would > result in k8s blocking the double-mount, but the path where k8s reschedules > workloads to another node). I think more detailed testing is yet to be performed from QE side with different scenarios like this. But one interesting fact is that, this is **not** just about an 'unresponsive node', rather something ( here test suite actions ) cause a POD which use "RWO" pod to be (re)scheduled. Not sure what exactly caused the "reschedule" here though. This can occur frequently/very frequently in a setup where POD with RWO is getting respawned in different nodes for various reasons and it should not cause data integrity issues.
Based on the various comments on the bug since comment #15 I would like to understand if OCP Storage team would help us understand why we ended up staging a RWO PV on 2 nodes. This seems to be the bug that needs some attention at the OCP layer as well, and that was the intent of moving the component to OCP-Storage. Marking this as NEEDINFO against Brad, as I do not want the conversation to derail possible root causing or help, from the OCP side of affairs. @Brad can you help us get required attention on this? Thanks.
(In reply to Shyamsundar from comment #27) > Based on the various comments on the bug since comment #15 I would like to > understand if OCP Storage team would help us understand why we ended up > staging a RWO PV on 2 nodes. This seems to be the bug that needs some > attention at the OCP layer as well, and that was the intent of moving the > component to OCP-Storage. > > Marking this as NEEDINFO against Brad, as I do not want the conversation to > derail possible root causing or help, from the OCP side of affairs. @Brad > can you help us get required attention on this? Thanks. While we wait, I would like to point out one of the recent analysis here https://bugzilla.redhat.com/show_bug.cgi?id=1745776#c4 which could be the reason for RWO pod to be 'staged' on more than a NODE at a time and caused FS data integrity issue.