+++ This bug was initially created as a clone of Bug #2247174 +++ Description of problem: Details here: https://tracker.ceph.com/issues/63188 How reproducible: Very likely. Steps to Reproduce: 1. Start with RHCS 5 client and MDS 2. Upgrade the (userspace) client to RHCS 7 (say, ceph-mgr) 3. Continue using the client 4. Upgrade the MDS (one active MDS should suffice) 5. The userspace client (ceph-mgr in this case) should crash. --- Additional comment from Venky Shankar on 2023-10-31 05:34:08 UTC --- Start with RHCS 6.1 client/MDS. --- Additional comment from Venky Shankar on 2023-10-31 05:43:33 UTC --- https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/408 --- Additional comment from Venky Shankar on 2023-10-31 14:54:31 UTC --- (In reply to Venky Shankar from comment #2) > https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/408 Merged. --- Additional comment from on 2023-10-31 19:45:27 UTC --- Builds are ready for testing. We need a qa_ack+ in order to attach this BZ to the errata advisory and move to ON_QA. --- Additional comment from errata-xmlrpc on 2023-11-01 04:52:19 UTC --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2023:118213-01 https://errata.devel.redhat.com/advisory/118213 --- Additional comment from errata-xmlrpc on 2023-11-01 04:52:26 UTC --- This bug has been added to advisory RHBA-2023:118213 by Thomas Serlin (tserlin) --- Additional comment from on 2023-11-02 11:12:55 UTC --- QA Test Plan: - Repeat steps used to reproduce, 1. Setup 5.3 RHCS with CephFS config. Run IO. 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO 3. Upgrade Ceph Nodes to RHCS 7 build with fix. 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can continue. --- Additional comment from Venky Shankar on 2023-11-03 05:25:20 UTC --- (In reply to sumr from comment #7) > QA Test Plan: > - Repeat steps used to reproduce, > > 1. Setup 5.3 RHCS with CephFS config. Run IO. Use the latest RHCS6 please. > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > continue. --- Additional comment from on 2023-11-07 12:32:54 UTC --- (In reply to Venky Shankar from comment #8) > (In reply to sumr from comment #7) > > QA Test Plan: > > - Repeat steps used to reproduce, > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > Use the latest RHCS6 please. > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > continue. Hi Venky, Executed test steps planned for fix verification as mentioned above with upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. Test Steps: 1. Setup latest RHCS 6 with CephFS config. Run IO. 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO 3. Upgrade Ceph Nodes to RHCS 7 build with fix. 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can continue. Result summary : 1.Ceph is healthy after upgrade but not immediately. As immediately upon upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after few minutes with Recovery, Ceph and MDS are Healthy. 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after transport endpoint shutdown", Remount to new mount point was required to continue IO. IO could be continued with new mount point. No other error or crash seen on Cluster or Client side. ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS auto recovers and, Client gets blocklisted, but IO continues on new mount. Complete Logs: http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/bz_2247174_verification.log Snippet for post-upgrade state: Cluster view: Ceph status immediately after upgrade, 2023-11-07 05:26:34,479 (cephci.cephadm.test_cephadm_upgrade) [INFO] - cephci.ceph.ceph.py:725 - cluster: id: 43b73854-7d47-11ee-9931-fa163ef75022 health: HEALTH_WARN 1 filesystem is degraded 1 filesystem is online with fewer MDS than max_mds Degraded data redundancy: 34/6081 objects degraded (0.559%), 3 pgs degraded services: mon: 3 daemons, quorum ceph-sumar-regression-9s46lr-node1-installer,ceph-sumar-regression-9s46lr-node3,ceph-sumar-regression-9s46lr-node2 (age 8m) mgr: ceph-sumar-regression-9s46lr-node1-installer.flqshy(active, since 8m), standbys: ceph-sumar-regression-9s46lr-node2.vaqqrv mds: 1/1 daemons up, 2 standby osd: 12 osds: 12 up (since 75s), 12 in (since 113m) data: volumes: 0/1 healthy, 1 recovering pools: 4 pools, 135 pgs objects: 2.03k objects, 5.4 GiB usage: 25 GiB used, 155 GiB / 180 GiB avail pgs: 34/6081 objects degraded (0.559%) 131 active+clean 3 active+recovery_wait+degraded 1 active+recovering 2023-11-07 05:26:56,808 (cephci.cephadm.test_cephadm_upgrade) [INFO] - cephci.ceph.ceph_admin.__init__.py:242 - service_type: mds service_id: cephfs service_name: mds.cephfs placement: label: mds status: created: '2023-11-07T08:33:53.741699Z' last_refresh: '2023-11-07T10:25:48.333470Z' running: 3 size: 3 [root@ceph-sumar-regression-9s46lr-node6 cephfs]# ceph status cluster: id: 43b73854-7d47-11ee-9931-fa163ef75022 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-sumar-regression-9s46lr-node1-installer,ceph-sumar-regression-9s46lr-node3,ceph-sumar-regression-9s46lr-node2 (age 28m) mgr: ceph-sumar-regression-9s46lr-node1-installer.flqshy(active, since 28m), standbys: ceph-sumar-regression-9s46lr-node2.vaqqrv mds: 2/2 daemons up, 1 standby osd: 12 osds: 12 up (since 21m), 12 in (since 2h) data: volumes: 1/1 healthy pools: 3 pools, 49 pgs objects: 1.04k objects, 1.6 GiB usage: 13 GiB used, 167 GiB / 180 GiB avail pgs: 49 active+clean Client View: [root@ceph-sumar-regression-9s46lr-node6 cephfs]# ls ls: cannot open directory '.': Cannot send after transport endpoint shutdown [root@ceph-sumar-regression-9s46lr-node6 ~]# ceph-fuse -n client.ceph-sumar-regression-9s46lr-node6 --client_fs cephfs /mnt/cephfs_1 2023-11-07T06:46:39.327-0500 7f64e6e39480 -1 init, newargv = 0x5605dc8c5f60 newargc=15 ceph-fuse[42555]: starting ceph client ceph-fuse[42555]: starting fuse [root@ceph-sumar-regression-9s46lr-node6 ~]# cd /mnt/cephfs_1 [root@ceph-sumar-regression-9s46lr-node6 cephfs_1]# ls fio_file_512M smallfile_dir18 smallfile_dir30 smallfile_dir311 smallfile_dir323 smallfile_dir335 smallfile_dir347 smallfile_dir359 smallfile_dir46 smallfile_dir58 smallfile_dir7 smallfile_dir81 smallfile_dir93 --- Additional comment from on 2023-11-08 05:47:22 UTC --- (In reply to sumr from comment #9) > (In reply to Venky Shankar from comment #8) > > (In reply to sumr from comment #7) > > > QA Test Plan: > > > - Repeat steps used to reproduce, > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > Use the latest RHCS6 please. > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > continue. > > Hi Venky, > > Executed test steps planned for fix verification as mentioned above with > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > Test Steps: > 1. Setup latest RHCS 6 with CephFS config. Run IO. > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > continue. > > Result summary : > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > few minutes with Recovery, Ceph and MDS are Healthy. > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > transport endpoint shutdown", Remount to new mount point was required to > continue IO. IO could be continued with new mount point. No other error or > crash seen on Cluster or Client side. > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > auto recovers and, Client gets blocklisted, but IO continues on new mount. > > Complete Logs: > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/ > bz_2247174_verification.log > > Snippet for post-upgrade state: > > Cluster view: > > Ceph status immediately after upgrade, > > 2023-11-07 05:26:34,479 (cephci.cephadm.test_cephadm_upgrade) [INFO] - > cephci.ceph.ceph.py:725 - cluster: > id: 43b73854-7d47-11ee-9931-fa163ef75022 > health: HEALTH_WARN > 1 filesystem is degraded > 1 filesystem is online with fewer MDS than max_mds > Degraded data redundancy: 34/6081 objects degraded (0.559%), 3 > pgs degraded > > services: > mon: 3 daemons, quorum > ceph-sumar-regression-9s46lr-node1-installer,ceph-sumar-regression-9s46lr- > node3,ceph-sumar-regression-9s46lr-node2 (age 8m) > mgr: ceph-sumar-regression-9s46lr-node1-installer.flqshy(active, since > 8m), standbys: ceph-sumar-regression-9s46lr-node2.vaqqrv > mds: 1/1 daemons up, 2 standby > osd: 12 osds: 12 up (since 75s), 12 in (since 113m) > > data: > volumes: 0/1 healthy, 1 recovering > pools: 4 pools, 135 pgs > objects: 2.03k objects, 5.4 GiB > usage: 25 GiB used, 155 GiB / 180 GiB avail > pgs: 34/6081 objects degraded (0.559%) > 131 active+clean > 3 active+recovery_wait+degraded > 1 active+recovering > > 2023-11-07 05:26:56,808 (cephci.cephadm.test_cephadm_upgrade) [INFO] - > cephci.ceph.ceph_admin.__init__.py:242 - service_type: mds > service_id: cephfs > service_name: mds.cephfs > placement: > label: mds > status: > created: '2023-11-07T08:33:53.741699Z' > last_refresh: '2023-11-07T10:25:48.333470Z' > running: 3 > size: 3 > > [root@ceph-sumar-regression-9s46lr-node6 cephfs]# ceph status > cluster: > id: 43b73854-7d47-11ee-9931-fa163ef75022 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum > ceph-sumar-regression-9s46lr-node1-installer,ceph-sumar-regression-9s46lr- > node3,ceph-sumar-regression-9s46lr-node2 (age 28m) > mgr: ceph-sumar-regression-9s46lr-node1-installer.flqshy(active, since > 28m), standbys: ceph-sumar-regression-9s46lr-node2.vaqqrv > mds: 2/2 daemons up, 1 standby > osd: 12 osds: 12 up (since 21m), 12 in (since 2h) > > data: > volumes: 1/1 healthy > pools: 3 pools, 49 pgs > objects: 1.04k objects, 1.6 GiB > usage: 13 GiB used, 167 GiB / 180 GiB avail > pgs: 49 active+clean > > Client View: > > [root@ceph-sumar-regression-9s46lr-node6 cephfs]# ls > ls: cannot open directory '.': Cannot send after transport endpoint shutdown > [root@ceph-sumar-regression-9s46lr-node6 ~]# ceph-fuse -n > client.ceph-sumar-regression-9s46lr-node6 --client_fs cephfs /mnt/cephfs_1 > 2023-11-07T06:46:39.327-0500 7f64e6e39480 -1 init, newargv = 0x5605dc8c5f60 > newargc=15 > ceph-fuse[42555]: starting ceph client > ceph-fuse[42555]: starting fuse > [root@ceph-sumar-regression-9s46lr-node6 ~]# cd /mnt/cephfs_1 > [root@ceph-sumar-regression-9s46lr-node6 cephfs_1]# ls > fio_file_512M smallfile_dir18 smallfile_dir30 smallfile_dir311 > smallfile_dir323 smallfile_dir335 smallfile_dir347 smallfile_dir359 > smallfile_dir46 smallfile_dir58 smallfile_dir7 smallfile_dir81 > smallfile_dir93 I have added client side logs for further debugging, logs - http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ snippet: 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted at osd epoch 471 INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: 1 flags:0x00004006 [13394.112941] Call Trace: [13394.113372] <TASK> [13394.113862] __schedule+0x248/0x620 [13394.114402] schedule+0x2d/0x60 [13394.114918] request_wait_answer+0x131/0x220 [fuse] [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] [13394.116808] fuse_statfs+0xd8/0x140 [fuse] [13394.117380] statfs_by_dentry+0x64/0x90 [13394.117971] user_statfs+0x57/0xc0 [13394.118461] __do_sys_statfs+0x20/0x60 [13394.119003] do_syscall_64+0x59/0x90 [13394.119537] ? exc_page_fault+0x62/0x150 [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd --- Additional comment from Venky Shankar on 2023-11-09 05:56:48 UTC --- (In reply to sumr from comment #9) > (In reply to Venky Shankar from comment #8) > > (In reply to sumr from comment #7) > > > QA Test Plan: > > > - Repeat steps used to reproduce, > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > Use the latest RHCS6 please. > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > continue. > > Hi Venky, > > Executed test steps planned for fix verification as mentioned above with > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > Test Steps: > 1. Setup latest RHCS 6 with CephFS config. Run IO. > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > continue. > > Result summary : > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > few minutes with Recovery, Ceph and MDS are Healthy. > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > transport endpoint shutdown", Remount to new mount point was required to > continue IO. IO could be continued with new mount point. No other error or > crash seen on Cluster or Client side. > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > auto recovers and, Client gets blocklisted, but IO continues on new mount. Do you see this behaviour in other upgrade tests? [...] > I have added client side logs for further debugging, > > logs - > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ > > snippet: > > 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted > at osd epoch 471 > > INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. > [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 > [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: > 1 flags:0x00004006 > [13394.112941] Call Trace: > [13394.113372] <TASK> > [13394.113862] __schedule+0x248/0x620 > [13394.114402] schedule+0x2d/0x60 > [13394.114918] request_wait_answer+0x131/0x220 [fuse] > [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 > [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] > [13394.116808] fuse_statfs+0xd8/0x140 [fuse] > [13394.117380] statfs_by_dentry+0x64/0x90 > [13394.117971] user_statfs+0x57/0xc0 > [13394.118461] __do_sys_statfs+0x20/0x60 > [13394.119003] do_syscall_64+0x59/0x90 > [13394.119537] ? exc_page_fault+0x62/0x150 > [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd Seems like the mount was unresponsive and that caused the MDS to blocklist the client. Do you see other (kernel) client getting blocklisted? Was only this (ceph-fuse) client being used for IO during upgrade? --- Additional comment from on 2023-11-09 06:20:58 UTC --- (In reply to Venky Shankar from comment #11) > (In reply to sumr from comment #9) > > (In reply to Venky Shankar from comment #8) > > > (In reply to sumr from comment #7) > > > > QA Test Plan: > > > > - Repeat steps used to reproduce, > > > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > > > Use the latest RHCS6 please. > > > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > continue. > > > > Hi Venky, > > > > Executed test steps planned for fix verification as mentioned above with > > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > > > Test Steps: > > 1. Setup latest RHCS 6 with CephFS config. Run IO. > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > continue. > > > > Result summary : > > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > > few minutes with Recovery, Ceph and MDS are Healthy. > > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > > transport endpoint shutdown", Remount to new mount point was required to > > continue IO. IO could be continued with new mount point. No other error or > > crash seen on Cluster or Client side. > > > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > > auto recovers and, Client gets blocklisted, but IO continues on new mount. > > Do you see this behaviour in other upgrade tests? > > [...] > > > I have added client side logs for further debugging, > > > > logs - > > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ > > > > snippet: > > > > 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted > > at osd epoch 471 > > > > INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. > > [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 > > [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: > > 1 flags:0x00004006 > > [13394.112941] Call Trace: > > [13394.113372] <TASK> > > [13394.113862] __schedule+0x248/0x620 > > [13394.114402] schedule+0x2d/0x60 > > [13394.114918] request_wait_answer+0x131/0x220 [fuse] > > [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 > > [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] > > [13394.116808] fuse_statfs+0xd8/0x140 [fuse] > > [13394.117380] statfs_by_dentry+0x64/0x90 > > [13394.117971] user_statfs+0x57/0xc0 > > [13394.118461] __do_sys_statfs+0x20/0x60 > > [13394.119003] do_syscall_64+0x59/0x90 > > [13394.119537] ? exc_page_fault+0x62/0x150 > > [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > Seems like the mount was unresponsive and that caused the MDS to blocklist > the client. Do you see other (kernel) client getting blocklisted? Was only > this (ceph-fuse) client being used for IO during upgrade? > Do you see this behaviour in other upgrade tests? No, existing upgrade regression tests do IO during Ceph cluster upgrade and Ceph Status is Healthy with IO. In this case, the only new step was Client been upgraded prior to upgrade with IO. Do you see other (kernel) client getting blocklisted? Only Ceph-fuse mount was covered, kernel mount was not created. If you don't need this system for further debugging I can rerun the same QA steps with kernel mount too. --- Additional comment from Venky Shankar on 2023-11-09 06:30:21 UTC --- (In reply to sumr from comment #12) > (In reply to Venky Shankar from comment #11) > > (In reply to sumr from comment #9) > > > (In reply to Venky Shankar from comment #8) > > > > (In reply to sumr from comment #7) > > > > > QA Test Plan: > > > > > - Repeat steps used to reproduce, > > > > > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > > > > > Use the latest RHCS6 please. > > > > > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > > continue. > > > > > > Hi Venky, > > > > > > Executed test steps planned for fix verification as mentioned above with > > > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > > > > > Test Steps: > > > 1. Setup latest RHCS 6 with CephFS config. Run IO. > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > continue. > > > > > > Result summary : > > > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > > > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > > > few minutes with Recovery, Ceph and MDS are Healthy. > > > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > > > transport endpoint shutdown", Remount to new mount point was required to > > > continue IO. IO could be continued with new mount point. No other error or > > > crash seen on Cluster or Client side. > > > > > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > > > auto recovers and, Client gets blocklisted, but IO continues on new mount. > > > > Do you see this behaviour in other upgrade tests? > > > > [...] > > > > > I have added client side logs for further debugging, > > > > > > logs - > > > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ > > > > > > snippet: > > > > > > 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted > > > at osd epoch 471 > > > > > > INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. > > > [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 > > > [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > this message. > > > [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: > > > 1 flags:0x00004006 > > > [13394.112941] Call Trace: > > > [13394.113372] <TASK> > > > [13394.113862] __schedule+0x248/0x620 > > > [13394.114402] schedule+0x2d/0x60 > > > [13394.114918] request_wait_answer+0x131/0x220 [fuse] > > > [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 > > > [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] > > > [13394.116808] fuse_statfs+0xd8/0x140 [fuse] > > > [13394.117380] statfs_by_dentry+0x64/0x90 > > > [13394.117971] user_statfs+0x57/0xc0 > > > [13394.118461] __do_sys_statfs+0x20/0x60 > > > [13394.119003] do_syscall_64+0x59/0x90 > > > [13394.119537] ? exc_page_fault+0x62/0x150 > > > [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > > Seems like the mount was unresponsive and that caused the MDS to blocklist > > the client. Do you see other (kernel) client getting blocklisted? Was only > > this (ceph-fuse) client being used for IO during upgrade? > > > Do you see this behaviour in other upgrade tests? > No, existing upgrade regression tests do IO during Ceph cluster upgrade and > Ceph Status is Healthy with IO. In this case, the only new step was Client > been upgraded prior to upgrade with IO. > > Do you see other (kernel) client getting blocklisted? > Only Ceph-fuse mount was covered, kernel mount was not created. If you don't > need this system for further debugging I can rerun the same QA steps with > kernel mount too. Yes, please. Additionally, you could test using both RHCS 5/6 builds. See if this (blocklisted client) is consistently reproducible. Also, use both the user-space and the kernel driver in the test. --- Additional comment from on 2023-11-09 12:13:32 UTC --- (In reply to Venky Shankar from comment #13) > (In reply to sumr from comment #12) > > (In reply to Venky Shankar from comment #11) > > > (In reply to sumr from comment #9) > > > > (In reply to Venky Shankar from comment #8) > > > > > (In reply to sumr from comment #7) > > > > > > QA Test Plan: > > > > > > - Repeat steps used to reproduce, > > > > > > > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > > > > > > > Use the latest RHCS6 please. > > > > > > > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > > > continue. > > > > > > > > Hi Venky, > > > > > > > > Executed test steps planned for fix verification as mentioned above with > > > > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > > > > > > > Test Steps: > > > > 1. Setup latest RHCS 6 with CephFS config. Run IO. > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > continue. > > > > > > > > Result summary : > > > > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > > > > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > > > > few minutes with Recovery, Ceph and MDS are Healthy. > > > > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > > > > transport endpoint shutdown", Remount to new mount point was required to > > > > continue IO. IO could be continued with new mount point. No other error or > > > > crash seen on Cluster or Client side. > > > > > > > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > > > > auto recovers and, Client gets blocklisted, but IO continues on new mount. > > > > > > Do you see this behaviour in other upgrade tests? > > > > > > [...] > > > > > > > I have added client side logs for further debugging, > > > > > > > > logs - > > > > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ > > > > > > > > snippet: > > > > > > > > 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted > > > > at osd epoch 471 > > > > > > > > INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. > > > > [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 > > > > [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > > this message. > > > > [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: > > > > 1 flags:0x00004006 > > > > [13394.112941] Call Trace: > > > > [13394.113372] <TASK> > > > > [13394.113862] __schedule+0x248/0x620 > > > > [13394.114402] schedule+0x2d/0x60 > > > > [13394.114918] request_wait_answer+0x131/0x220 [fuse] > > > > [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 > > > > [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] > > > > [13394.116808] fuse_statfs+0xd8/0x140 [fuse] > > > > [13394.117380] statfs_by_dentry+0x64/0x90 > > > > [13394.117971] user_statfs+0x57/0xc0 > > > > [13394.118461] __do_sys_statfs+0x20/0x60 > > > > [13394.119003] do_syscall_64+0x59/0x90 > > > > [13394.119537] ? exc_page_fault+0x62/0x150 > > > > [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > > > > Seems like the mount was unresponsive and that caused the MDS to blocklist > > > the client. Do you see other (kernel) client getting blocklisted? Was only > > > this (ceph-fuse) client being used for IO during upgrade? > > > > > Do you see this behaviour in other upgrade tests? > > No, existing upgrade regression tests do IO during Ceph cluster upgrade and > > Ceph Status is Healthy with IO. In this case, the only new step was Client > > been upgraded prior to upgrade with IO. > > > > Do you see other (kernel) client getting blocklisted? > > Only Ceph-fuse mount was covered, kernel mount was not created. If you don't > > need this system for further debugging I can rerun the same QA steps with > > kernel mount too. > > Yes, please. Additionally, you could test using both RHCS 5/6 builds. See if > this (blocklisted client) is consistently reproducible. Also, use both the > user-space and the kernel driver in the test. Repro is in-progress, I tried once but repro attempt was not successful i.e., Cluster was healthy after upgrade. Retrying again, will copy the logs to magna server when its accessible. --- Additional comment from on 2023-11-13 09:47:52 UTC --- (In reply to sumr from comment #14) > (In reply to Venky Shankar from comment #13) > > (In reply to sumr from comment #12) > > > (In reply to Venky Shankar from comment #11) > > > > (In reply to sumr from comment #9) > > > > > (In reply to Venky Shankar from comment #8) > > > > > > (In reply to sumr from comment #7) > > > > > > > QA Test Plan: > > > > > > > - Repeat steps used to reproduce, > > > > > > > > > > > > > > 1. Setup 5.3 RHCS with CephFS config. Run IO. > > > > > > > > > > > > Use the latest RHCS6 please. > > > > > > > > > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > > > > continue. > > > > > > > > > > Hi Venky, > > > > > > > > > > Executed test steps planned for fix verification as mentioned above with > > > > > upgrade from RHCS 6 build 17.2.6-153 to RHCS 7 build 18.2.0-118. > > > > > > > > > > Test Steps: > > > > > 1. Setup latest RHCS 6 with CephFS config. Run IO. > > > > > 2. Upgrade Ceph Client to RHCS 7 build with fix, continue IO > > > > > 3. Upgrade Ceph Nodes to RHCS 7 build with fix. > > > > > 4. Verify Ceph is Healthy, no crash seen with Client Ceph-mgr and IO can > > > > > continue. > > > > > > > > > > Result summary : > > > > > 1.Ceph is healthy after upgrade but not immediately. As immediately upon > > > > > upgrade Ceph was in HEALTH_WARN due to filesystem degraded state, but after > > > > > few minutes with Recovery, Ceph and MDS are Healthy. > > > > > 2.Existing Ceph-Fuse mount point got stale due to error "Cannot send after > > > > > transport endpoint shutdown", Remount to new mount point was required to > > > > > continue IO. IO could be continued with new mount point. No other error or > > > > > crash seen on Cluster or Client side. > > > > > > > > > > ASK : Please confirm if the behaviour seen post-upgrade is acceptable as MDS > > > > > auto recovers and, Client gets blocklisted, but IO continues on new mount. > > > > > > > > Do you see this behaviour in other upgrade tests? > > > > > > > > [...] > > > > > > > > > I have added client side logs for further debugging, > > > > > > > > > > logs - > > > > > http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/system_logs/ > > > > > > > > > > snippet: > > > > > > > > > > 2023-11-07T05:26:58.840-0500 7f3544ff9640 -1 client.24439 I was blocklisted > > > > > at osd epoch 471 > > > > > > > > > > INFO: task ceph-fuse:42425 blocked for more than 1228 seconds. > > > > > [13394.110644] Not tainted 5.14.0-284.30.1.el9_2.x86_64 #1 > > > > > [13394.111298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > > > this message. > > > > > [13394.112095] task:ceph-fuse state:D stack: 0 pid:42425 ppid: > > > > > 1 flags:0x00004006 > > > > > [13394.112941] Call Trace: > > > > > [13394.113372] <TASK> > > > > > [13394.113862] __schedule+0x248/0x620 > > > > > [13394.114402] schedule+0x2d/0x60 > > > > > [13394.114918] request_wait_answer+0x131/0x220 [fuse] > > > > > [13394.115519] ? cpuacct_percpu_seq_show+0x10/0x10 > > > > > [13394.116108] fuse_simple_request+0x19f/0x310 [fuse] > > > > > [13394.116808] fuse_statfs+0xd8/0x140 [fuse] > > > > > [13394.117380] statfs_by_dentry+0x64/0x90 > > > > > [13394.117971] user_statfs+0x57/0xc0 > > > > > [13394.118461] __do_sys_statfs+0x20/0x60 > > > > > [13394.119003] do_syscall_64+0x59/0x90 > > > > > [13394.119537] ? exc_page_fault+0x62/0x150 > > > > > [13394.120154] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > > > > > > Seems like the mount was unresponsive and that caused the MDS to blocklist > > > > the client. Do you see other (kernel) client getting blocklisted? Was only > > > > this (ceph-fuse) client being used for IO during upgrade? > > > > > > > Do you see this behaviour in other upgrade tests? > > > No, existing upgrade regression tests do IO during Ceph cluster upgrade and > > > Ceph Status is Healthy with IO. In this case, the only new step was Client > > > been upgraded prior to upgrade with IO. > > > > > > Do you see other (kernel) client getting blocklisted? > > > Only Ceph-fuse mount was covered, kernel mount was not created. If you don't > > > need this system for further debugging I can rerun the same QA steps with > > > kernel mount too. > > > > Yes, please. Additionally, you could test using both RHCS 5/6 builds. See if > > this (blocklisted client) is consistently reproducible. Also, use both the > > user-space and the kernel driver in the test. > > Repro is in-progress, I tried once but repro attempt was not successful > i.e., Cluster was healthy after upgrade. > Retrying again, will copy the logs to magna server when its accessible. I could not reproduce with two attempts, Logs: http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/bz_2247174_client_blocklist_repro1.log http://magna002.ceph.redhat.com/ceph-qe-logs/suma/bz_verify/bz_2247174_client_blocklist_repro2.log As the Ceph status is healthy and Client had no issues after upgrade as per QA test plan, marking this BZ as VERIFIED.
backport PR: https://github.com/ceph/ceph/pull/54244
Doc text update. PTAL.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7740