Bug 1980423

Summary: Creating a manila share fails and gets stuck in "creating" status
Product: Red Hat OpenStack Reporter: Tom Barron <tbarron>
Component: openstack-containersAssignee: OSP Team <rhos-maint>
Status: CLOSED ERRATA QA Contact: nlevinki <nlevinki>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.1 (Train)CC: alfrgarc, bhubbard, bkopilov, ccopello, dhill, gcharot, gfidente, gkadam, gouthamr, idryomov, jamsmith, jjoyce, lkuchlan, lmarsh, m.andre, mhackett, mmurray, pdonnell, pgrist, rhos-maint, spower, tbarron, vhariria, vimartin
Target Milestone: asyncKeywords: AutomationBlocker, Regression, TestOnly, Triaged
Target Release: 16.1 (Train on RHEL 8.2)Flags: vimartin: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-manila-share-container-16.1.6-7 Doc Type: Bug Fix
Doc Text:
This update fixes a bug that caused the creation of manila shares to fail in native CephFS and CephFS through NFS-Ganesha environments. The Shared File Systems service (manila) communicates with CephFS through the `ceph_vol_client` library. With the previous manila-share container, using `stat` or `mkdir` operating system calls from the `ceph_vol_client` library raised an exception that was handled incorrectly. The exception caused the Shared File Systems service to disconnect from the back end, which caused the share creation process to fail. This update introduces a safe function for directory creation that corrects the problem.
Story Points: ---
Clone Of: 1976693 Environment:
Last Closed: 2021-07-19 13:54:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1976693, 1978688    
Bug Blocks:    

Description Tom Barron 2021-07-08 15:55:22 UTC
+++ This bug was initially created as a clone of Bug #1976693 +++

Description of problem:
There is a failure to create a manila share in the ceph environments
(CephFS-Native and CephFS-NFS-Ganesha).

create-share requests get to the manila-share service and then the
child process handling the cephfs back end is killed for reasons not
yet known and then gets restarted and reinitialized.

Version-Release number of selected component (if applicable):
puppet-manila-15.5.0-2.20210601014536.9c6604a.el8ost.2.noarch
puppet-manila-15.5.0-2.20210601014536.9c6604a.el8ost.2.noarch
puppet-ceph-3.1.2-2.20210603181657.ffa80da.el8ost.1.noarch
ceph-ansible-4.0.57-1.el8cp.noarch

How reproducible:
100%

Steps to Reproduce:

(overcloud) [stack@undercloud-0 ~]$ manila type-list
+--------------------------------------+-----------+------------+------------+--------------------------------------+-------------------------------------+-------------+
| ID                                   | Name      | visibility | is_default | required_extra_specs                 | optional_extra_specs                | Description |
+--------------------------------------+-----------+------------+------------+--------------------------------------+-------------------------------------+-------------+
| 21a84950-d1d7-4bd4-9fd1-e410091e7479 | default   | public     | YES        | driver_handles_share_servers : False |  snapshot_support : True                                   | None        |
+--------------------------------------+-----------+------------+------------+--------------------------------------+-------------------------------------+-------------+

(overcloud) [stack@undercloud-0 ~]$ manila create nfs 1
+---------------------------------------+--------------------------------------+
| Property                              | Value                                |
+---------------------------------------+--------------------------------------+
| id                                    | 4064a7e9-8827-45fe-a88c-9e1550a49cd4 |
| size                                  | 1                                    |
| availability_zone                     | None                                 |
| created_at                            | 2021-06-28T04:02:26.000000           |
| status                                | creating                             |
| name                                  | None                                 |
| description                           | None                                 |
| project_id                            | d0d3edbb67754c01a4b71f0b379ca120     |
| snapshot_id                           | None                                 |
| share_network_id                      | None                                 |
| share_proto                           | NFS                                  |
| metadata                              | {}                                   |
| share_type                            | 21a84950-d1d7-4bd4-9fd1-e410091e7479 |
| is_public                             | False                                |
| snapshot_support                      | True                                 |
| task_state                            | None                                 |
| share_type_name                       | default                              |
| access_rules_status                   | active                               |
| replication_type                      | None                                 |
| has_replicas                          | False                                |
| user_id                               | 345024b1d79c4a1780d88f796575097c     |
| create_share_from_snapshot_support    | False                                |
| revert_to_snapshot_support            | False                                |
| share_group_id                        | None                                 |
| source_share_group_snapshot_member_id | None                                 |
| mount_snapshot_support                | False                                |
| share_server_id                       | None                                 |
| host                                  |                                      |
+---------------------------------------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ manila list
+--------------------------------------+------+------+-------------+----------+-----------+-----------------+-------------------------+-------------------+
| ID                                   | Name | Size | Share Proto | Status   | Is Public | Share Type Name | Host                    | Availability Zone |
+--------------------------------------+------+------+-------------+----------+-----------+-----------------+-------------------------+-------------------+
| 4064a7e9-8827-45fe-a88c-9e1550a49cd4 | None | 1    | NFS         | creating | False     | default         | hostgroup@cephfs#cephfs | nova              |
+--------------------------------------+------+------+-------------+----------+-----------+-----------------+-------------------------+-------------------+

Actual results:
Creating a manila share fails and gets stuck in "creating" status.

Expected results:
Creating a manila share should be successful.

Additional info:

2021-06-27 10:22:00.278 290 DEBUG manila.share.drivers.cephfs.driver
[req-d9e9bf29-7254-4ae8-8739-6256298b768d
345024b1d79c4a1780d88f796575097c d0d3edbb67754c01a4b71f0b379ca120 - -
-] create_share cephfs name=8dc0239c-1aef-4dc5-8647-f7bde9db4f66
size=1 share_group_id=None create_share
/usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
2021-06-27 10:22:00.288 290 INFO ceph_volume_client
[req-d9e9bf29-7254-4ae8-8739-6256298b768d
345024b1d79c4a1780d88f796575097c d0d3edbb67754c01a4b71f0b379ca120 - -
-] create_volume:
/volumes/_nogroup/8dc0239c-1aef-4dc5-8647-f7bde9db4f66
2021-06-27 10:22:01.017 7 INFO oslo_service.service
[req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Child 290 killed
by signal 6
2021-06-27 10:22:01.023 7 DEBUG oslo_service.service
[req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Started child 328
_start_child
/usr/lib/python3.6/site-packages/oslo_service/service.py:579

--- Additional comment from lkuchlan on 2021-06-28 04:23:34 UTC ---



--- Additional comment from Victoria Martinez de la Cruz on 2021-06-30 12:18:41 UTC ---

I don't see anything odd in the manila logs (apart from the fact that the backend becomes unreachable). Is there an environment I can look into or the Ceph logs? Thanks!

--- Additional comment from Yaniv Kaul on 2021-06-30 12:24:21 UTC ---

Is that a regression? Isn't marked as such, but I assume it worked in previous releases?

--- Additional comment from Tom Barron on 2021-06-30 14:51:35 UTC ---

(In reply to Yaniv Kaul from comment #3)
> Is that a regression? Isn't marked as such, but I assume it worked in
> previous releases?

Liron told me in email "I checked the earlier 16.2 phase3 run and we used ceph tag 4-44, here we used 4-55" so it used to work in 16.2 candidate code (and works in earlier versions).  Vida reported "Apparently the Ceph DFG job running the same pkg version had similar tempest failures."  I'm leaving the NEEDINFO in case QE has additional information but yeah, I think this is a regression and should be marked as such.

--- Additional comment from Giulio Fidente on 2021-06-30 15:49:44 UTC ---

In OSP/Ceph phase3 CI we have a run which worked:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-ceph-rhos-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-tls-ganesha/29/

and a run which failed:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-ceph-rhos-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-tls-ganesha/30/

the OSP compose these are based on seems to be different:

<gfidente> tag: 16.2_20210614.1 in 29
<gfidente> tag: 16.2_20210623.1 in 30

but the version of openstack-manila in both seems to be the same:

http://download-ipv4.eng.brq.redhat.com/rcm-guest/puddles/OpenStack/16.2-RHEL-8/RHOS-16.2-RHEL-8-20210614.n.1/compose/OpenStack/x86_64/os/Packages/openstack-manila-share-9.1.6-2.20210603154809.479a8a7.el8ost.1.noarch.rpm

vs

http://download-ipv4.eng.brq.redhat.com/rcm-guest/puddles/OpenStack/16.2-RHEL-8/RHOS-16.2-RHEL-8-20210623.n.1/compose/OpenStack/x86_64/os/Packages/openstack-manila-share-9.1.6-2.20210603154809.479a8a7.el8ost.1.noarch.rpm

I wonder if this isn't due to some update in Ceph itself or the ceph_volume_client library which manila-share uses.

--- Additional comment from Giulio Fidente on 2021-06-30 15:57:38 UTC ---

Similarily in the unified phase3 jobs, run 72 is working and 73 isn't; it seems that the version of the Ceph container deployed in the two runs is different:

72 is 4-44 http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-all-unified-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-ganesha-default/72/undercloud-0/home/stack/containers-prepare-parameter.yaml.gz

73 is 4-55 http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-all-unified-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-ganesha-default/73/undercloud-0/home/stack/containers-prepare-parameter.yaml.gz

This is probably an issue due to a recent update/change in Ceph; still investigating.

--- Additional comment from Giulio Fidente on 2021-06-30 16:14:41 UTC ---

From manila-share container, I think we have "ceph-common.x86_64 2:14.2.11-147.el8cp" [1] working, while "ceph-common.x86_64 2:14.2.11-181.el8cp" is broken [2]

1. http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-all-unified-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-ganesha-default/72/controller-1/var/log/extra/podman/containers/openstack-manila-share-podman-0/podman_info.log.gz
2. http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-all-unified-16.2_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-nfs-ganesha-default/73/controller-0/var/log/extra/podman/containers/openstack-manila-share-podman-0/podman_info.log.gz

--- Additional comment from Victoria Martinez de la Cruz on 2021-06-30 16:16:38 UTC ---

In the manila-shr log we can see the following

2021-06-28 04:02:27.187 4535 DEBUG manila.share.drivers.cephfs.driver [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c d0d3edbb67754c01a4b71f0b379ca120 - - -] create_share cephfs name=e2ff7bef-7577-4dc2-99bf-39d74ed87027 size=1 share_group_id=None create_share /usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
2021-06-28 04:02:27.198 4535 INFO ceph_volume_client [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c d0d3edbb67754c01a4b71f0b379ca120 - - -] create_volume: /volumes/_nogroup/e2ff7bef-7577-4dc2-99bf-39d74ed87027
2021-06-28 04:02:28.033 7 INFO oslo_service.service [req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Child 4535 killed by signal 6

When issuing the create share command, Ceph responds with SIGABRT. The backend (Ceph) becomes unreachable for Manila.

--- Additional comment from Goutham Pacha Ravi on 2021-06-30 18:16:54 UTC ---

The Ceph MDS log has this:

Jun 30 16:30:20 controller-1 conmon[48382]: 2021-06-30 16:30:20.060 7fafc91e0700  1 mds.0.4 Evicting (and blacklisting) client session 41050 (172.17.3.91:0/2787039070)
Jun 30 16:30:20 controller-1 conmon[48382]: 2021-06-30 16:30:20.060 7fafc91e0700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 41050 (172.17.3.91:0/2787039070)
Jun 30 16:34:42 controller-1 conmon[48382]: 2021-06-30 16:34:42.295 7fafc91e0700  1 mds.0.4 Evicting (and blacklisting) client session 159325 (172.17.3.91:0/3108840306)
Jun 30 16:34:42 controller-1 conmon[48382]: 2021-06-30 16:34:42.295 7fafc91e0700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 159325 (172.17.3.91:0/3108840306)
Jun 30 16:34:42 controller-1 conmon[48382]: 2021-06-30 16:34:42.751 7fafc29d3700  1 mds.0.4 session 159325 was removed while we waited for blacklist
Jun 30 17:34:06 controller-1 conmon[48382]: 2021-06-30 17:34:06.981 7fafc91e0700  1 mds.0.4 Evicting (and blacklisting) client session 159493 (172.17.3.91:0/3220034429)
Jun 30 17:34:06 controller-1 conmon[48382]: 2021-06-30 17:34:06.981 7fafc91e0700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 159493 (172.17.3.91:0/3220034429)
Jun 30 17:47:26 controller-1 conmon[48382]: 2021-06-30 17:47:26.806 7fafc91e0700  1 mds.0.4 Evicting (and blacklisting) client session 161425 (172.17.3.91:0/2266084359)
Jun 30 17:47:26 controller-1 conmon[48382]: 2021-06-30 17:47:26.806 7fafc91e0700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 161425 (172.17.3.91:0/2266084359)

I wonder why the client's getting blacklisted.

--- Additional comment from Goutham Pacha Ravi on 2021-06-30 18:28:20 UTC ---

Client metadata per mds:

[root@controller-0 /]# ceph tell mds.0 client ls
2021-06-30 18:25:51.980 7fdd7a74ab80 10 client.0 ll_register_callbacks cb 0x56414183c830 invalidate_ino_cb 0 invalidate_dentry_cb 0 switch_interrupt_cb 0 remount_cb 0
2021-06-30 18:25:51.981 7fdd3ffff700 10 client.0 ms_handle_connect on v2:172.17.3.91:3300/0
2021-06-30 18:25:51.982 7fdd7a74ab80 10 client.163237 fetch_fsmap learned FSMap version 6
2021-06-30 18:25:51.983 7fdd7a74ab80 10 client.163237 fetch_fsmap finished waiting for FSMap version 6
2021-06-30 18:25:51.983 7fdd7a74ab80 10 client.163237 resolve_mds: resolved '0' to role '1:0'
2021-06-30 18:25:51.983 7fdd7a74ab80 20 client.163237 populate_metadata read hostname 'controller-0'
2021-06-30 18:25:51.983 7fdd7a74ab80  4 client.163237 mds_command: new command op to 4859 tid=0[{"prefix": "get_command_descriptions"}]
2021-06-30 18:25:51.985 7fdd3ffff700 10 client.163237 ms_handle_connect on v2:172.17.3.148:6800/2360339553
2021-06-30 18:25:51.985 7fdd3ffff700 10 client.163237 handle_command_reply: tid=0
2021-06-30 18:25:51.986 7fdd7a74ab80  1 client.163237 shutdown
2021-06-30 18:25:51.986 7fdd3ffff700  0 client.163237 ms_handle_reset on v2:172.17.3.148:6800/2360339553
2021-06-30 18:25:51.987 7fdd7a74ab80 20 client.163237 trim_cache size 0 max 16384
2021-06-30 18:25:52.002 7fdd7a74ab80 10 client.0 ll_register_callbacks cb 0x7fdd600d6be0 invalidate_ino_cb 0 invalidate_dentry_cb 0 switch_interrupt_cb 0 remount_cb 0
2021-06-30 18:25:52.003 7fdd457fa700 10 client.0 ms_handle_connect on v2:172.17.3.91:3300/0
2021-06-30 18:25:52.004 7fdd7a74ab80 10 client.163243 fetch_fsmap learned FSMap version 6
2021-06-30 18:25:52.005 7fdd7a74ab80 10 client.163243 fetch_fsmap finished waiting for FSMap version 6
2021-06-30 18:25:52.005 7fdd7a74ab80 10 client.163243 resolve_mds: resolved '0' to role '1:0'
2021-06-30 18:25:52.005 7fdd7a74ab80 20 client.163243 populate_metadata read hostname 'controller-0'
2021-06-30 18:25:52.005 7fdd7a74ab80  4 client.163243 mds_command: new command op to 4859 tid=0[{"prefix": "client ls"}]
2021-06-30 18:25:52.006 7fdd457fa700 10 client.163243 ms_handle_connect on v2:172.17.3.148:6800/2360339553
2021-06-30 18:25:52.007 7fdd457fa700 10 client.163243 handle_command_reply: tid=0
2021-06-30 18:25:52.007 7fdd7a74ab80  1 client.163243 shutdown
2021-06-30 18:25:52.007 7fdd457fa700  0 client.163243 ms_handle_reset on v2:172.17.3.148:6800/2360339553
2021-06-30 18:25:52.008 7fdd7a74ab80 20 client.163243 trim_cache size 0 max 16384
[
    {
        "id": 161905,
        "entity": {
            "name": {
                "type": "client",
                "num": 161905
            },
            "addr": {
                "type": "any",
                "addr": "172.17.3.91:0",
                "nonce": 2230540863
            }
        },
        "state": "open",
        "num_leases": 0,
        "num_caps": 4,
        "request_load_avg": 0,
        "uptime": 2305.1990099519999,
        "requests_in_flight": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "recall_caps": {
            "value": 0,
            "halflife": 60
        },
        "release_caps": {
            "value": 0,
            "halflife": 60
        },
        "recall_caps_throttle": {
            "value": 0,
            "halflife": 1.5
        },
        "recall_caps_throttle2o": {
            "value": 0,
            "halflife": 0.5
        },
        "session_cache_liveness": {
            "value": 0.029218800216601667,
            "halflife": 300
        },
        "cap_acquisition": {
            "value": 0,
            "halflife": 10
        },
        "inst": "client.161905 172.17.3.91:0/2230540863",
        "completed_requests": [],
        "prealloc_inos": [],
        "used_inos": [],
        "client_metadata": {
            "features": "0x0000000000001fff",
            "ceph_sha1": "68fea1005601531fe60d2979c56ea63bc073c84f",
            "ceph_version": "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)",
            "entity_id": "manila",
            "hostname": "controller-0",
            "pid": "32617",
            "root": "/"
        }
    }
]

--- Additional comment from Patrick Donnelly on 2021-06-30 20:25:05 UTC ---

(In reply to Victoria Martinez de la Cruz from comment #8)
> In the manila-shr log we can see the following
> 
> 2021-06-28 04:02:27.187 4535 DEBUG manila.share.drivers.cephfs.driver
> [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> d0d3edbb67754c01a4b71f0b379ca120 - - -] create_share cephfs
> name=e2ff7bef-7577-4dc2-99bf-39d74ed87027 size=1 share_group_id=None
> create_share
> /usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
> 2021-06-28 04:02:27.198 4535 INFO ceph_volume_client
> [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> d0d3edbb67754c01a4b71f0b379ca120 - - -] create_volume:
> /volumes/_nogroup/e2ff7bef-7577-4dc2-99bf-39d74ed87027
> 2021-06-28 04:02:28.033 7 INFO oslo_service.service
> [req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Child 4535 killed by
> signal 6
> 
> When issuing the create share command, Ceph responds with SIGABRT. The
> backend (Ceph) becomes unreachable for Manila.

"backend" would be the process running the cephfs driver, yes? I think we need to see the logs from the client to get an idea what went wrong. That should be something like

/var/log/ceph/ceph-client.manila.log

The MDS blocklisting the client is not surprising as a any client which aborts without disconnecting from CephFS gracefully will be evicted eventually.

--- Additional comment from Tom Barron on 2021-06-30 21:10:19 UTC ---

(In reply to Patrick Donnelly from comment #11)
> (In reply to Victoria Martinez de la Cruz from comment #8)
> > In the manila-shr log we can see the following
> > 
> > 2021-06-28 04:02:27.187 4535 DEBUG manila.share.drivers.cephfs.driver
> > [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> > d0d3edbb67754c01a4b71f0b379ca120 - - -] create_share cephfs
> > name=e2ff7bef-7577-4dc2-99bf-39d74ed87027 size=1 share_group_id=None
> > create_share
> > /usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
> > 2021-06-28 04:02:27.198 4535 INFO ceph_volume_client
> > [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> > d0d3edbb67754c01a4b71f0b379ca120 - - -] create_volume:
> > /volumes/_nogroup/e2ff7bef-7577-4dc2-99bf-39d74ed87027
> > 2021-06-28 04:02:28.033 7 INFO oslo_service.service
> > [req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Child 4535 killed by
> > signal 6
> > 
> > When issuing the create share command, Ceph responds with SIGABRT. The
> > backend (Ceph) becomes unreachable for Manila.
> 
> "backend" would be the process running the cephfs driver, yes? I think we
> need to see the logs from the client to get an idea what went wrong. That
> should be something like

Yes.

> 
> /var/log/ceph/ceph-client.manila.log

We don't have logs under /var/log/ceph ... Anyone know a setting to enable these?

I'm not seeing anything in the journal either ...

> 
> The MDS blocklisting the client is not surprising as a any client which
> aborts without disconnecting from CephFS gracefully will be evicted
> eventually.

Ack.

Perhaps unrealated but I see:

[root@controller-0 ~]# podman exec ceph-mon-controller-0  ceph -s
  cluster:
    id:     3162a706-252a-4593-8864-f8f72c2d6f14
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
 
  services:
    mon: 3 daemons, quorum controller-2,controller-0,controller-1 (age 3d)
    mgr: controller-2(active, since 3d), standbys: controller-0, controller-1
    mds: cephfs:1 {0=controller-1=up:active} 2 up:standby
    osd: 15 osds: 15 up (since 3d), 15 in (since 3d)
    rgw: 3 daemons active (controller-0.rgw0, controller-1.rgw0, controller-2.rgw0)
 
  data:
    pools:   10 pools, 320 pgs
    objects: 265 objects, 24 MiB
    usage:   15 GiB used, 165 GiB / 180 GiB avail
    pgs:     320 active+clean


Reading https://www.suse.com/support/kb/doc/?id=000019960 this HEALTH_WARN appears to be due to clients (in this case clients running in OSP containers) not having security patches that are already enabled in the OCS server code.

My thought is not so much that the CVE is directly relevant as that the health warning is symptomatic of OSP client code (in this case, ceph volume library used by manila-share container) not being up to date and in phase with the server side code.

Just speculating, you tell us.

--- Additional comment from Patrick Donnelly on 2021-06-30 21:39:59 UTC ---

(In reply to Tom Barron from comment #12)
> (In reply to Patrick Donnelly from comment #11)
> > (In reply to Victoria Martinez de la Cruz from comment #8)
> > > In the manila-shr log we can see the following
> > > 
> > > 2021-06-28 04:02:27.187 4535 DEBUG manila.share.drivers.cephfs.driver
> > > [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> > > d0d3edbb67754c01a4b71f0b379ca120 - - -] create_share cephfs
> > > name=e2ff7bef-7577-4dc2-99bf-39d74ed87027 size=1 share_group_id=None
> > > create_share
> > > /usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
> > > 2021-06-28 04:02:27.198 4535 INFO ceph_volume_client
> > > [req-ce457d1b-c2f7-4d66-adf0-fc47599e055c 345024b1d79c4a1780d88f796575097c
> > > d0d3edbb67754c01a4b71f0b379ca120 - - -] create_volume:
> > > /volumes/_nogroup/e2ff7bef-7577-4dc2-99bf-39d74ed87027
> > > 2021-06-28 04:02:28.033 7 INFO oslo_service.service
> > > [req-54e69491-1385-4065-9f86-a77c606d65c6 - - - - -] Child 4535 killed by
> > > signal 6
> > > 
> > > When issuing the create share command, Ceph responds with SIGABRT. The
> > > backend (Ceph) becomes unreachable for Manila.
> > 
> > "backend" would be the process running the cephfs driver, yes? I think we
> > need to see the logs from the client to get an idea what went wrong. That
> > should be something like
> 
> Yes.
> 
> > 
> > /var/log/ceph/ceph-client.manila.log
> 
> We don't have logs under /var/log/ceph ... Anyone know a setting to enable
> these?

You would need to modify the ceph.conf where the driver runs with:

> [client]
> log file = /var/log/ceph/$name.$pid.log
> debug client = 20
> debug ms = 1


> Perhaps unrealated but I see:
> 
> [root@controller-0 ~]# podman exec ceph-mon-controller-0  ceph -s
>   cluster:
>     id:     3162a706-252a-4593-8864-f8f72c2d6f14
>     health: HEALTH_WARN
>             mons are allowing insecure global_id reclaim
>  
>   services:
>     mon: 3 daemons, quorum controller-2,controller-0,controller-1 (age 3d)
>     mgr: controller-2(active, since 3d), standbys: controller-0, controller-1
>     mds: cephfs:1 {0=controller-1=up:active} 2 up:standby
>     osd: 15 osds: 15 up (since 3d), 15 in (since 3d)
>     rgw: 3 daemons active (controller-0.rgw0, controller-1.rgw0,
> controller-2.rgw0)
>  
>   data:
>     pools:   10 pools, 320 pgs
>     objects: 265 objects, 24 MiB
>     usage:   15 GiB used, 165 GiB / 180 GiB avail
>     pgs:     320 active+clean
> 
> 
> Reading https://www.suse.com/support/kb/doc/?id=000019960 this HEALTH_WARN
> appears to be due to clients (in this case clients running in OSP
> containers) not having security patches that are already enabled in the OCS
> server code.
> 
> My thought is not so much that the CVE is directly relevant as that the
> health warning is symptomatic of OSP client code (in this case, ceph volume
> library used by manila-share container) not being up to date and in phase
> with the server side code.
> 
> Just speculating, you tell us.

I think this is unrelated but +Ilya can tell us for sure.

--- Additional comment from Goutham Pacha Ravi on 2021-07-01 02:21:56 UTC ---


> > We don't have logs under /var/log/ceph ... Anyone know a setting to enable
> > these?
> 
> You would need to modify the ceph.conf where the driver runs with:
> 
> > [client]
> > log file = /var/log/ceph/$name.$pid.log
> > debug client = 20
> > debug ms = 1


Did this, and re-attempted a "create_volume" call:


2021-07-01 02:18:22.243 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:23.243 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:24.244 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:25.244 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:26.244 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:27.244 7f75de7fc700 10 client.178756 renew_caps()
2021-07-01 02:18:27.244 7f75de7fc700 15 client.178756 renew_caps requesting from mds.0
2021-07-01 02:18:27.244 7f75de7fc700 10 client.178756 renew_caps mds.0
2021-07-01 02:18:27.244 7f75de7fc700  1 -- 172.17.3.91:0/1870589254 --> [v2:172.17.3.148:6800/2360339553,v1:172.17.3.148:6801/2360339553] -- client_session(request_renewcaps seq 12) v3 -- 0x7f75c8003fe0 con 0x555c3af526b0
2021-07-01 02:18:27.244 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:27.245 7f75dcff9700  1 -- 172.17.3.91:0/1870589254 <== mds.0 v2:172.17.3.148:6800/2360339553 21 ==== client_session(renewcaps seq 12) v1 ==== 28+0+0 (crc 0 0 0) 0x7f75c8004200 con 0x555c3af526b0
2021-07-01 02:18:27.245 7f75dcff9700 10 client.178756 handle_client_session client_session(renewcaps seq 12) v1 from mds.0
2021-07-01 02:18:28.244 7f75de7fc700 20 client.178756 trim_cache size 3 max 16384
2021-07-01 02:18:29.176 7f7612b5ab80  3 client.178756 statx enter (relpath /volumes/_nogroup/4d6c4c55-143e-49cd-85ca-f42321116a46 want 2047)
2021-07-01 02:18:29.176 7f7612b5ab80 15 inode.get on 0x7f75d400b570 0x1.head now 4
2021-07-01 02:18:29.176 7f7612b5ab80 10 client.178756 path_walk #0x1/volumes/_nogroup/4d6c4c55-143e-49cd-85ca-f42321116a46
2021-07-01 02:18:29.176 7f7612b5ab80 10 client.178756  0 0x1.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-27 09:46:26.287598 mtime=2021-06-30 16:27:27.749580 ctime=2021-06-30 16:27:27.749580 caps=pAsLsXsFs(0=pAsLsXsFs) has_dir_layout 0x7f75d400b570) volumes
2021-07-01 02:18:29.176 7f7612b5ab80 20 client.178756   (path is #0x1/volumes/_nogroup/4d6c4c55-143e-49cd-85ca-f42321116a46)
2021-07-01 02:18:29.176 7f7612b5ab80 20 client.178756 may_lookup 0x1.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-27 09:46:26.287598 mtime=2021-06-30 16:27:27.749580 ctime=2021-06-30 16:27:27.749580 caps=pAsLsXsFs(0=pAsLsXsFs) has_dir_layout 0x7f75d400b570); UserPerm(uid: 0, gid: 0)
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _getattr mask As issued=1
2021-07-01 02:18:29.177 7f7612b5ab80  3 client.178756 may_lookup 0x7f75d400b570 = 0
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 _lookup have dn volumes mds.-1 ttl 0.000000 seq 0
2021-07-01 02:18:29.177 7f7612b5ab80 15 inode.get on 0x7f75d400bba0 0x10000000000.head now 3
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _lookup 0x1.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-27 09:46:26.287598 mtime=2021-06-30 16:27:27.749580 ctime=2021-06-30 16:27:27.749580 caps=pAsLsXsFs(0=pAsLsXsFs) has_dir_layout 0x7f75d400b570) volumes = 0x10000000000.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=0.000000 mtime=2021-06-30 16:27:27.781292 ctime=2021-06-30 16:27:27.781292 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x1.head["volumes"] 0x7f75d400bba0)
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 put_inode on 0x1.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-27 09:46:26.287598 mtime=2021-06-30 16:27:27.749580 ctime=2021-06-30 16:27:27.749580 caps=pAsLsXsFs(0=pAsLsXsFs) has_dir_layout 0x7f75d400b570)
2021-07-01 02:18:29.177 7f7612b5ab80 15 inode.put on 0x7f75d400b570 0x1.head now 3
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756  1 0x10000000000.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=0.000000 mtime=2021-06-30 16:27:27.781292 ctime=2021-06-30 16:27:27.781292 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x1.head["volumes"] 0x7f75d400bba0) _nogroup
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756   (path is #0x1/volumes/_nogroup/4d6c4c55-143e-49cd-85ca-f42321116a46)
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 may_lookup 0x10000000000.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=0.000000 mtime=2021-06-30 16:27:27.781292 ctime=2021-06-30 16:27:27.781292 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x1.head["volumes"] 0x7f75d400bba0); UserPerm(uid: 0, gid: 0)
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _getattr mask As issued=1
2021-07-01 02:18:29.177 7f7612b5ab80  3 client.178756 may_lookup 0x7f75d400bba0 = 0
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 _lookup have dn _nogroup mds.-1 ttl 0.000000 seq 0
2021-07-01 02:18:29.177 7f7612b5ab80 15 inode.get on 0x7f75d400ce90 0x10000000001.head now 2
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _lookup 0x10000000000.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=0.000000 mtime=2021-06-30 16:27:27.781292 ctime=2021-06-30 16:27:27.781292 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x1.head["volumes"] 0x7f75d400bba0) _nogroup = 0x10000000001.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90)
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 put_inode on 0x10000000000.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=0.000000 mtime=2021-06-30 16:27:27.781292 ctime=2021-06-30 16:27:27.781292 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x1.head["volumes"] 0x7f75d400bba0)
2021-07-01 02:18:29.177 7f7612b5ab80 15 inode.put on 0x7f75d400bba0 0x10000000000.head now 2
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756  2 0x10000000001.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90) 4d6c4c55-143e-49cd-85ca-f42321116a46
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756   (path is #0x1/volumes/_nogroup/4d6c4c55-143e-49cd-85ca-f42321116a46)
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 may_lookup 0x10000000001.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90); UserPerm(uid: 0, gid: 0)
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _getattr mask As issued=1
2021-07-01 02:18:29.177 7f7612b5ab80  3 client.178756 may_lookup 0x7f75d400ce90 = 0
2021-07-01 02:18:29.177 7f7612b5ab80 15 inode.get on 0x7f75d400ce90 0x10000000001.head now 3
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 _do_lookup on #0x10000000001/4d6c4c55-143e-49cd-85ca-f42321116a46
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 choose_target_mds starting with req->inode 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90)
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 choose_target_mds inode dir hash is 2 on 4d6c4c55-143e-49cd-85ca-f42321116a46 => 1388017151
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 choose_target_mds 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90) is_hash=1 hash=1388017151
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 choose_target_mds from caps on inode 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90)
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 mds is 0
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 send_request rebuilding request 7 for mds.0
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 encode_cap_releases enter (req: 0x555c3b140b90, mds: 0)
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 send_request set sent_stamp to 2021-07-01 02:18:29.178094
2021-07-01 02:18:29.177 7f7612b5ab80 10 client.178756 send_request client_request(unknown.0:7 lookup #0x10000000001/4d6c4c55-143e-49cd-85ca-f42321116a46 2021-07-01 02:18:29.178062 caller_uid=0, caller_gid=0{}) v4 to mds.0
2021-07-01 02:18:29.177 7f7612b5ab80  1 -- 172.17.3.91:0/1870589254 --> [v2:172.17.3.148:6800/2360339553,v1:172.17.3.148:6801/2360339553] -- client_request(unknown.0:7 lookup #0x10000000001/4d6c4c55-143e-49cd-85ca-f42321116a46 2021-07-01 02:18:29.178062 caller_uid=0, caller_gid=0{}) v4 -- 0x555c3b118250 con 0x555c3af526b0
2021-07-01 02:18:29.177 7f7612b5ab80 20 client.178756 awaiting reply|forward|kick on 0x7ffea38f1e20
2021-07-01 02:18:29.178 7f75dcff9700  1 -- 172.17.3.91:0/1870589254 <== mds.0 v2:172.17.3.148:6800/2360339553 22 ==== client_reply(???:7 = -2 (2) No such file or directory) v1 ==== 458+0+0 (crc 0 0 0) 0x7f75e001f6f0 con 0x555c3af526b0
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 handle_client_reply got a reply. Safe:1 tid 7
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756 insert_trace from 2021-07-01 02:18:29.178094 mds.0 is_target=0 is_dentry=1
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756  features 0xffffffffffffffff
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756 update_snap_trace len 48
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 get_snap_realm 0x1 0x7f75d400b490 4 -> 5
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756 update_snap_trace snaprealm(0x1 nref=5 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[]) seq 1 <= 1 and same parent, SKIPPING
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756  hrm  is_target=0 is_dentry=1
2021-07-01 02:18:29.178 7f75dcff9700 12 client.178756 add_update_inode had 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90) caps pAsLsXsFs
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756  dir hash is 2
2021-07-01 02:18:29.178 7f75dcff9700 10 client.178756 add_update_cap issued pAsLsXsFs -> pAsLsXsFs from mds.0 on 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90)
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 got dirfrag map for 0x10000000001 frag * to mds -1
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 put_snap_realm 0x1 0x7f75d400b490 5 -> 4
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 handle_client_reply signalling caller 0x7ffea38f1e20
2021-07-01 02:18:29.178 7f75dcff9700 20 client.178756 handle_client_reply awaiting kickback on tid 7 0x7f75dcff8bd0
2021-07-01 02:18:29.178 7f7612b5ab80 20 client.178756 sendrecv kickback on tid 7 0x7f75dcff8bd0
2021-07-01 02:18:29.178 7f7612b5ab80 20 client.178756 lat 0.001414
2021-07-01 02:18:29.178 7f7612b5ab80 10 client.178756 _do_lookup res is -2
2021-07-01 02:18:29.178 7f7612b5ab80 10 client.178756 _lookup 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90) 4d6c4c55-143e-49cd-85ca-f42321116a46 = -2
2021-07-01 02:18:29.178 7f7612b5ab80 10 client.178756 put_inode on 0x10000000001.head(faked_ino=0 ref=3 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-30 16:27:27.754401 mtime=2021-06-30 16:27:27.755305 ctime=2021-06-30 16:27:27.755305 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["_nogroup"] 0x7f75d400ce90)
2021-07-01 02:18:29.178 7f7612b5ab80 15 inode.put on 0x7f75d400ce90 0x10000000001.head now 2
2021-07-01 02:18:29.178 7f7612b5ab80  3 client.178756 statx enter (relpath  want 2047)



After this, the manila-share service got a SIGABRT

--- Additional comment from Patrick Donnelly on 2021-07-01 03:31:29 UTC ---

I would expect a core dump and stack trace from the log...

Can you give me login details to test this myself with reproduction steps?

--- Additional comment from lkuchlan on 2021-07-01 04:10:39 UTC ---

(In reply to Yaniv Kaul from comment #3)
> Is that a regression? Isn't marked as such, but I assume it worked in
> previous releases?

Right, in the previous phase 3 runs it's all been fine.

--- Additional comment from Ilya Dryomov on 2021-07-01 10:14:44 UTC ---

(In reply to Tom Barron from comment #12)
> Perhaps unrealated but I see:
> 
> [root@controller-0 ~]# podman exec ceph-mon-controller-0  ceph -s
>   cluster:
>     id:     3162a706-252a-4593-8864-f8f72c2d6f14
>     health: HEALTH_WARN
>             mons are allowing insecure global_id reclaim
>  
> Reading https://www.suse.com/support/kb/doc/?id=000019960 this HEALTH_WARN
> appears to be due to clients (in this case clients running in OSP
> containers) not having security patches that are already enabled in the OCS
> server code.

Not quite.  This warning indicates that the servers are configured to allow unfixed clients.  If you had any unfixed clients actually connected you would see an additional warning.

Should be completely unrelated to the aborts that you are experiencing.

--- Additional comment from Giulio Fidente on 2021-07-01 10:29:30 UTC ---

(In reply to Ilya Dryomov from comment #17)
> (In reply to Tom Barron from comment #12)
> > Perhaps unrealated but I see:
> > 
> > [root@controller-0 ~]# podman exec ceph-mon-controller-0  ceph -s
> >   cluster:
> >     id:     3162a706-252a-4593-8864-f8f72c2d6f14
> >     health: HEALTH_WARN
> >             mons are allowing insecure global_id reclaim
> >  
> > Reading https://www.suse.com/support/kb/doc/?id=000019960 this HEALTH_WARN
> > appears to be due to clients (in this case clients running in OSP
> > containers) not having security patches that are already enabled in the OCS
> > server code.
> 
> Not quite.  This warning indicates that the servers are configured to allow
> unfixed clients.  If you had any unfixed clients actually connected you
> would see an additional warning.
> 
> Should be completely unrelated to the aborts that you are experiencing.

thanks Ilya; I guess this means every pre-existing deployment will see this warning after an update and we can't clear it (or enforce the more secure model) in case there are pre-existing, long running clients, like qemu guests?

--- Additional comment from Ilya Dryomov on 2021-07-01 11:29:35 UTC ---

Yes, see https://docs.ceph.com/en/latest/security/CVE-2021-20288/ for details.

--- Additional comment from Paul Grist on 2021-07-01 15:42:41 UTC ---

Setting this to blocker for awareness and this may impact 16.x customers in the field as well, it appears to be introduced with the latest ceph 4.2z2 release, but no root cause yet, so not clear where the potential fix will live.

We very much need to know how this escaped 4.2z2 RC build handoff and understand the retrospective on this one.

This will likely need to be async'd once we have a solution.

thanks,
Paul

cc: Mike Hackett as we may need to alert field if this reproduces in 16.1.x - investigating now

--- Additional comment from Tom Barron on 2021-07-01 19:30:26 UTC ---

(In reply to Paul Grist from comment #20)
> Setting this to blocker for awareness and this may impact 16.x customers in
> the field as well, it appears to be introduced with the latest ceph 4.2z2
> release, but no root cause yet, so not clear where the potential fix will
> live.
> 
> We very much need to know how this escaped 4.2z2 RC build handoff and
> understand the retrospective on this one.
> 
> This will likely need to be async'd once we have a solution.
> 
> thanks,
> Paul
> 
> cc: Mike Hackett as we may need to alert field if this reproduces in 16.1.x
> - investigating now

W.r.t. 16.1, the latest compose available still has  Ceph 14.2.11-147 packages in the OSP containers even though we are picking up 14.2.11-181 ceph packages in the server containers.

This combination does *not* trigger the problem.

From debug standpoint, this turns the investigation to client side changes in ceph (libcephfs, rados client, ceph_volume_client) rather than changes to mds, mons, etc.

From field standpoint, current 16.1 customers are *not* exposed to this issue but we will of course need a fix before the next 16.1 Z stream or failing that avoid client package bump beyond 14.2.11-147.  And similarly for 16.2.

--- Additional comment from Goutham Pacha Ravi on 2021-07-01 19:32:31 UTC ---

An update on the root cause; Patrick Donnelly attempted taking manila and the CephFS driver out of the picture, and using the ceph_volume_client library from ceph-common-14.2.11-181.el8cp.x86_64:

# echo '/tmp/%e-%p.core' | sudo tee /proc/sys/kernel/core_pattern
# python3
>>> from ceph_volume_client import CephFSVolumeClient, VolumePath                                                                                                                             
>>> vc = CephFSVolumeClient("manila", "/etc/ceph/ceph.conf", "ceph")                                                                                                                          
>>> vc.connect()                                                                                                                                                                              
>>> vc.create_volume(VolumePath(None, "1a228c3b-1ae0-4142-bccc-53a99e2a899e"))                                                                                                                
/usr/include/c++/8/string_view:172: constexpr const _CharT& std::basic_string_view<_CharT, _Traits>::operator[](std::basic_string_view<_CharT, _Traits>::size_type) const [with _CharT = char$_Traits = std::char_traits<char>; std::basic_string_view<_CharT, _Traits>::size_type = long unsigned int]: Assertion '__pos < this->_M_len' failed.                                          
Aborted (core dumped)

The core-dump file is attached.

--- Additional comment from Patrick Donnelly on 2021-07-02 02:55:02 UTC ---

Huge thanks to Brad Hubbard for helping me get an environment setup to debug the core file.

Here's the thread list:

> (gdb) info thread
>   Id   Target Id                        Frame 
> * 1    Thread 0x7f569e8c3b80 (LWP 6799) 0x00007f569dd72a74 in PyEval_EvalFrameEx (throwflag=0, 
>     f=Frame 0x558c14e4c4d8, for file /usr/lib/python3.6/site-packages/ceph_volume_client.py, line 650, in _mkdir_p (self=<CephFSVolumeClient(fs=<cephfs.LibCephFS at remote 0x7f569c78b588>, fs_name=None, connected=False, conf_path='/etc/ceph/ceph.conf', cluster_name='ceph', auth_id='manila', rados=<rados.Rados at remote 0x7f569a2e6a08>, own_rados=True, volume_prefix='/volumes', pool_ns_prefix='fsvolumens_', _id=5036324001222103531) at remote 0x7f569c781dd8>, path='/volumes/_nogroup/1a228c3b-1ae0-4142-bccc-53a99e2a899e', mode=493, parts=['', 'volumes', '_nogroup', '1a228c3b-1ae0-4142-bccc-53a99e2a899e'], i=1, subpath='')) at /usr/src/debug/python3-3.6.8-37.el8.x86_64/Python/ceval.c:4166
>   2    Thread 0x7f567affd700 (LWP 6811) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f4c910) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   3    Thread 0x7f56899f1700 (LWP 6802) 0x00007f569d0070f7 in epoll_wait (epfd=6, events=0x558c14fa7910, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>   4    Thread 0x7f567b7fe700 (LWP 6810) futex_abstimed_wait_cancelable (private=0, abstime=0x7f567b7fde80, expected=0, futex_word=0x558c14f548f4)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   5    Thread 0x7f567bfff700 (LWP 6809) futex_abstimed_wait_cancelable (private=0, abstime=0x7f567bffeea0, expected=0, futex_word=0x558c14f24408)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   6    Thread 0x7f568a1f2700 (LWP 6801) 0x00007f569d0070f7 in epoll_wait (epfd=3, events=0x558c14f7b9d0, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>   7    Thread 0x7f5670ff9700 (LWP 6826) futex_abstimed_wait_cancelable (private=0, abstime=0x7f5670ff8e70, expected=0, futex_word=0x558c150130a8)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   8    Thread 0x7f5673fff700 (LWP 6816) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f0fbb8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   9    Thread 0x7f56677fe700 (LWP 6828) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c1500f918) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   10   Thread 0x7f5671ffb700 (LWP 6824) futex_abstimed_wait_cancelable (private=0, abstime=0x7f5671ffadf0, expected=0, futex_word=0x558c15013a9c)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   11   Thread 0x7f567a7fc700 (LWP 6812) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f4cad8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   12   Thread 0x7f56717fa700 (LWP 6825) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c150142f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   13   Thread 0x7f5679ffb700 (LWP 6813) futex_abstimed_wait_cancelable (private=0, abstime=0x7f5679ffadf0, expected=0, futex_word=0x558c14f0f1f0)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   14   Thread 0x7f56737fe700 (LWP 6817) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f0fd90) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   15   Thread 0x7f5678ff9700 (LWP 6815) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f0f848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   16   Thread 0x7f56667fc700 (LWP 6830) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c1500d4bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   17   Thread 0x7f5666ffd700 (LWP 6829) futex_abstimed_wait_cancelable (private=0, abstime=0x7f5666ffcdf0, expected=0, futex_word=0x558c1500d3a0)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   18   Thread 0x7f5672ffd700 (LWP 6822) futex_abstimed_wait_cancelable (private=0, abstime=0x7f5672ffce80, expected=0, futex_word=0x558c1500e184)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>   19   Thread 0x7f56889ef700 (LWP 6808) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14d0b5c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   20   Thread 0x7f56727fc700 (LWP 6823) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c15013158) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   21   Thread 0x7f56891f0700 (LWP 6803) 0x00007f569d0070f7 in epoll_wait (epfd=9, events=0x558c14fd3850, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>   22   Thread 0x7f56797fa700 (LWP 6814) futex_wait_cancelable (private=0, expected=0, futex_word=0x558c14f0f30c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
>   23   Thread 0x7f5667fff700 (LWP 6827) futex_wait_cancelable (private=0, expected=0, futex_word=0x7f5667ffec00) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

and the backtrace:

> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f569cf2bdb5 in __GI_abort () at abort.c:79
> #2  0x00007f5698bef0f8 in std::__replacement_assert (__file=__file@entry=0x7f5698cd2900 "/usr/include/c++/8/string_view", __line=__line@entry=172, 
>     __function=__function@entry=0x7f5698cd8920 <std::basic_string_view<char, std::char_traits<char> >::operator[](unsigned long) const::__PRETTY_FUNCTION__> "constexpr const _CharT& std::basic_string_view<_CharT, _Traits>::operator[](std::basic_string_view<_CharT, _Traits>::size_type) const [with _CharT = char; _Traits = std::char_traits<char>; std::basic_"..., 
>     __condition=__condition@entry=0x7f5698cd28e8 "__pos < this->_M_len") at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2397
> #3  0x00007f5698c8768c in std::basic_string_view<char, std::char_traits<char> >::operator[] (this=<optimized out>, this=<optimized out>, __pos=<optimized out>)
>     at /usr/src/debug/ceph-14.2.11-181.el8cp.x86_64/src/include/fs_types.h:17
> #4  filepath::set_path (s=..., this=<optimized out>) at /usr/src/debug/ceph-14.2.11-181.el8cp.x86_64/src/include/filepath.h:103
> #5  filepath::filepath (this=0x7ffee8a2a1e0, s=<optimized out>) at /usr/src/debug/ceph-14.2.11-181.el8cp.x86_64/src/include/filepath.h:95
> #6  0x00007f5698c37e1a in Client::statx (this=0x558c15013a30, relpath=0x7f569e7f4690 "", stx=0x7ffee8a2a290, perms=..., want=2047, flags=0)
>     at /usr/src/debug/ceph-14.2.11-181.el8cp.x86_64/src/client/Client.cc:7301
> #7  0x00007f5698f45518 in __pyx_pf_6cephfs_9LibCephFS_80stat () from /usr/lib64/python3.6/site-packages/cephfs.cpython-36m-x86_64-linux-gnu.so
> #8  0x00007f5698f472f0 in __pyx_pw_6cephfs_9LibCephFS_81stat () from /usr/lib64/python3.6/site-packages/cephfs.cpython-36m-x86_64-linux-gnu.so
> ...

So statx is getting called with an empty path which causes the abort.

I cannot find any recent changes in Ceph that could explain how that new behavior could be introduced. Up to this point, the code is fairly simple and doesn't depend much on Ceph/libcephfs. The problematic code is:

>     def _mkdir_p(self, path, mode=0o755):
>         try:
>             self.fs.stat(path)
>         except cephfs.ObjectNotFound:
>             pass
>         else:
>             return
> 
>         parts = path.split(os.path.sep)
> 
>         for i in range(1, len(parts) + 1):
>             subpath = os.path.join(*parts[0:i])
>             try:
>                 self.fs.stat(subpath)
>             except cephfs.ObjectNotFound:
>                 self.fs.mkdir(subpath, mode)

it is failing on the second stat. Manually testing ad-hoc:

>>>> import os
>>>> a="/hi/bye"
>>>> b=a.split(os.path.sep)
>>>> for i in range(1, len(b)+1):
>...     print(os.path.join(*b[0:i]))
>... 
>
>hi
>hi/bye

So, it looks like this loop should always try to stat the empty string unless first character *is not* '/'. But even in testing we seem to always use a volume prefix beginning with root. So I'm not sure how we haven't seen this exception/abort before.

I will need to think about this more...

--- Additional comment from Patrick Donnelly on 2021-07-02 03:00:22 UTC ---

Goutham, can you share the the client logs we produced in our testing? I think we should be able to see the full path from the first stat.

--- Additional comment from Goutham Pacha Ravi on 2021-07-02 04:40:52 UTC ---

The top part of this client log was generated when we were debugging on the system; the remaining may not be relevant because the client continued to be active, although no volume operations were initiated through manila

--- Additional comment from Goutham Pacha Ravi on 2021-07-02 04:42:21 UTC ---

Client log when we began the manual debug; contains a lot of the statx calls before the SIGABRT.

--- Additional comment from Goutham Pacha Ravi on 2021-07-02 19:16:22 UTC ---

https://bugzilla.redhat.com/show_bug.cgi?id=1978688 has been updated with a fix: https://github.com/ceph/ceph/pull/42162

We could use this BZ as an OSP side tracker; no code changes are necessary in manila or any other openstack component repository

--- Additional comment from  on 2021-07-02 19:18:15 UTC ---

Executed manual share tests, and selected tempest API and scenario tests on a ceph-fs-ganesha env running RHOS-16.2-RHEL-8-20210616.n.1 with fix referenced in #27

puppet-manila-15.5.0-2.20210601014536.9c6604a.el8ost.2.noarch
python3-manilaclient-1.29.0-2.20210528015043.1b2cafb.el8ost.1.noarch
puppet-ceph-3.1.2-2.20210603181657.ffa80da.el8ost.1.noarch
ceph-ansible-4.0.57-1.el8cp.noarch

All tests passed.

--- Additional comment from Tom Barron on 2021-07-02 19:58:01 UTC ---

(In reply to Goutham Pacha Ravi from comment #27)
> https://bugzilla.redhat.com/show_bug.cgi?id=1978688 has been updated with a
> fix: https://github.com/ceph/ceph/pull/42162
> 
> We could use this BZ as an OSP side tracker; no code changes are necessary
> in manila or any other openstack component repository

Yeah, this one should be test only and verified after the RHCS fix is released.  That enables tracking OSP side (and an OSP side reference for blocker requests, etc.) as well as a check point for any glitches in production image delivery and OSP rebase on the RHCS side when it is released.

--- Additional comment from Giulio Fidente on 2021-07-06 15:57:03 UTC ---

The fix will be shipped in the RHCS 4.2z1-async1 update ; OSP will have to respin the containers to pick up the newer version of the python-cephfs package into the containers consuming it (for example, manila-share)

Comment 3 Tom Barron 2021-07-08 16:21:23 UTC
Ericcson is hitting this with 16.1 in https://access.redhat.com/support/cases/#/case/02983628.

They are running with manila-share image 16.1_20210616.1 and I guess our tests on 16.1 were using an earlier image 16.1_20210602.  Apparently the newer image was rebased with the problematic ceph client packages.

Comment 20 errata-xmlrpc 2021-07-19 13:54:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 async containers bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2772