Bug 2283898 - [Host Maintenance][NVMe] - When we try to place host containing the last available NVMe gateway to maintenance mode, it allows to do so even without a warning that there may be IO interruption
Summary: [Host Maintenance][NVMe] - When we try to place host containing the last avai...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 9.0
Assignee: Shweta Bhosale
QA Contact: Manasa
URL:
Whiteboard:
Depends On:
Blocks: 2267614 2298578 2298579
TreeView+ depends on / blocked
 
Reported: 2024-05-30 07:34 UTC by Manasa
Modified: 2025-04-15 08:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
.No warning is emitted when moving the last active gateway into maintenance mode Currently, there is no warning emitted when moving the last active gateway into maintenance mode. As a result, any hosts containing the NVMe-oF gateways are placed into maintenance mode without warning, causing potential I/O interruptions. Currently, no workaround is available. To avoid this issue, verify that there are other active gateways before moving a gateway into maintenance mode.
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-9111 0 None None None 2024-05-30 07:38:16 UTC

Description Manasa 2024-05-30 07:34:20 UTC
Description of problem:
[Host Maintenance][NVMe] - When we try to place host containing the last available NVMe gateway to maintenance mode, it allows to do so even without a warning that there may be IO interruption

Last available NVMe gateway node should not be allowed to be placed into maintenance mode unless user is first given a warning to use --force option.


Version-Release number of selected component (if applicable):
cp.stg.icr.io/cp/ibm-ceph/ceph-7-rhel9:7-56

How reproducible:
Always

Steps to Reproduce:
1.Create an RHCS 7.1 cluster with 4 NVMe gateway nodes.
2.Place 3 of them into maintenance mode and we see that the remaining 4th node becomes active and picks up all the IOs for all disks.

[root@ceph-ibm-upgrade-hn699h-node11 ~]# ceph orch host ls
HOST                                     ADDR          LABELS                    STATUS       
ceph-ibm-upgrade-hn699h-node1-installer  10.0.208.144  _admin,mgr,mon,installer               
ceph-ibm-upgrade-hn699h-node2            10.0.209.108  mgr,mon                                
ceph-ibm-upgrade-hn699h-node3            10.0.208.135  mon,osd                                
ceph-ibm-upgrade-hn699h-node4            10.0.208.32   osd,mds                                
ceph-ibm-upgrade-hn699h-node5            10.0.208.105  osd,mds                                
ceph-ibm-upgrade-hn699h-node6            10.0.211.178  osd                                    
ceph-ibm-upgrade-hn699h-node7            10.0.210.217  nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node8            10.0.211.22   nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node9            10.0.209.67   nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node10           10.0.208.252  nvmeof-gw                              
10 hosts in cluster

[root@ceph-ibm-upgrade-hn699h-node11 ~]# ceph nvme-gw show nvmeof ''
{
    "epoch": 56,
    "pool": "nvmeof",
    "group": "",
    "num gws": 4,
    "Anagrp list": "[ 1 2 3 4 ]"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node10.tisxus",
    "anagrp-id": 1,
    "performed-full-startup": 1,
    "Availability": "AVAILABLE",
    "ana states": " 1: ACTIVE , 2: ACTIVE , 3: ACTIVE , 4: ACTIVE ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node7.iamilt",
    "anagrp-id": 2,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node8.tnjcij",
    "anagrp-id": 3,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node9.mqoxhu",
    "anagrp-id": 4,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}

3.Try placing the last available NVMe gateway node into maintenance, we see that it is allowed without so much as a warning.

[root@ceph-ibm-upgrade-hn699h-node11 ~]# ceph orch host maintenance enter ceph-ibm-upgrade-hn699h-node10
Daemons for Ceph cluster 9dbd3814-1d7c-11ef-8e61-fa163e083545 stopped on host ceph-ibm-upgrade-hn699h-node10. Host ceph-ibm-upgrade-hn699h-node10 moved to maintenance mode

This then leads to IO interruption from the client.

[root@ceph-ibm-upgrade-hn699h-node11 ~]# ceph orch host ls
HOST                                     ADDR          LABELS                    STATUS       
ceph-ibm-upgrade-hn699h-node1-installer  10.0.208.144  _admin,mgr,mon,installer               
ceph-ibm-upgrade-hn699h-node2            10.0.209.108  mgr,mon                                
ceph-ibm-upgrade-hn699h-node3            10.0.208.135  mon,osd                                
ceph-ibm-upgrade-hn699h-node4            10.0.208.32   osd,mds                                
ceph-ibm-upgrade-hn699h-node5            10.0.208.105  osd,mds                                
ceph-ibm-upgrade-hn699h-node6            10.0.211.178  osd                                    
ceph-ibm-upgrade-hn699h-node7            10.0.210.217  nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node8            10.0.211.22   nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node9            10.0.209.67   nvmeof-gw                 Maintenance  
ceph-ibm-upgrade-hn699h-node10           10.0.208.252  nvmeof-gw                 Maintenance  
10 hosts in cluster

[root@ceph-ibm-upgrade-hn699h-node11 ~]# ceph nvme-gw show nvmeof ''
{
    "epoch": 57,
    "pool": "nvmeof",
    "group": "",
    "num gws": 4,
    "Anagrp list": "[ 1 2 3 4 ]"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node10.tisxus",
    "anagrp-id": 1,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node7.iamilt",
    "anagrp-id": 2,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node8.tnjcij",
    "anagrp-id": 3,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-ibm-upgrade-hn699h-node9.mqoxhu",
    "anagrp-id": 4,
    "performed-full-startup": 0,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"
}

Actual results:
All hosts containing NVMe gateways are placed into maintenance mode without warning that there may be IO interruption.

Expected results:
the user should atleast be given a warning and asked to pass --force parameter if he wishes to move the last remaining gateway to maintenance mode.

Additional info:

Comment 1 Aviv Caro 2024-05-30 08:02:31 UTC
I think this should be fixed for 7.1z1.


Note You need to log in before you can comment on or make changes to this bug.