1976128 – Unable to create max luns(255) per target and containers are being crashed after some disks added to client

Bug 1976128 - Unable to create max luns(255) per target and containers are being crashed after some disks added to client

Summary: Unable to create max luns(255) per target and containers are being crashed ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.2
Assignee:	Teoman ONAY
QA Contact:	Preethi
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2102272
TreeView+	depends on / blocked

Reported:	2021-06-25 09:25 UTC by Gopi
Modified:	2022-08-09 17:36 UTC (History)
CC List:	17 users (show)
Fixed In Version:	ceph-16.2.8-45.el8cp
Doc Type:	Bug Fix
Doc Text:	.Container process number limit set to `max` Previously, the process number limit, 2048, on the containers prevented new processes from being forked beyond the limit. With this release, the process number limit is set to `max`, which allows you to create as many luns as required per target. However, the number is still limited by the server resources.
Clone Of:
Environment:
Last Closed:	2022-08-09 17:35:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	52898	None	None	None	2021-10-12 10:18:05 UTC
Github	ceph ceph pull 44579	None	open	cephadm: Remove containers pids-limit	2022-03-10 09:50:51 UTC
Red Hat Product Errata	RHSA-2022:5997	None	None	None	2022-08-09 17:36:25 UTC

Description Gopi 2021-06-25 09:25:51 UTC

Description of problem:

Unable to create max luns per target. Container is crashed after some luns creation.

Version-Release number of selected component (if applicable):
ceph version 16.2.0-72.el8cp (1e802193e0b4084ffcdb2338dd09f08bbea54a1a) pacific (stable)

How reproducible:
100%

Steps to Reproduce:
1.Create gateways using below file.
[ceph: root@magna104 ~]# cat iscsi.yaml 
service_type: iscsi
service_id: iscsi
placement:
  hosts:
  - magna108
  - magna113
spec:
  pool: iscsi_pool
  trusted_ip_list: "ipv4,ipv6"
  api_user: admin
  api_password: admin
[ceph: root@magna104 ~]#

2.Start iscsi gateways using "Gwcli"

3.Create target and gateways
/iscsi-targets> ls
o- iscsi-targets ................................................................................. [DiscoveryAuth: None, Targets: 1]
  o- iqn.2003-01.com.redhat.iscsi-gw:ceph-igw ............................................................ [Auth: None, Gateways: 2]
    o- disks ............................................................................................................ [Disks: 0]
    o- gateways .............................................................................................. [Up: 2/2, Portals: 2]
    | o- magna108 .............................................................................................. [10.8.128.108 (UP)]
    | o- magna113 .............................................................................................. [10.8.128.113 (UP)]
    o- host-groups .................................................................................................... [Groups : 0]
    o- hosts ......................................................................................... [Auth: ACL_ENABLED, Hosts: 0]
/iscsi-targets>
4.Create client iqn.
5.Create images and add disks to client


Actual results:
/iscsi-target...at:rh7-client> disk add iscsi_pool/image127
ok
/iscsi-target...at:rh7-client> disk add iscsi_pool/image128
Exception in thread Thread-11:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.6/threading.py", line 1182, in run
    self.function(*self.args, **self.kwargs)
  File "/usr/lib/python3.6/site-packages/gwcli/gateway.py", line 646, in check_gateways
    check_thread.start()
  File "/usr/lib64/python3.6/threading.py", line 846, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
[root@magna108 ubuntu]# podman exec -it ff1f0ffc5f35 sh
Error: no container with name or ID ff1f0ffc5f35 found: no such container
[root@magna108 ubuntu]#

Expected results:
Max luns should be created per target.

Comment 3 Gopi 2021-06-28 05:32:39 UTC

I tried same scenario on 4.2 bare metal setup and it worked fine.

/iscsi-target...at:rh7-client> ls
o- iqn.1994-05.com.redhat:rh7-client ................................................. [LOGGED-IN, Auth: CHAP, Disks: 256(1074076M)]
  o- lun 0 ................................................. [rbd/rbd_disk(1.0T), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 1 ................................................... [rbd/disk_1(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
  o- lun 2 ................................................... [rbd/disk_2(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 3 ................................................... [rbd/disk_3(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
  o- lun 4 ................................................... [rbd/disk_4(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 5 ................................................... [rbd/disk_5(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
.
.
.
o- lun 250 ............................................... [rbd/disk_250(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 251 ............................................... [rbd/disk_251(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
  o- lun 252 ............................................... [rbd/disk_252(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 253 ............................................... [rbd/disk_253(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
  o- lun 254 ............................................... [rbd/disk_254(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
  o- lun 255 ............................................... [rbd/disk_255(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
/iscsi-target...at:rh7-client> disk add rbd/disk_256 size=100m
Failed : Disk limit of 256 reached.
disk auto-define failed(8), try using the /disks create command
/iscsi-target...at:rh7-client>

/iscsi-target...at:rh7-client> goto gateways
/iscsi-target...6046/gateways> cd /disks
/disks> create rbd image=disk_1 size=100m
Failed : Disk limit of 256 reached.

Comment 6 Yaniv Kaul 2021-07-01 09:26:35 UTC

(In reply to Gopi from comment #3)
> I tried same scenario on 4.2 bare metal setup and it worked fine.

So why isn't the bug marked as a regression?

> 
> /iscsi-target...at:rh7-client> ls
> o- iqn.1994-05.com.redhat:rh7-client
> ................................................. [LOGGED-IN, Auth: CHAP,
> Disks: 256(1074076M)]
>   o- lun 0 .................................................
> [rbd/rbd_disk(1.0T), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 1 ...................................................
> [rbd/disk_1(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 2 ...................................................
> [rbd/disk_2(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 3 ...................................................
> [rbd/disk_3(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 4 ...................................................
> [rbd/disk_4(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 5 ...................................................
> [rbd/disk_5(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
> .
> .
> .
> o- lun 250 ...............................................
> [rbd/disk_250(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 251 ...............................................
> [rbd/disk_251(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 252 ...............................................
> [rbd/disk_252(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 253 ...............................................
> [rbd/disk_253(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 254 ...............................................
> [rbd/disk_254(100M), Owner: dell-r730-068.dsal.lab.eng.rdu2.redhat.com]
>   o- lun 255 ...............................................
> [rbd/disk_255(100M), Owner: dell-r730-065.dsal.lab.eng.rdu2.redhat.com]
> /iscsi-target...at:rh7-client> disk add rbd/disk_256 size=100m
> Failed : Disk limit of 256 reached.
> disk auto-define failed(8), try using the /disks create command
> /iscsi-target...at:rh7-client>
> 
> /iscsi-target...at:rh7-client> goto gateways
> /iscsi-target...6046/gateways> cd /disks
> /disks> create rbd image=disk_1 size=100m
> Failed : Disk limit of 256 reached.

Comment 7 Gopi 2021-07-01 12:30:06 UTC

Hi Yaniv,

Thanks for pointing. Looks like I missed adding keyword. Will make a note.

Comment 19 Sebastian Wagner 2021-07-07 11:44:03 UTC

Ilya, could you verify that https://github.com/ceph/ceph/pull/42214 works for you?

Comment 20 Scott Ostapovicz 2021-07-07 14:08:09 UTC

This work is progressing but should not serve as a bottleneck to 5.0 due to the fact that these limits were not usually reached in practice.  Moving to 5.1

Comment 51 Preethi 2022-04-22 11:21:30 UTC

Issue still exist with the fix provided. Unable to create max luns per target i.e 256 with the latest build. Containers are being crashed after 190th luns creation.

Previously, we were allowed until 127th lun creation with the fix we could see only 190th lun gets created.


Health status before crash when 190lun gets added successfully -


[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# 
[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# 
[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# ceph status
  cluster:
    id:     7e821d06-c172-11ec-b5bf-fa163e42a69f
    health: HEALTH_WARN
            380 stray daemon(s) not managed by cephadm
 
  services:
    mon:         3 daemons, quorum ceph-pnataraj-hkpq9z-node1-installer,ceph-pnataraj-hkpq9z-node2,ceph-pnataraj-hkpq9z-node3 (age 3h)
    mgr:         ceph-pnataraj-hkpq9z-node1-installer.emcclr(active, since 3h), standbys: ceph-pnataraj-hkpq9z-node2.uvsywu
    osd:         10 osds: 10 up (since 3h), 10 in (since 3h)
    tcmu-runner: 380 portals active (2 hosts)
 
  data:
    pools:   4 pools, 97 pgs
    objects: 791 objects, 345 KiB
    usage:   446 MiB used, 200 GiB / 200 GiB avail
    pgs:     97 active+clean
 
  io:
    client:   1.7 KiB/s rd, 1 op/s rd, 0 op/s wr


Health status when 191th lun was added -
 
[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# ceph status
  cluster:
    id:     7e821d06-c172-11ec-b5bf-fa163e42a69f
    health: HEALTH_WARN
            380 stray daemon(s) not managed by cephadm
 
  services:
    mon:         3 daemons, quorum ceph-pnataraj-hkpq9z-node1-installer,ceph-pnataraj-hkpq9z-node2,ceph-pnataraj-hkpq9z-node3 (age 3h)
    mgr:         ceph-pnataraj-hkpq9z-node1-installer.emcclr(active, since 3h), standbys: ceph-pnataraj-hkpq9z-node2.uvsywu
    osd:         10 osds: 10 up (since 3h), 10 in (since 3h)
    tcmu-runner: 190 portals active (1 hosts)
 
  data:
    pools:   4 pools, 97 pgs
    objects: 791 objects, 345 KiB
    usage:   446 MiB used, 200 GiB / 200 GiB avail
    pgs:     97 active+clean
 
  io:
    client:   2.5 KiB/s rd, 2 op/s rd, 0 op/s wr

NOTE: Health status is in OK state but tcmu-runner container is crashed.
[root@ceph-pnataraj-hkpq9z-node4 cephuser]# podman ps -a
CONTAINER ID  IMAGE                                                                                                                         COMMAND               CREATED       STATUS           PORTS       NAMES
a32804441d49  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87  -n osd.2 -f --set...  22 hours ago  Up 22 hours ago              ceph-7e821d06-c172-11ec-b5bf-fa163e42a69f-osd-2
0c498bce1657  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87  -n osd.4 -f --set...  22 hours ago  Up 22 hours ago              ceph-7e821d06-c172-11ec-b5bf-fa163e42a69f-osd-4
d8710998f84c  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87  -n osd.7 -f --set...  22 hours ago  Up 22 hours ago              ceph-7e821d06-c172-11ec-b5bf-fa163e42a69f-osd-7
53f8080a97a1  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87                        22 hours ago  Up 22 hours ago              ceph-7e821d06-c172-11ec-b5bf-fa163e42a69f-iscsi-iscsipool-ceph-pnataraj-hkpq9z-node4-rbibzm
[root@ceph-pnataraj-hkpq9z-node4 cephuser]#

 
[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# ceph version
ceph version 16.2.7-107.el8cp (3106079e34bb001fa0999e9b975bd5e8a413f424) pacific (stable)
[root@ceph-pnataraj-hkpq9z-node1-installer cephuser]# 




Snippet of disk add

/iscsi-target...at:rh7-client> disk add iscsipool/test201
ok
/iscsi-target...at:rh7-client> disk add iscsipool/test202
ok
/iscsi-target...at:rh7-client> disk add iscsipool/test203
ok
/iscsi-target...at:rh7-client> disk add iscsipool/test204
    

ok
/iscsi-target...at:rh7-client> 
/iscsi-target...at:rh7-client> 
/iscsi-target...at:rh7-client> disk add iscsipool/test205
ok
/iscsi-target...at:rh7-client> disk add iscsipool/test206
ok
/iscsi-target...at:rh7-client> disk add iscsipool/test207




[root@ceph-pnataraj-hkpq9z-node4 cephuser]# podman exec -it 045a9e5fd62f /bin/bash
Error: no container with name or ID "045a9e5fd62f" found: no such container
[root@ceph-pnataraj-hkpq9z-node4 cephuser]# ls
[root@ceph-pnataraj-hkpq9z-node4 cephuser]# 


Attached Tcmu-runner log and snippet of output

http://pastebin.test.redhat.com/1046690

Cluster status:

HOST                                  ADDR          LABELS                    STATUS  
ceph-pnataraj-hkpq9z-node1-installer  10.0.209.110  _admin mon installer mgr          
ceph-pnataraj-hkpq9z-node2            10.0.208.171  mgr mon                           
ceph-pnataraj-hkpq9z-node3            10.0.210.54   osd mon                           
ceph-pnataraj-hkpq9z-node4            10.0.208.133  osd mds                           
ceph-pnataraj-hkpq9z-node5            10.0.208.218  osd mds

Comment 54 Preethi 2022-04-25 17:36:31 UTC

Yes, node 10.0.208.218 is down. I will repro the issue on fresh cluster and share the system details.

Comment 55 Preethi 2022-04-26 09:55:29 UTC

Was able to repro the issue with fix again -

1) Created images using count option - 6 extra images were created under target
2) deleted extra images and made 256 images per target
3) started adding luns to the client, at 201th lun creation found the below error

/iscsi-target...at:rh7-client> disk add iscsipool/test219
 
 
 
 
Exception in thread Thread-92:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.6/threading.py", line 1185, in run
    self.function(*self.args, **self.kwargs)
  File "/usr/lib/python3.6/site-packages/gwcli/gateway.py", line 646, in check_gateways
    check_thread.start()
  File "/usr/lib64/python3.6/threading.py", line 849, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Ouput - IT hangs at the 201th lun addition and later we see the exception above 
and after again checking podman ps -a - container was not found


Snippet of ceph status, podman ps, is copied below

http://pastebin.test.redhat.com/1047363---------> image creation snippet

http://pastebin.test.redhat.com/1047393 ------>output after issue hit is copied here

cluster details -

ssh cephuser@<ip> 
password is cephuser


ceph-pnataraj-luc1vd-node1-installer  10.0.210.237  _admin installer mgr mon         ---------->bootstrap/admin node 
ceph-pnataraj-luc1vd-node2            10.0.209.164  mon mgr                           
ceph-pnataraj-luc1vd-node3            10.0.210.124  osd mon                           
ceph-pnataraj-luc1vd-node4            10.0.209.187  osd mds                           ------------>iscsi gateway node 1
ceph-pnataraj-luc1vd-node5            10.0.209.227  osd mds                           -------------------iscsi gateway node2

Comment 58 Preethi 2022-04-26 12:56:44 UTC

Hi Adam,

I was able to get the unit.run from new cluster. Please find the below snippet


[root@ceph-tcmu-iscsi-fotdn1-node4 cephuser]# cd /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/
[root@ceph-tcmu-iscsi-fotdn1-node4 iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu]# ls
config  configfs  iscsi-gateway.cfg  keyring  unit.configured  unit.created  unit.image  unit.meta  unit.poststop  unit.run  unit.stop
[root@ceph-tcmu-iscsi-fotdn1-node4 iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu]# cat unit.run 
set -e
if ! grep -qs /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/configfs /proc/mounts; then mount -t configfs none /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/configfs; fi
# iscsi tcmu-runner container
! /bin/podman rm -f ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu-tcmu 2> /dev/null
! /bin/podman rm -f ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu-tcmu 2> /dev/null
! /bin/podman rm -f --storage ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu-tcmu 2> /dev/null
! /bin/podman rm -f --storage ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu-tcmu 2> /dev/null
/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/bin/tcmu-runner --privileged --group-add=disk --init --name ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu-tcmu --pids-limit=-1 -e CONTAINER_IMAGE=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 -e NODE_NAME=ceph-tcmu-iscsi-fotdn1-node4 -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/config:/etc/ceph/ceph.conf:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/keyring:/etc/ceph/keyring:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/iscsi-gateway.cfg:/etc/ceph/iscsi-gateway.cfg:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/configfs:/sys/kernel/config -v /var/log/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286:/var/log:z -v /dev:/dev --mount type=bind,source=/lib/modules,destination=/lib/modules,ro=true registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 &
# iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu
! /bin/podman rm -f ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu 2> /dev/null
! /bin/podman rm -f ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu 2> /dev/null
! /bin/podman rm -f --storage ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu 2> /dev/null
! /bin/podman rm -f --storage ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu 2> /dev/null
/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/bin/rbd-target-api --privileged --group-add=disk --init --name ceph-82e7533c-c54d-11ec-b555-fa163e2c8286-iscsi-iscsipool-ceph-tcmu-iscsi-fotdn1-node4-aicryu -d --log-driver journald --conmon-pidfile /run/ceph-82e7533c-c54d-11ec-b555-fa163e2c8286.ceph-tcmu-iscsi-fotdn1-node4.aicryu.service-pid --cidfile /run/ceph-82e7533c-c54d-11ec-b555-fa163e2c8286.ceph-tcmu-iscsi-fotdn1-node4.aicryu.service-cid --pids-limit=-1 --cgroups=split -e CONTAINER_IMAGE=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 -e NODE_NAME=ceph-tcmu-iscsi-fotdn1-node4 -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/config:/etc/ceph/ceph.conf:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/keyring:/etc/ceph/keyring:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/iscsi-gateway.cfg:/etc/ceph/iscsi-gateway.cfg:z -v /var/lib/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286/iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu/configfs:/sys/kernel/config -v /var/log/ceph/82e7533c-c54d-11ec-b555-fa163e2c8286:/var/log:z -v /dev:/dev --mount type=bind,source=/lib/modules,destination=/lib/modules,ro=true registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87
[root@ceph-tcmu-iscsi-fotdn1-node4 iscsi.iscsipool.ceph-tcmu-iscsi-fotdn1-node4.aicryu]#

Comment 59 Preethi 2022-04-26 12:58:37 UTC

output from the cluster where issue was seen-



[root@ceph-pnataraj-luc1vd-node4 cephuser]# cd /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# 
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# 
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# 
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# 
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# 
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]# cat unit.run 
set -e
if ! grep -qs /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/configfs /proc/mounts; then mount -t configfs none /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/configfs; fi
# iscsi tcmu-runner container
! /bin/podman rm -f ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga-tcmu 2> /dev/null
! /bin/podman rm -f ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga-tcmu 2> /dev/null
! /bin/podman rm -f --storage ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga-tcmu 2> /dev/null
! /bin/podman rm -f --storage ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga-tcmu 2> /dev/null
/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/bin/tcmu-runner --privileged --group-add=disk --init --name ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga-tcmu --pids-limit=-1 -e CONTAINER_IMAGE=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 -e NODE_NAME=ceph-pnataraj-luc1vd-node4 -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/config:/etc/ceph/ceph.conf:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/keyring:/etc/ceph/keyring:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/iscsi-gateway.cfg:/etc/ceph/iscsi-gateway.cfg:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/configfs:/sys/kernel/config -v /var/log/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1:/var/log:z -v /dev:/dev --mount type=bind,source=/lib/modules,destination=/lib/modules,ro=true registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 &
# iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga
! /bin/podman rm -f ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga 2> /dev/null
! /bin/podman rm -f ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga 2> /dev/null
! /bin/podman rm -f --storage ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga 2> /dev/null
! /bin/podman rm -f --storage ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga 2> /dev/null
/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/bin/rbd-target-api --privileged --group-add=disk --init --name ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1-iscsi-iscsipool-ceph-pnataraj-luc1vd-node4-levnga -d --log-driver journald --conmon-pidfile /run/ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1.ceph-pnataraj-luc1vd-node4.levnga.service-pid --cidfile /run/ceph-a1dc0a22-c4c0-11ec-87ba-fa163e719dd1.ceph-pnataraj-luc1vd-node4.levnga.service-cid --pids-limit=-1 --cgroups=split -e CONTAINER_IMAGE=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87 -e NODE_NAME=ceph-pnataraj-luc1vd-node4 -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/config:/etc/ceph/ceph.conf:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/keyring:/etc/ceph/keyring:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/iscsi-gateway.cfg:/etc/ceph/iscsi-gateway.cfg:z -v /var/lib/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1/iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga/configfs:/sys/kernel/config -v /var/log/ceph/a1dc0a22-c4c0-11ec-87ba-fa163e719dd1:/var/log:z -v /dev:/dev --mount type=bind,source=/lib/modules,destination=/lib/modules,ro=true registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3ed44ebc6fd5d0d39f52b0c62c72d1147643282abfaa53095d103cec54055f87
[root@ceph-pnataraj-luc1vd-node4 iscsi.iscsipool.ceph-pnataraj-luc1vd-node4.levnga]#

Comment 68 Gopi 2022-05-18 04:50:43 UTC

Hi Teoman,

Preethi is working on it and she will update soon.

Thanks,
Gopi

Comment 69 Preethi 2022-05-18 04:51:36 UTC

We do not have the failed setup now. Will reproduce the issue on the fresh cluster and share it once it is done.

Comment 70 Preethi 2022-05-18 15:07:15 UTC

Issue is not reproduced with latest 5.1Z1 build and was able to create 255 luns without any issues. output below:

http://pastebin.test.redhat.com/1052677
http://pastebin.test.redhat.com/1052502

Comment 71 Preethi 2022-05-18 15:08:57 UTC

Difference with failed QA and working setup is the Vms and baremetal. Issue last seen was reproduced on cluster built on VMs and issue not reproduced which was built on baremetal.

Comment 82 Preethi 2022-06-06 12:38:22 UTC

Issue is still seen in latest 5.2 build. As per the discussions, Lun test is applicable only when pids.max is set for both tcmu and iscsi containers
ceph version 16.2.8-34.el8cp (da41b2a854c731f3062af5ca7b7aca470b0bec29) pacific (stable)

# podman exec -it db95bf5300f1 cat /sys/fs/cgroup/pids/pids.max
23419
# podman exec -it 758f919a4905 cat /sys/fs/cgroup/pids/pids.max
max

Comment 88 Preethi 2022-06-16 09:38:20 UTC

Issue is not observed in the latest 5.2 build. was able to create 256 luns and below containers are set to max  for verification.
[root@plena001 ubuntu]# podman exec -it b8a0104f4fd9 cat /sys/fs/cgroup/pids/pids.max
max
[root@plena001 ubuntu]# podman exec -it f39915874caf cat /sys/fs/cgroup/pids/pids.max
max

http://pastebin.test.redhat.com/1058710
http://pastebin.test.redhat.com/1058759


[ceph: root@magna021 ceph]# ceph status
  cluster:
    id:     c8ce6d50-c0a1-11ec-a99b-002590fc2a2e
    health: HEALTH_OK
 
  services:
    mon:         5 daemons, quorum magna021,magna022,magna024,magna025,magna026 (age 7h)
    mgr:         magna022.icxgsh(active, since 7h), standbys: magna021.syfuos
    osd:         42 osds: 42 up (since 6h), 42 in (since 8w)
    rbd-mirror:  1 daemon active (1 hosts)
    tcmu-runner: 512 portals active (2 hosts)
 
  data:
    pools:   11 pools, 801 pgs
    objects: 891.65k objects, 3.4 TiB
    usage:   10 TiB used, 28 TiB / 38 TiB avail
    pgs:     801 active+clean
 
  io:
    client:   3.5 MiB/s rd, 44 KiB/s wr, 4.49k op/s rd, 83 op/s wr
 
[ceph: root@magna021 ceph]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 5
    },
    "mgr": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 42
    },
    "mds": {},
    "rbd-mirror": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 1
    },
    "tcmu-runner": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 512
    },
    "overall": {
        "ceph version 16.2.8-46.el8cp (8300c1ab46e5a5b616a783a729b2248c623a8193) pacific (stable)": 562
    }
}
[ceph: root@magna021 ceph]#

Comment 93 errata-xmlrpc 2022-08-09 17:35:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997

Note You need to log in before you can comment on or make changes to this bug.