Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2232067

Summary:	ceph backed image create stuck in waiting : [WRN] slow request osd_op
Product:	Red Hat OpenStack	Reporter:	pkomarov
Component:	ceph	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED NOTABUG	QA Contact:	Yogev Rabl <yrabl>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	16.2 (Train)	CC:	jdurgin, johfulto, lhh, mhicks, mschuppe
Target Milestone:	---	Keywords:	AutomationBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-15 16:29:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description pkomarov 2023-08-15 07:05:55 UTC

Description of problem:
openstack image create stuck in waiting

Version-Release number of selected component (if applicable):
osp 16.2.4 Openstack director operator : ver 13.0.17

How reproducible:
100%

Steps to Reproduce:
1.rerun https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-ospk8s-osp-director-dev-tools-16.2_novacontrol_hci_tlse_subnet/
2.*tempest will fail, jump onto node
export KUBECONFIG=/home/rhos-ci/jenkins/workspace/`ls -tr ~rhos-ci/jenkins/workspace/|tail -1`/openstack-k8s/ostest-working/kubeconfig 
pushd /home/rhos-ci/jenkins/workspace/`ls -tr ~rhos-ci/jenkins/workspace/|tail -1`/openstack-k8s/logs

oc exec -it -n openstack openstackclient bash
. /etc/bashrc
cd ~

3. openstack image create --disk-format raw --container-format bare --file ./test.img cirros-0.5.2-x86_64-disk.img

Actual results:
sh-4.4$ openstack image list
+--------------------------------------+------------------------------+--------+
| ID                                   | Name                         | Status |
+--------------------------------------+------------------------------+--------+
| 0659fd86-1731-4009-bb44-cd698d382b52 | cirros-0.5.2-x86_64-disk.img | queued |
+--------------------------------------+------------------------------+--------+




Additional info:

Comment 2 pkomarov 2023-08-15 07:14:32 UTC

 warnings on the controller for ceph

Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 22:47:11.609687 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: :47:46.810945 osd.3 (osd.3) 2319458 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 02:47:11.692233 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810950 osd.3 (
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3) 2319459 : cluster [WRN] slow request osd_op(client.15280.0:1 2.3 2.aa864923 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:31:31.262317 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810955 osd.3 (osd.3) 2319460 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:02:11.448902 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810960 
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319461 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+read+known_if_redirected e28) initiated 2023-08-14 23:01:03.609548 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810965 osd.3 (osd
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: .3) 2319462 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:01:03.706748 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810970 osd.3 (osd.3) 2319463 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:32:11.768077 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810975
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319464 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:17:11.453000 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810980 osd.3
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319465 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 23:02:11.613658 currently delayed
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810985 osd.3
Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319466 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:02:11.695999 currently delayed

Comment 3 pkomarov 2023-08-15 09:56:07 UTC

sos reports and must-gathers are all at : 
http://file.tlv.redhat.com/~pkomarov/sos_reports_2232067/

Comment 4 John Fulton 2023-08-15 16:11:35 UTC

The Ceph OSD nodes cannot talk to each other on the cluster_network (AKA storage management) [0]. 
Established connections are only to themselves while connections to the peer nodes are stuck in SYN_SENT [1].
Because ceph OSD nodes cannot write data to each other, ceph itself cannot write any data [2].

Their respective IPs are routed. I was told that the same job worked (16.2_hci job); where both computes are in the same network. 

Please update the configuration of this job to ensure that these nodes can communicate with each other across the cluster network.

[0]
Node1
[root@computehcileaf1-0 ~]# ip a | grep 172.19
    inet 172.19.1.10/24 brd 172.19.1.255 scope global vlan41
[root@computehcileaf1-0 ~]# time telnet 172.19.2.10 6800                                                                                                                                     
Trying 172.19.2.10...
^C

real    0m6.667s
user    0m0.001s
sys     0m0.003s
[root@computehcileaf1-0 ~]#


Node2
[root@computehcileaf2-0 ~]# ip a | grep 172.19
    inet 172.19.2.10/24 brd 172.19.2.255 scope global vlan42
[root@computehcileaf2-0 ~]# time telnet 172.19.1.10 6800                                                                                                                                     
Trying 172.19.1.10...
^C

real    0m4.443s
user    0m0.001s
sys     0m0.002s
[root@computehcileaf2-0 ~]#


[1]
[root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19  | grep SYN
tcp        0      1 172.22.0.103:48214      172.19.1.10:6806        SYN_SENT   
tcp        0      1 172.22.0.103:38794      172.19.1.10:6800        SYN_SENT   
tcp        0      1 172.22.0.103:44868      172.19.1.10:6804        SYN_SENT   
tcp        0      1 172.22.0.103:38788      172.19.1.10:6800        SYN_SENT   
tcp        0      1 172.22.0.103:48230      172.19.1.10:6806        SYN_SENT   
tcp        0      1 172.22.0.103:51194      172.19.1.10:6802        SYN_SENT   
tcp        0      1 172.22.0.103:44866      172.19.1.10:6804        SYN_SENT   
tcp        0      1 172.22.0.103:51174      172.19.1.10:6802        SYN_SENT   
[root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19  | grep ESTA
tcp        0      0 172.19.2.10:6800        172.19.2.10:56646       ESTABLISHED
tcp        0      0 172.19.2.10:6802        172.19.2.10:51284       ESTABLISHED
tcp        0      0 172.19.2.10:6806        172.19.2.10:39474       ESTABLISHED
tcp        0      0 172.19.2.10:39474       172.19.2.10:6806        ESTABLISHED
tcp        0      0 172.19.2.10:56646       172.19.2.10:6800        ESTABLISHED
tcp        0      0 172.19.2.10:51284       172.19.2.10:6802        ESTABLISHED
[root@computehcileaf2-0 ~]# 

[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19  | grep ESTAB
tcp        0      0 172.19.1.10:60544       172.19.1.10:6800        ESTABLISHED
tcp        0      0 172.19.1.10:54374       172.19.1.10:6806        ESTABLISHED
tcp        0      0 172.19.1.10:51098       172.19.1.10:6802        ESTABLISHED
tcp        0      0 172.19.1.10:6800        172.19.1.10:60544       ESTABLISHED
tcp        0      0 172.19.1.10:6802        172.19.1.10:51098       ESTABLISHED
tcp        0      0 172.19.1.10:6806        172.19.1.10:54374       ESTABLISHED
[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19  | grep SYNC
[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19  | grep SYN
tcp        0      1 172.22.0.102:51016      172.19.2.10:6804        SYN_SENT   
tcp        0      1 172.22.0.102:38066      172.19.2.10:6800        SYN_SENT   
tcp        0      1 172.22.0.102:58650      172.19.2.10:6802        SYN_SENT   
tcp        0      1 172.22.0.102:38070      172.19.2.10:6800        SYN_SENT   
tcp        0      1 172.22.0.102:51012      172.19.2.10:6804        SYN_SENT   
tcp        0      1 172.22.0.102:58654      172.19.2.10:6802        SYN_SENT   
tcp        0      1 172.22.0.102:57534      172.19.2.10:6806        SYN_SENT   
tcp        0      1 172.22.0.102:57530      172.19.2.10:6806        SYN_SENT   
[root@computehcileaf1-0 ~]# 

[2]

[root@ctlnovastdalone-0 /]# ceph osd pool create new_pool 32 32 replicated replicated_rule_osd
pool 'new_pool' created
[root@ctlnovastdalone-0 /]# time rbd create --size 1024 new_pool/foo                                                                                                                         
^C

real    0m13.843s
user    0m0.052s
sys     0m0.033s
[root@ctlnovastdalone-0 /]#


[3]
[root@computehcileaf1-0 ~]# grep network /etc/ceph/ceph.conf 
cluster network = 172.19.0.0/24,172.19.0.0/24,172.19.1.0/24,172.19.2.0/24
public network = 172.18.0.0/24,172.18.0.0/24,172.18.1.0/24,172.18.2.0/24
[root@computehcileaf1-0 ~]#

Comment 5 John Fulton 2023-08-15 16:29:45 UTC

This was a configuration issue with the job.

After Martin added the routes manually and ceph status improved. I then re-ran the ansible playbook which triggered tempest and saw that images were created [1].

The following PR should fix the job.

 https://github.com/openstack-k8s-operators/osp-director-dev-tools/pull/508


[1]
sh-4.4$ openstack image list
+--------------------------------------+----------------------------------+--------+
| ID                                   | Name                             | Status |
+--------------------------------------+----------------------------------+--------+
| b3ff782b-2e8c-4df2-a071-1ccbd7da0062 | cirros-0.5.3-x86_64-disk.img     | active |
| 52bfd6ac-4e53-47f3-9e32-1fb123942424 | cirros-0.5.3-x86_64-disk.img_alt | active |
+--------------------------------------+----------------------------------+--------+
sh-4.4$