Description of problem: openstack image create stuck in waiting Version-Release number of selected component (if applicable): osp 16.2.4 Openstack director operator : ver 13.0.17 How reproducible: 100% Steps to Reproduce: 1.rerun https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-ospk8s-osp-director-dev-tools-16.2_novacontrol_hci_tlse_subnet/ 2.*tempest will fail, jump onto node export KUBECONFIG=/home/rhos-ci/jenkins/workspace/`ls -tr ~rhos-ci/jenkins/workspace/|tail -1`/openstack-k8s/ostest-working/kubeconfig pushd /home/rhos-ci/jenkins/workspace/`ls -tr ~rhos-ci/jenkins/workspace/|tail -1`/openstack-k8s/logs oc exec -it -n openstack openstackclient bash . /etc/bashrc cd ~ 3. openstack image create --disk-format raw --container-format bare --file ./test.img cirros-0.5.2-x86_64-disk.img Actual results: sh-4.4$ openstack image list +--------------------------------------+------------------------------+--------+ | ID | Name | Status | +--------------------------------------+------------------------------+--------+ | 0659fd86-1731-4009-bb44-cd698d382b52 | cirros-0.5.2-x86_64-disk.img | queued | +--------------------------------------+------------------------------+--------+ Additional info:
warnings on the controller for ceph Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 22:47:11.609687 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: :47:46.810945 osd.3 (osd.3) 2319458 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 02:47:11.692233 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810950 osd.3 ( Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3) 2319459 : cluster [WRN] slow request osd_op(client.15280.0:1 2.3 2.aa864923 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:31:31.262317 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810955 osd.3 (osd.3) 2319460 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:02:11.448902 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810960 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319461 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+read+known_if_redirected e28) initiated 2023-08-14 23:01:03.609548 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810965 osd.3 (osd Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: .3) 2319462 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:01:03.706748 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810970 osd.3 (osd.3) 2319463 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:32:11.768077 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810975 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319464 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:17:11.453000 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810980 osd.3 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319465 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 23:02:11.613658 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810985 osd.3 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319466 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:02:11.695999 currently delayed
sos reports and must-gathers are all at : http://file.tlv.redhat.com/~pkomarov/sos_reports_2232067/
The Ceph OSD nodes cannot talk to each other on the cluster_network (AKA storage management) [0]. Established connections are only to themselves while connections to the peer nodes are stuck in SYN_SENT [1]. Because ceph OSD nodes cannot write data to each other, ceph itself cannot write any data [2]. Their respective IPs are routed. I was told that the same job worked (16.2_hci job); where both computes are in the same network. Please update the configuration of this job to ensure that these nodes can communicate with each other across the cluster network. [0] Node1 [root@computehcileaf1-0 ~]# ip a | grep 172.19 inet 172.19.1.10/24 brd 172.19.1.255 scope global vlan41 [root@computehcileaf1-0 ~]# time telnet 172.19.2.10 6800 Trying 172.19.2.10... ^C real 0m6.667s user 0m0.001s sys 0m0.003s [root@computehcileaf1-0 ~]# Node2 [root@computehcileaf2-0 ~]# ip a | grep 172.19 inet 172.19.2.10/24 brd 172.19.2.255 scope global vlan42 [root@computehcileaf2-0 ~]# time telnet 172.19.1.10 6800 Trying 172.19.1.10... ^C real 0m4.443s user 0m0.001s sys 0m0.002s [root@computehcileaf2-0 ~]# [1] [root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19 | grep SYN tcp 0 1 172.22.0.103:48214 172.19.1.10:6806 SYN_SENT tcp 0 1 172.22.0.103:38794 172.19.1.10:6800 SYN_SENT tcp 0 1 172.22.0.103:44868 172.19.1.10:6804 SYN_SENT tcp 0 1 172.22.0.103:38788 172.19.1.10:6800 SYN_SENT tcp 0 1 172.22.0.103:48230 172.19.1.10:6806 SYN_SENT tcp 0 1 172.22.0.103:51194 172.19.1.10:6802 SYN_SENT tcp 0 1 172.22.0.103:44866 172.19.1.10:6804 SYN_SENT tcp 0 1 172.22.0.103:51174 172.19.1.10:6802 SYN_SENT [root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19 | grep ESTA tcp 0 0 172.19.2.10:6800 172.19.2.10:56646 ESTABLISHED tcp 0 0 172.19.2.10:6802 172.19.2.10:51284 ESTABLISHED tcp 0 0 172.19.2.10:6806 172.19.2.10:39474 ESTABLISHED tcp 0 0 172.19.2.10:39474 172.19.2.10:6806 ESTABLISHED tcp 0 0 172.19.2.10:56646 172.19.2.10:6800 ESTABLISHED tcp 0 0 172.19.2.10:51284 172.19.2.10:6802 ESTABLISHED [root@computehcileaf2-0 ~]# [root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep ESTAB tcp 0 0 172.19.1.10:60544 172.19.1.10:6800 ESTABLISHED tcp 0 0 172.19.1.10:54374 172.19.1.10:6806 ESTABLISHED tcp 0 0 172.19.1.10:51098 172.19.1.10:6802 ESTABLISHED tcp 0 0 172.19.1.10:6800 172.19.1.10:60544 ESTABLISHED tcp 0 0 172.19.1.10:6802 172.19.1.10:51098 ESTABLISHED tcp 0 0 172.19.1.10:6806 172.19.1.10:54374 ESTABLISHED [root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep SYNC [root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep SYN tcp 0 1 172.22.0.102:51016 172.19.2.10:6804 SYN_SENT tcp 0 1 172.22.0.102:38066 172.19.2.10:6800 SYN_SENT tcp 0 1 172.22.0.102:58650 172.19.2.10:6802 SYN_SENT tcp 0 1 172.22.0.102:38070 172.19.2.10:6800 SYN_SENT tcp 0 1 172.22.0.102:51012 172.19.2.10:6804 SYN_SENT tcp 0 1 172.22.0.102:58654 172.19.2.10:6802 SYN_SENT tcp 0 1 172.22.0.102:57534 172.19.2.10:6806 SYN_SENT tcp 0 1 172.22.0.102:57530 172.19.2.10:6806 SYN_SENT [root@computehcileaf1-0 ~]# [2] [root@ctlnovastdalone-0 /]# ceph osd pool create new_pool 32 32 replicated replicated_rule_osd pool 'new_pool' created [root@ctlnovastdalone-0 /]# time rbd create --size 1024 new_pool/foo ^C real 0m13.843s user 0m0.052s sys 0m0.033s [root@ctlnovastdalone-0 /]# [3] [root@computehcileaf1-0 ~]# grep network /etc/ceph/ceph.conf cluster network = 172.19.0.0/24,172.19.0.0/24,172.19.1.0/24,172.19.2.0/24 public network = 172.18.0.0/24,172.18.0.0/24,172.18.1.0/24,172.18.2.0/24 [root@computehcileaf1-0 ~]#
This was a configuration issue with the job. After Martin added the routes manually and ceph status improved. I then re-ran the ansible playbook which triggered tempest and saw that images were created [1]. The following PR should fix the job. https://github.com/openstack-k8s-operators/osp-director-dev-tools/pull/508 [1] sh-4.4$ openstack image list +--------------------------------------+----------------------------------+--------+ | ID | Name | Status | +--------------------------------------+----------------------------------+--------+ | b3ff782b-2e8c-4df2-a071-1ccbd7da0062 | cirros-0.5.3-x86_64-disk.img | active | | 52bfd6ac-4e53-47f3-9e32-1fb123942424 | cirros-0.5.3-x86_64-disk.img_alt | active | +--------------------------------------+----------------------------------+--------+ sh-4.4$