Bug 2232067
| Summary: | ceph backed image create stuck in waiting : [WRN] slow request osd_op | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | pkomarov |
| Component: | ceph | Assignee: | Giulio Fidente <gfidente> |
| Status: | CLOSED NOTABUG | QA Contact: | Yogev Rabl <yrabl> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 16.2 (Train) | CC: | jdurgin, johfulto, lhh, mhicks, mschuppe |
| Target Milestone: | --- | Keywords: | AutomationBlocker |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-15 16:29:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
pkomarov
2023-08-15 07:05:55 UTC
warnings on the controller for ceph Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 22:47:11.609687 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: :47:46.810945 osd.3 (osd.3) 2319458 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 02:47:11.692233 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810950 osd.3 ( Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3) 2319459 : cluster [WRN] slow request osd_op(client.15280.0:1 2.3 2.aa864923 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:31:31.262317 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810955 osd.3 (osd.3) 2319460 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:02:11.448902 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810960 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319461 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+read+known_if_redirected e28) initiated 2023-08-14 23:01:03.609548 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810965 osd.3 (osd Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: .3) 2319462 : cluster [WRN] slow request osd_op(client.10126.0:1 3.11 3.d66454f1 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:01:03.706748 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: 2023-08-15 06:47:46.810970 osd.3 (osd.3) 2319463 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 06:32:11.768077 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810975 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: osd.3 (osd.3) 2319464 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 15:17:11.453000 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810980 osd.3 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319465 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-14 23:02:11.613658 currently delayed Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: cluster 2023-08-15 06:47:46.810985 osd.3 Aug 15 06:47:56 ctlnovastdalone-0 conmon[64370]: (osd.3) 2319466 : cluster [WRN] slow request osd_op(client.4148.0:1 1.12 1.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e28) initiated 2023-08-15 03:02:11.695999 currently delayed sos reports and must-gathers are all at : http://file.tlv.redhat.com/~pkomarov/sos_reports_2232067/ The Ceph OSD nodes cannot talk to each other on the cluster_network (AKA storage management) [0].
Established connections are only to themselves while connections to the peer nodes are stuck in SYN_SENT [1].
Because ceph OSD nodes cannot write data to each other, ceph itself cannot write any data [2].
Their respective IPs are routed. I was told that the same job worked (16.2_hci job); where both computes are in the same network.
Please update the configuration of this job to ensure that these nodes can communicate with each other across the cluster network.
[0]
Node1
[root@computehcileaf1-0 ~]# ip a | grep 172.19
inet 172.19.1.10/24 brd 172.19.1.255 scope global vlan41
[root@computehcileaf1-0 ~]# time telnet 172.19.2.10 6800
Trying 172.19.2.10...
^C
real 0m6.667s
user 0m0.001s
sys 0m0.003s
[root@computehcileaf1-0 ~]#
Node2
[root@computehcileaf2-0 ~]# ip a | grep 172.19
inet 172.19.2.10/24 brd 172.19.2.255 scope global vlan42
[root@computehcileaf2-0 ~]# time telnet 172.19.1.10 6800
Trying 172.19.1.10...
^C
real 0m4.443s
user 0m0.001s
sys 0m0.002s
[root@computehcileaf2-0 ~]#
[1]
[root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19 | grep SYN
tcp 0 1 172.22.0.103:48214 172.19.1.10:6806 SYN_SENT
tcp 0 1 172.22.0.103:38794 172.19.1.10:6800 SYN_SENT
tcp 0 1 172.22.0.103:44868 172.19.1.10:6804 SYN_SENT
tcp 0 1 172.22.0.103:38788 172.19.1.10:6800 SYN_SENT
tcp 0 1 172.22.0.103:48230 172.19.1.10:6806 SYN_SENT
tcp 0 1 172.22.0.103:51194 172.19.1.10:6802 SYN_SENT
tcp 0 1 172.22.0.103:44866 172.19.1.10:6804 SYN_SENT
tcp 0 1 172.22.0.103:51174 172.19.1.10:6802 SYN_SENT
[root@computehcileaf2-0 ~]# netstat -an | grep 680 | grep 19 | grep ESTA
tcp 0 0 172.19.2.10:6800 172.19.2.10:56646 ESTABLISHED
tcp 0 0 172.19.2.10:6802 172.19.2.10:51284 ESTABLISHED
tcp 0 0 172.19.2.10:6806 172.19.2.10:39474 ESTABLISHED
tcp 0 0 172.19.2.10:39474 172.19.2.10:6806 ESTABLISHED
tcp 0 0 172.19.2.10:56646 172.19.2.10:6800 ESTABLISHED
tcp 0 0 172.19.2.10:51284 172.19.2.10:6802 ESTABLISHED
[root@computehcileaf2-0 ~]#
[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep ESTAB
tcp 0 0 172.19.1.10:60544 172.19.1.10:6800 ESTABLISHED
tcp 0 0 172.19.1.10:54374 172.19.1.10:6806 ESTABLISHED
tcp 0 0 172.19.1.10:51098 172.19.1.10:6802 ESTABLISHED
tcp 0 0 172.19.1.10:6800 172.19.1.10:60544 ESTABLISHED
tcp 0 0 172.19.1.10:6802 172.19.1.10:51098 ESTABLISHED
tcp 0 0 172.19.1.10:6806 172.19.1.10:54374 ESTABLISHED
[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep SYNC
[root@computehcileaf1-0 ~]# netstat -an | grep 680 | grep 19 | grep SYN
tcp 0 1 172.22.0.102:51016 172.19.2.10:6804 SYN_SENT
tcp 0 1 172.22.0.102:38066 172.19.2.10:6800 SYN_SENT
tcp 0 1 172.22.0.102:58650 172.19.2.10:6802 SYN_SENT
tcp 0 1 172.22.0.102:38070 172.19.2.10:6800 SYN_SENT
tcp 0 1 172.22.0.102:51012 172.19.2.10:6804 SYN_SENT
tcp 0 1 172.22.0.102:58654 172.19.2.10:6802 SYN_SENT
tcp 0 1 172.22.0.102:57534 172.19.2.10:6806 SYN_SENT
tcp 0 1 172.22.0.102:57530 172.19.2.10:6806 SYN_SENT
[root@computehcileaf1-0 ~]#
[2]
[root@ctlnovastdalone-0 /]# ceph osd pool create new_pool 32 32 replicated replicated_rule_osd
pool 'new_pool' created
[root@ctlnovastdalone-0 /]# time rbd create --size 1024 new_pool/foo
^C
real 0m13.843s
user 0m0.052s
sys 0m0.033s
[root@ctlnovastdalone-0 /]#
[3]
[root@computehcileaf1-0 ~]# grep network /etc/ceph/ceph.conf
cluster network = 172.19.0.0/24,172.19.0.0/24,172.19.1.0/24,172.19.2.0/24
public network = 172.18.0.0/24,172.18.0.0/24,172.18.1.0/24,172.18.2.0/24
[root@computehcileaf1-0 ~]#
This was a configuration issue with the job. After Martin added the routes manually and ceph status improved. I then re-ran the ansible playbook which triggered tempest and saw that images were created [1]. The following PR should fix the job. https://github.com/openstack-k8s-operators/osp-director-dev-tools/pull/508 [1] sh-4.4$ openstack image list +--------------------------------------+----------------------------------+--------+ | ID | Name | Status | +--------------------------------------+----------------------------------+--------+ | b3ff782b-2e8c-4df2-a071-1ccbd7da0062 | cirros-0.5.3-x86_64-disk.img | active | | 52bfd6ac-4e53-47f3-9e32-1fb123942424 | cirros-0.5.3-x86_64-disk.img_alt | active | +--------------------------------------+----------------------------------+--------+ sh-4.4$ |