Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2333295

Summary:	Volume to image uploads are unstable after HF for bug #2296989 was applied
Product:	Red Hat OpenStack	Reporter:	Alex Stupnikov <astupnik>
Component:	openstack-glance	Assignee:	Cyril Roelandt <cyril>
Status:	CLOSED MIGRATED	QA Contact:	msava
Severity:	high	Docs Contact:	Andy Stillman <astillma>
Priority:	unspecified
Version:	17.1 (Wallaby)	CC:	abhijadh, akekane, athomas, cyril, eglynn, enothen, falim, gcharot, gkadam, jraju, mkatari, rdhasman, riramos, sapaul, udesale
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-01-22 16:13:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2024-12-19 09:57:23 UTC

Description of problem:
We applied HF for bug #2296989 in customer's environment to unlock huge volume to image uploads and overcome limitations introduced by local conversion. HF worked as expected (RBD images are no longer downloaded locally for conversion), but it looks like there is another issue that was hidden by bug #2296989 that now affects customer's workflows.

Customer reproduced it by simultaneously creating multiple volumes from same 50 GB image and then uploading created volumes to Glance simultaneously. Same picture was reproduced consistently: 2-3 volumes were uploaded in ~20-30 minutes, remaining ones were stuck: upload continues, but stays slow and BW made available after successfull uploads wasn't used by ongoing uploads. 

RHOSP architecture is quite complex to troubleshoot network performance problems: there is a TCP connection with cinder-volume owning client side, then HAProxy terminating client connection and proxying it to Glance backend. We involved network support group to figure out which part of this scheme is not working as expected and it looks like Glance is a source of the issue (will share data and follow-ups privately).

We are looking for help from Glance engineering with this problem: we need to debug interactions with Ceph and processing inbound TCP connections, then figure out consistent conclusions on top of that. It is impossible to do with standard Glance logs and our regular debugging methods.


Version-Release number of selected component (if applicable): RHOSP 17.1


How reproducible: simultaneously create multiple volumes from single Glance image, then upload volumes to Glance


Actual results: few volumes are uploaded successfully, remaining ones are uploaded very slowly and essentially stuck


Expected results: after few successful uploads, remaining volume uploads remain very slow and essentially blocked


Additional info: will be provided privately

Comment 20 Manoj Katari 2025-01-09 05:22:44 UTC

To check from ceph end, enable debug and log_to_file before you start the operation

#sudo cephadm shell
#ceph config set global log_to_file true
#ceph tell osd.* config set debug_osd 20/20
#ceph config set mgr mgr/cephadm/log_to_cluster_level debug
#ceph -W cephadm --watch-debug    (this command monitors ceph activity)


Along with ceph logs, capture these commands (from #sudo cephadm shell) on another terminal, when the issue is reproduced
#ceph -v
#ceph -s
#ceph osd perf
#ceph tell osd.* bench
#ceph osd stat
#ceph osd tree
#rbd ls -l -p images
#rbd ls -l -p volumes
#ceph pg ls-by-pool images
#ceph pg ls-by-pool volumes