1764545 – Problem cloning a volume larger than 500GB

Bug 1764545 - Problem cloning a volume larger than 500GB

Summary: Problem cloning a volume larger than 500GB

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-cinder
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z11
Target Release:	13.0 (Queens)
Assignee:	Eric Harney
QA Contact:	Tzach Shefi
Docs Contact:	Chuck Copello
URL:
Whiteboard:
Depends On:
Blocks:	1796694
TreeView+	depends on / blocked

Reported:	2019-10-23 09:52 UTC by Andre
Modified:	2023-09-07 20:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-cinder-12.0.8-5.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-10 11:25:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	691901	'None'	MERGED	Increase cpu limit for image conversion	2021-01-19 08:22:28 UTC
OpenStack gerrit	703682	None	MERGED	Configurable timeout of the QEMU img conversion	2021-01-19 08:22:27 UTC
Red Hat Product Errata	RHBA-2020:0764	None	None	None	2020-03-10 11:25:57 UTC

Internal Links: 1796694

Description Andre 2019-10-23 09:52:00 UTC

Description of problem:

This happen when the source volume is big enough (more than 500GB). With smaller volumes (~100GB) there's no problem.

The full error and the logs I'll post in a new comment as private, since it contains customer sensitive information.

Version-Release number of selected component (if applicable):

openstack-cinder-12.0.6-3.el7ost.noarch                     Thu Jun 13 11:05:12 2019
openstack-nova-api-17.0.9-9.el7ost.noarch                   Thu Jun 13 11:05:36 2019
openstack-nova-common-17.0.9-9.el7ost.noarch                Thu Jun 13 11:03:27 2019
openstack-nova-compute-17.0.9-9.el7ost.noarch               Thu Jun 13 11:05:07 2019
openstack-nova-conductor-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:36 2019
openstack-nova-console-17.0.9-9.el7ost.noarch               Thu Jun 13 11:05:36 2019
openstack-nova-migration-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:07 2019
openstack-nova-novncproxy-17.0.9-9.el7ost.noarch            Thu Jun 13 11:05:36 2019
openstack-nova-placement-api-17.0.9-9.el7ost.noarch         Thu Jun 13 11:05:35 2019
openstack-nova-scheduler-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:36 2019
puppet-cinder-12.4.1-4.el7ost.noarch                        Thu Jun 13 11:02:56 2019
puppet-nova-12.4.0-17.el7ost.noarch                         Thu Jun 13 11:02:56 2019
python2-cinderclient-3.5.0-1.el7ost.noarch                  Tue Dec 11 17:34:09 2018
python2-novaclient-10.1.0-1.el7ost.noarch                   Tue Dec 11 17:34:09 2018
python-cinder-12.0.6-3.el7ost.noarch                        Thu Jun 13 11:03:41 2019
python-nova-17.0.9-9.el7ost.noarch                          Thu Jun 13 11:03:27 2019


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
This seems to be similar to this bug[1] that was opened for OSP9. They both have similar error:
~~~
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager ProcessExecutionError: Unexpected error while running command.
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=8 -- env LC_ALL=C qemu-img info /var/lib/cinder/mnt/[OMITTED]
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Exit code: -9
~~~

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1402594

Comment 1 Andre 2019-10-29 15:30:03 UTC

Hi,

I was checking the code for customer version[1] and they have the workaround that increases the cpu_time to 30[2]. But on the command that raises the issue, we have that the cpu is set to 8[3], is this an issue? It seems that it should reflect the cpu_time set in the code, but that's not the case.


[1] ~~~
$ grep -ir nova pollux-tds-controller-1/sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-V 
openstack-nova-api-17.0.9-9.el7ost.noarch
openstack-nova-common-17.0.9-9.el7ost.noarch
openstack-nova-compute-17.0.9-9.el7ost.noarch
openstack-nova-conductor-17.0.9-9.el7ost.noarch
openstack-nova-console-17.0.9-9.el7ost.noarch
openstack-nova-migration-17.0.9-9.el7ost.noarch
openstack-nova-novncproxy-17.0.9-9.el7ost.noarch
openstack-nova-placement-api-17.0.9-9.el7ost.noarch
openstack-nova-scheduler-17.0.9-9.el7ost.noarch
~~~

[2] ~~~
QEMU_IMG_LIMITS = processutils.ProcessLimits(
    cpu_time=30,
    address_space=1 * units.Gi)
~~~

[3] ~~~
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=8 -- env LC_ALL=C qemu-img info /var/lib/cinder/mnt/[OMITTED]
~~~

Comment 2 Eric Harney 2019-10-29 15:35:35 UTC

Are you looking at the code in nova or cinder?  Both have QEMU_IMG_LIMITS that are applied for these calls.

Comment 3 Andre 2019-10-31 12:59:21 UTC

I was checking this on nova code, do we have it in both places? If so, where the nova variable is being used?
I wanna try to increase this value so customer can try it, how should we proceed?

Comment 4 Eric Harney 2019-11-05 20:55:16 UTC

(In reply to Andre from comment #3)
> I was checking this on nova code, do we have it in both places? If so, where
> the nova variable is being used?
> I wanna try to increase this value so customer can try it, how should we
> proceed?

The error shown in the description and comment #1 was from the cinder volume manager so the nova code would not be relevant.  The Nova code also already has higher limits than Cinder does.

Changing it Cinder and restarting cinder-volume should help.

If this works, we can work toward getting this patch to relevant branches:
https://review.opendev.org/#/c/691901/

Comment 5 Andre 2019-11-07 15:01:46 UTC

How should we proceed testing? Since it requires a change in the code, I need engineering patch, right?

Comment 6 Edu Alcaniz 2019-11-12 09:14:15 UTC

Hi Eric, do you mind help us on this topic, please. 

The customer has a production environment, so, should we provide a hotfix or test package to the customer instead of manual changes?

Comment 7 Eric Harney 2019-11-12 15:50:46 UTC

I am trying to get https://review.opendev.org/#/c/691901/ merged into upstream master, will provide a hotfix package once we at least get this merged there.

Comment 12 Tzach Shefi 2020-02-05 17:57:52 UTC

Waiting for newer puddle 2020-01-15.3 
Resulted in a pre-fixedin version 

openstack-cinder-12.0.8-3.el7ost < openstack-cinder-12.0.8-5.el7ost

Comment 13 Tzach Shefi 2020-02-10 15:15:43 UTC

Verfied on:
openstack-cinder-12.0.10-2.el7ost.noarch

Using K2 iscsi backed 500G volume,
cinder create 500 --name K2_500G

(overcloud) [stack@undercloud-0 ~]$ cinder show 7bace336-d2cc-4530-864e-e4f455e73eb1
+--------------------------------+------------------------------------------+
| Property                       | Value                                    |
+--------------------------------+------------------------------------------+
| attached_servers               | ['2ff82e12-e95a-45e5-ad28-4bd6d840e9e7'] |
| attachment_ids                 | ['def3d853-c19f-4ec6-82b8-92e09131873f'] |
| availability_zone              | nova                                     |
| bootable                       | false                                    |
| consistencygroup_id            | None                                     |
| created_at                     | 2020-02-09T15:52:44.000000               |
| description                    | None                                     |
| encrypted                      | False                                    |
| id                             | 7bace336-d2cc-4530-864e-e4f455e73eb1     |
| metadata                       | attached_mode : rw                       |
| migration_status               | None                                     |
| multiattach                    | False                                    |
| name                           | K2_500G                                  |
| os-vol-host-attr:host          | controller-0@k2iscsi#k2iscsi             |
| os-vol-mig-status-attr:migstat | None                                     |
| os-vol-mig-status-attr:name_id | None                                     |
| os-vol-tenant-attr:tenant_id   | 2cd01b0fe6c644a48cbfa6da5a03d25b         |
| replication_status             | None                                     |
| size                           | 500                                      |
| snapshot_id                    | None                                     |
| source_volid                   | None                                     |
| status                         | in-use                                   |
| updated_at                     | 2020-02-09T15:53:31.000000               |
| user_id                        | b8c1ccd7e02c4f22ad8929f1cb5fcaba         |
| volume_type                    | tripleo                                  |
+--------------------------------+------------------------------------------+


Attached to instance and filled with random data
 nova volume-attach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 7bace336-d2cc-4530-864e-e4f455e73eb1


# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda     253:0    0    1G  0 disk 
|-vda1  253:1    0 1015M  0 part /
`-vda15 253:15   0    8M  0 part 
vdb     253:16   0  500G  0 disk /root/kuku
# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev                    240.1M         0    240.1M   0% /dev
/dev/vda1               978.9M     23.9M    914.2M   3% /
tmpfs                   244.2M         0    244.2M   0% /dev/shm
tmpfs                   244.2M     88.0K    244.1M   0% /run
/dev/vdb                492.0G    466.4G    652.5M 100% /root/kuku

Now lets detach volume and clone it
nova volume-detach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 7bace336-d2cc-4530-864e-e4f455e73eb1

cinder create 501 --source-volid 7bace336-d2cc-4530-864e-e4f455e73eb1 --name 501G_ClonedVolume 
+--------------------------------+--------------------------------------+
| Property                       | Value                                |
+--------------------------------+--------------------------------------+
| attachments                    | []                                   |
| availability_zone              | nova                                 |
| bootable                       | false                                |
| consistencygroup_id            | None                                 |
| created_at                     | 2020-02-10T05:36:30.000000           |
| description                    | None                                 |
| encrypted                      | False                                |
| id                             | b23cd5d0-2579-4ac4-a1dc-a04863319497 |
| metadata                       | {}                                   |
| migration_status               | None                                 |
| multiattach                    | False                                |
| name                           | 501G_ClonedVolume                    |
| os-vol-host-attr:host          | controller-0@k2iscsi#k2iscsi         |
| os-vol-mig-status-attr:migstat | None                                 |
| os-vol-mig-status-attr:name_id | None                                 |
| os-vol-tenant-attr:tenant_id   | 2cd01b0fe6c644a48cbfa6da5a03d25b     |
| replication_status             | None                                 |
| size                           | 501                                  |
| snapshot_id                    | None                                 |
| source_volid                   | 7bace336-d2cc-4530-864e-e4f455e73eb1 |
| status                         | creating                             |
| updated_at                     | 2020-02-10T05:36:31.000000           |
| user_id                        | b8c1ccd7e02c4f22ad8929f1cb5fcaba     |
| volume_type                    | tripleo                              |
+--------------------------------+--------------------------------------+

Wait a while for clone operation to finish

(overcloud) [stack@undercloud-0 ~]$ cinder list
+--------------------------------------+-----------+-------------------+------+-------------+----------+--------------------------------------+
| ID                                   | Status    | Name              | Size | Volume Type | Bootable | Attached to                          |
+--------------------------------------+-----------+-------------------+------+-------------+----------+--------------------------------------+
| 7bace336-d2cc-4530-864e-e4f455e73eb1 | available | K2_500G           | 500  | tripleo     | false    |                                      |
| b23cd5d0-2579-4ac4-a1dc-a04863319497 | in-use    | 501G_ClonedVolume | 501  | tripleo     | false    | 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 |  -> cloned volume


Attach volume to instance and check we have same data on both
#nova volume-attach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 b23cd5d0-2579-4ac4-a1dc-a04863319497

Looking inside all data us there, successfully cloned a 500G, good to verify.

Comment 15 errata-xmlrpc 2020-03-10 11:25:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0764

Note You need to log in before you can comment on or make changes to this bug.