Bug 1764545

Summary: Problem cloning a volume larger than 500GB
Product: Red Hat OpenStack Reporter: Andre <afariasa>
Component: openstack-cinderAssignee: Eric Harney <eharney>
Status: CLOSED ERRATA QA Contact: Tzach Shefi <tshefi>
Severity: high Docs Contact: Chuck Copello <ccopello>
Priority: medium    
Version: 13.0 (Queens)CC: ealcaniz, eharney, mmethot, pmorey
Target Milestone: z11Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-cinder-12.0.8-5.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 11:25:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1796694    

Description Andre 2019-10-23 09:52:00 UTC
Description of problem:

This happen when the source volume is big enough (more than 500GB). With smaller volumes (~100GB) there's no problem.

The full error and the logs I'll post in a new comment as private, since it contains customer sensitive information.

Version-Release number of selected component (if applicable):

openstack-cinder-12.0.6-3.el7ost.noarch                     Thu Jun 13 11:05:12 2019
openstack-nova-api-17.0.9-9.el7ost.noarch                   Thu Jun 13 11:05:36 2019
openstack-nova-common-17.0.9-9.el7ost.noarch                Thu Jun 13 11:03:27 2019
openstack-nova-compute-17.0.9-9.el7ost.noarch               Thu Jun 13 11:05:07 2019
openstack-nova-conductor-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:36 2019
openstack-nova-console-17.0.9-9.el7ost.noarch               Thu Jun 13 11:05:36 2019
openstack-nova-migration-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:07 2019
openstack-nova-novncproxy-17.0.9-9.el7ost.noarch            Thu Jun 13 11:05:36 2019
openstack-nova-placement-api-17.0.9-9.el7ost.noarch         Thu Jun 13 11:05:35 2019
openstack-nova-scheduler-17.0.9-9.el7ost.noarch             Thu Jun 13 11:05:36 2019
puppet-cinder-12.4.1-4.el7ost.noarch                        Thu Jun 13 11:02:56 2019
puppet-nova-12.4.0-17.el7ost.noarch                         Thu Jun 13 11:02:56 2019
python2-cinderclient-3.5.0-1.el7ost.noarch                  Tue Dec 11 17:34:09 2018
python2-novaclient-10.1.0-1.el7ost.noarch                   Tue Dec 11 17:34:09 2018
python-cinder-12.0.6-3.el7ost.noarch                        Thu Jun 13 11:03:41 2019
python-nova-17.0.9-9.el7ost.noarch                          Thu Jun 13 11:03:27 2019


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
This seems to be similar to this bug[1] that was opened for OSP9. They both have similar error:
~~~
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager ProcessExecutionError: Unexpected error while running command.
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=8 -- env LC_ALL=C qemu-img info /var/lib/cinder/mnt/[OMITTED]
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Exit code: -9
~~~

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1402594

Comment 1 Andre 2019-10-29 15:30:03 UTC
Hi,

I was checking the code for customer version[1] and they have the workaround that increases the cpu_time to 30[2]. But on the command that raises the issue, we have that the cpu is set to 8[3], is this an issue? It seems that it should reflect the cpu_time set in the code, but that's not the case.


[1] ~~~
$ grep -ir nova pollux-tds-controller-1/sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-V 
openstack-nova-api-17.0.9-9.el7ost.noarch
openstack-nova-common-17.0.9-9.el7ost.noarch
openstack-nova-compute-17.0.9-9.el7ost.noarch
openstack-nova-conductor-17.0.9-9.el7ost.noarch
openstack-nova-console-17.0.9-9.el7ost.noarch
openstack-nova-migration-17.0.9-9.el7ost.noarch
openstack-nova-novncproxy-17.0.9-9.el7ost.noarch
openstack-nova-placement-api-17.0.9-9.el7ost.noarch
openstack-nova-scheduler-17.0.9-9.el7ost.noarch
~~~

[2] ~~~
QEMU_IMG_LIMITS = processutils.ProcessLimits(
    cpu_time=30,
    address_space=1 * units.Gi)
~~~

[3] ~~~
2019-10-21 14:25:28.790 76 ERROR cinder.volume.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=8 -- env LC_ALL=C qemu-img info /var/lib/cinder/mnt/[OMITTED]
~~~

Comment 2 Eric Harney 2019-10-29 15:35:35 UTC
Are you looking at the code in nova or cinder?  Both have QEMU_IMG_LIMITS that are applied for these calls.

Comment 3 Andre 2019-10-31 12:59:21 UTC
I was checking this on nova code, do we have it in both places? If so, where the nova variable is being used?
I wanna try to increase this value so customer can try it, how should we proceed?

Comment 4 Eric Harney 2019-11-05 20:55:16 UTC
(In reply to Andre from comment #3)
> I was checking this on nova code, do we have it in both places? If so, where
> the nova variable is being used?
> I wanna try to increase this value so customer can try it, how should we
> proceed?

The error shown in the description and comment #1 was from the cinder volume manager so the nova code would not be relevant.  The Nova code also already has higher limits than Cinder does.

Changing it Cinder and restarting cinder-volume should help.

If this works, we can work toward getting this patch to relevant branches:
https://review.opendev.org/#/c/691901/

Comment 5 Andre 2019-11-07 15:01:46 UTC
How should we proceed testing? Since it requires a change in the code, I need engineering patch, right?

Comment 6 Edu Alcaniz 2019-11-12 09:14:15 UTC
Hi Eric, do you mind help us on this topic, please. 

The customer has a production environment, so, should we provide a hotfix or test package to the customer instead of manual changes?

Comment 7 Eric Harney 2019-11-12 15:50:46 UTC
I am trying to get https://review.opendev.org/#/c/691901/ merged into upstream master, will provide a hotfix package once we at least get this merged there.

Comment 12 Tzach Shefi 2020-02-05 17:57:52 UTC
Waiting for newer puddle 2020-01-15.3 
Resulted in a pre-fixedin version 

openstack-cinder-12.0.8-3.el7ost < openstack-cinder-12.0.8-5.el7ost

Comment 13 Tzach Shefi 2020-02-10 15:15:43 UTC
Verfied on:
openstack-cinder-12.0.10-2.el7ost.noarch

Using K2 iscsi backed 500G volume,
cinder create 500 --name K2_500G

(overcloud) [stack@undercloud-0 ~]$ cinder show 7bace336-d2cc-4530-864e-e4f455e73eb1
+--------------------------------+------------------------------------------+
| Property                       | Value                                    |
+--------------------------------+------------------------------------------+
| attached_servers               | ['2ff82e12-e95a-45e5-ad28-4bd6d840e9e7'] |
| attachment_ids                 | ['def3d853-c19f-4ec6-82b8-92e09131873f'] |
| availability_zone              | nova                                     |
| bootable                       | false                                    |
| consistencygroup_id            | None                                     |
| created_at                     | 2020-02-09T15:52:44.000000               |
| description                    | None                                     |
| encrypted                      | False                                    |
| id                             | 7bace336-d2cc-4530-864e-e4f455e73eb1     |
| metadata                       | attached_mode : rw                       |
| migration_status               | None                                     |
| multiattach                    | False                                    |
| name                           | K2_500G                                  |
| os-vol-host-attr:host          | controller-0@k2iscsi#k2iscsi             |
| os-vol-mig-status-attr:migstat | None                                     |
| os-vol-mig-status-attr:name_id | None                                     |
| os-vol-tenant-attr:tenant_id   | 2cd01b0fe6c644a48cbfa6da5a03d25b         |
| replication_status             | None                                     |
| size                           | 500                                      |
| snapshot_id                    | None                                     |
| source_volid                   | None                                     |
| status                         | in-use                                   |
| updated_at                     | 2020-02-09T15:53:31.000000               |
| user_id                        | b8c1ccd7e02c4f22ad8929f1cb5fcaba         |
| volume_type                    | tripleo                                  |
+--------------------------------+------------------------------------------+


Attached to instance and filled with random data
 nova volume-attach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 7bace336-d2cc-4530-864e-e4f455e73eb1


# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda     253:0    0    1G  0 disk 
|-vda1  253:1    0 1015M  0 part /
`-vda15 253:15   0    8M  0 part 
vdb     253:16   0  500G  0 disk /root/kuku
# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev                    240.1M         0    240.1M   0% /dev
/dev/vda1               978.9M     23.9M    914.2M   3% /
tmpfs                   244.2M         0    244.2M   0% /dev/shm
tmpfs                   244.2M     88.0K    244.1M   0% /run
/dev/vdb                492.0G    466.4G    652.5M 100% /root/kuku

Now lets detach volume and clone it
nova volume-detach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 7bace336-d2cc-4530-864e-e4f455e73eb1

cinder create 501 --source-volid 7bace336-d2cc-4530-864e-e4f455e73eb1 --name 501G_ClonedVolume 
+--------------------------------+--------------------------------------+
| Property                       | Value                                |
+--------------------------------+--------------------------------------+
| attachments                    | []                                   |
| availability_zone              | nova                                 |
| bootable                       | false                                |
| consistencygroup_id            | None                                 |
| created_at                     | 2020-02-10T05:36:30.000000           |
| description                    | None                                 |
| encrypted                      | False                                |
| id                             | b23cd5d0-2579-4ac4-a1dc-a04863319497 |
| metadata                       | {}                                   |
| migration_status               | None                                 |
| multiattach                    | False                                |
| name                           | 501G_ClonedVolume                    |
| os-vol-host-attr:host          | controller-0@k2iscsi#k2iscsi         |
| os-vol-mig-status-attr:migstat | None                                 |
| os-vol-mig-status-attr:name_id | None                                 |
| os-vol-tenant-attr:tenant_id   | 2cd01b0fe6c644a48cbfa6da5a03d25b     |
| replication_status             | None                                 |
| size                           | 501                                  |
| snapshot_id                    | None                                 |
| source_volid                   | 7bace336-d2cc-4530-864e-e4f455e73eb1 |
| status                         | creating                             |
| updated_at                     | 2020-02-10T05:36:31.000000           |
| user_id                        | b8c1ccd7e02c4f22ad8929f1cb5fcaba     |
| volume_type                    | tripleo                              |
+--------------------------------+--------------------------------------+

Wait a while for clone operation to finish

(overcloud) [stack@undercloud-0 ~]$ cinder list
+--------------------------------------+-----------+-------------------+------+-------------+----------+--------------------------------------+
| ID                                   | Status    | Name              | Size | Volume Type | Bootable | Attached to                          |
+--------------------------------------+-----------+-------------------+------+-------------+----------+--------------------------------------+
| 7bace336-d2cc-4530-864e-e4f455e73eb1 | available | K2_500G           | 500  | tripleo     | false    |                                      |
| b23cd5d0-2579-4ac4-a1dc-a04863319497 | in-use    | 501G_ClonedVolume | 501  | tripleo     | false    | 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 |  -> cloned volume


Attach volume to instance and check we have same data on both
#nova volume-attach 2ff82e12-e95a-45e5-ad28-4bd6d840e9e7 b23cd5d0-2579-4ac4-a1dc-a04863319497

Looking inside all data us there, successfully cloned a 500G, good to verify.

Comment 15 errata-xmlrpc 2020-03-10 11:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0764