Previously, the targetcli utility removed storage objects under certain conditions. This happened when the volume hosting the storage objects was down and the user restored target configuration with the command "targetcli restoreconfig". With this update, configuration is now saved at a block granularity, and, as a result, the described problem no longer occurs.
DescriptionPrasanna Kumar Kalever
2017-12-12 05:36:00 UTC
1. Created 1 user:glfs storage object and 1 target with 3 portals
2. Brought down the volume on which the storage object was hosted
3. While storage is down, we initiated a 'targetcli restoreconfig'
4. Since the vol is down storage object was not loaded as tcmu-runner-> handle_netlink -> add_device() failed. So here 'targetcli ls' listing Storage Objects = 0, Targets=1.
5. tried to Create 2nd user:glfs block using other vol2. The following 'targetcli / saveconfig' will update the saveconfig.json file with StorageObjects=1, Targets=2.
Which will then result in overwriting of the Storage Objects.
I don't think this was the case earlier, i.e. Storage Objects not listing in 'targetcli ls' upon add_device() failure.
I have kept a watch on '/sys/kernel/config/target/core/user_0/' and try to execute 'targetcli restoreconfig' while the underline backend storage is down. I see that the entries are getting created. But once the 'targetcli restoreconfig' exits they get removed as soon as add_device is failing (May be self.delete ?).
In steps this is exactly what I observed, for single block:
[RHEL]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[RHEL]# uname -a
Linux dhcp42-218.lab.eng.blr.redhat.com 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[RHEL]# rpm -qa | grep -e rtslib -e targetcli -e tcmu-runner -e configshell
targetcli-2.1.fb46-1.el7.noarch
python-rtslib-2.1.fb63-2.el7.noarch
tcmu-runner-1.2.0-15.el7rhgs.x86_64
python-configshell-1.1.fb23-3.el7.noarch
Here is what I have done:
* Created gluster volume named 'sample' on both the VM's
* Created a block with storage object and target using targetcli
[RHEL]# targetcli ls /backstores/user:glfs
o- user:glfs .................................................................................... [Storage Objects: 1]
o- block .............. [sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 (1.0GiB) activated]
o- alua ......................................................................................... [ALUA Groups: 0]
[RHEL]# targetcli ls /iscsi
o- iscsi ................................................................................................ [Targets: 1]
o- iqn.2016-12.org.gluster-block:670f4641-176a-4b1a-bdcf-8036fdbf7f76 .................................... [TPGs: 1]
o- tpg1 ......................................................................... [gen-acls, tpg-auth, 1-way auth]
o- acls .............................................................................................. [ACLs: 0]
o- luns .............................................................................................. [LUNs: 1]
| o- lun0 .................................................................................. [user/block (None)]
o- portals ........................................................................................ [Portals: 1]
o- 192.168.124.227:3260 ................................................................................. [OK]
Now,
* Stop volume & reload the configuration
[RHEL]# gluster vol status
Volume sample is not started
[RHEL]# targetcli clearconfig confirm=True
All configuration cleared
[RHEL]# targetcli restoreconfig ~/saveconfig.json
Configuration restored, 2 recoverable errors:
Could not create StorageObject block: [Errno 2] No such file or directory, skipped
Could not find matching StorageObject for LUN 0, skipped
[RHEL]# targetcli ls /backstores/user:glfs
o- user:glfs .................................................................................... [Storage Objects: 0]
Mean while, I have colleted the tcmu-runner logs in debug mode,
[RHEL]# tcmu-runner -d
2017-12-06 18:50:03.401 2412 [DEBUG] main:816 : handler path: /usr/lib64/tcmu-runner
2017-12-06 18:50:03.403 2412 [DEBUG] load_our_module:524 : Module 'target_core_user' is already loaded
2017-12-06 18:50:03.404 2412 [DEBUG] main:829 : 1 runner handlers found
2017-12-06 18:50:03.409 2412 [DEBUG] dbus_bus_acquired:437 : bus org.kernel.TCMUService1 acquired
2017-12-06 18:50:03.409 2412 [DEBUG] dbus_name_acquired:453 : name org.kernel.TCMUService1 acquired
2017-12-06 18:50:22.745 2412 [DEBUG] handle_netlink:207 : cmd 1. Got header version 2. Supported 2.
2017-12-06 18:50:22.763 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error
2017-12-06 18:50:23.746 2412 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-12-06 18:50:23.757 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error
2017-12-06 18:50:24.748 2412 [ERROR] tcmu_glfs_open:562 : tcmu_create_glfs_object failed
2017-12-06 18:50:24.748 2412 [ERROR] add_device:483 : handler open failed for uio0
A watch on /sys/kernel/config/target/core/user_0/, while restoreconfig is in progress,
[RHEL]# ls -R /sys/kernel/config/target/core/user_0/
/sys/kernel/config/target/core/user_0/:
block hba_info hba_mode
/sys/kernel/config/target/core/user_0/block:
alias alua alua_lu_gp attrib control enable info lba_map pr statistics udev_path wwn
/sys/kernel/config/target/core/user_0/block/alua:
default_tg_pt_gp
/sys/kernel/config/target/core/user_0/block/alua/default_tg_pt_gp:
alua_access_state alua_support_lba_dependent alua_write_metadata tg_pt_gp_id
alua_access_status alua_support_offline implicit_trans_secs trans_delay_msecs
alua_access_type alua_support_standby members
alua_support_active_nonoptimized alua_support_transitioning nonop_delay_msecs
alua_support_active_optimized alua_support_unavailable preferred
/sys/kernel/config/target/core/user_0/block/attrib:
alua_support dev_size hw_max_sectors hw_queue_depth pgr_support
cmd_time_out hw_block_size hw_pi_prot_type max_data_area_mb qfull_time_out
/sys/kernel/config/target/core/user_0/block/pr:
res_aptpl_active res_holder res_pr_generation res_pr_registered_i_pts res_type
res_aptpl_metadata res_pr_all_tgt_pts res_pr_holder_tg_port res_pr_type
/sys/kernel/config/target/core/user_0/block/statistics:
scsi_dev scsi_lu scsi_tgt_dev
/sys/kernel/config/target/core/user_0/block/statistics/scsi_dev:
indx inst ports role
/sys/kernel/config/target/core/user_0/block/statistics/scsi_lu:
creation_time dev_type hs_num_cmds inst lu_name prod resets state_bit vend
dev full_stat indx lun num_cmds read_mbytes rev status write_mbytes
/sys/kernel/config/target/core/user_0/block/statistics/scsi_tgt_dev:
indx inst non_access_lus num_lus resets status
/sys/kernel/config/target/core/user_0/block/wwn:
vpd_assoc_logical_unit vpd_assoc_scsi_target_device vpd_assoc_target_port vpd_protocol_identifier vpd_unit_serial
[RHEL]# cat /sys/kernel/config/target/core/user_0/block/info
Status: DEACTIVATED Max Queue Depth: 0 SectorSize: 0 HwMaxSectors: 128
Config: glfs/sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 Size: 1073741824 MaxDataA
reaMB: 8
what is triggering a self.delete() in the rtslib code ? which is leading to deletion of storage object from the target configuration on add_device() failure.
Any update on this bug? Do you need further information from us?
Comment 3Maurizio Lombardi
2017-12-21 07:57:48 UTC
Hi,
No updates right now, I am going to look at it ASAP.
Comment 13Maurizio Lombardi
2018-02-27 13:41:58 UTC
Hi Atin,
The Release Candidate Fix Freeze is imminent and the patches have not been merged by upstream yet.
Do you think it would acceptable if we move this bz to 7.6, set the z-stream flag and eventually provide an hotfix to the affected customers?
Comment 14Maurizio Lombardi
2018-02-27 15:32:49 UTC
My manager also suggests that there is a possibility of a zero day errata for userspace, if PM agrees that this is a candidate.
Comment 15Maurizio Lombardi
2018-02-27 18:38:39 UTC
Andy Grover says that the patches are not ready for upstream,
if no one has further objections I will proceed to remove the blocker flag.
Maurizio - fix in first z-stream would be great, or zero day would be even greater. We meed to ensure this fix gets shiped before RHGS-3.4.0 ships where both of these options look to be feasible as per the schedule.
Comment 18Maurizio Lombardi
2018-04-10 15:24:14 UTC
Tested with:
- kernel-3.10.0-915.el7
- targetcli-2.1.fb46-6.el7
- python-configshell-1.1.fb23-4.el7
- python-rtslib-2.1.fb63-12.el7
The issue is no longer reproducible; No regression found.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2018:3019
1. Created 1 user:glfs storage object and 1 target with 3 portals 2. Brought down the volume on which the storage object was hosted 3. While storage is down, we initiated a 'targetcli restoreconfig' 4. Since the vol is down storage object was not loaded as tcmu-runner-> handle_netlink -> add_device() failed. So here 'targetcli ls' listing Storage Objects = 0, Targets=1. 5. tried to Create 2nd user:glfs block using other vol2. The following 'targetcli / saveconfig' will update the saveconfig.json file with StorageObjects=1, Targets=2. Which will then result in overwriting of the Storage Objects. I don't think this was the case earlier, i.e. Storage Objects not listing in 'targetcli ls' upon add_device() failure. I have kept a watch on '/sys/kernel/config/target/core/user_0/' and try to execute 'targetcli restoreconfig' while the underline backend storage is down. I see that the entries are getting created. But once the 'targetcli restoreconfig' exits they get removed as soon as add_device is failing (May be self.delete ?). In steps this is exactly what I observed, for single block: [RHEL]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) [RHEL]# uname -a Linux dhcp42-218.lab.eng.blr.redhat.com 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux [RHEL]# rpm -qa | grep -e rtslib -e targetcli -e tcmu-runner -e configshell targetcli-2.1.fb46-1.el7.noarch python-rtslib-2.1.fb63-2.el7.noarch tcmu-runner-1.2.0-15.el7rhgs.x86_64 python-configshell-1.1.fb23-3.el7.noarch Here is what I have done: * Created gluster volume named 'sample' on both the VM's * Created a block with storage object and target using targetcli [RHEL]# targetcli ls /backstores/user:glfs o- user:glfs .................................................................................... [Storage Objects: 1] o- block .............. [sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 (1.0GiB) activated] o- alua ......................................................................................... [ALUA Groups: 0] [RHEL]# targetcli ls /iscsi o- iscsi ................................................................................................ [Targets: 1] o- iqn.2016-12.org.gluster-block:670f4641-176a-4b1a-bdcf-8036fdbf7f76 .................................... [TPGs: 1] o- tpg1 ......................................................................... [gen-acls, tpg-auth, 1-way auth] o- acls .............................................................................................. [ACLs: 0] o- luns .............................................................................................. [LUNs: 1] | o- lun0 .................................................................................. [user/block (None)] o- portals ........................................................................................ [Portals: 1] o- 192.168.124.227:3260 ................................................................................. [OK] Now, * Stop volume & reload the configuration [RHEL]# gluster vol status Volume sample is not started [RHEL]# targetcli clearconfig confirm=True All configuration cleared [RHEL]# targetcli restoreconfig ~/saveconfig.json Configuration restored, 2 recoverable errors: Could not create StorageObject block: [Errno 2] No such file or directory, skipped Could not find matching StorageObject for LUN 0, skipped [RHEL]# targetcli ls /backstores/user:glfs o- user:glfs .................................................................................... [Storage Objects: 0] Mean while, I have colleted the tcmu-runner logs in debug mode, [RHEL]# tcmu-runner -d 2017-12-06 18:50:03.401 2412 [DEBUG] main:816 : handler path: /usr/lib64/tcmu-runner 2017-12-06 18:50:03.403 2412 [DEBUG] load_our_module:524 : Module 'target_core_user' is already loaded 2017-12-06 18:50:03.404 2412 [DEBUG] main:829 : 1 runner handlers found 2017-12-06 18:50:03.409 2412 [DEBUG] dbus_bus_acquired:437 : bus org.kernel.TCMUService1 acquired 2017-12-06 18:50:03.409 2412 [DEBUG] dbus_name_acquired:453 : name org.kernel.TCMUService1 acquired 2017-12-06 18:50:22.745 2412 [DEBUG] handle_netlink:207 : cmd 1. Got header version 2. Supported 2. 2017-12-06 18:50:22.763 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error 2017-12-06 18:50:23.746 2412 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed 2017-12-06 18:50:23.757 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error 2017-12-06 18:50:24.748 2412 [ERROR] tcmu_glfs_open:562 : tcmu_create_glfs_object failed 2017-12-06 18:50:24.748 2412 [ERROR] add_device:483 : handler open failed for uio0 A watch on /sys/kernel/config/target/core/user_0/, while restoreconfig is in progress, [RHEL]# ls -R /sys/kernel/config/target/core/user_0/ /sys/kernel/config/target/core/user_0/: block hba_info hba_mode /sys/kernel/config/target/core/user_0/block: alias alua alua_lu_gp attrib control enable info lba_map pr statistics udev_path wwn /sys/kernel/config/target/core/user_0/block/alua: default_tg_pt_gp /sys/kernel/config/target/core/user_0/block/alua/default_tg_pt_gp: alua_access_state alua_support_lba_dependent alua_write_metadata tg_pt_gp_id alua_access_status alua_support_offline implicit_trans_secs trans_delay_msecs alua_access_type alua_support_standby members alua_support_active_nonoptimized alua_support_transitioning nonop_delay_msecs alua_support_active_optimized alua_support_unavailable preferred /sys/kernel/config/target/core/user_0/block/attrib: alua_support dev_size hw_max_sectors hw_queue_depth pgr_support cmd_time_out hw_block_size hw_pi_prot_type max_data_area_mb qfull_time_out /sys/kernel/config/target/core/user_0/block/pr: res_aptpl_active res_holder res_pr_generation res_pr_registered_i_pts res_type res_aptpl_metadata res_pr_all_tgt_pts res_pr_holder_tg_port res_pr_type /sys/kernel/config/target/core/user_0/block/statistics: scsi_dev scsi_lu scsi_tgt_dev /sys/kernel/config/target/core/user_0/block/statistics/scsi_dev: indx inst ports role /sys/kernel/config/target/core/user_0/block/statistics/scsi_lu: creation_time dev_type hs_num_cmds inst lu_name prod resets state_bit vend dev full_stat indx lun num_cmds read_mbytes rev status write_mbytes /sys/kernel/config/target/core/user_0/block/statistics/scsi_tgt_dev: indx inst non_access_lus num_lus resets status /sys/kernel/config/target/core/user_0/block/wwn: vpd_assoc_logical_unit vpd_assoc_scsi_target_device vpd_assoc_target_port vpd_protocol_identifier vpd_unit_serial [RHEL]# cat /sys/kernel/config/target/core/user_0/block/info Status: DEACTIVATED Max Queue Depth: 0 SectorSize: 0 HwMaxSectors: 128 Config: glfs/sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 Size: 1073741824 MaxDataA reaMB: 8 what is triggering a self.delete() in the rtslib code ? which is leading to deletion of storage object from the target configuration on add_device() failure.