Bug 1524791

Summary: Storage Objects gets removed Upon restoreconfig, if the underlying backend is down
Product: Red Hat Enterprise Linux 7 Reporter: Prasanna Kumar Kalever <prasanna.kalever>
Component: python-rtslibAssignee: Maurizio Lombardi <mlombard>
Status: CLOSED ERRATA QA Contact: Martin Hoyer <mhoyer>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.4CC: amukherj, bkunal, cww, dsundqvi, glamb, lmiksik, loberman, mchristi, mhoyer, mlombard, olim, rcyriac, rhandlin, sabose, salmy
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Previously, the targetcli utility removed storage objects under certain conditions. This happened when the volume hosting the storage objects was down and the user restored target configuration with the command "targetcli restoreconfig". With this update, configuration is now saved at a block granularity, and, as a result, the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1565074 1566103 (view as bug list) Environment:
Last Closed: 2018-10-30 07:44:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1537170, 1546181, 1554642, 1559239, 1565074, 1566103, 1592629    

Description Prasanna Kumar Kalever 2017-12-12 05:36:00 UTC
1. Created 1 user:glfs storage object and 1 target with 3 portals
2. Brought down the volume on which the storage object was hosted
3. While storage is down, we initiated a 'targetcli restoreconfig'
4. Since the vol is down storage object was not loaded as tcmu-runner-> handle_netlink -> add_device() failed. So here 'targetcli ls' listing Storage Objects = 0, Targets=1.
5. tried to Create 2nd user:glfs block using other vol2. The following 'targetcli / saveconfig' will update the saveconfig.json file with StorageObjects=1, Targets=2.

Which will then result in overwriting of the Storage Objects.


I don't think this was the case earlier, i.e. Storage Objects not listing in 'targetcli ls' upon add_device() failure.

I have kept a watch on '/sys/kernel/config/target/core/user_0/' and try to execute 'targetcli restoreconfig' while the underline backend storage is down. I see that the entries are getting created. But once the 'targetcli restoreconfig' exits they get removed as soon as add_device is failing (May be self.delete ?).



In steps this is exactly what I observed, for single block:

[RHEL]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

[RHEL]# uname -a
Linux dhcp42-218.lab.eng.blr.redhat.com 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

[RHEL]#  rpm -qa | grep -e rtslib -e targetcli -e tcmu-runner -e configshell
targetcli-2.1.fb46-1.el7.noarch
python-rtslib-2.1.fb63-2.el7.noarch
tcmu-runner-1.2.0-15.el7rhgs.x86_64
python-configshell-1.1.fb23-3.el7.noarch


Here is what I have done:
* Created gluster volume named 'sample' on both the VM's
* Created a block with storage object and target using targetcli


[RHEL]# targetcli ls /backstores/user:glfs
o- user:glfs .................................................................................... [Storage Objects: 1]
  o- block .............. [sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 (1.0GiB) activated]
    o- alua ......................................................................................... [ALUA Groups: 0]
[RHEL]# targetcli ls /iscsi
o- iscsi ................................................................................................ [Targets: 1]
  o- iqn.2016-12.org.gluster-block:670f4641-176a-4b1a-bdcf-8036fdbf7f76 .................................... [TPGs: 1]
    o- tpg1 ......................................................................... [gen-acls, tpg-auth, 1-way auth]
      o- acls .............................................................................................. [ACLs: 0]
      o- luns .............................................................................................. [LUNs: 1]
      | o- lun0 .................................................................................. [user/block (None)]
      o- portals ........................................................................................ [Portals: 1]
        o- 192.168.124.227:3260 ................................................................................. [OK]


Now,
* Stop volume & reload the configuration

[RHEL]# gluster vol status
Volume sample is not started

[RHEL]# targetcli clearconfig confirm=True
All configuration cleared

[RHEL]# targetcli restoreconfig ~/saveconfig.json
Configuration restored, 2 recoverable errors:
Could not create StorageObject block: [Errno 2] No such file or directory, skipped
Could not find matching StorageObject for LUN 0, skipped

[RHEL]# targetcli ls /backstores/user:glfs
o- user:glfs .................................................................................... [Storage Objects: 0]



Mean while, I have colleted the tcmu-runner logs in debug mode,


[RHEL]# tcmu-runner -d
2017-12-06 18:50:03.401 2412 [DEBUG] main:816 : handler path: /usr/lib64/tcmu-runner
2017-12-06 18:50:03.403 2412 [DEBUG] load_our_module:524 : Module 'target_core_user' is already loaded
2017-12-06 18:50:03.404 2412 [DEBUG] main:829 : 1 runner handlers found
2017-12-06 18:50:03.409 2412 [DEBUG] dbus_bus_acquired:437 : bus org.kernel.TCMUService1 acquired
2017-12-06 18:50:03.409 2412 [DEBUG] dbus_name_acquired:453 : name org.kernel.TCMUService1 acquired
2017-12-06 18:50:22.745 2412 [DEBUG] handle_netlink:207 : cmd 1. Got header version 2. Supported 2.
2017-12-06 18:50:22.763 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error
2017-12-06 18:50:23.746 2412 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-12-06 18:50:23.757 2412 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: Input/output error
2017-12-06 18:50:24.748 2412 [ERROR] tcmu_glfs_open:562 : tcmu_create_glfs_object failed
2017-12-06 18:50:24.748 2412 [ERROR] add_device:483 : handler open failed for uio0



A watch on /sys/kernel/config/target/core/user_0/, while restoreconfig is in progress,


[RHEL]# ls -R /sys/kernel/config/target/core/user_0/
/sys/kernel/config/target/core/user_0/:
block  hba_info  hba_mode
/sys/kernel/config/target/core/user_0/block:
alias  alua  alua_lu_gp  attrib  control  enable  info  lba_map  pr  statistics  udev_path  wwn

/sys/kernel/config/target/core/user_0/block/alua:
default_tg_pt_gp

/sys/kernel/config/target/core/user_0/block/alua/default_tg_pt_gp:
alua_access_state                 alua_support_lba_dependent  alua_write_metadata  tg_pt_gp_id
alua_access_status                alua_support_offline        implicit_trans_secs  trans_delay_msecs
alua_access_type                  alua_support_standby        members
alua_support_active_nonoptimized  alua_support_transitioning  nonop_delay_msecs
alua_support_active_optimized     alua_support_unavailable    preferred

/sys/kernel/config/target/core/user_0/block/attrib:
alua_support  dev_size       hw_max_sectors   hw_queue_depth    pgr_support
cmd_time_out  hw_block_size  hw_pi_prot_type  max_data_area_mb  qfull_time_out

/sys/kernel/config/target/core/user_0/block/pr:
res_aptpl_active    res_holder          res_pr_generation      res_pr_registered_i_pts  res_type
res_aptpl_metadata  res_pr_all_tgt_pts  res_pr_holder_tg_port  res_pr_type

/sys/kernel/config/target/core/user_0/block/statistics:
scsi_dev  scsi_lu  scsi_tgt_dev

/sys/kernel/config/target/core/user_0/block/statistics/scsi_dev:
indx  inst  ports  role

/sys/kernel/config/target/core/user_0/block/statistics/scsi_lu:
creation_time  dev_type   hs_num_cmds  inst  lu_name   prod         resets  state_bit  vend
dev            full_stat  indx         lun   num_cmds  read_mbytes  rev     status     write_mbytes

/sys/kernel/config/target/core/user_0/block/statistics/scsi_tgt_dev:
indx  inst  non_access_lus  num_lus  resets  status

/sys/kernel/config/target/core/user_0/block/wwn:
vpd_assoc_logical_unit  vpd_assoc_scsi_target_device  vpd_assoc_target_port  vpd_protocol_identifier  vpd_unit_serial



[RHEL]# cat /sys/kernel/config/target/core/user_0/block/info
Status: DEACTIVATED  Max Queue Depth: 0  SectorSize: 0  HwMaxSectors: 128
        Config: glfs/sample.124.227/block-store/670f4641-176a-4b1a-bdcf-8036fdbf7f76 Size: 1073741824 MaxDataA
reaMB: 8



what is triggering a self.delete() in the rtslib code ? which is leading to deletion of storage object from the target configuration on add_device() failure.

Comment 2 Sahina Bose 2017-12-20 06:58:59 UTC
Any update on this bug? Do you need further information from us?

Comment 3 Maurizio Lombardi 2017-12-21 07:57:48 UTC
Hi,

No updates right now, I am going to look at it ASAP.

Comment 13 Maurizio Lombardi 2018-02-27 13:41:58 UTC
Hi Atin,

The Release Candidate Fix Freeze is imminent and the patches have not been merged by upstream yet.

Do you think it would acceptable if we move this bz to 7.6, set the z-stream flag and eventually provide an hotfix to the affected customers?

Comment 14 Maurizio Lombardi 2018-02-27 15:32:49 UTC
My manager also suggests that there is a possibility of a zero day errata for userspace, if PM agrees that this is a candidate.

Comment 15 Maurizio Lombardi 2018-02-27 18:38:39 UTC
Andy Grover says that the patches are not ready for upstream,
if no one has further objections I will proceed to remove the blocker flag.

Comment 16 Atin Mukherjee 2018-02-27 23:57:45 UTC
Maurizio - fix in first z-stream would be great, or zero day would be even greater. We meed to ensure this fix gets shiped before RHGS-3.4.0 ships where both of these options look to be feasible as per the schedule.

Comment 18 Maurizio Lombardi 2018-04-10 15:24:14 UTC
Package built: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15750470

Comment 22 Martin Hoyer 2018-07-10 07:51:13 UTC
Tested with: 
 - kernel-3.10.0-915.el7
 - targetcli-2.1.fb46-6.el7
 - python-configshell-1.1.fb23-4.el7
 - python-rtslib-2.1.fb63-12.el7
The issue is no longer reproducible; No regression found.

Comment 24 errata-xmlrpc 2018-10-30 07:44:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3019