Bug 1940548

Summary: Fibre channel not flushing data on disconnect
Product: Red Hat OpenStack Reporter: Rajat Dhasmana <rdhasman>
Component: python-os-brickAssignee: Rajat Dhasmana <rdhasman>
Status: CLOSED ERRATA QA Contact: Tzach Shefi <tshefi>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: alonare, apevec, astupnik, geguileo, jschluet, jvisser, lhh, pgrist, pmannidi, sputhenp, udesale
Target Milestone: z6Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-os-brick-2.10.5-1.20201114041630.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-26 13:52:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1936314    
Attachments:
Description Flags
Nova compute log none

Description Rajat Dhasmana 2021-03-18 15:34:54 UTC
The Fibre channel connector is not flushing the data on disconnect/detach causing partial data to be written on the volume.

Comment 9 Tzach Shefi 2021-05-13 07:28:47 UTC
Created attachment 1782639 [details]
Nova compute log

Verified on:
python3-os-brick-2.10.5-1.20201114041631.el8ost.noarch

On a 3par FC multipath system, enabled privsep logs on compute node. 

On one of the compute nodes I intently switched back to 
volume_use_multipath=false plus restarted nova docker. 

Then I booted an instance from a boot volume  instA. 
As expected instance booted-up using single path mode. 

Then I reverted back to volume_use_multipath=true, restarted nova docker.
After which I'd successfully hard rebooted the instance.

On attached compute log we notice:

And 2 lines below the reboot begins

  2021-05-09 14:12:42.276 7 INFO nova.compute.manager [req-b15f3264-7bc3-43ee-9519-f9fcfc5467fa b6d4233590524903bb53fb2c5d751b82 e61d0920577f4abb9254ee80a7fdb600 - default default] [instance: bf18e05e-234a-4dd9-9c99-e7465311f712] Rebooting instance

A little later os-brick starts disconnecting:

  2021-05-09 14:12:43.604 7 DEBUG os_brick.initiator.connectors.fibre_channel [req-b15f3264-7bc3-43ee-9519-f9fcfc5467fa b6d4233590524903bb53fb2c5d751b82 e61d0920577f4abb9254ee80a7fdb600 - default default] ==> disconnect_volume: ...

And then after trying to find the multipath (which cannot find) it
finally gives up and just flushes the individual device /dev/sdb

  2021-05-09 14:12:56.456 112 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): blockdev --flushbufs /dev/sdb execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:372


Which confirms this bug is now resolved,
A compute node originally running in single path mode, was switched over to using multipath mode. The instance running on it survived a reboot and didn't lose access to boot volume.

Comment 10 Tzach Shefi 2021-05-13 07:55:55 UTC
Adding extra bit of post verification info, 
I had also booted a second instB on compute-1 node after re-enabling volume_use_multipath=true. 

This instance as expected bootedup using mpath:
[root@compute-1 ~]# multipath -ll
360002ac0000000000000155300021f6b dm-0 3PARdata,VV
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 16:0:1:1 sdc 8:32  active ready running
  |- 16:0:0:1 sdd 8:48  active ready running
  |- 16:0:2:1 sde 8:64  active ready running
  |- 7:0:1:1  sdf 8:80  active ready running
  |- 7:0:0:1  sdg 8:96  active ready running
  `- 7:0:2:1  sdh 8:112 active ready running

We don't see DM device for instA, as it's original connection state wasn't yet refreshed 
thus it remains in single path mode. 

To refresh instA's connection info we must use: nova shelved/nshelved,
Doing so refres connection info and moves instA to MP state.

(overcloud) [stack@seal41 ~]$ nova shelve instA
+--------------------------------------+-------+-------------------+------------+-------------+-----------------------+
| ID                                   | Name  | Status            | Task State | Power State | Networks              |
+--------------------------------------+-------+-------------------+------------+-------------+-----------------------+
| bf18e05e-234a-4dd9-9c99-e7465311f712 | instA | SHELVED_OFFLOADED | spawning   | Shutdown    | internal=192.168.0.16 |
| 1ad69879-450b-4b85-aed5-f44d7519805e | instB | ACTIVE            | -          | Running     | internal=192.168.0.20 |
+--------------------------------------+-------+-------------------+------------+-------------+-----------------------+

(overcloud) [stack@seal41 ~]$ nova unshelve instA
(overcloud) [stack@seal41 ~]$ nova list
+--------------------------------------+-------+--------+------------+-------------+-----------------------+
| ID                                   | Name  | Status | Task State | Power State | Networks              |
+--------------------------------------+-------+--------+------------+-------------+-----------------------+
| bf18e05e-234a-4dd9-9c99-e7465311f712 | instA | ACTIVE | -          | Running     | internal=192.168.0.16 |
| 1ad69879-450b-4b85-aed5-f44d7519805e | instB | ACTIVE | -          | Running     | internal=192.168.0.20 |
+--------------------------------------+-------+--------+------------+-------------+-----------------------+

InstA is now also running in multipath, it just startedup/moved to compute-0 due shelve operation.

instA:
[heat-admin@compute-0 ~]$ sudo -i                                                                                                                                                                           
[root@compute-0 ~]# multipath -ll                                                                                                                                                                           
May 13 07:38:31 | /etc/multipath.conf line 13, duplicate keyword: skip_kpartx                                                                                                                               
360002ac0000000000000155200021f6b dm-0 3PARdata,VV                                                                                                                                                          
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 15:0:1:0 sdb 8:16 active ready running
  |- 15:0:2:0 sdc 8:32 active ready running
  |- 15:0:0:0 sdd 8:48 active ready running
  |- 16:0:0:0 sde 8:64 active ready running
  |- 16:0:2:0 sdf 8:80 active ready running
  `- 16:0:1:0 sdg 8:96 active ready running

instB, remains on compute-1
[root@compute-1 ~]# multipath -ll
360002ac0000000000000155300021f6b dm-0 3PARdata,VV
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 16:0:1:1 sdc 8:32  active ready running
  |- 16:0:0:1 sdd 8:48  active ready running
  |- 16:0:2:1 sde 8:64  active ready running
  |- 7:0:1:1  sdf 8:80  active ready running
  |- 7:0:0:1  sdg 8:96  active ready running
  `- 7:0:2:1  sdh 8:112 active ready running
[root@compute-1 ~]# 

And thus after switching compute-1 from SP to MP plus shelving/unshelving of InstA,
Both instances now use multipathing.

Comment 16 errata-xmlrpc 2021-05-26 13:52:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097