Bug 1662449

Summary: Bad error handling when running lvm commands in vdsm
Product: Red Hat Enterprise Virtualization Manager Reporter: Frank DeLorey <fdelorey>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Liran Rotenberg <lrotenbe>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.2.7CC: fdelorey, gwatson, jcoscia, lsurette, nashok, nsoffer, rbarry, rdlugyhe, srevivo, stirabos, tnisan, vpagar, ycui
Target Milestone: ovirt-4.3.1   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If a user with an invalid sudo configuration uses sudo to run commands, sudo appends a "last login" message to the command output. When this happens, VDSM fails to run lvm commands. Previously, the VDSM log did not contain helpful information about what caused those failures. The current release improves error handling in the VDSM code running lvm commands. Now, if VDSM fails, an error message clearly states that there was invalid output from the lvm commands, and shows the output added by sudo. Although this change does not fix the root cause, an invalid sudo configuration, it makes it easier to understand the issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-08 12:36:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1664147, 1666215, 1666603    
Bug Blocks:    

Description Frank DeLorey 2018-12-28 15:44:26 UTC
Description of problem:

The following steps were performed:

1. All VMs were shutdown except for hosted engine.
2. Hosts were put into global maintenance
3. Hosted engine VM was shutdown via #hosted-engine --vm-shutdown
4. yum update was performed on both hosts.
5. Attempted to start hosted engine to apply patches, but #hosted-engine
--vm-start failed and problem was discovered.

I've used this same process on 15 different RHV instances to patch and
upgrade from 4.0 -> 4.1 -> 4.2. This was the last RHV instance to perform
patching on. This instance is different from the others in the following
ways:
1. Different server model and CPU architecture (These two servers are AMD
the rest are Intel)
2. This sever has remote NFS storage, the others have local NFS storage.
3. This server connects to a remote storage server via rdma over NFS over
an infiniband fabric. (Things can get a little complicated here, but the
remote storage server is reachable and the nfs mount point for HE can be
mounted manually.)


Additional Info:

I can see the errors start here:

2018-12-28 04:23:37,459-0500 INFO  (vm/de454dd5) [vdsm.api] START getVolumeSize(sdUUID='d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='c197ab9e-7384-478a-8775-5c52b883ce3a', volUUID='50446aee-a8a9-400b-b117-17b79b50a9fe', options=None) from=internal, task_id=abf04feb-11b4-4039-bbf3-a4840901eabb (api:46)
2018-12-28 04:23:37,632-0500 ERROR (vm/de454dd5) [storage.StorageDomainCache] Error while looking for domain `d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46` (sdc:174)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in _findUnfetchedDomain
    return mod.findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1575, in findDomainPath
    vg = lvm.getVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 935, in getVG
    vg = _lvminfo.getVg(vgName)  # returns single VG namedtuple
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 581, in getVg
    vgs = self._reloadvgs(vgName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 389, in _reloadvgs
    pv_name = fields[pvNameIdx]
IndexError: list index out of range

Comment 2 Frank DeLorey 2018-12-28 15:45:33 UTC
Created attachment 1517228 [details]
Agent log

Comment 3 Frank DeLorey 2018-12-28 15:46:20 UTC
Created attachment 1517229 [details]
Broker log

Comment 4 Frank DeLorey 2018-12-28 15:47:25 UTC
Created attachment 1517230 [details]
vdsm log where issue starts

Comment 5 Frank DeLorey 2018-12-28 15:48:05 UTC
Created attachment 1517231 [details]
Latest vdsm log

Comment 6 Frank DeLorey 2018-12-28 16:41:48 UTC
Created attachment 1517245 [details]
rpms from host

Comment 7 Simone Tiraboschi 2018-12-28 16:48:52 UTC
ovirt-ha-broker continuously restarts since it's unable to connect the storage server:

MainThread::WARNING::2018-12-28 07:02:47,618::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't  connect to VDSM within 60 seconds 
MainThread::INFO::2018-12-28 07:02:50,009::broker::45::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.2.18 started

we have to investigate what can cause it.

Comment 8 Simone Tiraboschi 2018-12-28 16:53:50 UTC
Frank, can you please attach the output of
  pvs
  vgs
  lvs

executed on that host?

Comment 13 Nir Soffer 2018-12-28 22:46:16 UTC
Frank, can you enable debug logs in vdsm, restart vdsm and post log with the failure
in debug mode?

To enable debug log, edit /etc/vdsm/logger.conf, and replace the log level to DEBUG
for these loggers:
- root
- storge
- virt
- vds

This will show the actual pvs command that vdsm uses, and the response from lvm, which
may explain what is wrong with lvm output.

Comment 14 Nir Soffer 2018-12-28 22:56:39 UTC
(In reply to Frank DeLorey from comment #0)
> 2018-12-28 04:23:37,632-0500 ERROR (vm/de454dd5)
> [storage.StorageDomainCache] Error while looking for domain
> `d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46` (sdc:174)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in
> _findUnfetchedDomain
>     return mod.findDomain(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1637, in findDomain
>     return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1575, in findDomainPath
>     vg = lvm.getVG(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 935, in
> getVG
>     vg = _lvminfo.getVg(vgName)  # returns single VG namedtuple
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 581, in
> getVg
>     vgs = self._reloadvgs(vgName)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 389, in
> _reloadvgs
>     pv_name = fields[pvNameIdx]
> IndexError: list index out of range

This means that we got unexpected output from pvs, and storage initialization
failed, which cause vdsm to remain in "not ready" state.

Comment 15 Nir Soffer 2018-12-29 00:13:25 UTC
Frank, the attached patch will log the offending line that cause the IndexError.

Can you get this patch applied on one of the hosts so we can get more info
on this error?

Comment 16 Gordon Watson 2018-12-29 01:48:47 UTC
Nir,

I've provided the customer with a patched lvm.py file. I've tested it here, i.e. restarted vdsm and activated the host.

Regards, GFW.

Comment 18 nijin ashok 2018-12-29 08:55:23 UTC
The issue is because of custom configuration in the pam.d/auth-ac. With the Nir++ patch  https://gerrit.ovirt.org/#/c/96505/, we are getting below traceback.

====
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in _findUnfetchedDomain
    return mod.findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1575, in findDomainPath
    vg = lvm.getVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 953, in getVG
    vg = _lvminfo.getVg(vgName)  # returns single VG namedtuple
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 599, in getVg
    vgs = self._reloadvgs(vgName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 406, in _reloadvgs
    raise InvalidLine("pv_name", line)
InvalidLine: Missing field 'pv_name' in line 'Last login: Fri Dec 28 21:41:33 EST 2018'
===

So we are getting an unexpected value "Last login: Fri Dec 28 21:41:33 EST 2018" which is a "lastlog" output. While checking further I can see that there is a custom configuration in "/etc/pam.d/system-auth".

====
session     required      pam_lastlog.so showfailed
====

In the latest sudo rpm, it adds new configuration in /etc/pam.d/sudo (that's the reason, the customer began to experience the issue after updating the server)

session    include      system-auth

So whenever a sudo command is executed, it also provides "lastlog"  in the output since the customer have the above configuration in system-auth. The vdsm get this unexpected value and fails during processing the vgs and pvs output.

With the same configuration in my test environment.

====
# su vdsm -s /bin/sh
sh-4.2$ /usr/bin/sudo -n /sbin/lvm vgs
Last login: Sat Dec 29 13:56:06 IST 2018                                     ===> last log output
  VG                                   #PV #LV #SN Attr   VSize   VFree 
  ec909468-4a08-49fb-885b-f707792ce6f3   1  13   0 wz--n-  48.62g 31.25g
  rhel_lab-orac-11g-client               1   3   0 wz--n- <59.51g 64.00m

====

My "hsm/init" also fails with the same error after adding the "pam_lastlog" in system-auth with latest sudo package. I have asked the customer to remove the mentioned line from system-auth and restart the  services.

Thanks, Nir for the quick patch.

Comment 19 Nir Soffer 2018-12-29 10:08:01 UTC
Thansk nijin!

The fact that user configuration can change the output from any command
is crazy. There is no way to use sudo in a management program in this way.

I think we don't have anything to do here expect improving the log, and
maybe file sudo bug. sudo should never change application output

For 4.4 we can more the lvm command execution to supervdsm or another
process running as root.

Comment 20 Nir Soffer 2018-12-29 12:56:04 UTC
How to test the patch:

   echo "session     required      pam_lastlog.so showfailed" >> /etc/pam.d/system-auth
   systemctl restart vdsmd

After restarting vdsm it will fail with:

    InvalidOutputLine: Invalid pvs command ouptut line: 'Last login: Sat Dec 29 14:42:39 IST 2018'

Remove the added line from /etc/pam.d/system-auth to recover.

Comment 21 Sandro Bonazzola 2019-01-28 09:44:42 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 23 Liran Rotenberg 2019-03-06 09:28:32 UTC
Verified on:
vdsm-4.30.10-1.el7ev.x86_64

Steps:
1. On the host run: 
    a. echo "session     required      pam_lastlog.so showfailed" >> /etc/pam.d/system-auth
    b. systemctl restart vdsmd

2. Check vdsm log

Results:
VDSM restarted and showed the following error:

2019-03-06 11:22:03,917+0200 ERROR (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, started daemon 140708589577984)> failed (concurrent:202)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 407, in storageRefresh
    lvm.bootstrap(skiplvs=blockSD.SPECIAL_LVS_V4)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 747, in bootstrap
    _lvminfo.bootstrap()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 368, in bootstrap
    self._reloadpvs()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 393, in _reloadpvs
    raise InvalidOutputLine("pvs", line)
InvalidOutputLine: Invalid pvs command ouptut line: 'Last login: Wed Mar  6 11:22:03 IST 2019'

Comment 26 errata-xmlrpc 2019-05-08 12:36:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077