Description of problem: The following steps were performed: 1. All VMs were shutdown except for hosted engine. 2. Hosts were put into global maintenance 3. Hosted engine VM was shutdown via #hosted-engine --vm-shutdown 4. yum update was performed on both hosts. 5. Attempted to start hosted engine to apply patches, but #hosted-engine --vm-start failed and problem was discovered. I've used this same process on 15 different RHV instances to patch and upgrade from 4.0 -> 4.1 -> 4.2. This was the last RHV instance to perform patching on. This instance is different from the others in the following ways: 1. Different server model and CPU architecture (These two servers are AMD the rest are Intel) 2. This sever has remote NFS storage, the others have local NFS storage. 3. This server connects to a remote storage server via rdma over NFS over an infiniband fabric. (Things can get a little complicated here, but the remote storage server is reachable and the nfs mount point for HE can be mounted manually.) Additional Info: I can see the errors start here: 2018-12-28 04:23:37,459-0500 INFO (vm/de454dd5) [vdsm.api] START getVolumeSize(sdUUID='d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='c197ab9e-7384-478a-8775-5c52b883ce3a', volUUID='50446aee-a8a9-400b-b117-17b79b50a9fe', options=None) from=internal, task_id=abf04feb-11b4-4039-bbf3-a4840901eabb (api:46) 2018-12-28 04:23:37,632-0500 ERROR (vm/de454dd5) [storage.StorageDomainCache] Error while looking for domain `d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46` (sdc:174) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in _findUnfetchedDomain return mod.findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1575, in findDomainPath vg = lvm.getVG(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 935, in getVG vg = _lvminfo.getVg(vgName) # returns single VG namedtuple File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 581, in getVg vgs = self._reloadvgs(vgName) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 389, in _reloadvgs pv_name = fields[pvNameIdx] IndexError: list index out of range
Created attachment 1517228 [details] Agent log
Created attachment 1517229 [details] Broker log
Created attachment 1517230 [details] vdsm log where issue starts
Created attachment 1517231 [details] Latest vdsm log
Created attachment 1517245 [details] rpms from host
ovirt-ha-broker continuously restarts since it's unable to connect the storage server: MainThread::WARNING::2018-12-28 07:02:47,618::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::INFO::2018-12-28 07:02:50,009::broker::45::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.2.18 started we have to investigate what can cause it.
Frank, can you please attach the output of pvs vgs lvs executed on that host?
Frank, can you enable debug logs in vdsm, restart vdsm and post log with the failure in debug mode? To enable debug log, edit /etc/vdsm/logger.conf, and replace the log level to DEBUG for these loggers: - root - storge - virt - vds This will show the actual pvs command that vdsm uses, and the response from lvm, which may explain what is wrong with lvm output.
(In reply to Frank DeLorey from comment #0) > 2018-12-28 04:23:37,632-0500 ERROR (vm/de454dd5) > [storage.StorageDomainCache] Error while looking for domain > `d355d3c0-c6eb-4a46-9dfc-c9a902ec3d46` (sdc:174) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in > _findUnfetchedDomain > return mod.findDomain(sdUUID) > File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line > 1637, in findDomain > return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) > File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line > 1575, in findDomainPath > vg = lvm.getVG(sdUUID) > File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 935, in > getVG > vg = _lvminfo.getVg(vgName) # returns single VG namedtuple > File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 581, in > getVg > vgs = self._reloadvgs(vgName) > File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 389, in > _reloadvgs > pv_name = fields[pvNameIdx] > IndexError: list index out of range This means that we got unexpected output from pvs, and storage initialization failed, which cause vdsm to remain in "not ready" state.
Frank, the attached patch will log the offending line that cause the IndexError. Can you get this patch applied on one of the hosts so we can get more info on this error?
Nir, I've provided the customer with a patched lvm.py file. I've tested it here, i.e. restarted vdsm and activated the host. Regards, GFW.
The issue is because of custom configuration in the pam.d/auth-ac. With the Nir++ patch https://gerrit.ovirt.org/#/c/96505/, we are getting below traceback. ==== Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in _findUnfetchedDomain return mod.findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1575, in findDomainPath vg = lvm.getVG(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 953, in getVG vg = _lvminfo.getVg(vgName) # returns single VG namedtuple File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 599, in getVg vgs = self._reloadvgs(vgName) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 406, in _reloadvgs raise InvalidLine("pv_name", line) InvalidLine: Missing field 'pv_name' in line 'Last login: Fri Dec 28 21:41:33 EST 2018' === So we are getting an unexpected value "Last login: Fri Dec 28 21:41:33 EST 2018" which is a "lastlog" output. While checking further I can see that there is a custom configuration in "/etc/pam.d/system-auth". ==== session required pam_lastlog.so showfailed ==== In the latest sudo rpm, it adds new configuration in /etc/pam.d/sudo (that's the reason, the customer began to experience the issue after updating the server) session include system-auth So whenever a sudo command is executed, it also provides "lastlog" in the output since the customer have the above configuration in system-auth. The vdsm get this unexpected value and fails during processing the vgs and pvs output. With the same configuration in my test environment. ==== # su vdsm -s /bin/sh sh-4.2$ /usr/bin/sudo -n /sbin/lvm vgs Last login: Sat Dec 29 13:56:06 IST 2018 ===> last log output VG #PV #LV #SN Attr VSize VFree ec909468-4a08-49fb-885b-f707792ce6f3 1 13 0 wz--n- 48.62g 31.25g rhel_lab-orac-11g-client 1 3 0 wz--n- <59.51g 64.00m ==== My "hsm/init" also fails with the same error after adding the "pam_lastlog" in system-auth with latest sudo package. I have asked the customer to remove the mentioned line from system-auth and restart the services. Thanks, Nir for the quick patch.
Thansk nijin! The fact that user configuration can change the output from any command is crazy. There is no way to use sudo in a management program in this way. I think we don't have anything to do here expect improving the log, and maybe file sudo bug. sudo should never change application output For 4.4 we can more the lvm command execution to supervdsm or another process running as root.
How to test the patch: echo "session required pam_lastlog.so showfailed" >> /etc/pam.d/system-auth systemctl restart vdsmd After restarting vdsm it will fail with: InvalidOutputLine: Invalid pvs command ouptut line: 'Last login: Sat Dec 29 14:42:39 IST 2018' Remove the added line from /etc/pam.d/system-auth to recover.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Verified on: vdsm-4.30.10-1.el7ev.x86_64 Steps: 1. On the host run: a. echo "session required pam_lastlog.so showfailed" >> /etc/pam.d/system-auth b. systemctl restart vdsmd 2. Check vdsm log Results: VDSM restarted and showed the following error: 2019-03-06 11:22:03,917+0200 ERROR (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, started daemon 140708589577984)> failed (concurrent:202) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 407, in storageRefresh lvm.bootstrap(skiplvs=blockSD.SPECIAL_LVS_V4) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 747, in bootstrap _lvminfo.bootstrap() File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 368, in bootstrap self._reloadpvs() File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 393, in _reloadpvs raise InvalidOutputLine("pvs", line) InvalidOutputLine: Invalid pvs command ouptut line: 'Last login: Wed Mar 6 11:22:03 IST 2019'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077