Bug 1570562

Summary: vdsm is dead after upgrade to vdsm-4.20.26-1.el7ev.x86_64
Product: [oVirt] vdsm Reporter: Michael Burman <mburman>
Component: CoreAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.19CC: bugs, dfediuck, mburman, mperina, pkliczew
Target Milestone: ovirt-4.2.5Keywords: Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.20.35 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-31 15:31:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1597179    
Bug Blocks:    
Attachments:
Description Flags
logs
none
failedQA logs none

Description Michael Burman 2018-04-23 09:02:06 UTC
Created attachment 1425592 [details]
logs

Description of problem:
vdsm is dead after upgrade to vdsm-4.20.26-1.el7ev.x86_64

Looks like we have an issue again with vdsm upgrade to latest build vdsm-4.20.26-1.el7ev.x86_64

Apr 23 11:43:23 red-vds2 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Apr 23 11:43:23 red-vds2 systemd: Stopping Auxiliary vdsm service for running helper functions as root...
Apr 23 11:43:23 red-vds2 systemd: Stopping Virtual Desktop Server Manager network restoration...
Apr 23 11:43:23 red-vds2 systemd: Stopping Virtualization daemon...
Apr 23 11:43:23 red-vds2 daemonAdapter: Traceback (most recent call last):
Apr 23 11:43:23 red-vds2 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
Apr 23 11:43:23 red-vds2 daemonAdapter: finalizer()
Apr 23 11:43:23 red-vds2 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
Apr 23 11:43:23 red-vds2 daemonAdapter: res = self._callback(*self._args, **self._kwargs)
Apr 23 11:43:23 red-vds2 daemonAdapter: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
Apr 23 11:43:23 red-vds2 systemd: Starting Virtualization daemon...
Apr 23 11:43:23 red-vds2 systemd: Started Auxiliary vdsm service for running helper functions as root.
Apr 23 11:43:23 red-vds2 systemd: Starting Auxiliary vdsm service for running helper functions as root...
Apr 23 11:43:24 red-vds2 systemd: Started Virtualization daemon.
Apr 23 11:43:24 red-vds2 systemd: Starting Virtual Desktop Server Manager network restoration...
Apr 23 11:43:24 red-vds2 systemd: Stopping Auxiliary vdsm service for running helper functions as root...
Apr 23 11:43:24 red-vds2 daemonAdapter: Traceback (most recent call last):
Apr 23 11:43:24 red-vds2 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
Apr 23 11:43:24 red-vds2 daemonAdapter: finalizer()
Apr 23 11:43:24 red-vds2 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
Apr 23 11:43:24 red-vds2 daemonAdapter: res = self._callback(*self._args, **self._kwargs)
Apr 23 11:43:24 red-vds2 daemonAdapter: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
Apr 23 11:43:25 red-vds2 systemd: Stopped Auxiliary vdsm service for running helper functions as root.

- I don't think it's the same issue as BZ 1557735

Version-Release number of selected component (if applicable):
vdsm-4.20.26-1.el7ev.x86_64

How reproducible:
Sometimes 60%-70%

Steps to Reproduce:
1. Upgrade vdsm to vdsm-4.20.26-1.el7ev.x86_64

Actual results:
vdsm is dead 

Expected results:
vdsm should run after upgrade

Comment 1 Piotr Kliczewski 2018-04-25 10:27:34 UTC
I attempted to reproduce it where I upgraded from 4.20.23-1 to 4.20.26-4.git6b485e4 and noticed similar error during stopping:

Apr 25 06:13:21 localhost systemd: Stopped Virtual Desktop Server Manager.
Apr 25 06:13:21 localhost systemd: Stopping Auxiliary vdsm service for running helper functions as root...
Apr 25 06:13:21 localhost daemonAdapter: Traceback (most recent call last):
Apr 25 06:13:21 localhost daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
Apr 25 06:13:21 localhost daemonAdapter: finalizer()
Apr 25 06:13:21 localhost daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
Apr 25 06:13:21 localhost daemonAdapter: res = self._callback(*self._args, **self._kwargs)
Apr 25 06:13:21 localhost daemonAdapter: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'

but later I see:

Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running mkdirs
Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running configure_coredump
Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running configure_vdsm_logs
Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running wait_for_network
Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running run_init_hooks
Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running check_is_configured
Apr 25 06:13:24 localhost vdsmd_init_common.sh: abrt is already configured for vdsm
Apr 25 06:13:24 localhost vdsmd_init_common.sh: lvm is configured for vdsm
Apr 25 06:13:24 localhost vdsmd_init_common.sh: libvirt is already configured for vdsm
Apr 25 06:13:24 localhost vdsmd_init_common.sh: Current revision of multipath.conf detected, preserving
Apr 25 06:13:24 localhost vdsmd_init_common.sh: vdsm: Running validate_configuration
Apr 25 06:13:24 localhost vdsmd_init_common.sh: SUCCESS: ssl configured to true. No conflicts
Apr 25 06:13:24 localhost vdsmd_init_common.sh: vdsm: Running prepare_transient_repository
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running syslog_available
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running nwfilter
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running dummybr
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running tune_system
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running test_space
Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running test_lo
Apr 25 06:13:25 localhost systemd: Started Virtual Desktop Server Manager.
Apr 25 06:13:25 localhost systemd: Started MOM instance configured for VDSM purposes.
Apr 25 06:13:25 localhost systemd: Starting MOM instance configured for VDSM purposes...

When I activated this host it was marked as 'UP'. Can you please try to retest it on different host and use 4.20.26-4 as final vdsm version.

Comment 2 Piotr Kliczewski 2018-04-25 10:58:29 UTC
Additional observations:

In your case I see that vdsm was started which triggered supervdsm start. Supervdsm started but for some reason it was restarted and vdsm ended up being stopped.

I see that you are using centos 7.5 whereas I used 7.4. based on the logs I do not see any issue with our code. Let's focus on OS differences.

Comment 3 Michael Burman 2018-04-25 11:41:40 UTC
(In reply to Piotr Kliczewski from comment #1)
> I attempted to reproduce it where I upgraded from 4.20.23-1 to
> 4.20.26-4.git6b485e4 and noticed similar error during stopping:
> 
> Apr 25 06:13:21 localhost systemd: Stopped Virtual Desktop Server Manager.
> Apr 25 06:13:21 localhost systemd: Stopping Auxiliary vdsm service for
> running helper functions as root...
> Apr 25 06:13:21 localhost daemonAdapter: Traceback (most recent call last):
> Apr 25 06:13:21 localhost daemonAdapter: File
> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
> Apr 25 06:13:21 localhost daemonAdapter: finalizer()
> Apr 25 06:13:21 localhost daemonAdapter: File
> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
> Apr 25 06:13:21 localhost daemonAdapter: res = self._callback(*self._args,
> **self._kwargs)
> Apr 25 06:13:21 localhost daemonAdapter: OSError: [Errno 2] No such file or
> directory: '/var/run/vdsm/svdsm.sock'
> 
> but later I see:
> 
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running mkdirs
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running
> configure_coredump
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running
> configure_vdsm_logs
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running
> wait_for_network
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running run_init_hooks
> Apr 25 06:13:23 localhost vdsmd_init_common.sh: vdsm: Running
> check_is_configured
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: abrt is already configured
> for vdsm
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: lvm is configured for vdsm
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: libvirt is already
> configured for vdsm
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: Current revision of
> multipath.conf detected, preserving
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: vdsm: Running
> validate_configuration
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: SUCCESS: ssl configured to
> true. No conflicts
> Apr 25 06:13:24 localhost vdsmd_init_common.sh: vdsm: Running
> prepare_transient_repository
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running
> syslog_available
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running nwfilter
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running dummybr
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running tune_system
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running test_space
> Apr 25 06:13:25 localhost vdsmd_init_common.sh: vdsm: Running test_lo
> Apr 25 06:13:25 localhost systemd: Started Virtual Desktop Server Manager.
> Apr 25 06:13:25 localhost systemd: Started MOM instance configured for VDSM
> purposes.
> Apr 25 06:13:25 localhost systemd: Starting MOM instance configured for VDSM
> purposes...
> 
> When I activated this host it was marked as 'UP'. Can you please try to
> retest it on different host and use 4.20.26-4 as final vdsm version.

Reproduced on 4 different hosts

Comment 4 Michael Burman 2018-04-25 11:43:39 UTC
(In reply to Piotr Kliczewski from comment #2)
> Additional observations:
> 
> In your case I see that vdsm was started which triggered supervdsm start.
> Supervdsm started but for some reason it was restarted and vdsm ended up
> being stopped.
> 
> I see that you are using centos 7.5 whereas I used 7.4. based on the logs I
> do not see any issue with our code. Let's focus on OS differences.

I use rhel.75 with latest kernel 3.10.0-862.el7.x86_64

Comment 5 Piotr Kliczewski 2018-04-25 15:34:13 UTC
I tested vdsm upgrade from 4.20.23-1 to 4.20.26-4 four times by running:
- install vdsm
- add host
- set maintenance to the host
- enable repo with newer version
- run yum update
- activate the host
- set maintenance to the host
- remove host
- remove vdsm from the host
- disable newer repo

I used rhel75 vm (fresh install) with 3.10.0-862.el7.x86_64 kernel. All four upgrades were successful. You tested with upgrade to 4.20.26-1 and I tested to 4.20.26-4. Please check whether you will be able to reproduce with newer vdsm version.

Comment 6 Michael Burman 2018-04-26 08:58:27 UTC
(In reply to Piotr Kliczewski from comment #5)
> I tested vdsm upgrade from 4.20.23-1 to 4.20.26-4 four times by running:
> - install vdsm
> - add host
> - set maintenance to the host
> - enable repo with newer version
> - run yum update
> - activate the host
> - set maintenance to the host
> - remove host
> - remove vdsm from the host
> - disable newer repo
> 
> I used rhel75 vm (fresh install) with 3.10.0-862.el7.x86_64 kernel. All four
> upgrades were successful. You tested with upgrade to 4.20.26-1 and I tested
> to 4.20.26-4. Please check whether you will be able to reproduce with newer
> vdsm version.

Hi
What is this version? qe has only 4.20.26-1 available, is it master build?
I can test with the next d/s build we will get.

Comment 7 Piotr Kliczewski 2018-04-26 09:15:05 UTC
(In reply to Michael Burman from comment #6)
> Hi
> What is this version? qe has only 4.20.26-1 available, is it master build?
> I can test with the next d/s build we will get.

It is 4.2 snapshot [1] repo. From my test env I have no access to the repo you are using. Please use newer vdsm build and let me know the result.

[1] http://resources.ovirt.org/pub/ovirt-4.2-snapshot/rpm/el7/noarch/

Comment 8 Michael Burman 2018-04-26 09:23:17 UTC
(In reply to Piotr Kliczewski from comment #7)
> (In reply to Michael Burman from comment #6)
> > Hi
> > What is this version? qe has only 4.20.26-1 available, is it master build?
> > I can test with the next d/s build we will get.
> 
> It is 4.2 snapshot [1] repo. From my test env I have no access to the repo
> you are using. Please use newer vdsm build and let me know the result.
> 
> [1] http://resources.ovirt.org/pub/ovirt-4.2-snapshot/rpm/el7/noarch/

Piotr, this is an upstream vdsm, i don't this is a good test. I can try it, but testing d/s vdsm to u/s vdsm is not good test.

Comment 9 Michael Burman 2018-04-26 09:35:27 UTC
(In reply to Michael Burman from comment #8)
> (In reply to Piotr Kliczewski from comment #7)
> > (In reply to Michael Burman from comment #6)
> > > Hi
> > > What is this version? qe has only 4.20.26-1 available, is it master build?
> > > I can test with the next d/s build we will get.
> > 
> > It is 4.2 snapshot [1] repo. From my test env I have no access to the repo
> > you are using. Please use newer vdsm build and let me know the result.
> > 
> > [1] http://resources.ovirt.org/pub/ovirt-4.2-snapshot/rpm/el7/noarch/
> 
> Piotr, this is an upstream vdsm, i don't this is a good test. I can try it,
> but testing d/s vdsm to u/s vdsm is not good test.

I get Dependency issues with gluster packages. master vdsm requires higher gluster versions.

Comment 10 Martin Perina 2018-04-26 11:24:48 UTC
(In reply to Michael Burman from comment #8)
> (In reply to Piotr Kliczewski from comment #7)
> > (In reply to Michael Burman from comment #6)
> > > Hi
> > > What is this version? qe has only 4.20.26-1 available, is it master build?
> > > I can test with the next d/s build we will get.
> > 
> > It is 4.2 snapshot [1] repo. From my test env I have no access to the repo
> > you are using. Please use newer vdsm build and let me know the result.
> > 
> > [1] http://resources.ovirt.org/pub/ovirt-4.2-snapshot/rpm/el7/noarch/
> 
> Piotr, this is an upstream vdsm, i don't this is a good test. I can try it,
> but testing d/s vdsm to u/s vdsm is not good test.

Michael, you should receive either vdsm-4.20.27-1.el7ev in today's 4.2.3 compose

Comment 11 Michael Burman 2018-04-26 11:41:32 UTC
(In reply to Martin Perina from comment #10)
> (In reply to Michael Burman from comment #8)
> > (In reply to Piotr Kliczewski from comment #7)
> > > (In reply to Michael Burman from comment #6)
> > > > Hi
> > > > What is this version? qe has only 4.20.26-1 available, is it master build?
> > > > I can test with the next d/s build we will get.
> > > 
> > > It is 4.2 snapshot [1] repo. From my test env I have no access to the repo
> > > you are using. Please use newer vdsm build and let me know the result.
> > > 
> > > [1] http://resources.ovirt.org/pub/ovirt-4.2-snapshot/rpm/el7/noarch/
> > 
> > Piotr, this is an upstream vdsm, i don't this is a good test. I can try it,
> > but testing d/s vdsm to u/s vdsm is not good test.
> 
> Michael, you should receive either vdsm-4.20.27-1.el7ev in today's 4.2.3
> compose

Great, i will test it then with the new vdsm build for qe.Thanks Martin

Comment 12 Michael Burman 2018-04-26 12:38:19 UTC
Piotr please note that the latest update we had 4.20.26-1(4.2.3-2) also included libvirt packages which maybe related to this issue.

Comment 13 Michael Burman 2018-04-26 13:08:32 UTC
The bug easily reproduced if updating the vdsm and libvirt at once, which was the case for qe on the latest build. 
I just did yum history undo to the latest update(vdsm + libvirt) and did update again and it happened.  

libvirt-daemon-3.9.0-14.el7_5.3.x86_64
libvirt-client-3.9.0-14.el7_5.3.x86_64

Comment 14 Piotr Kliczewski 2018-04-26 13:16:41 UTC
(In reply to Michael Burman from comment #13)
> The bug easily reproduced if updating the vdsm and libvirt at once, which
> was the case for qe on the latest build. 
> I just did yum history undo to the latest update(vdsm + libvirt) and did
> update again and it happened.  
> 

What about if you update only vdsm?

> libvirt-daemon-3.9.0-14.el7_5.3.x86_64
> libvirt-client-3.9.0-14.el7_5.3.x86_64

Comment 15 Piotr Kliczewski 2018-04-26 13:19:54 UTC
and what about if you update libvirt only?

Comment 16 Martin Perina 2018-04-26 13:22:10 UTC
(In reply to Michael Burman from comment #13)
> The bug easily reproduced if updating the vdsm and libvirt at once, which
> was the case for qe on the latest build. 
> I just did yum history undo to the latest update(vdsm + libvirt) and did
> update again and it happened.  
> 
> libvirt-daemon-3.9.0-14.el7_5.3.x86_64
> libvirt-client-3.9.0-14.el7_5.3.x86_64

Michael, when you mentioned libvirt are you sure that you have performed the update when host was in Maintenance?

Comment 17 Michael Burman 2018-04-26 13:44:50 UTC
If only vdsm or libvirt, seems to be ok.
It doesn't matter, in both cases vdsm is dead. I usually do it when it's up, but tested with maintenance as well. vdsm should stay alive in both cases.

Comment 18 Piotr Kliczewski 2018-04-26 14:21:56 UTC
Would it be possible to test [1]? It is a fix which was created for different issue but it could potentially solve this one as well.


[1] https://gerrit.ovirt.org/#/c/89446/

Comment 19 Martin Perina 2018-04-26 14:38:14 UTC
(In reply to Michael Burman from comment #17)
> If only vdsm or libvirt, seems to be ok.
> It doesn't matter, in both cases vdsm is dead. I usually do it when it's up,
> but tested with maintenance as well. vdsm should stay alive in both cases.

Well, from customer point of view it does matter, because host upgrades being performed while host is not in maintenance are not supported [1]. But yeah, if reproducible while host is in Maintenance, than we need to resolve it.


[1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/upgrade_guide/manually_updating_virtualization_hosts

Comment 20 Michael Burman 2018-04-26 14:51:25 UTC
(In reply to Piotr Kliczewski from comment #18)
> Would it be possible to test [1]? It is a fix which was created for
> different issue but it could potentially solve this one as well.
> 
> 
> [1] https://gerrit.ovirt.org/#/c/89446/

- Piotr, i need a d/s RPM to test this potential fix.

- Martin, yes i know and understand that from customer point of few it does matter and not supported, but vdsm should be kept alive even if performing update while host is UP. 
As part of our new effort in QE to test manual tier 4, such tests are important and i always prefer to push the system to it's edges and not to what is maybe supported for customers. 
I'm personally have multiple environments and i always perform updates in 3 ways:
1) While host is in maintenance
2) While host is UP (not on SPM host)
3) While host is UP and have at least 1 VM running(not on SPM host)
This way i usually see bugs(or potential bugs) others don't see.
Any how, this specific bug indeed reproduced on a maintenance host))

Comment 21 Martin Perina 2018-04-27 08:26:10 UTC
Please retest with RHV 4.2.3-4 which contains vdsm-4.20.27-1.el7ev.x86_64.rpm

Comment 22 Michael Burman 2018-04-28 09:59:41 UTC
Same result -
Upgraded from vdsm-4.20.25-1.el7ev.x86_64 -> vdsm-4.20.27-1.el7ev.x86_64
vdsm is dead with same error - 

Apr 28 12:43:50 red-vds4.qa.lab.tlv.redhat.com daemonAdapter[14863]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
Apr 28 12:43:50 red-vds4.qa.lab.tlv.redhat.com systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root.

This reproduced when updating vdsm+libvirt in one shot
libvirt-3.9.0-14.el7_5.2 - > 3.9.0-14.el7_5.3
vdsm-4.20.25-1.el7ev.x86_64 -> vdsm-4.20.27-1.el7ev.x86_64

This bug reproduced as well when doing yum history undo to both vdsm+libvirt
vdsm is dead with same error.

Comment 23 Michael Burman 2018-04-28 10:02:10 UTC
Created attachment 1428017 [details]
failedQA logs

Comment 24 Martin Perina 2018-05-01 21:23:02 UTC
After offline discussion removing blocker flag and retargeting to 4.2.4

Comment 25 Red Hat Bugzilla Rules Engine 2018-05-01 21:23:08 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 26 Piotr Kliczewski 2018-06-28 13:10:37 UTC
This seems to be systemd bug and I would like to open one. Please provide systemd logs when the issue occurs so we can provide enough info for systemd devs to analyze.

Comment 31 Michael Burman 2018-07-23 15:14:30 UTC
Verified on - vdsm-4.20.35-1.el7ev.x86_64

Upgraded libvirt-3.9.0-14.el7_5.2 > libvirt-3.9.0-14.el7_5.6.x86_64
Upgraded vdsm-4.20.25-1.el7ev > vdsm-4.20.35-1.el7ev.x86_64
vdsm is alive after upgrade

Comment 32 Sandro Bonazzola 2018-07-31 15:31:33 UTC
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.