Bug 1511608 - host crash during vdsm-netupgrade leaves corrupted persisted networks
Summary: host crash during vdsm-netupgrade leaves corrupted persisted networks
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.2.2
: 4.20.18
Assignee: Edward Haas
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-09 17:05 UTC by Petr Matyáš
Modified: 2019-04-28 13:50 UTC (History)
5 users (show)

Fixed In Version: vdsm v4.20.18
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-29 11:04:46 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
messages log (687.93 KB, text/plain)
2017-11-09 17:05 UTC, Petr Matyáš
no flags Details
/var/lib/vdsm content (186 bytes, application/x-gzip)
2017-11-13 08:08 UTC, Petr Matyáš
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85050 0 master MERGED net: Refactor atomic running config safe store 2020-10-23 05:59:44 UTC
oVirt gerrit 85051 0 master MERGED net: Cleanup netconfpersistence from suffix '/' in constants 2020-10-23 05:59:30 UTC
oVirt gerrit 85052 0 master MERGED net: Atomically persist the *config 2020-10-23 05:59:30 UTC
oVirt gerrit 86710 0 master MERGED net: During atomic copytree, apply fsync on all files 2020-10-23 05:59:30 UTC

Description Petr Matyáš 2017-11-09 17:05:06 UTC
Created attachment 1350033 [details]
messages log

Description of problem:
After stopping a host with 'echo c > /proc/sysrq-trigger' (triggers kdump) VDSM can't be started with:
Nov  9 18:28:51 aqua-vds7 vdsm-tool: Traceback (most recent call last):
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main
Nov  9 18:28:51 aqua-vds7 vdsm-tool: return tool_command[cmd]["command"](*args)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 45, in retore_nets_init
Nov  9 18:28:51 aqua-vds7 vdsm-tool: netrestore.init_nets()
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py", line 52, in init_nets
Nov  9 18:28:51 aqua-vds7 vdsm-tool: persistent_config = PersistentConfig()
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 211, in __init__
Nov  9 18:28:51 aqua-vds7 vdsm-tool: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 121, in __init__
Nov  9 18:28:51 aqua-vds7 vdsm-tool: nets = self._getConfigs(self.networksPath)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 172, in _getConf
igs
Nov  9 18:28:51 aqua-vds7 vdsm-tool: networkEntities[fileName] = Config._getConfigDict(fullPath)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 155, in _getConf
igDict
Nov  9 18:28:51 aqua-vds7 vdsm-tool: return json.load(configurationFile)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
Nov  9 18:28:51 aqua-vds7 vdsm-tool: **kw)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
Nov  9 18:28:51 aqua-vds7 vdsm-tool: return _default_decoder.decode(s)
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
Nov  9 18:28:51 aqua-vds7 vdsm-tool: obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Nov  9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
Nov  9 18:28:51 aqua-vds7 vdsm-tool: raise ValueError("No JSON object could be decoded")
Nov  9 18:28:51 aqua-vds7 vdsm-tool: ValueError: No JSON object could be decoded

Version-Release number of selected component (if applicable):
vdsm-4.20.6-45.gitc8b15d5.el7.centos.x86_64

How reproducible:
always

Steps to Reproduce:
1. setup power management with kdump on a host
2. run 'echo c > /proc/sysrq-trigger' on that host

Actual results:
after reboot finishes vdsm fails to start

Expected results:
vdsm (and everything else) should start correctly

Additional info:

Comment 1 Dan Kenigsberg 2017-11-09 17:08:59 UTC
Please include the content of your /var/lib/vdsm directory

We don't have crash resistance, I'm afraid, but let us see what we can improve.

Comment 2 Red Hat Bugzilla Rules Engine 2017-11-09 17:09:04 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Petr Matyáš 2017-11-13 08:08:48 UTC
Created attachment 1351459 [details]
/var/lib/vdsm content

Comment 4 Edward Haas 2017-11-20 07:48:29 UTC
(In reply to Petr Matyáš from comment #3)
> Created attachment 1351459 [details]
> /var/lib/vdsm content

Unfortunately, this is unhelpful as the folders are empty.

Comment 5 Edward Haas 2017-11-20 08:02:31 UTC
VDSM and VDSM networking specifically is not crash resistance.
Some aspects of persisting data to files is performed atomically and some are not, therefore, configuration and running files may get corrupted.

Please suggest it as an RFE so we can consider it for 4.3

Comment 6 Dan Kenigsberg 2017-11-20 10:42:09 UTC
Petr, you have marked this as a Regression. Can you please say when did it ever work?

Comment 7 Petr Matyáš 2017-11-20 11:10:48 UTC
(In reply to Dan Kenigsberg from comment #6)
> Petr, you have marked this as a Regression. Can you please say when did it
> ever work?

This worked in 4.1 but I'm not sure about specific version of vdsm package, should be working with latest 4.1 though.

Comment 8 Dan Kenigsberg 2017-11-21 16:48:49 UTC
Can you try it, again, on 4.1?

Make sure that the state of the host (e.g. which networks are attached to it) is the same.

And most importantly, please include {super,}vdsm.log and engine.log from your environment.

Comment 9 Meni Yakove 2017-11-22 15:05:03 UTC
I just tried 'echo c > /proc/sysrq-trigger' on vdsm-4.20.7-37.gitfb0d1c3.el7.centos.x86_64 and after the host was up vdsmd and supervdsmd was up as well.

Comment 10 Dan Kenigsberg 2017-11-22 15:45:25 UTC
Petr, we'd need to know what Vdsm has been doing while it was sysrq'ed. Also, please explain why this is an automation blocker. Do you have such destructive flows in automation? Do they fail 100%?

as this does not reproduce on my setups, in my opinion this should not block 4.2.0.

Comment 11 Petr Matyáš 2017-11-22 15:48:49 UTC
This is not a destructive scenario, this is a kdump scenario. It should work correctly and boot the host up after kdump is finished and start all services correctly.

I didn't get to testing this on 4.1 neither I retested it on 4.2, I'll have time for this hopefully tomorrow.

Comment 12 Petr Matyáš 2017-11-23 12:26:13 UTC
So testing on ovirt-engine-4.1.8-0.1.el7.noarch with a host having vdsm-4.19.38-1.el7ev.x86_64 this exact scenario works.

Now I'm going to reprovision the env to 4.2 and test again with the same host.

Comment 14 Petr Matyáš 2017-11-23 16:46:31 UTC
So after retest on ovirt-engine-4.2.0-0.5.master.el7.noarch with vdsm-4.20.7-1.el7ev.x86_64 this flow works correctly and vdsm is started after reboot.

Kdump flow is not reported as finished, but that is another bug.

Comment 15 Dan Kenigsberg 2017-11-29 09:31:55 UTC
We understand that the report can occur if sysrq took place while the boot time vdsm-netupgrade was running.

Comment 16 Edward Haas 2017-12-06 09:12:04 UTC
The network unified persistence files are now saved in an atomic manner using symlinks.
With the submitted change, both /var/lib/vdsm//netconf and /var/lib/vdsm/persistence/netconf are symlinks.

Comment 17 Petr Matyáš 2017-12-11 15:32:04 UTC
The package version is not in latest downstream build yet.

Comment 18 Gil Klein 2017-12-13 08:57:22 UTC
(In reply to Petr Matyáš from comment #17)
> The package version is not in latest downstream build yet.
Moving to ON_QA as we got ovirt-4.2.0-10 yesterday

Comment 19 Petr Matyáš 2017-12-13 12:16:57 UTC
Still not working on vdsm-4.20.9.1-1.el7ev.x86_64

Same traceback:
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: Traceback (most recent call last):
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/bin/vdsm-tool", line 219, in main
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return tool_command[cmd]["command"](*args)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 45
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: netrestore.init_nets()
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py", l
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: persistent_config = PersistentConfig()
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: nets = self._getConfigs(self.networksPath)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: networkEntities[fileName] = Config._getConfigDict(fullPath)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return json.load(configurationFile)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: **kw)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return _default_decoder.decode(s)
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: raise ValueError("No JSON object could be decoded")
Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: ValueError: No JSON object could be decoded

Comment 20 Edward Haas 2017-12-24 09:41:08 UTC
(In reply to Petr Matyáš from comment #19)
> Still not working on vdsm-4.20.9.1-1.el7ev.x86_64
> 

That's because v4.20.9-9-gf10278730 is not included in that downstream version.

Comment 21 Edward Haas 2017-12-24 09:42:26 UTC
Please note that the BZ itself is an upstream one and targeted to 4.2.1.

Comment 23 Petr Matyáš 2018-01-18 09:46:54 UTC
Verified on vdsm-4.20.13-1.el7ev.x86_64

Comment 24 Petr Matyáš 2018-01-18 14:15:17 UTC
I checked the wrong host, the issue is still present in vdsm-4.20.13-1.el7ev.x86_64

Comment 25 Petr Matyáš 2018-01-18 14:17:29 UTC
Actually the fix is already included in that version.

Comment 26 Petr Matyáš 2018-01-22 14:16:04 UTC
Still seeing
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: Traceback (most recent call last):
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/bin/vdsm-tool", line 219, in main
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return tool_command[cmd]["command"](*args)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 4
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: netrestore.init_nets()
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py",
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: persistent_config = PersistentConfig()
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: nets = self._getConfigs(self.networksPath)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: networkEntities[fileName] = Config._getConfigDict(fullPath)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return json.load(configurationFile)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: **kw)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return _default_decoder.decode(s)
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: raise ValueError("No JSON object could be decoded")
Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: ValueError: No JSON object could be decoded

Using vdsm-4.20.14-1.el7ev.x86_64

Comment 28 Dan Kenigsberg 2018-01-29 20:35:32 UTC
this time, we're sync'ing the changes to disk before moving to new configuration.

Comment 29 Petr Matyáš 2018-02-19 16:27:27 UTC
Verified on vdsm-4.20.18-1.el7ev.x86_64

Comment 30 Sandro Bonazzola 2018-03-29 11:04:46 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.