Created attachment 1350033 [details] messages log Description of problem: After stopping a host with 'echo c > /proc/sysrq-trigger' (triggers kdump) VDSM can't be started with: Nov 9 18:28:51 aqua-vds7 vdsm-tool: Traceback (most recent call last): Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main Nov 9 18:28:51 aqua-vds7 vdsm-tool: return tool_command[cmd]["command"](*args) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 45, in retore_nets_init Nov 9 18:28:51 aqua-vds7 vdsm-tool: netrestore.init_nets() Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py", line 52, in init_nets Nov 9 18:28:51 aqua-vds7 vdsm-tool: persistent_config = PersistentConfig() Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 211, in __init__ Nov 9 18:28:51 aqua-vds7 vdsm-tool: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 121, in __init__ Nov 9 18:28:51 aqua-vds7 vdsm-tool: nets = self._getConfigs(self.networksPath) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 172, in _getConf igs Nov 9 18:28:51 aqua-vds7 vdsm-tool: networkEntities[fileName] = Config._getConfigDict(fullPath) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistence.py", line 155, in _getConf igDict Nov 9 18:28:51 aqua-vds7 vdsm-tool: return json.load(configurationFile) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load Nov 9 18:28:51 aqua-vds7 vdsm-tool: **kw) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads Nov 9 18:28:51 aqua-vds7 vdsm-tool: return _default_decoder.decode(s) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode Nov 9 18:28:51 aqua-vds7 vdsm-tool: obj, end = self.raw_decode(s, idx=_w(s, 0).end()) Nov 9 18:28:51 aqua-vds7 vdsm-tool: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode Nov 9 18:28:51 aqua-vds7 vdsm-tool: raise ValueError("No JSON object could be decoded") Nov 9 18:28:51 aqua-vds7 vdsm-tool: ValueError: No JSON object could be decoded Version-Release number of selected component (if applicable): vdsm-4.20.6-45.gitc8b15d5.el7.centos.x86_64 How reproducible: always Steps to Reproduce: 1. setup power management with kdump on a host 2. run 'echo c > /proc/sysrq-trigger' on that host Actual results: after reboot finishes vdsm fails to start Expected results: vdsm (and everything else) should start correctly Additional info:
Please include the content of your /var/lib/vdsm directory We don't have crash resistance, I'm afraid, but let us see what we can improve.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Created attachment 1351459 [details] /var/lib/vdsm content
(In reply to Petr Matyáš from comment #3) > Created attachment 1351459 [details] > /var/lib/vdsm content Unfortunately, this is unhelpful as the folders are empty.
VDSM and VDSM networking specifically is not crash resistance. Some aspects of persisting data to files is performed atomically and some are not, therefore, configuration and running files may get corrupted. Please suggest it as an RFE so we can consider it for 4.3
Petr, you have marked this as a Regression. Can you please say when did it ever work?
(In reply to Dan Kenigsberg from comment #6) > Petr, you have marked this as a Regression. Can you please say when did it > ever work? This worked in 4.1 but I'm not sure about specific version of vdsm package, should be working with latest 4.1 though.
Can you try it, again, on 4.1? Make sure that the state of the host (e.g. which networks are attached to it) is the same. And most importantly, please include {super,}vdsm.log and engine.log from your environment.
I just tried 'echo c > /proc/sysrq-trigger' on vdsm-4.20.7-37.gitfb0d1c3.el7.centos.x86_64 and after the host was up vdsmd and supervdsmd was up as well.
Petr, we'd need to know what Vdsm has been doing while it was sysrq'ed. Also, please explain why this is an automation blocker. Do you have such destructive flows in automation? Do they fail 100%? as this does not reproduce on my setups, in my opinion this should not block 4.2.0.
This is not a destructive scenario, this is a kdump scenario. It should work correctly and boot the host up after kdump is finished and start all services correctly. I didn't get to testing this on 4.1 neither I retested it on 4.2, I'll have time for this hopefully tomorrow.
So testing on ovirt-engine-4.1.8-0.1.el7.noarch with a host having vdsm-4.19.38-1.el7ev.x86_64 this exact scenario works. Now I'm going to reprovision the env to 4.2 and test again with the same host.
So after retest on ovirt-engine-4.2.0-0.5.master.el7.noarch with vdsm-4.20.7-1.el7ev.x86_64 this flow works correctly and vdsm is started after reboot. Kdump flow is not reported as finished, but that is another bug.
We understand that the report can occur if sysrq took place while the boot time vdsm-netupgrade was running.
The network unified persistence files are now saved in an atomic manner using symlinks. With the submitted change, both /var/lib/vdsm//netconf and /var/lib/vdsm/persistence/netconf are symlinks.
The package version is not in latest downstream build yet.
(In reply to Petr Matyáš from comment #17) > The package version is not in latest downstream build yet. Moving to ON_QA as we got ovirt-4.2.0-10 yesterday
Still not working on vdsm-4.20.9.1-1.el7ev.x86_64 Same traceback: Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: Traceback (most recent call last): Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/bin/vdsm-tool", line 219, in main Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return tool_command[cmd]["command"](*args) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 45 Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: netrestore.init_nets() Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py", l Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: persistent_config = PersistentConfig() Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: nets = self._getConfigs(self.networksPath) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: networkEntities[fileName] = Config._getConfigDict(fullPath) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersistenc Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return json.load(configurationFile) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: **kw) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: return _default_decoder.decode(s) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: obj, end = self.raw_decode(s, idx=_w(s, 0).end()) Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: raise ValueError("No JSON object could be decoded") Dec 13 14:15:40 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[6370]: ValueError: No JSON object could be decoded
(In reply to Petr Matyáš from comment #19) > Still not working on vdsm-4.20.9.1-1.el7ev.x86_64 > That's because v4.20.9-9-gf10278730 is not included in that downstream version.
Please note that the BZ itself is an upstream one and targeted to 4.2.1.
Verified on vdsm-4.20.13-1.el7ev.x86_64
I checked the wrong host, the issue is still present in vdsm-4.20.13-1.el7ev.x86_64
Actually the fix is already included in that version.
Still seeing Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: Traceback (most recent call last): Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/bin/vdsm-tool", line 219, in main Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return tool_command[cmd]["command"](*args) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 4 Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: netrestore.init_nets() Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netrestore.py", Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: persistent_config = PersistentConfig() Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: super(PersistentConfig, self).__init__(CONF_PERSIST_DIR) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: nets = self._getConfigs(self.networksPath) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: networkEntities[fileName] = Config._getConfigDict(fullPath) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib/python2.7/site-packages/vdsm/network/netconfpersisten Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return json.load(configurationFile) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/__init__.py", line 290, in load Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: **kw) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: return _default_decoder.decode(s) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: obj, end = self.raw_decode(s, idx=_w(s, 0).end()) Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: raise ValueError("No JSON object could be decoded") Jan 22 16:14:53 aqua-vds7.qa.lab.tlv.redhat.com vdsm-tool[15535]: ValueError: No JSON object could be decoded Using vdsm-4.20.14-1.el7ev.x86_64
this time, we're sync'ing the changes to disk before moving to new configuration.
Verified on vdsm-4.20.18-1.el7ev.x86_64
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.