Bug 1308962 - Adding an additional 3.5 host on a system that was initially deployed with HE from 3.4 fails with 'Setup validation': cannot marshal None unless allow_none is enabled
Summary: Adding an additional 3.5 host on a system that was initially deployed with HE...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.16.32
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Simone Tiraboschi
QA Contact: Aharon Canan
URL:
Whiteboard: integration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-16 15:08 UTC by Vladimir Vasilev
Modified: 2019-12-16 05:27 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-01 16:56:17 UTC
oVirt Team: Integration
Embargoed:
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)

Description Vladimir Vasilev 2016-02-16 15:08:40 UTC
Description of problem:
I'm trying to deploy additional node to self hosted engine and I'm getting this error:
[ ERROR ] Failed to execute stage 'Setup validation': cannot marshal None unless allow_none is enabled

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.2.6.1-1.el6ev.noarch
ovirt-host-deploy-1.3.2-1.el6ev.noarch
vdsm-4.16.32-1.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. On freshly installed RHEL6 install ovirt-hosted-engine-setup
2. Run hosted-engine --deploy
3. Enter the NFS share for RHEVM, scp the answer file from 1st node
4. At the end enter password for admin@internal 

Actual results:
Setup fails with:
[ ERROR ] Failed to execute stage 'Setup validation': cannot marshal None unless allow_none is enabled

Expected results:
Additional node is added to hosted-engine


Additional info:
In the generated ovirt-hosted-engine-setup-XXXXX.log file I see:
2016-02-13 14:15:59 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 270, in _validation
    imgVolUUID[1],
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1483, in __request
    allow_none=self.__allow_none)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1132, in dumps
    data = m.dumps(params)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 677, in dumps
    dump(v, write)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 699, in __dump
    f(self, value, write)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 703, in dump_nil
    raise TypeError, "cannot marshal None unless allow_none is enabled"
TypeError: cannot marshal None unless allow_none is enabled

In vdsm.log I see:
Thread-37::DEBUG::2016-02-13 14:15:59,192::task::993::Storage.TaskManager.Task::(_decref) Task=`57da5167-6c6f-438a-b19c-af2a3b332d9d`::ref 1 aborting False
Thread-37::ERROR::2016-02-13 14:15:59,192::task::866::Storage.TaskManager.Task::(_setError) Task=`57da5167-6c6f-438a-b19c-af2a3b332d9d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 621, in spmStop
    pool.stopSpm()
  File "/usr/share/vdsm/storage/securable.py", line 75, in wrapper
    raise SecureError("Secured object is not in safe state")
SecureError: Secured object is not in safe state

As suggested by a colleague I re-generated /etc/vdsm/vdsm.id (it was missing) with:
dmidecode -s system-uuid > /etc/vdsm/vdsm.id
but it didn't helped.

Comment 6 Sandro Bonazzola 2016-02-17 07:30:33 UTC
Looks like an issue on vdsm side, related to system-uuid

Comment 13 Yaniv Bronhaim 2016-02-22 09:24:50 UTC
Sandro, why do you think its an issue with the host uuid? - MainProcess|Thread-14::DEBUG::2016-02-13 14:13:25,644::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return getHardwareInfo with {'systemProductName': 'IBM eServer BladeCenter HS21 -[7995G3U]-', 'systemUUID': 'c0e9acde-1782-b601-4c39-00215e233eb4', 'systemSerialNumber': '99K8813', 'systemManufacturer': 'IBM'}

dmidecode works alright. from the description I don't understand if something is missing on the host and if its reproducible - not always we use /etc/vdsm/vdsm.id, and if its not exist it shouldn't block the adding flow.

from the log seems like something suspicious happened with the spm 

Thread-37::ERROR::2016-02-13 14:15:59,192::task::866::Storage.TaskManager.Task::(_setError) Task=`57da5167-6c6f-438a-b19c-af2a3b332d9d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 621, in spmStop
    pool.stopSpm()
  File "/usr/share/vdsm/storage/securable.py", line 75, in wrapper
    raise SecureError("Secured object is not in safe state")
SecureError: Secured object is not in safe state


we need reproduction and fuller log if possible

Comment 16 Vladimir Vasilev 2016-02-23 09:23:12 UTC
After checking again logs and source code we (thanks Adam) found that validateStorageDomain always returns None

We have this block of code:

# prepareImage to populate /var/run/vdsm/storage
       for imgVolUUID in [
           [
               self.environment[ohostedcons.StorageEnv.IMG_UUID],
               self.environment[ohostedcons.StorageEnv.VOL_UUID]
           ],
           [
               self.environment[ohostedcons.StorageEnv.METADATA_IMAGE_UUID],
               self.environment[ohostedcons.StorageEnv.METADATA_VOLUME_UUID]
           ],
           [
               self.environment[ohostedcons.StorageEnv.LOCKSPACE_IMAGE_UUID],
               self.environment[ohostedcons.StorageEnv.LOCKSPACE_VOLUME_UUID]
           ],
       ]:
           self.cli.prepareImage(
               self.environment[
                   ohostedcons.StorageEnv.SP_UUID
               ],
               self.environment[
                   ohostedcons.StorageEnv.SD_UUID
               ],
               imgVolUUID[0],
               imgVolUUID[1],
           )

And at least one of those UUIDs (I cannot tell which) is None which is
resulting in an invalid prepareImage call.  From the vdsm log I can
see that exactly one prepareImage succeeded:

Thread-36::INFO::2016-02-22
08:47:13,471::logUtils::44::dispatcher::(wrapper) Run and protect:
prepareImage(sdUUID='88a8314c-8174-4c88-a931-eb8d577300f1',
spUUID='0a824c8e-3582-4925-b67f-c6a78a864aeb',
imgUUID='6ca64c01-e758-4256-96ae-bfc5c4e8bfcb',
leafUUID='0551c43d-ecb9-4ce9-b37c-9b1f4089533e')

This probably means that ohostedcons.StorageEnv.METADATA_IMAGE_UUID
and ohostedcons.StorageEnv.METADATA_VOLUME_UUID are not being handled
properly.

Comment 17 Simone Tiraboschi 2016-02-29 16:49:09 UTC
The issue was that some values was missing in the answerfile it downloaded from the first host ( robocop01.rhev.stage.mwc.hst.phx2.redhat.com )

2016-02-22 08:46:26 INFO otopi.plugins.ovirt_hosted_engine_setup.core.remote_answerfile remote_answerfile._fetch_answer_file:180 Answer file successfully downloaded
...
2016-02-22 08:46:26 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.remote_answerfile remote_answerfile._parse_answer_file:189 OVEHOSTED_STORAGE/lockspaceImageUUID=None
2016-02-22 08:46:26 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.remote_answerfile remote_answerfile._parse_answer_file:189 OVEHOSTED_STORAGE/iSCSILunId=None
2016-02-22 08:46:26 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.remote_answerfile remote_answerfile._parse_answer_file:189 OVEHOSTED_STORAGE/metadataImageUUID=None
2016-02-22 08:46:26 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.remote_answerfile remote_answerfile._parse_answer_file:189 OVEHOSTED_STORAGE/imgAlias=None

Vladimir, could you please check what do you have in /etc/ovirt-hosted-engine/hosted-engine.conf on your first host and attach here the hosted-engine-setup logs from that host?

Comment 18 Oved Ourfali 2016-02-29 17:05:50 UTC
Based on that, moving back to integration.

Comment 21 Vladimir Vasilev 2016-02-29 17:11:52 UTC
hosted-engine.conf and answers.conf attached

I see they're empty in hosted-engine.conf:
metadata_volume_UUID=None
metadata_image_UUID=None
lockspace_volume_UUID=None
lockspace_image_UUID=None

And for some reason I don't have them set at all in production, only staging.

Do I need to put correct values for them or just remove altogether?

Comment 22 Simone Tiraboschi 2016-02-29 17:17:48 UTC
Correct value for sure; you can use vdsClient to scan that storage domain and get that values. Fell free to ask help.

Now the question is just why they are empty in that answerfile.

Can you please attach the logs from the latest execution of hosted-engine-setup on robocop01 ?

Comment 24 Vladimir Vasilev 2016-02-29 20:13:13 UTC
Setup logs from robocop01 attached.

I see that this system was setup in beginning of 2015 and hosted-engine at that time didn't had these entries in the answer files or in hosted-engine.conf. Just compared with production and it's the same case. 
Hypervisors which were installed 1 year ago don't have entries for the mentioned uuids in both conf files and the new hypervisors which were added after updating hosted-engine have them as None.

If I re-deploy brand new cluster I will not have this problem. Looks like this happens only when you copy the answer file generated by older version of hosted-engine.

How to find the correct uuids?
When I run:
vdsClient -s 0 getStorageDomainInfo
for the rhevm domain I'm getting only uuid (and no vguuid or lockspace). 
Or I need just the uuid part from getStorageDomainInfo ?

Comment 25 Simone Tiraboschi 2016-02-29 20:53:34 UTC
No, they were already there the issue is just that ovirt-hosted-engine-setup didn't sucesfully completed on robocop01 when you deployed it on 2015-05-18

2015-05-18 09:26:35 INFO otopi.plugins.ovirt_hosted_engine_setup.core.answerfile answerfile._save_answers:52 Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150518092635.conf'
2015-05-18 09:26:35 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/core/answerfile.py", line 128, in _save_answers_at_cleanup
    self._save_answers(name)
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/core/answerfile.py", line 73, in _save_answers
    if self.environment[ohostedcons.CoreEnv.NODE_SETUP]:
KeyError: 'OVEHOSTED_CORE/nodeSetup'
2015-05-18 09:26:35 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Clean up': 'OVEHOSTED_CORE/nodeSetup'

Adding an additional host if the first wasn't correctly deployed it isn't a supported path.

You need to run getVolumesList to get the list of all the volumes and then getVolumeInfo for each of them till you identify all the uuidof lockspace and metadata images and volumes.

Depending from when hosted-engine-setup halted when you deployed on robocop01, that volumes can miss at all.
Keeping the system if hosted-engine-setup failed wasn't a good idea.

Comment 26 Simone Tiraboschi 2016-03-01 09:08:10 UTC
Reopening after better understanding the issue.
That values weren't present in the answerfile at 3.4 time.
Upgrading existing hosts from 3.4 to 3.5 and from there to 3.6 is not an issue (excluding the case of 3.5 on el6 -> 3.5 on el7 -> 3.6 on el7 since there we ask to the user to redeploy the host with 3.5 after reinstalling it with el7).

The issue is just if we want to add an additional 3.5 host to a system that was initially deployed with 3.4 (so NFS only since iSCSI was introduced only with 3.5).

Probably the easiest solution is to document how to manually gather that values adding them to the answerfile.

Comment 27 Simone Tiraboschi 2016-03-01 16:56:17 UTC
Issue:
metadata and lockspace volumes were not in use on 3.4 hosted-engine host and the 3.4 -> 3.5 upgrade doesn't touch the existing answerfile.
While deploying an additional 3.5 host (or redeploying an existing host with el7 instead of el6) requires a value != None (an empty string does the job!)

Workaround:
manually edit /etc/ovirt-hosted-engine/answers.conf on the host where you are going to download the answerfile from and add:

OVEHOSTED_STORAGE/metadataImageUUID=str:
OVEHOSTED_STORAGE/metadataVolumeUUID=str:
OVEHOSTED_STORAGE/lockspaceImageUUID=str:
OVEHOSTED_STORAGE/lockspaceVolumeUUID=str:

Please ensure to avoid any white space after 'str:'.

Deploy the additional 3.5 host as usual ensuring to use that answerfile.

Comment 28 Nikolai Sednev 2016-03-02 10:09:05 UTC
(In reply to Simone Tiraboschi from comment #27)
> Issue:
> metadata and lockspace volumes were not in use on 3.4 hosted-engine host and
> the 3.4 -> 3.5 upgrade doesn't touch the existing answerfile.
> While deploying an additional 3.5 host (or redeploying an existing host with
> el7 instead of el6) requires a value != None (an empty string does the job!)
> 
> Workaround:
> manually edit /etc/ovirt-hosted-engine/answers.conf on the host where you
> are going to download the answerfile from and add:
> 
> OVEHOSTED_STORAGE/metadataImageUUID=str:
> OVEHOSTED_STORAGE/metadataVolumeUUID=str:
> OVEHOSTED_STORAGE/lockspaceImageUUID=str:
> OVEHOSTED_STORAGE/lockspaceVolumeUUID=str:
> 
> Please ensure to avoid any white space after 'str:'.
> 
> Deploy the additional 3.5 host as usual ensuring to use that answerfile.

Works well for me, I did as follows:
1)Reprovisioned one of my hosts to el7.2/3.5.
2)Added repositories of 3.5.
3)yum install ovirt-hosted-engine-setup -y.
4)yum update -y
5)vi /etc/ovirt-hosted-engine/answers.conf
6)Added:
OVEHOSTED_STORAGE/metadataImageUUID=str:
OVEHOSTED_STORAGE/metadataVolumeUUID=str:
OVEHOSTED_STORAGE/lockspaceImageUUID=str:
OVEHOSTED_STORAGE/lockspaceVolumeUUID=str:
7)Avoided any white space after 'str:'.
8)Saved the file.
9)Ran hosted-engine --deploy.
10)Redeployed the host, while taken the answerfile from another host.

Comment 29 Nikolai Sednev 2016-03-02 10:55:32 UTC
Worked the same way also on Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160219.0.el7ev).


Note You need to log in before you can comment on or make changes to this bug.