Bug 1254185

Summary:

Failed to deploy HE over RHEVH6.7 | [ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist network configuration. Error code: "16" message: "Unexpected exception"

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Nikolai Sednev <nsednev>

Component:

vdsm

Assignee:

Yeela Kaplan <ykaplan>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Nikolai Sednev <nsednev>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.5.4

CC:

bazulay, bkorren, bmcclain, danken, dkenigsb, dougsland, ecohen, fdeutsch, gklein, ibarkan, istein, lpeer, lsurette, nsednev, oourfali, pstehlik, sbonazzo, ycui, yeylon, ylavi

Target Milestone:

---

Keywords:

Triaged, Unconfirmed, ZStream

Target Release:

3.5.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

infra

Fixed In Version:

vdsm-4.16.26

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-09-06 17:09:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1250199

Attachments:

Description	Flags
alma04.log	none
ovirt-hosted-engine-setup-20150817110321-g67var.log	none
logs Aug 17	none
figure1	none
figure2	none
logs	none

Description Nikolai Sednev 2015-08-17 11:46:09 UTC

Description of problem:
Failed to deploy HE over RHEVH6.7 over NFS storage with error as described bellow:

[ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist network configuration. Error code: "16" message: "Unexpected exception"

Taken from /var/log/messages:

Aug 17 11:07:29 alma04 vdsm vds ERROR unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper#012  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 562, in setSafeNetworkConfig#012  File "/usr/share/vdsm/API.py", line 1541, in setSafeNetworkConfig#012  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__#012  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>#012  File "<string>", line 2, in setSafeNetworkConfig#012  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod#012  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect#012  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 143, in Client#012  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 270, in SocketClient#012  File "<string>", line 1, in connect#012error: [Errno 2] No such file or directory
Aug 17 11:07:31 alma04 ntpd[14338]: Listen normally on 6 rhevm fe80::a236:9fff:fe3b:167c UDP 123

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150813.0.el6ev)


How reproducible:
100%

Steps to Reproduce:
1.Clean install RHEVH6.7 (20150813.0.el6ev)
2.Deploy HE over NFS share.
3.Get the error.

Actual results:
HE deployment failed

Expected results:
HE deployment should pass

Additional info:
logs from host attached

Comment 1 Nikolai Sednev 2015-08-17 11:46:59 UTC

Created attachment 1063780 [details]
alma04.log

Comment 2 Nikolai Sednev 2015-08-17 11:54:37 UTC

Created attachment 1063782 [details]
ovirt-hosted-engine-setup-20150817110321-g67var.log

Comment 3 Dan Kenigsberg 2015-08-17 12:17:25 UTC

We have supervdsm running twice, and writing twice into the same log file

MainThread::DEBUG::2015-08-17 08:08:51,518::supervdsmServer::422::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm
MainThread::DEBUG::2015-08-17 08:08:51,518::supervdsmServer::422::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm

MainThread::ERROR::2015-08-17 08:08:51,519::supervdsmServer::468::SuperVdsm.Server::(main) Could not start Super Vdsm
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 448, in main
    server = manager.get_server()
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 467, in get_server
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 136, in __init__
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 106, in __init__
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 227, in __init__
  File "<string>", line 1, in bind
error: [Errno 98] Address already in use

Comment 4 Douglas Schilling Landgraf 2015-08-18 00:26:57 UTC

Hi Dan,

(In reply to Dan Kenigsberg from comment #3)
> We have supervdsm running twice, and writing twice into the same log file
> 
> MainThread::DEBUG::2015-08-17
> 08:08:51,518::supervdsmServer::422::SuperVdsm.Server::(main) Making sure I'm
> root - SuperVdsm
> MainThread::DEBUG::2015-08-17
> 08:08:51,518::supervdsmServer::422::SuperVdsm.Server::(main) Making sure I'm
> root - SuperVdsm
> 
> MainThread::ERROR::2015-08-17
> 08:08:51,519::supervdsmServer::468::SuperVdsm.Server::(main) Could not start
> Super Vdsm
> Traceback (most recent call last):
>   File "/usr/share/vdsm/supervdsmServer", line 448, in main
>     server = manager.get_server()
>   File "/usr/lib64/python2.6/multiprocessing/managers.py", line 467, in
> get_server
>   File "/usr/lib64/python2.6/multiprocessing/managers.py", line 136, in
> __init__
>   File "/usr/lib64/python2.6/multiprocessing/connection.py", line 106, in
> __init__
>   File "/usr/lib64/python2.6/multiprocessing/connection.py", line 227, in
> __init__
>   File "<string>", line 1, in bind
> error: [Errno 98] Address already in use

I was able to reproduce this report just once. I have noticed that during the install of RHEL 6.7 (HE vm) for oVirt Engine setup that vnc application unexpectedly closed and vdsm service was down. I have tried to reproduce this report again to provide full logs (as I could reproduce the report in the first attempt) and I still not able to reproduce after several times. What I have seen was two supervdsm process, one as defunct and the 'Address already in use' in the vdsm.log.

Comment 5 Douglas Schilling Landgraf 2015-08-18 00:28:05 UTC

Hi Nikolai, 

Are you able to easily reproduce your report? 

Thanks!

Comment 6 Douglas Schilling Landgraf 2015-08-18 00:37:57 UTC

Hi Dan,

As I said previously, I am still not able to reproduce again the report but here some data that might help you.

After rhev-h installation and networking settings via TUI
=================================================================
[root@node ~]# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150813.0.el7ev)

# rpm -qa | grep -i vdsm
vdsm-xmlrpc-4.16.24-2.el7ev.noarch
vdsm-python-4.16.24-2.el7ev.noarch
vdsm-cli-4.16.24-2.el7ev.noarch
vdsm-jsonrpc-4.16.24-2.el7ev.noarch
vdsm-hook-vhostmd-4.16.24-2.el7ev.noarch
vdsm-4.16.24-2.el7ev.x86_64
vdsm-reg-4.16.24-2.el7ev.noarch
vdsm-python-zombiereaper-4.16.24-2.el7ev.noarch
vdsm-yajsonrpc-4.16.24-2.el7ev.noarch
vdsm-hook-ethtool-options-4.16.24-2.el7ev.noarch
ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch

# ps aux | grep -i vdsm
root     12245  0.1  0.5 621828 59220 ?        S<sl 22:35   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
root     16969  0.0  0.0  10632   932 pts/0    S+   22:47   0:00 grep --color=auto -i vdsm

# service vdsmd status
Redirecting to /bin/systemctl status  vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
   Active: inactive (dead) since Mon 2015-08-17 22:47:31 UTC; 31s ago
  Process: 16936 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS)
  Process: 16121 ExecStart=/usr/share/vdsm/daemonAdapter -0 /dev/null -1 /dev/null -2 /dev/null /usr/share/vdsm/vdsm (code=exited, status=0/SUCCESS)
  Process: 15033 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 16121 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/vdsmd.service

Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 parse_server_challenge()
Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 ask_user_info()
Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 client step 2
Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 ask_user_info()
Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 make_client_response()
Aug 17 22:36:08 localhost python[16121]: DIGEST-MD5 client step 3
Aug 17 22:47:23 node.localdomain systemd[1]: Stopping Virtual Desktop Server Manager...
Aug 17 22:47:30 node.localdomain python[16121]: DIGEST-MD5 client mech dispose
Aug 17 22:47:30 node.localdomain python[16121]: DIGEST-MD5 common mech dispose
Aug 17 22:47:30 node.localdomain vdsmd_init_common.sh[16936]: vdsm: Running run_final_hooks
Aug 17 22:47:31 node.localdomain systemd[1]: Stopped Virtual Desktop Server Manager.


Setting Hosted Engine via TUI
====================================
- Provided RHEL-6.7 iso for HE installer

In the middle of initial questions (figure 1) for setting HE, I see vdsm has started and a defunct process of supervdsm has been created.

# /bin/systemctl status  -l vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
   Active: active (running) since Mon 2015-08-17 22:57:37 UTC; 3min 3s ago

# ps aux | grep -i vdsm
root     17331  1.0  0.5 626520 60564 ?        S<sl 22:57   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
vdsm     17514  1.8  0.3 1515680 45864 ?       S<sl 22:57   0:01 /usr/bin/python /usr/share/vdsm/vdsm
vdsm     17616  0.0  0.0 382596  1092 ?        S<l  22:57   0:00 /usr/libexec/ioprocess --read-pipe-fd 39 --write-pipe-fd 38 --max-threads 10 --max-queued-requests 10
qemu     17624  0.0  0.0      0     0 ?        Z<   22:57   0:00 [supervdsmServer] <defunct>
vdsm     17641  0.0  0.0 382596  3140 ?        S<l  22:57   0:00 /usr/libexec/ioprocess --read-pipe-fd 47 --write-pipe-fd 46 --max-threads 10 --max-queued-requests 10
root     17714  0.0  0.0  10632   932 pts/0    R+   22:58   0:00 grep --color=auto -i vdsm


Later (figure 2), after setting all Hosted-Engine, I see a new "Start vdsmd", when the setup of hosted-engine is complete and ask to login into the system via vnc, this is the status of vdsm and ps:

# ps aux | grep -i vdsm
root     17977  0.5  0.5 710656 61748 ?        S<sl 23:01   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
vdsm     18171  3.0  0.4 1550224 57120 ?       S<sl 23:02   0:04 /usr/bin/python /usr/share/vdsm/vdsm
qemu     18687  0.0  0.0      0     0 ?        Z<   23:02   0:00 [supervdsmServer] <defunct>
vdsm     18688  0.0  0.0 382596  3200 ?        S<l  23:02   0:00 /usr/libexec/ioprocess --read-pipe-fd 56 --write-pipe-fd 55 --max-threads 10 --max-queued-requests 10
qemu     19136  7.2  0.2 4724024 34852 ?       Sl   23:03   0:02 /usr/libexec/qemu-kvm -name HostedEngine -S -machine rhel6.5.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 9612e98d-6125-4f89-ac78-01661a6cd4ec -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.1-20150813.0.el7ev,serial=8103812E-150B-4725-B1E2-BA9C9B191C81_52:54:00:c6:56:b6,uuid=9612e98d-6125-4f89-ac78-01661a6cd4ec -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/HostedEngine.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-08-17T23:03:52,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/data/ovirt-hosted-engine-setup/rhel67.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive file=/var/run/vdsm/storage/353cf67b-514b-416a-acf6-cb75b6a3c739/cbc09241-a8db-4eda-b164-798a2906029e/6e625010-ee57-46df-a9f6-dc491c4d5d08,if=none,id=drive-virtio-disk0,format=raw,serial=cbc09241-a8db-4eda-b164-798a2906029e,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:3a:be:7d,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/9612e98d-6125-4f89-ac78-01661a6cd4ec.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/9612e98d-6125-4f89-ac78-01661a6cd4ec.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -msg timestamp=on
root     19227  0.0  0.0  10632   932 pts/0    S+   23:04   0:00 grep --color=auto -i vdsm



# service vdsmd status -l
Redirecting to /bin/systemctl status  -l vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
   Active: active (running) since Mon 2015-08-17 23:02:08 UTC; 2min 9s ago
 Main PID: 18171 (vdsm)
   CGroup: /system.slice/vdsmd.service
           ├─18171 /usr/bin/python /usr/share/vdsm/vdsm
           └─18688 /usr/libexec/ioprocess --read-pipe-fd 56 --write-pipe-fd 55 --max-threads 10 --max-queued-requests 10

Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 ask_user_info()
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 make_client_response()
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 client step 2
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 parse_server_challenge()
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 ask_user_info()
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 make_client_response()
Aug 17 23:03:52 node.localdomain python[18171]: DIGEST-MD5 client step 3
Aug 17 23:03:52 node.localdomain vdsm[18171]: vdsm vds ERROR unexpected error
                                              Traceback (most recent call last):
                                                File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper
                                                File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 383, in vmSetTicket
                                                File "/usr/share/vdsm/API.py", line 642, in setTicket
                                                File "/usr/share/vdsm/virt/vm.py", line 4847, in setTicket
                                              AttributeError: 'NoneType' object has no attribute 'XMLDesc'
Aug 17 23:03:53 node.localdomain vdsm[18171]: vdsm vds ERROR unexpected error
                                              Traceback (most recent call last):
                                                File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper
                                                File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 383, in vmSetTicket
                                                File "/usr/share/vdsm/API.py", line 642, in setTicket
                                                File "/usr/share/vdsm/virt/vm.py", line 4847, in setTicket
                                              AttributeError: 'NoneType' object has no attribute 'XMLDesc'
Aug 17 23:03:54 node.localdomain vdsm[18171]: vdsm vm.Vm WARNING vmId=`9612e98d-6125-4f89-ac78-01661a6cd4ec`::_readPauseCode unsupported by libvirt vm


I am attaching the logs but keep in mind/ignore that after adding the HE to oVirt Engine 3.6 it failed because of missing package ovirt-vmconsole. I will do new tests in RHEV-M 3.5 too.

Comment 7 Douglas Schilling Landgraf 2015-08-18 00:40:12 UTC

Hi Nikolai,

Just for double check, could you please also provide the RHEV-M version you have used during your report? 

Thanks!

Comment 8 Douglas Schilling Landgraf 2015-08-18 00:42:50 UTC

Created attachment 1064103 [details]
logs Aug 17

Comment 9 Douglas Schilling Landgraf 2015-08-18 00:44:25 UTC

Created attachment 1064115 [details]
figure1

Comment 10 Douglas Schilling Landgraf 2015-08-18 00:44:46 UTC

Created attachment 1064116 [details]
figure2

Comment 11 Nikolai Sednev 2015-08-18 05:02:56 UTC

(In reply to Douglas Schilling Landgraf from comment #5)
> Hi Nikolai, 
> 
> Are you able to easily reproduce your report? 
> 
> Thanks!

I wasn't trying to reproduce the issue, it happened during clean deployment and I  filed a bug with all the logs. 
Flow was as follows:

[root@alma04 ~]# hosted-engine --deploy
[ INFO  ] Stage: Initializing          
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup            
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]:                                                                              
          Configuration files: []                                                                                                         
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20150817110321-g67var.log                                
          Version: otopi-1.3.2 (otopi-1.3.2-1.el6ev)                                                                                      
          It has been detected that this program is executed through an SSH connection without using screen.                              
          Continuing with the installation may lead to broken installation if the network connection fails.                               
          It is highly recommended to abort the installation and run it inside a screen session using command "screen".                   
          Do you want to continue anyway? (Yes, No)[No]: yes                                                                              
[ INFO  ] Hardware supports virtualization                                                                                                
[ INFO  ] Stage: Environment packages setup                                                                                               
[ INFO  ] Stage: Programs detection                                                                                                       
[ INFO  ] Stage: Environment setup                                                                                                        
[ INFO  ] Generating libvirt-spice certificates                                                                                           
[ INFO  ] Stage: Environment customization                                                                                                
                                                                                                                                          
          --== STORAGE CONFIGURATION ==--                                                                                                 
                                                                                                                                          
          During customization use CTRL-D to abort.                                                                                       
          Please specify the storage you would like to use (iscsi, nfs3, nfs4)[nfs3]:                                                     
          Please specify the full shared storage connection path to use (example: host:/path): 10.35.160.108:/Virt/nsednev_HE_3_5_z_1     
[ INFO  ] Installing on first host                                                                                                        
          Please provide storage domain name. [hosted_storage]:                                                                           
          Local storage datacenter name is an internal name and currently will not be shown in engine's admin UI.                         
          Please enter local datacenter name [hosted_datacenter]:                                                                         
                                                                                                                                          
          --== SYSTEM CONFIGURATION ==--                                                                                                  
                                                                                                                                          
                                                                                                                                          
          --== NETWORK CONFIGURATION ==--                                                                                                 
                                                                                                                                          
          Please indicate a nic to set rhevm bridge on: (em1, em2, p1p1, p1p2) [em1]: p1p1                                                
          iptables was detected on your computer, do you wish setup to configure it? (Yes, No)[Yes]:                                      
          Please indicate a pingable gateway IP address [10.35.117.254]:                                                                  
                                                                                                                                          
          --== VM CONFIGURATION ==--                                                                                                      
                                                                                                                                          
          Please specify the device to boot the VM from (cdrom, disk, pxe) [cdrom]: pxe                                                   
          Please specify an alias for the Hosted Engine image [hosted_engine]:                                                            
          The following CPU types are supported by this host:                                                                             
                 - model_SandyBridge: Intel SandyBridge Family                                                                            
                 - model_Westmere: Intel Westmere Family                                                                                  
                 - model_Nehalem: Intel Nehalem Family                                                                                    
                 - model_Penryn: Intel Penryn Family                                                                                      
                 - model_Conroe: Intel Conroe Family                                                                                      
          Please specify the CPU type to be used by the VM [model_SandyBridge]:                                                           
          Please specify the number of virtual CPUs for the VM [Defaults to minimum requirement: 2]:                                      
          Please specify the disk size of the VM in GB [Defaults to minimum requirement: 25]:                                             
          You may specify a unicast MAC address for the VM or accept a randomly generated default [00:16:3e:39:e4:af]: 00:16:3E:7C:CC:CC  
          Please specify the memory size of the VM in MB [Defaults to minimum requirement: 4096]:                                         
          Please specify the console type you would like to use to connect to the VM (vnc, spice) [vnc]:                                  
                                                                                                                                          
          --== HOSTED ENGINE CONFIGURATION ==--                                                                                           
                                                                                                                                          
          Enter the name which will be used to identify this host inside the Administrator Portal [hosted_engine_1]:                      
          Enter 'admin@internal' user password that will be used for accessing the Administrator Portal:                                  
          Confirm 'admin@internal' user password:                                                                                         
          Please provide the FQDN for the engine you would like to use.                                                                   
          This needs to match the FQDN that you will use for the engine installation within the VM.                                       
          Note: This will be the FQDN of the VM you are now going to create,                                                              
          it should not point to the base host or to any other existing machine.                                                          
          Engine FQDN: nsednev-he-3.qa.lab.tlv.redhat.com                                                                                 
          Please provide the name of the SMTP server through which we will send notifications [localhost]:                                
          Please provide the TCP port number of the SMTP server [25]:                                                                     
          Please provide the email address from which notifications will be sent [root@localhost]:                                        
          Please provide a comma-separated list of email addresses which will get notifications [root@localhost]:                         
[ INFO  ] Stage: Setup validation                                                                                                         
                                                                                                                                          
          --== CONFIGURATION PREVIEW ==--                                                                                                 
                                                                                                                                          
          Bridge interface                   : p1p1                                                                                       
          Engine FQDN                        : nsednev-he-3.qa.lab.tlv.redhat.com                                                         
          Bridge name                        : rhevm                                                                                      
          SSH daemon port                    : 22                                                                                         
          Firewall manager                   : iptables                                                                                   
          Gateway address                    : 10.35.117.254                                                                              
          Host name for web application      : hosted_engine_1                                                                            
          Host ID                            : 1                                                                                          
          Image alias                        : hosted_engine                                                                              
          Image size GB                      : 25                                                                                         
          Storage connection                 : 10.35.160.108:/Virt/nsednev_HE_3_5_z_1                                                     
          Console type                       : vnc                                                                                        
          Memory size MB                     : 4096                                                                                       
          MAC address                        : 00:16:3E:7C:CC:CC                                                                          
          Boot type                          : pxe                                                                                        
          Number of CPUs                     : 2                                                                                          
          CPU Type                           : model_SandyBridge                                                                          
                                                                                                                                          
          Please confirm installation settings (Yes, No)[Yes]:                                                                            
[ INFO  ] Stage: Transaction setup                                                                                                        
[ INFO  ] Stage: Misc configuration                                                                                                       
[ INFO  ] Stage: Package installation                                                                                                     
[ INFO  ] Stage: Misc configuration                                                                                                       
[ INFO  ] Configuring libvirt                                                                                                             
[ INFO  ] Configuring VDSM                                                                                                                
[ INFO  ] Starting vdsmd                                                                                                                  
[ INFO  ] Waiting for VDSM hardware info                                                                                                  
[ INFO  ] Waiting for VDSM hardware info                                                                                                  
[ INFO  ] Configuring the management bridge                                                                                               
[ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist network configuration. Error code: "16" message: "Unexpected exception"
[ INFO  ] Stage: Clean up                                                                                                                        
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150817110729.conf'                                        
[ INFO  ] Stage: Pre-termination                                                                                                                 
[ INFO  ] Stage: Termination                                                                                                                     
[root@alma04 ~]# cat /etc/redhat-release                                                                                                         
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150813.0.el6ev)

Comment 12 Nikolai Sednev 2015-08-18 05:06:35 UTC

(In reply to Douglas Schilling Landgraf from comment #7)
> Hi Nikolai,
> 
> Just for double check, could you please also provide the RHEV-M version you
> have used during your report? 
> 
> Thanks!

vt16.6.

Comment 13 Fabian Deutsch 2015-08-18 10:08:36 UTC

Please provide the vdsm and he-setup logs for comment 11.

Comment 14 Nikolai Sednev 2015-08-18 11:20:48 UTC

(In reply to Fabian Deutsch from comment #13)
> Please provide the vdsm and he-setup logs for comment 11.

All logs were already provided to the bug, within the tar file alma04.log you will find the vdsm.log and others, while he-setup log provided separately ovirt-hosted-engine-setup-20150817110321-g67var.log
Please see the "Attachments" section of this bug report.

Comment 16 Sandro Bonazzola 2015-08-18 13:39:14 UTC

Thread-16::DEBUG::2015-08-17 11:07:23,878::BindingXMLRPC::1133::vds::(wrapper) client [127.0.0.1]::call setupNetworks with ({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) {}
Thread-16::DEBUG::2015-08-17 11:07:29,705::BindingXMLRPC::1140::vds::(wrapper) return setupNetworks with {'status': {'message': 'Done', 'code': 0}}
Detector thread::DEBUG::2015-08-17 11:07:29,722::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 127.0.0.1:35476
Detector thread::DEBUG::2015-08-17 11:07:29,736::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 127.0.0.1:35476
Detector thread::DEBUG::2015-08-17 11:07:29,736::protocoldetector::247::vds.MultiProtocolAcceptor::(_handle_connection_read) Detected protocol xml from 127.0.0.1:35476
Detector thread::DEBUG::2015-08-17 11:07:29,736::BindingXMLRPC::1173::XmlDetector::(handleSocket) xml over http detected from ('127.0.0.1', 35476)
Thread-17::DEBUG::2015-08-17 11:07:29,776::BindingXMLRPC::1133::vds::(wrapper) client [127.0.0.1]::call setSafeNetworkConfig with () {}
Thread-17::ERROR::2015-08-17 11:07:29,776::BindingXMLRPC::1152::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper
  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 562, in setSafeNetworkConfig
  File "/usr/share/vdsm/API.py", line 1541, in setSafeNetworkConfig
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setSafeNetworkConfig
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 143, in Client
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 270, in SocketClient
  File "<string>", line 1, in connect
error: [Errno 2] No such file or directory

Comment 17 Sandro Bonazzola 2015-08-18 13:41:27 UTC

supervdsm shows:
MainProcess|Thread-16::INFO::2015-08-17 11:07:29,479::netconfpersistence::76::root::(setNetwork) Adding network rhevm({'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True})
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,479::utils::739::root::(execCmd) /sbin/ip route show to 0.0.0.0/0 table all (cwd None)
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,484::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-16::INFO::2015-08-17 11:07:29,492::netconfpersistence::166::root::(_clearDisk) Clearing /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,492::netconfpersistence::174::root::(_clearDisk) No existent config to clear.
MainProcess|Thread-16::INFO::2015-08-17 11:07:29,493::netconfpersistence::190::root::(save) Saved new config RunningConfig({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}) to /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,493::utils::739::root::(execCmd) /usr/libexec/vdsm/hooks/after_network_setup/30_ethtool_options (cwd None)
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,704::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-16::INFO::2015-08-17 11:07:29,704::hooks::100::root::(_runHooksDir) 
MainProcess|Thread-16::DEBUG::2015-08-17 11:07:29,705::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return setupNetworks with None

Comment 18 Sandro Bonazzola 2015-08-18 13:43:09 UTC

Looks like a protocol error between supervdsmd and vdsmd or some sort of race condition. Moving to vdsm / network vertical

Comment 19 Fabian Deutsch 2015-08-18 13:55:05 UTC

Ido, could this be related to your latest network changes?

Comment 20 Ido Barkan 2015-08-19 07:11:53 UTC

Hardly. The fact that there are multiple instances of supervdsm needs to be resolved first.

Comment 21 Barak Korren 2015-08-19 08:01:41 UTC

should this issue be resolved by the following commit?
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=c6f3e6eba71c1166281f4ef7e9be1ad7000f4e77
Please properly assign and ack the bug.

Comment 23 Nikolai Sednev 2015-08-19 09:04:34 UTC

Please disregard comment #22, not related to this bug.

Comment 24 Dan Kenigsberg 2015-08-19 11:00:29 UTC

(In reply to Barak Korren from comment #21)
> should this issue be resolved by the following commit?
> https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;
> h=c6f3e6eba71c1166281f4ef7e9be1ad7000f4e77
> Please properly assign and ack the bug.

Not at all - the patch is unrelated to this bug, and this bug is still under deliberation.

Comment 25 Dan Kenigsberg 2015-08-19 11:28:08 UTC

When setSafeNetworkconfig is sent from vdsm to supervdsm, nothing shows in the logs - for more than an hour:

MainProcess|Thread-16::DEBUG::2015-08-19 08:48:51,050::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return setupNetworks with None
MainThread::DEBUG::2015-08-19 10:04:15,180::supervdsmServer::462::SuperVdsm.Server::(main) Terminated normally

I don't know what caused supervdsm to die, and what kept it silent for so long.

Comment 26 Sandro Bonazzola 2015-08-19 12:26:20 UTC

(In reply to Nikolai Sednev from comment #23)
> Please disregard comment #22, not related to this bug.

So why did you move this bug to hosted-engine?

(In reply to Dan Kenigsberg from comment #25)

> I don't know what caused supervdsm to die, and what kept it silent for so
> long.

Nikolai, is this bug reproducible?

Comment 27 Nikolai Sednev 2015-08-19 12:40:33 UTC

Might be happened automatically, I didn't tried to change something.
Regarding reproduction I tried only once and got this behavior.

Comment 28 Nikolai Sednev 2015-08-19 13:58:19 UTC

(In reply to Sandro Bonazzola from comment #26)
> (In reply to Nikolai Sednev from comment #23)
> > Please disregard comment #22, not related to this bug.
> 
> So why did you move this bug to hosted-engine?
> 
> (In reply to Dan Kenigsberg from comment #25)
> 
> > I don't know what caused supervdsm to die, and what kept it silent for so
> > long.
> 
> Nikolai, is this bug reproducible?

Sandro, it's reproducible without any problem:
[ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist network configuration. Error code: "16" message: "Unexpected exception"

Comment 29 Dima Kuznetsov 2015-08-20 14:39:32 UTC

Hello, all.

I've looked around for the double-supervdsm issue and I have a pretty good idea why it happens.

First of all, we should note that in boot.log, vdsm fails to start with the following error:

---------cut here---------
Loading the softdog kernel module:                         [  OK  ]
Starting wdmd:                                             [  OK  ]
Starting ovirt-post:                                       [  OK  ]
Starting sanlock:                                          [  OK  ]
Starting virt-who:                                         [  OK  ]
Starting ovirt-cim:                                        [  OK  ]
Starting ovirt-firstboot:                                  [  OK  ]
Starting RHN Managed Xen Domains:                          [  OK  ]
supervdsm start                                            [  OK  ]
cannot start vdsm, operation is locked                     [FAILED]
---------cut here---------

This log suggest that vdsmd was started before sysvinit got to starting it, and it is true, vdsmd is up when the system starts, and since sysvinit did not start it, someone else did.

I've looked around and apparently 'ovirt-post' starts vdsmd in the background, so by the time sysvinit gets to it vdsmd start command fails due to a lock. The first start happens as part of on-boot hook of ovirt-node:
---------cut here---------
[root@localhost libexec]# cat /usr/libexec/ovirt-node/hooks/on-boot/90-start-vdsm
#[...]
service vdsmd start &

exit 0
---------cut here---------

The second instance of supervdsm is spawned because
 * when ovirt-node hook starts vdsmd, vdsmd checks if supervdsm is up (its not) and starts it.
 * then sysvinit gets to starting supervdsm and there is no lock (unlike vdsm service) so with a little luck, service can be started twice if the timing is right.

I think we should evaluate if its ovirt-node's responsibility to start vdsmd, and if it is, ovirt-post stage should be sequenced after vdsm/supervdsm services are started.

Current start order is: (alphabetical)
---------cut here---------
lrwxrwxrwx. 1 root root 20 2015-08-20 14:10 S98ovirt-post -> ../init.d/ovirt-post
lrwxrwxrwx. 1 root root 17 2015-08-20 14:10 S98sanlock -> ../init.d/sanlock
lrwxrwxrwx. 1 root root 18 2015-08-20 14:10 S98virt-who -> ../init.d/virt-who
lrwxrwxrwx. 1 root root 11 2015-08-14 03:13 S99local -> ../rc.local
lrwxrwxrwx. 1 root root 19 2015-08-20 14:10 S99ovirt-cim -> ../init.d/ovirt-cim
lrwxrwxrwx. 1 root root 25 2015-08-14 03:15 S99ovirt-firstboot -> ../init.d/ovirt-firstboot
lrwxrwxrwx. 1 root root 33 2015-08-14 03:14 S99rhn-virtualization-host -> ../init.d/rhn-virtualization-host
lrwxrwxrwx. 1 root root 20 2015-08-14 03:15 S99supervdsmd -> ../init.d/supervdsmd
lrwxrwxrwx. 1 root root 15 2015-08-14 03:15 S99vdsmd -> ../init.d/vdsmd
---------cut here---------

Comment 30 Douglas Schilling Landgraf 2015-08-20 18:03:24 UTC

Hi Dima,

(In reply to Dima Kuznetsov from comment #29)
> Hello, all.
> 
> I've looked around for the double-supervdsm issue and I have a pretty good
> idea why it happens.
> 
> First of all, we should note that in boot.log, vdsm fails to start with the
> following error:
> 
> ---------cut here---------
> Loading the softdog kernel module:                         [  OK  ]
> Starting wdmd:                                             [  OK  ]
> Starting ovirt-post:                                       [  OK  ]
> Starting sanlock:                                          [  OK  ]
> Starting virt-who:                                         [  OK  ]
> Starting ovirt-cim:                                        [  OK  ]
> Starting ovirt-firstboot:                                  [  OK  ]
> Starting RHN Managed Xen Domains:                          [  OK  ]
> supervdsm start                                            [  OK  ]
> cannot start vdsm, operation is locked                     [FAILED]
> ---------cut here---------
> 
> This log suggest that vdsmd was started before sysvinit got to starting it,
> and it is true, vdsmd is up when the system starts, and since sysvinit did
> not start it, someone else did.
> 
> I've looked around and apparently 'ovirt-post' starts vdsmd in the
> background, so by the time sysvinit gets to it vdsmd start command fails due
> to a lock. The first start happens as part of on-boot hook of ovirt-node:
> ---------cut here---------
> [root@localhost libexec]# cat
> /usr/libexec/ovirt-node/hooks/on-boot/90-start-vdsm
> #[...]
> service vdsmd start &
> 
> exit 0
> ---------cut here---------
> 
> The second instance of supervdsm is spawned because
>  * when ovirt-node hook starts vdsmd, vdsmd checks if supervdsm is up (its
> not) and starts it.
>  * then sysvinit gets to starting supervdsm and there is no lock (unlike
> vdsm service) so with a little luck, service can be started twice if the
> timing is right.
> 
> I think we should evaluate if its ovirt-node's responsibility to start
> vdsmd, and if it is, ovirt-post stage should be sequenced after
> vdsm/supervdsm services are started.
> 
> Current start order is: (alphabetical)
> ---------cut here---------
> lrwxrwxrwx. 1 root root 20 2015-08-20 14:10 S98ovirt-post ->
> ../init.d/ovirt-post
> lrwxrwxrwx. 1 root root 17 2015-08-20 14:10 S98sanlock -> ../init.d/sanlock
> lrwxrwxrwx. 1 root root 18 2015-08-20 14:10 S98virt-who -> ../init.d/virt-who
> lrwxrwxrwx. 1 root root 11 2015-08-14 03:13 S99local -> ../rc.local
> lrwxrwxrwx. 1 root root 19 2015-08-20 14:10 S99ovirt-cim ->
> ../init.d/ovirt-cim
> lrwxrwxrwx. 1 root root 25 2015-08-14 03:15 S99ovirt-firstboot ->
> ../init.d/ovirt-firstboot
> lrwxrwxrwx. 1 root root 33 2015-08-14 03:14 S99rhn-virtualization-host ->
> ../init.d/rhn-virtualization-host
> lrwxrwxrwx. 1 root root 20 2015-08-14 03:15 S99supervdsmd ->
> ../init.d/supervdsmd
> lrwxrwxrwx. 1 root root 15 2015-08-14 03:15 S99vdsmd -> ../init.d/vdsmd
> ---------cut here---------

Thanks for your input, I could reproduce the problem right now and I don't see the message "cannot start vdsm, operation is locked" in /var/log/boot.log from this fresh install although your logic seems good and this error might be triggering from different scenarios. I will make some tests/debugging now.

Comment 31 Douglas Schilling Landgraf 2015-08-21 03:48:56 UTC

(In reply to Douglas Schilling Landgraf from comment #30)
> 
> Thanks for your input, I could reproduce the problem right now and I don't
> see the message "cannot start vdsm, operation is locked" in
> /var/log/boot.log from this fresh install although your logic seems good and
> this error might be triggering from different scenarios. I will make some
> tests/debugging now.

Unfortunately, was not exactly the reproducer, my storage was not able to communicate so I got a different error.

@Nikolai, I still cannot see the error you shared bellow. I noticed you triggered the HE manually via console in comment#11 (in a fresh install of rhev-h). Did you see this error triggering the HE via TUI too?

"""
[ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist network configuration. Error code: "16" message: "Unexpected exception"
[ INFO  ] Stage: Clean 
"""

Comment 32 Ying Cui 2015-08-21 05:05:29 UTC

Virt QE also helped to try to reproduce the issue through Hosted-Engine TUI in bug description error, so far we haven't reproduced it. 

Steps:
1. TUI clean installed rhevh 6.7 on local disk machine.
2. dhcp setup network via RHEVH TUI, em1.
   Note: not bond and vlan network, because here has another bug 100% reproduce: Bug 1254523
3. set root password in RHEVH-> RHEVM TUI.
4. start hosted engine setup on Hosted Engine TUI using rhevm-appliance ova.
5. Use default options for HE setup process. 
6. HE can setup successful, not encounter such issue in bug description. 

Not sure what exact action will trigger this bug, so not sure the exact reproducible here. Thanks.

Comment 33 Nikolai Sednev 2015-08-23 11:06:04 UTC

(In reply to Douglas Schilling Landgraf from comment #31)
> (In reply to Douglas Schilling Landgraf from comment #30)
> > 
> > Thanks for your input, I could reproduce the problem right now and I don't
> > see the message "cannot start vdsm, operation is locked" in
> > /var/log/boot.log from this fresh install although your logic seems good and
> > this error might be triggering from different scenarios. I will make some
> > tests/debugging now.
> 
> Unfortunately, was not exactly the reproducer, my storage was not able to
> communicate so I got a different error.
> 
> @Nikolai, I still cannot see the error you shared bellow. I noticed you
> triggered the HE manually via console in comment#11 (in a fresh install of
> rhev-h). Did you see this error triggering the HE via TUI too?
> 
> """
> [ ERROR ] Failed to execute stage 'Misc configuration': Failed to persist
> network configuration. Error code: "16" message: "Unexpected exception"
> [ INFO  ] Stage: Clean 
> """

I'm not using the TUI, only CLI based deployment.

Comment 34 Nikolai Sednev 2015-08-23 11:07:59 UTC

(In reply to Ying Cui from comment #32)
> Virt QE also helped to try to reproduce the issue through Hosted-Engine TUI
> in bug description error, so far we haven't reproduced it. 
> 
> Steps:
> 1. TUI clean installed rhevh 6.7 on local disk machine.
> 2. dhcp setup network via RHEVH TUI, em1.
>    Note: not bond and vlan network, because here has another bug 100%
> reproduce: Bug 1254523
> 3. set root password in RHEVH-> RHEVM TUI.
> 4. start hosted engine setup on Hosted Engine TUI using rhevm-appliance ova.
> 5. Use default options for HE setup process. 
> 6. HE can setup successful, not encounter such issue in bug description. 
> 
> Not sure what exact action will trigger this bug, so not sure the exact
> reproducible here. Thanks.

Please follow up the original reproduction steps.
Deploy using CLI.
Deploy using NFS.
Deploy using PXE.

Comment 35 Yaniv Lavi 2015-08-23 12:11:55 UTC

(In reply to Nikolai Sednev from comment #34)
> (In reply to Ying Cui from comment #32)
> > Virt QE also helped to try to reproduce the issue through Hosted-Engine TUI
> > in bug description error, so far we haven't reproduced it. 
> > 
> > Steps:
> > 1. TUI clean installed rhevh 6.7 on local disk machine.
> > 2. dhcp setup network via RHEVH TUI, em1.
> >    Note: not bond and vlan network, because here has another bug 100%
> > reproduce: Bug 1254523
> > 3. set root password in RHEVH-> RHEVM TUI.
> > 4. start hosted engine setup on Hosted Engine TUI using rhevm-appliance ova.
> > 5. Use default options for HE setup process. 
> > 6. HE can setup successful, not encounter such issue in bug description. 
> > 
> > Not sure what exact action will trigger this bug, so not sure the exact
> > reproducible here. Thanks.
> 
> Please follow up the original reproduction steps.
> Deploy using CLI.
> Deploy using NFS.
> Deploy using PXE.

The only supported path is via TUI. Please try to reproduce via TUI asap.

Comment 36 Nikolai Sednev 2015-08-23 12:52:33 UTC

The deployment via CLI passed OK at the first RHEVH6.7 with the fix https://gerrit.ovirt.org/#/c/45202/ , thanks to Yeela's efforts.

Comment 37 Yaniv Lavi 2015-08-23 13:52:46 UTC

(In reply to Nikolai Sednev from comment #36)
> The deployment via CLI passed OK at the first RHEVH6.7 with the fix
> https://gerrit.ovirt.org/#/c/45202/ , thanks to Yeela's efforts.

Please recreate with TUI.

Comment 39 Nikolai Sednev 2015-08-24 05:57:34 UTC

Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of deploying it is via TUI?

Comment 40 Douglas Schilling Landgraf 2015-08-24 12:41:48 UTC

(In reply to Nikolai Sednev from comment #39)
> Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> deploying it is via TUI?

It's not supported users go to shell and type hosted-engine --deploy. We created a flow via TUI with all adjusts to trigger the hosted-engine --deploy.

Comment 41 Nikolai Sednev 2015-08-24 13:07:59 UTC

(In reply to Douglas Schilling Landgraf from comment #40)
> (In reply to Nikolai Sednev from comment #39)
> > Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> > deploying it is via TUI?
> 
> It's not supported users go to shell and type hosted-engine --deploy. We
> created a flow via TUI with all adjusts to trigger the hosted-engine
> --deploy.

I didn't saw this error happening if deployment triggered by the TUI.

Comment 42 Yaniv Lavi 2015-08-24 13:15:55 UTC

(In reply to Nikolai Sednev from comment #41)
> (In reply to Douglas Schilling Landgraf from comment #40)
> > (In reply to Nikolai Sednev from comment #39)
> > > Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> > > deploying it is via TUI?
> > 
> > It's not supported users go to shell and type hosted-engine --deploy. We
> > created a flow via TUI with all adjusts to trigger the hosted-engine
> > --deploy.
> 
> I didn't saw this error happening if deployment triggered by the TUI.

Then this is not a real bug.

Comment 43 Nikolai Sednev 2015-08-25 14:12:49 UTC

(In reply to Yaniv Dary from comment #42)
> (In reply to Nikolai Sednev from comment #41)
> > (In reply to Douglas Schilling Landgraf from comment #40)
> > > (In reply to Nikolai Sednev from comment #39)
> > > > Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> > > > deploying it is via TUI?
> > > 
> > > It's not supported users go to shell and type hosted-engine --deploy. We
> > > created a flow via TUI with all adjusts to trigger the hosted-engine
> > > --deploy.
> > 
> > I didn't saw this error happening if deployment triggered by the TUI.
> 
> Then this is not a real bug.

I've added a second host via TUI, got it added to the HE engine and then saw it came to active.
After few seconds host became non-operational and was reported by the engine as:
	
2015-Aug-25, 09:56
	
Host hosted_engine_2 moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'p1p1 (rhevm)'.

I've checked that host is UP by logging in to it over ssh and continuously pinging it, it was UP. I've activated the host within the engine via WEBUI, it bacame active for a few seconds and then again was set by the engine in to not-operational with the same error, while host was UP and not even a single ping was lost.
After some time I reactivated the host within the engine again and it stayed UP.

I've checked that I still have several running vdsmds and supervdsm processes on both hosts, might be that using TUI we're not hitting the race, but it's there and Yeela's patch will resolve multiple vdsm instances issue.

Comment 44 Nikolai Sednev 2015-08-25 15:50:01 UTC

I've retried to add the second host after another reprovision and got really weird results:
[ INFO  ] Still waiting for VDSM host to become operational...
          The host hosted_engine_2 is in non-operational state.
          Please try to activate it via the engine webadmin UI.
          Retry checking host status or ignore this and continue (Retry, Ignore)[Retry]?                                                                             The host hosted_engine_2 is in non-operational state.
          Please try to activate it via the engine webadmin UI.
          Retry checking host status or ignore this and continue (Retry, Ignore)[Retry]?
          The host hosted_engine_2 is in non-operational state.
          Please try to activate it via the engine webadmin UI.
          Retry checking host status or ignore this and continue (Retry, Ignore)[Retry]?


I saw some errors within the ovirt-ha-broker:

Thread-7::ERROR::2015-08-25 15:47:34,494::notifications::30::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 111] Connection refused
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 21, in send_email
  File "/usr/lib64/python2.6/smtplib.py", line 244, in __init__
  File "/usr/lib64/python2.6/smtplib.py", line 300, in connect
  File "/usr/lib64/python2.6/smtplib.py", line 278, in _get_socket
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
error: [Errno 111] Connection refused
Thread-67::INFO::2015-08-25 15:47:41,950::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-67::INFO::2015-08-25 15:47:41,955::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-68::INFO::2015-08-25 15:47:41,956::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-68::INFO::2015-08-25 15:47:41,957::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-69::INFO::2015-08-25 15:47:41,958::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-69::INFO::2015-08-25 15:47:41,959::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-10::INFO::2015-08-25 15:47:41,968::mem_free::53::mem_free.MemFree::(action) memFree: 31278

Comment 45 Douglas Schilling Landgraf 2015-08-25 15:54:20 UTC

(In reply to Nikolai Sednev from comment #43)
> (In reply to Yaniv Dary from comment #42)
> > (In reply to Nikolai Sednev from comment #41)
> > > (In reply to Douglas Schilling Landgraf from comment #40)
> > > > (In reply to Nikolai Sednev from comment #39)
> > > > > Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> > > > > deploying it is via TUI?
> > > > 
> > > > It's not supported users go to shell and type hosted-engine --deploy. We
> > > > created a flow via TUI with all adjusts to trigger the hosted-engine
> > > > --deploy.
> > > 
> > > I didn't saw this error happening if deployment triggered by the TUI.
> > 
> > Then this is not a real bug.
> 
> I've added a second host via TUI, got it added to the HE engine and then saw
> it came to active.
> After few seconds host became non-operational and was reported by the engine
> as:
> 	
> 2015-Aug-25, 09:56
> 	
> Host hosted_engine_2 moved to Non-Operational state because interfaces which
> are down are needed by required networks in the current cluster: 'p1p1
> (rhevm)'.
> 
> I've checked that host is UP by logging in to it over ssh and continuously
> pinging it, it was UP. I've activated the host within the engine via WEBUI,
> it bacame active for a few seconds and then again was set by the engine in
> to not-operational with the same error, while host was UP and not even a
> single ping was lost.

ping against ip or fqdn? in both sides?

> After some time I reactivated the host within the engine again and it stayed
> UP.
> 
> I've checked that I still have several running vdsmds and supervdsm
> processes on both hosts, might be that using TUI we're not hitting the race,
> but it's there and Yeela's patch will resolve multiple vdsm instances issue.

Please provide the vdsm, ovirt-hosted-engine-ha, ovirt-hosted-engine-setup and engine logs (and even access to machine). If it's a different report we probably need a different bug as it might create a confusion.

Comment 46 Nikolai Sednev 2015-08-25 16:48:06 UTC

(In reply to Douglas Schilling Landgraf from comment #45)
> (In reply to Nikolai Sednev from comment #43)
> > (In reply to Yaniv Dary from comment #42)
> > > (In reply to Nikolai Sednev from comment #41)
> > > > (In reply to Douglas Schilling Landgraf from comment #40)
> > > > > (In reply to Nikolai Sednev from comment #39)
> > > > > > Yaniv, why CLI based HE-deployment not blocked in RHEVHs if the only way of
> > > > > > deploying it is via TUI?
> > > > > 
> > > > > It's not supported users go to shell and type hosted-engine --deploy. We
> > > > > created a flow via TUI with all adjusts to trigger the hosted-engine
> > > > > --deploy.
> > > > 
> > > > I didn't saw this error happening if deployment triggered by the TUI.
> > > 
> > > Then this is not a real bug.
> > 
> > I've added a second host via TUI, got it added to the HE engine and then saw
> > it came to active.
> > After few seconds host became non-operational and was reported by the engine
> > as:
> > 	
> > 2015-Aug-25, 09:56
> > 	
> > Host hosted_engine_2 moved to Non-Operational state because interfaces which
> > are down are needed by required networks in the current cluster: 'p1p1
> > (rhevm)'.
> > 
> > I've checked that host is UP by logging in to it over ssh and continuously
> > pinging it, it was UP. I've activated the host within the engine via WEBUI,
> > it bacame active for a few seconds and then again was set by the engine in
> > to not-operational with the same error, while host was UP and not even a
> > single ping was lost.
> 
> ping against ip or fqdn? in both sides?
> 
> > After some time I reactivated the host within the engine again and it stayed
> > UP.
> > 
> > I've checked that I still have several running vdsmds and supervdsm
> > processes on both hosts, might be that using TUI we're not hitting the race,
> > but it's there and Yeela's patch will resolve multiple vdsm instances issue.
> 
> Please provide the vdsm, ovirt-hosted-engine-ha, ovirt-hosted-engine-setup
> and engine logs (and even access to machine). If it's a different report we
> probably need a different bug as it might create a confusion.


Ping against both IP and FQDN and ping is ICMP protocol, if sent and received, than both sides are getting the packet and reply.

Regarding access to the machines,plese contact me via IRC for more details.

Regarding more logs, I'll supply them shortly.

Comment 49 Nikolai Sednev 2015-08-26 07:00:51 UTC

Second clean RHEVH6.7 addition attempt has failed during it's initial deployment, not even got to the adding of the host to the engine.
Logs are attached (named "logs")

Comment 50 Nikolai Sednev 2015-08-26 07:01:28 UTC

Created attachment 1067163 [details]
logs

Comment 51 Oved Ourfali 2015-08-26 08:39:42 UTC

Bug is already merged, and it is 3.5 only.
Pavel - can you QE-ack?

Comment 53 Nikolai Sednev 2015-08-31 09:45:08 UTC

Works for me on these components:
vdsm-4.16.26-1.el6ev.x86_64
sanlock-2.8-2.el6_5.x86_64
libvirt-client-0.10.2-54.el6.x86_64
mom-0.4.1-5.el6ev.noarch
qemu-kvm-rhev-0.12.1.2-2.479.el6.x86_64
ovirt-hosted-engine-setup-1.2.5.3-1.el6ev.noarch
ovirt-hosted-engine-ha-1.2.6-3.el6ev.noarch
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150828.0.el6ev)


Engine:
rhevm-3.5.4.2-1.3.el6ev.noarch
rhevm-setup-3.5.4.2-1.3.el6ev.noarch
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
Linux version 2.6.32-573.3.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Mon Aug 10 09:44:54 EDT 2015

I've tested the deployment using CLI and TUI, original bug symptoms weren't reproduced. Also there was no race and only one instance of vdsmd and supervdsm were running as desired:
45652 vdsm       0 -20 2587m  54m  10m S  4.6  0.2   0:47.46 vdsm
 46262 vdsm      20   0  144m  14m 2580 S  0.7  0.0   0:03.63 ovirt-ha-agent
 46319 vdsm      20   0  724m  18m 2824 S  0.3  0.1   0:08.74 ovirt-ha-broker

#  ps aux | grep vdsm
root      45450  0.0  0.0  11304   820 ?        S<   06:05   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid                           
root      45452  0.1  0.1 774232 65200 ?        S<l  06:05   0:01 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid                                                                                                                          
vdsm      45649  0.0  0.0   9200   804 ?        S<   06:06   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid                                                                                         
vdsm      45652  4.0  0.1 2649108 55924 ?       S<l  06:06   0:35 /usr/bin/python /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid                   
vdsm      45721  0.0  0.0 394568  1420 ?        S<l  06:06   0:00 /usr/libexec/ioprocess --read-pipe-fd 44 --write-pipe-fd 42 --max-threads 10 --max-queued-requests 10                                                                                                                                               
vdsm      45728  0.0  0.0 470348  3540 ?        S<l  06:06   0:00 /usr/libexec/ioprocess --read-pipe-fd 52 --write-pipe-fd 51 --max-threads 10 --max-queued-requests 10                                                                                                                                               
vdsm      46262  0.3  0.0 148152 14636 ?        S    06:07   0:02 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent                         
vdsm      46306  0.0  0.0 394568  3444 ?        S<l  06:07   0:00 /usr/libexec/ioprocess --read-pipe-fd 63 --write-pipe-fd 62 --max-threads 10 --max-queued-requests 10
vdsm      46319  0.7  0.0 742328 18420 ?        Sl   06:07   0:06 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker



I've tested on NFS3 and with two hosts.

Comment 54 Eyal Edri 2015-09-06 17:09:35 UTC

RHEV 3.5.4 Released. closing current release.