Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 858350

Summary: RHEV 3.1 Beta1 - DC not coming back online after orderly shutdown
Product: Red Hat Enterprise Virtualization Manager Reporter: Steve Reichard <sreichar>
Component: ovirt-engineAssignee: Ayal Baron <abaron>
Status: CLOSED WONTFIX QA Contact: Haim <hateya>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, amureini, bthurber, dyasny, fsimonce, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-26 09:52:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rhevm logs
none
hyper1 logs
none
hyper2 logs none

Description Steve Reichard 2012-09-18 18:21:12 UTC
Created attachment 614090 [details]
rhevm logs

Description of problem:

-Summary:
 Because of a planned site outage, my RHEV 3.1 Beta1 environment was brought  down. Upon power up, one of the redundant FC controllers failed.  This should have caused path failures, but other paths would have been operational. The DC does not come back online. Attempts to activate the storage get the following error:
	
Failed to activate Storage Domain brFC (Data Center Default) by admin@internal

-Details

Shutdown - 
 + hypervisors were put into maintenance
 + hypervisors were shutdown
 + rhevm was shutdown
 + SC were shutdown
 + equipment was powered off

Boot Up - 
 + switches were powered
 + storage was powered
 + systems were powered
 + rhevm was booted
 + hypervisors were booted


The RHEVM showed that the host were up and one was serving as a SPM.  
Logging into this host, the LVs could be read.  
Access to the storage was lost when this host was put maintenance.  

As further debug/characterization:
Hypervisors were brought in and out of maintenance
Hypervisors were reboot when in maintenance
RHEVM was reboot
Storage was shutdown and repowered (still fails)
Failed controller was pulled
Rebooted RHEV

The DC still did not come online.

One host as put into maintenance.  
A new DC was created.
A new cluster was created.
The host was edited to be in the new DC.
A new LUN was presented from the same storage array.
A new Data Domain was created.
This DC came online.

It appears to me that the problem is not something with the storage array since the new DC could come online, however, the failed paths is an issue for the old DC?


I will be attaching log from both the RHEVM and hypevisors.  {Is log collector supposed to work? Did I do something wrong?

[root@br-rhevm ovirt-engine]# rhevm-log-collector
INFO: /tmp/logcollector does not exist.  It will be created.
Please provide the REST API password for the admin@internal RHEV-M user (CTRL+D to skip): 
Host list (datacenter=None, cluster=None, host=None):
Data Center          | Cluster              | Hostname/IP Address
Default              | Default              | 10.16.136.100
Secondary            | Secondary-clus1      | 10.16.136.101
[root@br-rhevm ovirt-engine]# ls /tmp/logcollector/
[root@br-rhevm ovirt-engine]# 
}










Version-Release number of selected component (if applicable):

 Red Hat Enterprise Virtualization Manager Version: '3.1.0-11.el6ev'
(c) 2012 Red Hat Inc. All rights reserved.


[root@br-rhevm ovirt-engine]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)
[root@br-rhevm ovirt-engine]# uname -a
Linux br-rhevm.cloud.lab.eng.bos.redhat.com 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@br-rhevm ovirt-engine]# yum list installed | grep -e rhev -e virt -e vdsm
apache-commons-jxpath.noarch          1.3-5.1_redhat_1.ep6.el6         @rhel-x86_64-server-6-rhevm-3-beta
libvirt.x86_64                        0.9.10-21.el6_3.3                @rhel-x86_64-server-6
libvirt-client.x86_64                 0.9.10-21.el6_3.3                @rhel-x86_64-server-6
libvirt-python.x86_64                 0.9.10-21.el6_3.3                @rhel-x86_64-server-6
rhev-guest-tools-iso.noarch           3.1-4                            @rhel-x86_64-server-6-rhevm-3-beta
rhev-hypervisor6.noarch               6.3-20120813.0.rhev31.el6_3      @rhel-x86_64-server-6-rhevh-beta
rhevm.noarch                          3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-backend.noarch                  3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-config.noarch                   3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-dbscripts.noarch                3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-doc.noarch                      3.1.0-7.el6eng                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-genericapi.noarch               3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-image-uploader.noarch           3.1.0-1.el6ev                    @rhel-x86_64-server-6-rhevm-3-beta
rhevm-iso-uploader.noarch             3.1.0-2.el6ev                    @rhel-x86_64-server-6-rhevm-3-beta
rhevm-log-collector.noarch            3.1.0-2.el6ev                    @rhel-x86_64-server-6-rhevm-3-beta
rhevm-notification-service.noarch     3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-restapi.noarch                  3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-sdk.noarch                      3.1.0.5-1.el6ev                  @rhel-x86_64-server-6-rhevm-3-beta
rhevm-setup.noarch                    3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-tools-common.noarch             3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-userportal.noarch               3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhevm-webadmin-portal.noarch          3.1.0-11.el6ev                   @rhel-x86_64-server-6-rhevm-3-beta
rhn-virtualization-common.noarch      5.4.14-9.el6sat                  @rhn-tools-rhel-x86_64-server-6
rhn-virtualization-host.noarch        5.4.14-9.el6sat                  @rhn-tools-rhel-x86_64-server-6
vdsm-bootstrap.noarch                 4.9.6-28.0.el6_3                 @rhel-x86_64-server-6-rhevm-3-beta
virt-viewer.x86_64                    0.5.2-9.el6                      @rhel-x86_64-server-6
virt-what.x86_64                      1.11-1.1.el6                     @anaconda-RedHatEnterpriseLinux-201206132210.x86_64/6.3
[root@br-rhevm ovirt-engine]# 



[root@br-rhelhyp1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)
[root@br-rhelhyp1 ~]# uname -a
Linux br-rhelhyp1.cloud.lab.eng.bos.redhat.com 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@br-rhelhyp1 ~]# yum list installed | grep -e vdsm -e rhev -e virt
libvirt.x86_64                   0.9.10-21.el6_3.3         @rhel-x86_64-server-6
libvirt-client.x86_64            0.9.10-21.el6_3.3         @rhel-x86_64-server-6
libvirt-python.x86_64            0.9.10-21.el6_3.3         @rhel-x86_64-server-6
qemu-img-rhev.x86_64             2:0.12.1.2-2.298.el6_3    @rhel-x86_64-rhev-mgmt-agent-6-beta
qemu-kvm-rhev.x86_64             2:0.12.1.2-2.298.el6_3    @rhel-x86_64-rhev-mgmt-agent-6-beta
qemu-kvm-rhev-tools.x86_64       2:0.12.1.2-2.298.el6_3    @rhel-x86_64-rhev-mgmt-agent-6-beta
rhn-virtualization-common.noarch 5.4.14-9.el6sat           @rhn-tools-rhel-x86_64-server-6
rhn-virtualization-host.noarch   5.4.14-9.el6sat           @rhn-tools-rhel-x86_64-server-6
sanlock.x86_64                   2.3-3.el6_3               @rhel-x86_64-rhev-mgmt-agent-6-beta
sanlock-lib.x86_64               2.3-3.el6_3               @rhel-x86_64-rhev-mgmt-agent-6-beta
sanlock-python.x86_64            2.3-3.el6_3               @rhel-x86_64-rhev-mgmt-agent-6-beta
vdsm.x86_64                      4.9.6-27.0.el6_3          @rhel-x86_64-rhev-mgmt-agent-6-beta
vdsm-cli.noarch                  4.9.6-27.0.el6_3          @rhel-x86_64-rhev-mgmt-agent-6-beta
vdsm-python.x86_64               4.9.6-27.0.el6_3          @rhel-x86_64-rhev-mgmt-agent-6-beta
virt-viewer.x86_64               0.5.2-9.el6               @rhel-x86_64-server-6
virt-what.x86_64                 1.11-1.1.el6              @anaconda-RedHatEnterpriseLinux-201206132210.x86_64/6.3
[root@br-rhelhyp1 ~]# 




How reproducible:

Unknown

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Steve Reichard 2012-09-18 18:37:05 UTC
Created attachment 614098 [details]
hyper1 logs

Comment 2 Steve Reichard 2012-09-18 18:38:58 UTC
Created attachment 614100 [details]
hyper2 logs

Comment 3 Federico Simoncelli 2012-09-21 13:48:04 UTC
The issue is that storage_pool_format_type became/was NULL:

engine=> select id,name,storage_pool_format_type from storage_pool;
                  id                  |   name    | storage_pool_format_type 
--------------------------------------+-----------+--------------------------
 f7728d06-6d04-43d1-807b-e838b6ee3eb8 | Secondary | 3
 989ca2aa-e875-11e1-95f7-00215a5c2e44 | Default   | NULL

Logs:

2012-09-21 09:31:27,210 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (QuartzScheduler_Worker-71) [7f614b79] START, SpmStartVDSCommand(vdsId = 5f4fbe68-e89f-11e1-8b0a-00215a5c2e44, storage
PoolId = 989ca2aa-e875-11e1-95f7-00215a5c2e44, prevId=-1, prevLVER=0, storagePoolFormatType=null, recoveryMode=Manual, SCSIFencing=false), log id: 268f760b
2012-09-21 09:31:27,210 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-71) [7f614b79] Failed in SpmStartVDS method, for vds: br-rhelhyp1; host: 10.16.136.100
2012-09-21 09:31:27,211 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-71) [7f614b79] Command SpmStartVDS execution failed. Exception: NullPointerException:

Relevant code SpmStartVDSCommand.java:91

 _result = getBroker().spmStart(getParameters().getStoragePoolId().toString(),
         getParameters().getPrevId(), getParameters().getPrevLVER(),
         getParameters().getRecoveryMode().getValue(),
         (new Boolean(getParameters().getSCSIFencing())).toString().toLowerCase(),
         Config.<Integer> GetValue(ConfigValues.MaxNumberOfHostsInStoragePool),
         getParameters().getStoragePoolFormatType().getValue());

Where getParameters().getStoragePoolFormatType() would have been null.

Changing the storage_pool_format_type to 3 solved the problem:

engine=> update storage_pool set storage_pool_format_type=NULL where id='989ca2aa-e875-11e1-95f7-00215a5c2e44';
engine=> select id,name,storage_pool_format_type from storage_pool;
                  id                  |   name    | storage_pool_format_type 
--------------------------------------+-----------+--------------------------
 f7728d06-6d04-43d1-807b-e838b6ee3eb8 | Secondary | 3
 989ca2aa-e875-11e1-95f7-00215a5c2e44 | Default   | 3

The most important thing that should be investigated (beside the NULL value in the db) is why the host was reported as SPM after that SpmStartVDSCommand *failed* for NullPointerException.

PS. VDSM was reporting the correct spm status with: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': 0}}

Comment 4 Ayal Baron 2012-09-23 06:50:46 UTC
(In reply to comment #3)
> The issue is that storage_pool_format_type became/was NULL:
> 
> engine=> select id,name,storage_pool_format_type from storage_pool;
>                   id                  |   name    | storage_pool_format_type 
> --------------------------------------+-----------+--------------------------
>  f7728d06-6d04-43d1-807b-e838b6ee3eb8 | Secondary | 3
>  989ca2aa-e875-11e1-95f7-00215a5c2e44 | Default   | NULL
> 
> Logs:
> 
> 2012-09-21 09:31:27,210 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
> (QuartzScheduler_Worker-71) [7f614b79] START, SpmStartVDSCommand(vdsId =
> 5f4fbe68-e89f-11e1-8b0a-00215a5c2e44, storage
> PoolId = 989ca2aa-e875-11e1-95f7-00215a5c2e44, prevId=-1, prevLVER=0,
> storagePoolFormatType=null, recoveryMode=Manual, SCSIFencing=false), log id:
> 268f760b
> 2012-09-21 09:31:27,210 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand]
> (QuartzScheduler_Worker-71) [7f614b79] Failed in SpmStartVDS method, for
> vds: br-rhelhyp1; host: 10.16.136.100
> 2012-09-21 09:31:27,211 ERROR
> [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-71)
> [7f614b79] Command SpmStartVDS execution failed. Exception:
> NullPointerException:
> 
> Relevant code SpmStartVDSCommand.java:91
> 
>  _result =
> getBroker().spmStart(getParameters().getStoragePoolId().toString(),
>          getParameters().getPrevId(), getParameters().getPrevLVER(),
>          getParameters().getRecoveryMode().getValue(),
>          (new
> Boolean(getParameters().getSCSIFencing())).toString().toLowerCase(),
>          Config.<Integer>
> GetValue(ConfigValues.MaxNumberOfHostsInStoragePool),
>          getParameters().getStoragePoolFormatType().getValue());
> 
> Where getParameters().getStoragePoolFormatType() would have been null.
> 
> Changing the storage_pool_format_type to 3 solved the problem:
> 
> engine=> update storage_pool set storage_pool_format_type=NULL where
> id='989ca2aa-e875-11e1-95f7-00215a5c2e44';
> engine=> select id,name,storage_pool_format_type from storage_pool;
>                   id                  |   name    | storage_pool_format_type 
> --------------------------------------+-----------+--------------------------
>  f7728d06-6d04-43d1-807b-e838b6ee3eb8 | Secondary | 3
>  989ca2aa-e875-11e1-95f7-00215a5c2e44 | Default   | 3
> 
> The most important thing that should be investigated (beside the NULL value
> in the db) is why the host was reported as SPM after that SpmStartVDSCommand
> *failed* for NullPointerException.
> 
> PS. VDSM was reporting the correct spm status with: {'spm_st': {'spmId': -1,
> 'spmStatus': 'Free', 'spmLver': 0}}

the NULL value is another bug which is already being worked on:
851003 - StoragePool storage_pool_format_type is null after upgrade (lost in update).

wrt the other issue - need to investigate.