Bug 1306845

Summary: rhel-osp-director: Update from 7.1 fails due to keystone errors.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: augol, dbecker, jcoufal, mandreou, mburns, morazi, ohochman, rhel-osp-director-maint, sasha, srevivo
Target Milestone: y3   
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-15 10:13:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2016-02-11 21:36:47 UTC
rhel-osp-director: Update from 7.1 fails due to keystone errors.
Environment:
openstack-tripleo-heat-templates-0.8.6-119.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch
openstack-puppet-modules-2015.1.8-49.el7ost.noarch


Steps to reproduce:
1. Deploy 7.1 GA with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1 --swift-storage-scale 1  --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml

2. Populate the setup with objects
3. Attemp to update to 7.3

Result:
IN_PROGRESS                                                                                                                                        
FAILED                                                                                                                                             
update finished with status FAILED    


e[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE                                                            
+----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+----------------------------------------------+                                                                                                                                        
| resource_name                                | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                              |                                                                                                                                        
+----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+----------------------------------------------+                                                                                                                                        
| ComputeNodesPostDeployment                   | 97a09040-4647-44b6-940c-df85ef81bbd1          | OS::TripleO::ComputePostDeployment                | UPDATE_FAILED   | 2016-02-11T20:04:10Z |                                              |                                                                                                                                        
| ComputePuppetDeployment                      | 1518a056-449f-41e7-bdbc-c4cd08d98051          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2016-02-11T20:04:24Z | ComputeNodesPostDeployment                   |                                                                                                                                        
| 0                                            | 7f394c90-9edb-464b-915b-478d90972e3e          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-11T20:04:38Z | ComputePuppetDeployment                      |                                                                                                                                        
| ObjectStorageNodesPostDeployment             | 7d9baf9f-7b39-4cfc-8c9f-105c923b43a3          | OS::TripleO::ObjectStoragePostDeployment          | UPDATE_FAILED   | 2016-02-11T20:04:52Z |                                              |                                                                                                                                        
| ControllerNodesPostDeployment                | e2f82d6c-db36-4491-8807-eee84b9df1dc          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-02-11T20:05:04Z |                                              |                                                                                                                                        
| StorageDeployment_Step1                      | 5f43c559-2a1d-4e1f-b099-6dc8e6d0ae61          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2016-02-11T20:05:05Z | ObjectStorageNodesPostDeployment             |                                                                                                                                        
| 0                                            | 7284d005-33bc-4fe2-a973-a579041c51ce          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-11T20:05:06Z | StorageDeployment_Step1                      |                                                                                                                                        
| ControllerOvercloudServicesDeployment_Step6  | 6818b1d1-f41d-4a1a-9829-93174a95e11a          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2016-02-11T20:15:07Z | ControllerNodesPostDeployment                |                                                                                                                                        
| 0                                            | 14198409-d23a-4616-9a5b-a8ae2ee1dd6c          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-11T20:15:10Z | ControllerOvercloudServicesDeployment_Step6  |                                                                                                                                        
+----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+----------------------------------------------+    


Running [stack@instack ~]$ heat deployment-show 14198409-d23a-4616-9a5b-a8ae2ee1dd6   shows the following errors:


Error: Could not prefetch keystone_tenant provider 'openstack': undefined method `collect' for nil:NilClass
Error: Could not prefetch keystone_role provider 'openstack': undefined method `collect' for nil:NilClass
Error: Could not prefetch keystone_user provider 'openstack': undefined method `collect' for nil:NilClass
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@admin]: Could not evaluate: undefined method `empty?' for nil:NilClass
Warning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]: Skipping because of failed dependencies
", "deploy_status_code": 6 }, "creation_time": "2016-02-11T17:52:46Z", "updated_time": "2016-02-11T20:19:23Z", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6", "id": "14198409-d23a-4616-9a5b-a8ae2ee1dd6c" }



Expected result:
No above error.

Comment 3 Marios Andreou 2016-02-12 09:28:50 UTC
(In reply to Alexander Chuzhoy from comment #0)
> | ControllerOvercloudServicesDeployment_Step6  |
> 6818b1d1-f41d-4a1a-9829-93174a95e11a          |
> OS::Heat::StructuredDeployments                   | UPDATE_FAILED   |
> 2016-02-11T20:15:07Z | ControllerNodesPostDeployment                |       
> 
> | 0                                            |
> 14198409-d23a-4616-9a5b-a8ae2ee1dd6c          |
> OS::Heat::StructuredDeployment                    | UPDATE_FAILED   |
> 2016-02-11T20:15:10Z | ControllerOvercloudServicesDeployment_Step6  |       
> 
> +------------

^^^ at this point all the openstack-* services are managed by pacemaker and the constraints are all defined etc. Here we are doing the initialisation of the keystone admin role and heat domain:


1698 # SERVICES INIT                                                                
1699 # Needs to happen on all nodes (some parts do .conf file amendments)           
1700 # but it needs to happen on $pacemaker_master first (other parts make          
1701 # API calls and we don't want races, and the .conf file amendments             
1702 # often depend on the API calls having been made).                             
1703 if (hiera('step') >= 5 and $pacemaker_master) or hiera('step') >= 6 {          
1704                                                                                
1705   include ::keystone::roles::admin                                             
1706                                                                                
1707   # TO-DO: Remove this class as soon as Keystone v3 will be fully functional   
1708   include ::heat::keystone::domain                                             
1709   Class['::keystone::roles::admin'] -> Exec['heat_domain_create']              
1710                                                                                
1711 } # END SERVICES INIT (STEP 5 and 6)                              

Not sure why this is failing at the moment, just pointing this out as a first data point and because this sounds familiar I think we may have hit it before.

Comment 5 Marios Andreou 2016-02-12 10:27:41 UTC
At the moment, we just know that Keystone initialisation couldn't happen (in particular error resolving keystone_tenant/role/user and so couldn't initialise the Admin role, like we ask it to in overcloud_controller_pacemaker.pp:

1705   include ::keystone::roles::admin                                    

There isn't enough info on the bug to know for sure why this is happening. It may for example be because rabbit isn't running, since we have a constraint like 

1049     pacemaker::constraint::base { 'rabbitmq-then-keystone-constraint':         

I mention this case in particular because I found a (seemingly, at least matching the error text in the report here) identical trace in my logs like:

9657:Jan 19 07:46:38 overcloud-controller-2.localdomain os-collect-config[5291]: ies and will be changed in a later release.\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_tenant provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_role provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_user provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@admin]: Could not evaluate: undefined method `empty?' for nil:NilClass\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]: Skipping because of failed dependencies\u001b[0m\n", "deploy_status_code": 6}


The root in that case was Keystone not running cos rabbit wasn't. In any case, I can't really do much until folks come online who can either provide access to the env or fixup permissions on the sos reports for a closer look. Hopefully once we find it it will be an easy fix... :/ #famous_last_words 

thanks, marios

Comment 8 Marios Andreou 2016-02-12 14:41:53 UTC
Update: tl;dr, the update failed because yum did (couldn't reach mirrors). Not sure if the keystone error described by sasha is just a symptom of the update stopping at that point. More info below.

I just spent a ages staring at the logs from the various boxes. Assuming I've read things correclty, the update started with the overcloud-cephstorage-0 node, which completed, and then subsequently failed on the overcloud-objectstorage-0. Best I can tell, it fails because it can't reach any yum repos:

Feb 11 20:07:36 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [mNotice: /Stage[main]/Tripleo::Packages/Exec[package-upgrade]/returns: Error downloading packages:[0m
Feb 11 20:07:36 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [mNotice: /Stage[main]/Tripleo::Packages/Exec[package-upgrade]/returns:   rabbitmq-server-3.3.5-15.el7ost.noarch: [Errno 256] No more mirrors to try.[0m
Feb 11 20:07:37 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [1;31mError: yum -y update returned 1 instead of one of [0][0m

None of the controllers or the compute got updated so I don't think it has anything to do with the update procedure per-say (since controllers is where it gets most interesting with the removal of the given node from the cluster). There are various errors in the controller logs, including the one described by Sasha for the keystone heat domain, but given the yum_update fails then the rest of the puppet run doesn't happen either, so our setup is left in a half-updated, half started state. In particular I see all pacemaker managed services are started on controllers 1 and 2.

Just got access to the environment so will poke further. In particular I'd like to see if the yum update on the object-storage was just a transient network issue and if after that completes ok the rest of the update can continue. 

thanks, marios

Comment 9 Marios Andreou 2016-02-12 15:31:28 UTC
(In reply to marios from comment #8)
> Update: tl;dr, the update failed because yum did (couldn't reach mirrors).
> Not sure if the keystone error described by sasha is just a symptom of the
> update stopping at that point. More info below.

I just confirmed on the environment that all services can be started fine (so there is no config issue at the moment, afaics). Some services were down on controller-0 and all services were unmanaged because the cluster was set into maintenance mode. I unset that

pcs property unset maintenance-mode 

and in combination with a previous pcs cluster stop and after a while pcs cluster start on controller 0 all pacemaker managed services seem OK again including keystone and heat:

[root@overcloud-controller-0 heat-admin]# pcs status | grep "keystone\|heat" -A 1
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: openstack-keystone-clone [openstack-keystone]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

I also got onto the objectstorage node (where yum failed) and I was able to yum install vim, so that seems to be ok (i.e. repos are reachable). 

@Sasha, do you think now that the cluster is back into a reasonable state you could try the update again so we can see if that yum_update issue is transient and if we can get further after/if it passes.

thanks

Comment 10 Marios Andreou 2016-02-15 10:13:11 UTC
(In reply to marios from comment #9)
> (In reply to marios from comment #8)
> > Update: tl;dr, the update failed because yum did (couldn't reach mirrors).
> > Not sure if the keystone error described by sasha is just a symptom of the
> > update stopping at that point. More info below.
> 
> I just confirmed on the environment that all services can be started fine
> (so there is no config issue at the moment, afaics). Some services were down


After more debugging with Sasha on the environment on Friday evening and since this is definitely not a config issue, putting it down to transient yum repo problems for now (we can re-open if you disagree Sasha?). 

We started another update which successfully updated the object-storage node (which is what failed initially, and because of which this bug was filed) but I left before that completed. In any case, since this doesn't appear for now to be a config issue (pacemaker cluster running fine and all managed services OK on 3 controllers) am closing for now.

Comment 11 Amit Ugol 2018-05-02 10:49:20 UTC
closed, no need for needinfo.