Bug 1897691 - [RHOSP 13 to 16.1 Upgrades][InstanceHA] nova_compute container stuck in restarting state after executing nova_hybrid_state tasks
Summary: [RHOSP 13 to 16.1 Upgrades][InstanceHA] nova_compute container stuck in resta...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: All
urgent
urgent
Target Milestone: z3
: 16.1 (Train on RHEL 8.2)
Assignee: Lukas Bezdicka
QA Contact: Jose Luis Franco
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-13 18:53 UTC by MD Sufiyan
Modified: 2023-12-15 20:06 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170174.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 18:37:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
updated nova-compute-container-puppet.yaml (53.42 KB, text/plain)
2020-11-13 18:53 UTC, MD Sufiyan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 762708 0 None MERGED [train-only][ffwd] Update InstanceHA script in hybrid mode 2021-02-16 16:02:15 UTC
Red Hat Product Errata RHEA-2020:5413 0 None None None 2020-12-15 18:37:58 UTC

Description MD Sufiyan 2020-11-13 18:53:17 UTC
Created attachment 1729211 [details]
updated nova-compute-container-puppet.yaml

Description of problem:

nova_compute container gets stuck in restarting loop after executing nova_hybrid_state tasks while performing the upgrade of OSP13 to OSP16.1 in instanceHA environment.

~~~
[root@msufiyan-novacomputeiha-0 ~]#docker logs nova_compute


++ cat /run_command                                                                                                 
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '                                                            
+ ARGS=                                                                                                             
+ sudo kolla_copy_cacerts                                                                                           
+ [[ ! -n '' ]]                                                                                                     
+ . kolla_extend_start                                                                                              
++ [[ ! -d /var/log/kolla/nova ]]                                                                                   
+++ stat -c %a /var/log/kolla/nova                                                                                  
++ [[ 2755 != \7\5\5 ]]                                                                                             
++ chmod 755 /var/log/kolla/nova                                                                                    
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '                                                 
++ . /usr/local/bin/kolla_nova_extend_start                                                                         
+++ [[ ! -d /var/lib/nova/instances ]]                                                                              
+ echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''                                  
+ exec /var/lib/nova/instanceha/check-run-nova-compute                                                              
Traceback (most recent call last):                                                                                  
  File "/var/lib/nova/instanceha/check-run-nova-compute", line 191, in <module>                                     
    connection = create_nova_connection(config.sections["placement"])                                               
  File "/var/lib/nova/instanceha/check-run-nova-compute", line 149, in create_nova_connection                       
    http_log_debug=options.has_key("verbose"),                                                                      
AttributeError: 'dict' object has no attribute 'has_key'                                                            
[root@msufiyan-novacomputeiha-0 ~]#
~~~

- It seems code for instanceHA script to run nova_compute was missing in nova-compute-container-puppet.yaml due to which nova-compute were still using old code[1] in file[2] restart the container

~~~
            nova = client.Client(version,
                                 region_name=options["os_region_name"][0],
                                 session=keystone_session, auth=keystone_auth,
                                 http_log_debug=options.has_key("verbose"),
                                 endpoint_type=nova_endpoint_type)
~~~

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/extraconfig/tasks/instanceha/check-run-nova-compute#L149
[2] /var/lib/nova/instanceha/check-run-nova-compute

Workaround:-

1) Adding below tasks bind mouting the new "check-run-nova-compute" in /usr/share/openstack-tripleo-heat-templates/deployment/nova/nova-compute-container-puppet.yaml 

~~~
1126                 # This code is partial copy of logic in podman installation
1127                 - name: is Instance HA enabled
1128                   set_fact:
1129                     instance_ha_enabled: {get_param: EnableInstanceHA}
1130                 - name: install Instance HA script that runs nova-compute
1131                   when: instance_ha_enabled|bool
1132                   copy:
1133                     content: {get_file: ../../scripts/check-run-nova-compute}
1134                     dest: /var/lib/nova/instanceha/check-run-nova-compute
1135                     mode: 0755
~~~

2) Update the plan

~~~
openstack overcloud upgrade prepare ...
~~~

3) re-run the nova-hybrid task to boot the container with new check-run-nova-compute

~~~
nohup openstack overcloud upgrade run --stack msufiyan --playbook upgrade_steps_playbook.yaml --tags nova_hybrid_state --limit all --yes &
~~~

Additional Note:-

Updated check-run-nova-compute will have:-

~~~
    149         else:
    150             # OSP >= Ocata
    151             # ArgSpec(args=['version'], varargs='args', keywords='kwargs', defaults=None)
    152             nova = client.Client(version,
    153                                  region_name=region,
    154                                  session=keystone_session, auth=keystone_auth,
    155                                  http_log_debug="verbose" in options,                 <<<=====
    156                                  endpoint_type=nova_endpoint_type)
~~~

Version-Release number of selected component (if applicable):
OSP16.1 

How reproducible:
Evertime when we perform upgrade framework[3] in instanceHA bases environment.

[3] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-controller-nodes-with-director-deployed-ceph-storage_upgrading-overcloud-standard

Comment 15 errata-xmlrpc 2020-12-15 18:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413


Note You need to log in before you can comment on or make changes to this bug.