Bug 1478004 - Cannot upgrade OSP11 to OSP12 due to freeze after executing of overcloud deploy command
Summary: Cannot upgrade OSP11 to OSP12 due to freeze after executing of overcloud depl...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 12.0 (Pike)
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ga
: 12.0 (Pike)
Assignee: Lee Yarwood
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On: 1456562
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-03 12:46 UTC by Lee Yarwood
Modified: 2017-09-28 15:21 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1456562
Environment:
Last Closed: 2017-09-28 15:21:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1708435 0 None None None 2017-08-03 12:46:15 UTC

Description Lee Yarwood 2017-08-03 12:46:16 UTC
+++ This bug was initially created as a clone of Bug #1456562 +++

Description of problem:
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy --libvirt-type kvm --ntp-server clock.redhat.com     --control-scale 1     --templates $THT     -e $THT/environments/low-memory-usage.yaml     -e $THT/environments/docker.yaml     -e $THT/environments/major-upgrade-composable-steps-docker.yaml     -e $THT/environments/containers-upgrade-repos.yaml     -e $THT/environments/docker-osp12.yaml
Started Mistral Workflow tripleo.validations.v1.check_pre_deployment_validations. Execution ID: 17ba8909-34a5-4738-a169-47eafa3c7b53
Waiting for messages on queue 'e6be1d56-3d22-45d2-9e5e-e991c1e786b4' with no timeout.

(undercloud) [stack@undercloud-0 ~]$ date
Mon May 29 10:58:10 EDT 2017
(undercloud) [stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
^[[A+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| id                                   | stack_name | stack_status    | creation_time        | updated_time | project                          |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| 61252786-3df4-4fa8-b91b-868a5af148ad | overcloud  | CREATE_COMPLETE | 2017-05-29T08:38:37Z | None         | 853b4c20691043b58ab0b5fa344cf23d |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
(undercloud) [stack@undercloud-0 ~]$ ps -aux |grep overcloud
stack     3471  0.1  0.3 359000 62840 pts/0    S+   10:49   0:00 /usr/bin/python2 /usr/bin/openstack overcloud deploy --libvirt-type kvm --ntp-server clock.redhat.com --control-scale 1 --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/containers-upgrade-repos.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-osp12.yaml
stack     3908  0.0  0.0 112664   972 pts/1    R+   10:58   0:00 grep --color=auto overcloud


Version-Release number of selected component (if applicable):
OSP11->OSP12

How reproducible:
alway

Steps to Reproduce:
1) Deploy OSP11 Undercloud & Overcloud
2) perform minor update of undercloud node
>sudo rhos-release 11 -r 7.4
>sudo systemctl stop 'openstack-*' 'neutron-*' httpd
>sudo yum update -y instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient
>openstack undercloud upgrade
>sudo reboot

3) upgrade osp11 undrcloud to osp12

>sudo rhos-release 12-director
>sudo systemctl stop openstack-*
>sudo systemctl stop neutron-*
>sudo systemctl stop httpd

Apply workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1456458

>sudo yum -y update instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient
>openstack undercloud upgrade

#Overcloud upgrade 
source stackrc

export IMAGE_TAG=`bash /usr/bin/puddle-version http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/12.0-RHEL-7/latest_containers`

 sudo wget -O /home/stack/overcloud_containers.yaml http://file.rdu.redhat.com/~ohochman/containers/overcloud_containers.yaml
    sudo chown stack:stack /home/stack/overcloud_containers.yaml
    sudo sed -i 's:<IMAGE_TAG>:'$IMAGE_TAG':g' /home/stack/overcloud_containers.yaml
    sudo sed -i 's/--insecure-registry 192.168.24.3:8787/--insecure-registry docker-registry.engineering.redhat.com/' /etc/sysconfig/docker
    sudo service docker restart
    source /home/stack/stackrc && openstack overcloud container image upload --verbose --config-file /home/stack/overcloud_containers.yaml 
    
sudo wget -O /usr/share/openstack-tripleo-heat-templates/environments/docker-osp12.yaml http://file.rdu.redhat.com/~ohochman/containers/docker-osp12.yaml
    sudo chown stack:stack /usr/share/openstack-tripleo-heat-templates/environments/docker-osp12.yaml

    sudo sed -i 's:<IMAGE_TAG>:'$IMAGE_TAG':g' /usr/share/openstack-tripleo-heat-templates/environments/docker-osp12.yaml


# use master repos
cat > ~/containers-upgrade-repos.yaml <<EOEF
parameter_defaults:
  UpgradeInitCommand: |
    set -e
    curl -LO http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm
    rpm -ivh rhos-release-latest.noarch.rpm || true
    rhos-release 12-director
    yum clean all
EOEF

export THT=/usr/share/openstack-tripleo-heat-templates

# migrate to containers
openstack overcloud deploy \
    --libvirt-type qemu \
    --ntp-server clock.redhat.com \
    --control-scale 1 \
    --templates $THT \
    -e $THT/environments/low-memory-usage.yaml \
    -e $THT/environments/docker.yaml \
    -e $THT/environments/major-upgrade-composable-steps-docker.yaml \
    -e ~/containers-upgrade-repos.yaml \
    -e ~/docker-osp12.yaml

Actual results:
overcloud upgrade stuck

Expected results:

overcloud successfully upgraded to next major version
Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-29 11:09:24 EDT ---

This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-29 11:14:07 EDT ---

This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

--- Additional comment from Dougal Matthews on 2017-06-07 08:54:05 EDT ---

In the Mistral executor logs I found the following error:

2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor Traceback (most recent call last):
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 112, in run_action
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor     result = action.run()
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 159, in run
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, e_str)
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor ActionException: ZaqarAction.queue_post failed: <class 'zaqarclient.transport.errors.InternalServerError'>: Error response from Zaqar. Code: 500. Text: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <html><head>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <title>500 Internal Server Error</title>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor </head><body>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <h1>Internal Server Error</h1>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <p>The server encountered an internal error or
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor misconfiguration and was unable to complete
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor your request.</p>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <p>Please contact the server administrator at 
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor  [no address given] to inform them of the time this error occurred,
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor  and the actions you performed just before this error.</p>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor <p>More information about this error may be available
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor in the server error log.</p>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor </body></html>
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor .
2017-06-07 08:31:42.569 1446 ERROR mistral.executors.default_executor 



It seems that Zaqar is returning an error every time Mistral attempts to access it.

--- Additional comment from Thomas Hervé on 2017-06-07 10:35:35 EDT ---

So from the apache logs:

[Wed Jun 07 09:56:16.069984 2017] [:error] [pid 1906] [remote 192.168.24.1:144] mod_wsgi (pid=1906): Target WSGI script '/var/www/cgi-bin/zaqar/zaqar-server' cannot be loaded as Python module.
[Wed Jun 07 09:56:16.070035 2017] [:error] [pid 1906] [remote 192.168.24.1:144] mod_wsgi (pid=1906): Exception occurred processing WSGI script '/var/www/cgi-bin/zaqar/zaqar-server'.
[Wed Jun 07 09:56:16.070054 2017] [:error] [pid 1906] [remote 192.168.24.1:144] Traceback (most recent call last):
[Wed Jun 07 09:56:16.070068 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/var/www/cgi-bin/zaqar/zaqar-server", line 41, in <module>
[Wed Jun 07 09:56:16.070091 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     log.register_options(conf)
[Wed Jun 07 09:56:16.070098 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 253, in register_options
[Wed Jun 07 09:56:16.070110 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     conf.register_cli_opts(_options.common_cli_opts)
[Wed Jun 07 09:56:16.070115 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2318, in __inner
[Wed Jun 07 09:56:16.070124 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     result = f(self, *args, **kwargs)
[Wed Jun 07 09:56:16.070129 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2540, in register_cli_opts
[Wed Jun 07 09:56:16.070136 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     self.register_cli_opt(opt, group, clear_cache=False)
[Wed Jun 07 09:56:16.070140 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2322, in __inner
[Wed Jun 07 09:56:16.070147 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     return f(self, *args, **kwargs)
[Wed Jun 07 09:56:16.070151 2017] [:error] [pid 1906] [remote 192.168.24.1:144]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2532, in register_cli_opt
[Wed Jun 07 09:56:16.070185 2017] [:error] [pid 1906] [remote 192.168.24.1:144]     raise ArgsAlreadyParsedError("cannot register CLI option")
[Wed Jun 07 09:56:16.070201 2017] [:error] [pid 1906] [remote 192.168.24.1:144] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option


Restarting apache seems to fix it, but I don't know why.

--- Additional comment from Thomas Hervé on 2017-06-08 15:50:25 EDT ---

After the issue reappeared, it seemed that an additional step was required: move /var/log/zaqar/zaqar.log. I wasn't able to know why yet, but maybe permissions error?

--- Additional comment from Artem Hrechanychenko on 2017-06-19 11:50:30 EDT ---

w/a restart httpd service and delete /var/log/zaqar/zaqar.log

--- Additional comment from Dan Prince on 2017-06-19 11:53:40 EDT ---

Any chance this is related to https://review.openstack.org/#/c/472618

--- Additional comment from Marius Cornea on 2017-07-11 17:46:54 EDT ---

Looks like an SELinux issue:

 [root@undercloud-0 stack]# grep zaqar /var/log/audit/audit.log                                                                                                                                                                                
type=AVC msg=audit(1499809225.125:6424): avc:  denied  { open } for  pid=23130 comm="httpd" path="/var/log/zaqar/zaqar.log" dev="vda1" ino=12686192 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:var_log_t:s0 tclass=file
type=AVC msg=audit(1499809391.677:6514): avc:  denied  { open } for  pid=23563 comm="httpd" path="/var/log/zaqar/zaqar.log" dev="vda1" ino=12686192 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:var_log_t:s0 tclass=file


 [root@undercloud-0 stack]# ls -lZ /var/log/zaqar/zaqar.log 
-rw-r--r--. zaqar zaqar system_u:object_r:var_log_t:s0   /var/log/zaqar/zaqar.log

--- Additional comment from Marius Cornea on 2017-07-11 17:49:36 EDT ---

After removing the file and restarting httpd:

[root@undercloud-0 stack]# ls -lZ /var/log/zaqar/zaqar.log
-rw-r--r--. zaqar zaqar system_u:object_r:httpd_log_t:s0 /var/log/zaqar/zaqar.log

--- Additional comment from Mike Orazi on 2017-07-25 10:49:15 EDT ---

Lon,

Can you help confirm this is an selinux bug?

--- Additional comment from Marius Cornea on 2017-07-25 17:14 EDT ---

After undercloud upgrade:

[root@undercloud-0 stack]# rpm -qa | grep openstack-selinux
openstack-selinux-0.8.8-0.20170622195307.74ddc0e.el7ost.noarch

[root@undercloud-0 stack]#  ls -lZ /var/log/zaqar/zaqar.log
-rw-r--r--. zaqar zaqar system_u:object_r:var_log_t:s0   /var/log/zaqar/zaqar.log

Attaching /var/log/audit/audit.log after undercloud upgrade.

--- Additional comment from Marius Cornea on 2017-08-02 10:06:43 EDT ---



--- Additional comment from Lon Hohberger on 2017-08-02 11:18:52 EDT ---

We need to set /var/log/zaqar/zaqar.log to a different file context and then call 'restorecon' after deployment.

--- Additional comment from Lon Hohberger on 2017-08-02 11:55:03 EDT ---

https://github.com/redhat-openstack/openstack-selinux/commit/f7297a77903fcb8963c7796e49d357cb7cce2b28

This still needs a restorecon for it to operate correctly.

Comment 1 Lee Yarwood 2017-08-03 15:24:13 UTC
I created this after initial discussions with Lon suggested that we would need to run restorecon outside of the openstack-selinux package, however that may no longer be the case. I'll close this out if we don't need anything in instack etc.

Comment 2 Lee Yarwood 2017-09-28 15:21:41 UTC
AFAIK this was fixed elsewhere and thus this bug isn't required anymore.


Note You need to log in before you can comment on or make changes to this bug.