Bug 1494455

Summary:	[osp12]Controller Node replacement failed on overcloud.AllNodesDeploySteps.ControllerDeployment_Step3.0:, UPDATE aborted
Product:	Red Hat OpenStack	Reporter:	Artem Hrechanychenko <ahrechan>
Component:	openstack-tripleo-heat-templates	Assignee:	Emilien Macchi <emacchi>
Status:	CLOSED ERRATA	QA Contact:	Artem Hrechanychenko <ahrechan>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	12.0 (Pike)	CC:	ahrechan, akaris, bnemec, chjones, dbecker, dciabrin, dprince, emacchi, jcoufal, jjoyce, jslagle, m.andre, mburns, michele, morazi, ohochman, rhel-osp-director-maint, sasha
Target Milestone:	rc	Keywords:	TestBlocker, Triaged
Target Release:	12.0 (Pike)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-7.0.3-16.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-13 22:11:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1499217, 1501852, 1505909, 1514520
Bug Blocks:

Description Artem Hrechanychenko 2017-09-22 10:03:29 UTC

Description of problem:

I tried to replace controller-2 with controller-3 using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes

overcloud.AllNodesDeploySteps.ControllerDeployment_Step3.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 63711c38-3e69-435e-8d2c-e4fa6f81b467
  status: UPDATE_FAILED

full failures list output
http://pastebin.test.redhat.com/518376

(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.11 "sudo cat /etc/corosync/corosync.conf"
totem {
    version: 2
    cluster_name: tripleo_cluster
    transport: udpu
    token: 10000
}

nodelist {
    node {
        ring0_addr: controller-0
        nodeid: 1
    }

    node {
        ring0_addr: controller-1
        nodeid: 2
    }

    node {
        ring0_addr: controller-2
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes


Version-Release number of selected component (if applicable):
OSP12

How reproducible:
Alway

Steps to Reproduce:
1.Deploy OSP12 OC with 3ctrl+1comp+ 1 empty ironic node
2.https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes
3. steps 5-8 http://etherpad.corp.redhat.com/JtZ84Hp2nQ

Actual results:
Update Failed,

Expected results:
Update failed on ControllerNodesPostDeployment as wrote in manual in step 9.4.3

Additional info:

Comment 2 Artem Hrechanychenko 2017-09-22 10:04:05 UTC

openstack-keystone-12.0.1-0.20170907172639.6a67918.el7ost.noarch
python-openstackclient-lang-3.12.0-0.20170821150739.f67ebce.el7ost.noarch
openstack-neutron-common-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
openstack-tripleo-common-containers-7.6.1-0.20170912115321.el7ost.noarch
openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7ost.noarch
python-openstackclient-3.12.0-0.20170821150739.f67ebce.el7ost.noarch
openstack-tripleo-common-7.6.1-0.20170912115321.el7ost.noarch
openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-nova-api-16.0.1-0.20170908213719.el7ost.noarch
openstack-nova-conductor-16.0.1-0.20170908213719.el7ost.noarch
openstack-glance-15.0.0-0.20170830130905.9820166.el7ost.noarch
openstack-nova-compute-16.0.1-0.20170908213719.el7ost.noarch
puppet-openstacklib-11.3.1-0.20170825142820.18ee919.el7ost.noarch
openstack-heat-api-9.0.1-0.20170911115334.0c64134.el7ost.noarch
openstack-swift-object-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch
openstack-swift-proxy-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch
openstack-ironic-api-9.1.1-0.20170908114346.feb64c2.el7ost.noarch
openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-nova-common-16.0.1-0.20170908213719.el7ost.noarch
puppet-openstack_extras-11.3.1-0.20170906070209.b99c3a4.el7ost.noarch
openstack-tripleo-puppet-elements-7.0.0-0.20170910154847.2094778.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch
openstack-swift-container-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch
openstack-neutron-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
openstack-neutron-openvswitch-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
openstack-heat-engine-9.0.1-0.20170911115334.0c64134.el7ost.noarch
openstack-ironic-conductor-9.1.1-0.20170908114346.feb64c2.el7ost.noarch
openstack-tempest-17.0.0-0.20170901201711.ad75393.el7ost.noarch
openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-tripleo-validations-7.3.1-0.20170907082220.efe8a72.el7ost.noarch
openstack-selinux-0.8.9-0.1.el7ost.noarch
openstack-nova-placement-api-16.0.1-0.20170908213719.el7ost.noarch
openstack-swift-account-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch
openstack-heat-common-9.0.1-0.20170911115334.0c64134.el7ost.noarch
python-openstacksdk-0.9.17-0.20170821143340.7946243.el7ost.noarch
openstack-tripleo-ui-7.4.1-0.20170911164240.16684db.el7ost.noarch
openstack-ironic-common-9.1.1-0.20170908114346.feb64c2.el7ost.noarch
openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch
openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-nova-scheduler-16.0.1-0.20170908213719.el7ost.noarch
openstack-neutron-ml2-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
openstack-heat-api-cfn-9.0.1-0.20170911115334.0c64134.el7ost.noarch
openstack-tripleo-image-elements-7.0.0-0.20170910153513.526772d.el7ost.noarch
openstack-zaqar-5.0.1-0.20170905222047.el7ost.noarch

Comment 3 Artem Hrechanychenko 2017-09-22 10:04:34 UTC

(undercloud) [stack@undercloud-0 ~]$ cat remove-controller.yaml 
parameters:
  ControllerRemovalPolicies:
    [{'resource_list': ['2']}]
  CorosyncSettleTries: 5

Comment 4 Artem Hrechanychenko 2017-09-22 10:05:14 UTC

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| a954f4a6-57bf-4feb-b301-1675a24af625 | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.7  |
| 4c890019-5b13-4e10-b716-baa4a454bba4 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| 6844a375-44dd-4ad9-918a-3892b6f2d39e | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 8a7ae32c-b808-4357-86a8-db8f873aff4f | controller-3 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |


ping 192.168.24.8
PING 192.168.24.8 (192.168.24.8) 56(84) bytes of data.
64 bytes from 192.168.24.8: icmp_seq=1 ttl=64 time=0.430 ms

Comment 6 Omri Hochman 2017-09-25 14:14:53 UTC

Adding Test-Blocker/Blocker - as it block replace controller in OSP12 .

Comment 7 Ben Nemec 2017-09-25 14:18:48 UTC

I looked at http://pastebin.test.redhat.com/518376 and I'm not sure that's the root cause of the update failure.  It has a reason of "UPDATE aborted", which usually means something else failed and caused Heat to kill that resource too.  It's frequently seen on timeouts in my experience, although I believe there are other things that can cause it too.

Comment 8 Ben Nemec 2017-09-27 18:15:58 UTC

So to follow up on my previous comment, I think we need to know if there were any other failed resources or if the update as a whole timed out to cause the aborted resource status that I'm seeing.  In the logs provided there isn't actually a failure that I can see so it's impossible to say what went wrong.

Comment 9 Artem Hrechanychenko 2017-09-28 15:04:04 UTC

Update stuck on memcached container on new node

[heat-admin@controller-2 ~]$ sudo docker inspect memcached |grep command
                "config_data": "{\"start_order\": 1, \"image\": \"192.168.24.1:8787/rhosp12/openstack-memcached-docker:2017-09-27.3\", \"environment\": [\"TRIPLEO_CONFIG_HASH=010a1638903ff5c9a080e00eeb23cc0f\"], \"command\": [\"/bin/bash\", \"-c\", \"source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS\"], \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/sys/fs/selinux:/sys/fs/selinux\", \"/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro\"], \"net\": \"host\", \"privileged\": false, \"restart\": \"always\"}",

[heat-admin@controller-2 ~]$ sudo docker exec -it memcached /bin/bash
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$ ps 
    PID TTY          TIME CMD
     72 ?        00:00:00 bash
     84 ?        00:00:00 ps
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$ ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
memcach+       1  0.0  0.0  11636  1376 ?        Ss   13:20   0:00 /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS
memcach+       7  0.0  0.0 629048  2384 ?        Sl   13:20   0:00 /usr/bin/memcached -p 11211 -u memcached -m 7941 -c 8192 -l 172.17.1.23 -U 11211 -t 8 >> /var/log/memcached.log 2>&1
memcach+      72  0.5  0.0  11640  1708 ?        Ss   14:57   0:00 /bin/bash
memcach+      91  0.0  0.0  47448  1664 ?        R+   14:57   0:00 ps -aux
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$ cat /var/log/memcached.log 
cat: /var/log/memcached.log: No such file or directory
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found

[heat-admin@controller-3 ~]$ sudo docker ps
CONTAINER ID        IMAGE                                                               COMMAND                  CREATED             STATUS              PORTS               NAMES
55d2050d2664        192.168.24.1:8787/rhosp12/openstack-memcached-docker:2017-09-27.3   "/bin/bash -c 'source"   38 minutes ago      Up 38 minutes                           memcached
[heat-admin@controller-3 ~]$ sudo docker exec -it memcached /bin/bash
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$ cat /var/log/memcached.log                                                                                                                                                                      
cat: /var/log/memcached.log: No such file or directory
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$ ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
memcach+       1  0.0  0.0  11636  1372 ?        Ss   14:23   0:00 /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS
memcach+       7  0.0  0.0 628024  3380 ?        Sl   14:23   0:00 /usr/bin/memcached -p 11211 -u memcached -m 7941 -c 8192 -l 172.17.1.16 -U 11211 -t 8 >> /var/log/memcached.log 2>&1
memcach+      44  0.3  0.0  11640  1700 ?        Ss   15:03   0:00 /bin/bash
memcach+      63  0.0  0.0  47448  1656 ?        R+   15:03   0:00 ps -aux
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
bash: hostname: command not found
()[memcached@ /]$

Comment 10 Chris Jones 2017-10-02 15:46:16 UTC

Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck on memcached that is definitely outside our wheelhouse, but I also think it's too broad to say that PIDONE is automatically responsible for controller replacement issues.

We will definitely always help for issues where we can, but IMHO that would generally mean that the lower level clustering (ie Pacemaker & Friends) is involved.

Re-assigning to DFG:DF - please let me know if you disagree :)

Comment 11 Ben Nemec 2017-10-02 15:57:26 UTC

I've historically done a lot of work on controller replacement, so I don't have a problem with assigning this to DF.

That said, I've looked at this and my container-fu is not strong enough to figure out what's going on.  As Artem posted above, it seems to be stuck on a single memcached container with no logs that either of us could find to explain what was happening.  I'm not familiar enough with containers to guess at where it is stuck.

Comment 12 Ben Nemec 2017-10-05 17:08:41 UTC

I'm told that the lack of logs from memcached is actually normal, so that's probably a red herring.  That said, I've debugged this about as much as I can and haven't come up with anything.  I'm sending it to the containers DFG as it seems to be a regression in the controller node replacement procedure that is happening during container deployment and I think we need someone from that team to take a look.

Comment 13 Dan Prince 2017-10-10 20:02:22 UTC

(In reply to Chris Jones from comment #10)
> Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck
> on memcached that is definitely outside our wheelhouse, but I also think
> it's too broad to say that PIDONE is automatically responsible for
> controller replacement issues.
> 
> We will definitely always help for issues where we can, but IMHO that would
> generally mean that the lower level clustering (ie Pacemaker & Friends) is
> involved.
> 
> Re-assigning to DFG:DF - please let me know if you disagree :)

Chris: We do run memcached under pacemaker btw but I'll have a look and see what I can figure out.

Comment 14 Dan Prince 2017-10-10 20:04:06 UTC

Took a look at a dev environment this afternoon and noticed the following:


On controller-0 we have the pacemaker version of memcached running:

(In reply to Dan Prince from comment #13)
> (In reply to Chris Jones from comment #10)
> > Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck
> > on memcached that is definitely outside our wheelhouse, but I also think
> > it's too broad to say that PIDONE is automatically responsible for
> > controller replacement issues.
> > 
> > We will definitely always help for issues where we can, but IMHO that would
> > generally mean that the lower level clustering (ie Pacemaker & Friends) is
> > involved.
> > 
> > Re-assigning to DFG:DF - please let me know if you disagree :)
> 
> Chris: We do run memcached under pacemaker btw but I'll have a look and see
> what I can figure out.

Sorry, I was thinking of redis. Having a look anyways!

Comment 15 Dan Prince 2017-10-10 20:08:57 UTC

A couple of things I noticed this afternoon. On controller-0 (which is untouched by this controller replacement operation) we have the following 'pcmklatest' tagged containers:


[root@controller-0 ~]# docker ps | grep pcmklatest
bb0484adb579        192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest                   "/bin/bash /usr/local"   23 hours ago        Up 23 hours                                   haproxy-bundle-docker-0
a1fac4cc7caf        192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest                     "/bin/bash /usr/local"   23 hours ago        Up 23 hours                                   redis-bundle-docker-0
ebb60a21be59        192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest                   "/bin/bash /usr/local"   23 hours ago        Up 23 hours                                   galera-bundle-docker-0
3716733347d1        192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest                  "/bin/bash /usr/local"   23 hours ago        Up 23 hours (healthy)                         rabbitmq-bundle-docker-0


-------

On both controller-2 and controller-3 containers with these docker tags are missing entirely.

I'm wondering if the issue here is that a stack update was missing the docker-ha.yaml heat environment. Or that perhaps due to some other method this environment the resource registry entries for the HA services were overridden during the upgrade.

Comment 16 Michele Baldessari 2017-10-11 09:11:30 UTC

> On both controller-2 and controller-3 containers with these docker tags are
> missing entirely.
> 
> I'm wondering if the issue here is that a stack update was missing the
> docker-ha.yaml heat environment. Or that perhaps due to some other method
> this environment the resource registry entries for the HA services were
> overridden during the upgrade.

That could definitely be. There should not be a situation where those tags are missing (pacemaker would not be able to spawn those containers as the resource definition of bundles has the 'pcmklatest' tag in it). (On a side node, I'll be able to help more around this next week when I am back home)

Comment 17 Damien Ciabrini 2017-10-11 15:51:26 UTC

We had a first round of investigation with Michele.

According to the completer replacement procedure [1], step 9.4.2 consist in redeploying with a special puppet template "remove-controller.yaml", which should stop early at ControllerDeployment_Step1 with an "error" status, in order to perform some additional steps to reconfigure pacemaker.

What we see from the heat-deployed ansible task that I'm attaching is that this initial puppet run correctly detects that a Pacemaker resource is in Error...

        ...
        "Error: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]", 
        "Error: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>
        ...

... but for some reason, either the puppet run doesn't finish in error, or the error is not reported back to the ansible task...

        "Notice: Applied catalog in 3710.85 seconds"
    ], 
    "failed": false, 
    "failed_when_result": false
}


... thus the controller deployment follows up and finishes in error in later steps, because pacemaker was never given a chance to be configured on that new node.

This is the ansible task that should error out (comments inlined):

    - name: Write the config_step hieradata
      copy: content="{{dict(step=step|int)|to_json}}" dest=/etc/puppet/hieradata/config_step.json force=true mode=0600
    - name: Run puppet host configuration for step {{step}}
      command: >-
        puppet apply
        --modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules
        --logdest syslog --logdest console --color=false
        /var/lib/tripleo-config/puppet_step_config.pp
      changed_when: false
      check_mode: no
      register: outputs
      failed_when: false
      no_log: true
      # The above never fails *but* it registers output and return code in the 'outputs' variable
    - debug: var=(outputs.stderr|default('')).split('\n')|union(outputs.stdout_lines|default([]))
      when: outputs is defined
      failed_when: outputs|failed
      # The line above is the one that should fail (but does not) 

We're still investigating if puppet itself is not returning error or if this ansible task masks the error.

Comment 19 Damien Ciabrini 2017-11-08 10:32:42 UTC

I'm moving this bug to MODIFIED as it doesn't require any code change, it just needs bz#1499217 and bz#1501852 to be fixed.

Comment 21 Artem Hrechanychenko 2017-11-30 11:19:34 UTC

VERIFIED
openstack-tripleo-heat-templates-7.0.3-16.el7ost.noarch

Comment 24 errata-xmlrpc 2017-12-13 22:11:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462