Bug 1413508

Summary: Stack minor update hangs at last node
Product: Red Hat OpenStack Reporter: Chen <cchen>
Component: openstack-tripleo-commonAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: augol, cchen, dbecker, dmaley, knakai, lbezdick, mandreou, mburns, mcornea, morazi, peli, rhel-osp-director-maint, sathlang, slinaber
Target Milestone: asyncKeywords: Reopened, Triaged
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-0.3.1-4.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-14 15:45:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1428845    
Bug Blocks:    
Attachments:
Description Flags
os-collect-config of pc-oscont03t
none
heat deployment-show for failed resources
none
heat resource-list, heat deployment-output-show and RedHatMultipath.sh
none
Error messages in the standard output
none
overcloud update stack after enabling -vv option showing "UPDATE paused until Hook pre-update is cleared"
none
-vv --debug enabled standard output but only two breakpoints reached before timing out none

Description Chen 2017-01-16 09:01:23 UTC
Description of problem:

The stack minor update ran well for N-1 nodes. But after removing the last node's breakpoint, the update will remain as "IN_PROGRESS" forever until the keystone token expired after 4 hours.

IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
ERROR: Authentication failed: Authentication required

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-0.8.14-16.el7ost.noarch

How reproducible:

100% in customer's environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

All logs are in collab-shell.

Undercloud sosreport: /cases/01766275/sosreport-20170116-140800
Standard output in undercloud's screen when updating: /cases/01766275/x-text/putty-logs.txt
os-collect-config from all the nodes: /cases/01766275/0116_logs.zip

Comment 2 Chen 2017-01-16 10:41:13 UTC
Hi Lucas, 

Thank you very much for your reply.

I am asking for the output now. But the customer has already gone offline now so maybe tomorrow I could provide you the feedback.

BTW the this time the update started on Jan 16th so maybe we can skip the error on Jan 11th.

Yes osnova06t was a problematic node which made the whole update stuck. We rebooted that node (restarting os-collect-config seemed not working) and this time os-nova06t finished the update. But I didn't observe that NFS mount error.

Best Regards,
Chen

Comment 3 Chen 2017-01-16 10:50:26 UTC
Hi Lucas,

Please disregard my comments about ignoring the output of Jan 11th because you mentioned that the update could hang since then. Sorry for that.

Best Regards,
Chen

Comment 5 Chen 2017-01-17 02:14:08 UTC
Hi Lucas,

Here is the output from customer's environment. Thank you for your help in advance.

[root@pc-osadmin01t stack]# heat resource-list overcloud -n5 | grep -v COMPLETE                                                                                         +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                                                                                  | resource_status | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ComputeNodesPostDeployment                    | df3cacac-0d06-4dda-98d2-f46c164ae70d          | OS::TripleO::ComputePostDeployment                                                                             | UPDATE_FAILED   | 2017-01-09T03:45:05 | overcloud                                                                                                                                         |
| 5                                             | 106cc260-d0af-43b1-beea-7314c12ee6da          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw                                                            |
| ComputePuppetDeployment                       | 6ce6ac54-cac1-487d-8a03-27afaa05738b          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED   | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo                                                                                                 |
| 0                                             | 9f3fe72b-f1cc-45f7-a6fd-eadc81ccb952          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 1                                             | 600b7d9f-ce96-43a2-af33-7bba5df6247a          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 3                                             | bda0ae4f-11fd-44a1-a246-2973c9d731ac          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 6                                             | 260e097b-326f-4651-9b39-5b8ac0dc9c84          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| ComputeAllNodesDeployment                     | 8c5548b8-bafd-4c01-ae4c-eb8a967c02ae          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud                                                                                                                                         |
| ControllerAllNodesDeployment                  | a6e41c06-0b25-4279-94aa-4334b54f6755          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED   | 2017-01-16T03:43:05 | overcloud                                                                                                                                         |
| 0                                             | 0ebb69ae-ac7c-4d06-9995-26ad2f0d2338          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 1                                             | 7d1f8b6c-9974-46b9-b514-4acbee61babb          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 2                                             | 3e1aa2e1-bf93-4459-a0ca-c10b3ae6bd64          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 2                                             | 4310d569-07ef-4bef-8aca-cd7c550846b8          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 4                                             | 7c498c82-6bca-4871-9a4e-efc28a3f7c82          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 5                                             | bc590794-6255-449e-861a-cb5b708acb94          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 7                                             | 7a2a3a88-a7c3-47d0-9b8f-2bd0f9a915e9          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------

Best Regards,
Chen

Comment 6 Lukas Bezdicka 2017-01-17 09:16:48 UTC
| 5                                             | 106cc260-d0af-43b1-beea-7314c12ee6da          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED   | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw                                                            |

So we have puppet failure on compute 5, let's see.

Comment 7 Chen 2017-01-17 10:26:12 UTC
Hi Lucas,

If anything still needs to be collected please let me know. My understanding is that the issue is still being investigated and if it is wrong please let me know.

Thank you !

Best Regards,
Chen

Comment 10 Chen 2017-01-17 14:53:07 UTC
Hi Sofer, Marios,

Thank you for your help.

With your guide, I checked their templates again.

I found in the end a script called NovaComputeNFS.sh will be executed.

OS::TripleO::ComputeExtraConfigPre: OS-TripleO-ComputeExtraConfigPre.yaml -> NovaComputeNFS.yaml -> script/NovaComputeNFS.sh

This is what the script looks like:

<snip>

sudo mount -t nfs -o nosuid,noexec,nodev,rw,soft $NFS_mountpoint /var/lib/nova/instances
echo "Config automount when next boot ..."
echo "$NFS_mountpoint /var/lib/nova/instances nfs rw,soft,nodev,noexec,nosuid 0 0" | tee -a /etc/fstab

So in the end /etc/fstab will be updated by adding an entry and /var/lib/nova/instances will be mounted at boot time.

I collected pc-osnova06t's sosreport and I can see that entry in fstab and /var/lib/nova/instances has been mounted.

So I have a question here: 

OS::TripleO::ComputeExtraConfigPre should be executed when the stack was deployed. So every compute node should contain that fstab entry. But why only pc-nova06t was complaining that /var/lib/nova/instances would be mounted twice ?
I will double check with the customer to ask them whether ComputeExtraConfigPre was added after the deployment. And I will also have a look at other compute nodes to see their fstab.

Additional logs:

pc-osnova06's sosreport is in /cases/01766275/sosreport-20170111-155925/pc-osnova06t.localdomain

update command and templates which were used are in /cases/01766275/stack_home.tar.gz

Thanks again !

Best Regards,
Chen

Comment 11 Chen 2017-01-17 15:10:34 UTC
Hi All,

So it appears to me that we could remove the OS-TripleO-ComputeExtraConfigPre-enviroment.yaml from the update command to solve this problem, am I right ?

openstack overcloud update stack overcloud -i --templates \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/templates/FET-network-environment.yaml \
-e /home/stack/templates/node-extraconfig/FET-storage-enviroment.yaml \
-e /home/stack/templates/node-extraconfig/OS-TripleO-ControllerExtraConfigPre-environment.yaml \
-e /home/stack/templates/node-extraconfig/assent-environment.yaml \
-e /home/stack/templates/node-extraconfig/scheduler_hints_env.yaml \
-e /home/stack/templates/node-extraconfig/timezone.yaml \
-e /home/stack/templates/node-extraconfig/metadata.yaml 

I still can not explain why other nodes are not encountering this NFS mount problem...

Best Regards,
Chen

Comment 12 Sofer Athlan-Guyot 2017-01-18 15:31:44 UTC
Hi,

the nfs mounting is happening not on one node but on any node during
each run: pc-osnova04, pc-osnova03t, pc-osnova05t, ... were all
affected, sometimes mulitples nodes.  The details of the failure
below[1]

I'm afraid that removing the template may not be enough as it will
still be in the heat database.  You could try to make the script more
robust (if it's already mounted don't do anything).  Or you could
disable it completly by changing the script and adding exit 0 at the
top of it.  Or you can use the OS::Heat::None association for
OS::TripleO::ComputeExtraConfigPre

My choice here would be to device a simple check in
templates/node-extraconfig/script/NovaComputeNFS.sh to skip the
mounting if it's already mounted and you should be fine.

Something along the line of

    if ! findmnt /var/lib/nova/instances | grep -i nfs ; then
       mount
    fi

Verify that it works, I don't have a nfs env ready to test this
command.
        






[1] NFS Failures details:

143981:Dec 20 18:15:49 pc-osnova04t.localdomain os-collect-config[4547]: [2016-12-20 10:15:49,900] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}
421110:Jan 11 17:52:55 pc-osnova04t.localdomain os-collect-config[6704]: [2017-01-11 17:52:55,621] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

01162017logs/pc-osnova03t_os-collect-config_01162017.txt
144069:Dec 20 18:44:11 pc-osnova03t.localdomain os-collect-config[4551]: [2016-12-20 10:44:11,969] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}
420990:Jan 11 17:58:03 pc-osnova03t.localdomain os-collect-config[6225]: [2017-01-11 17:58:03,960] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

01162017logs/pc-osnova05t_os-collect-config_01162017.txt
143919:Dec 20 18:12:22 pc-osnova05t.localdomain os-collect-config[4609]: [2016-12-20 10:12:22,176] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}
334404:Jan 11 17:50:10 pc-osnova05t.localdomain os-collect-config[5389]: [2017-01-11 17:50:10,231] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

01162017logs/pc-osnova08t_os-collect-config_01162017.txt
144126:Dec 20 18:40:38 pc-osnova08t.localdomain os-collect-config[2348]: [2016-12-20 10:40:38,853] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

01162017logs/pc-osnova01t_os-collect-config_01162017.txt
351908:Jan 11 17:58:09 pc-osnova01t.localdomain os-collect-config[7759]: [2017-01-11 17:58:09,452] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

01162017logs/pc-osnova06t_os-collect-config_01162017.txt
144065:Dec 21 01:39:01 pc-osnova06t.localdomain os-collect-config[4614]: [2016-12-20 17:39:01,632] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}
333554:Jan 11 15:06:39 pc-osnova06t.localdomain os-collect-config[5031]: [2017-01-11 15:06:39,196] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}
350160:Jan 16 11:06:40 pc-osnova06t.localdomain os-collect-config[3915]: [2017-01-16 11:06:40,530] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou
nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32}

Comment 13 Chen 2017-01-23 08:11:32 UTC
Created attachment 1243522 [details]
os-collect-config of pc-oscont03t

Comment 14 Chen 2017-01-23 08:15:19 UTC
Hi Sofer,

Today the customer tried update again but unexpectedly it failed at one of the controller's breakpoint but not computes'.

completed: [u'pc-osnova04t', u'pc-osnova02t']
on_breakpoint: [u'pc-oscont01t', u'pc-oscont02t', u'pc-osnova05t', u'pc-osnova03t', u'pc-osnova01t', u'pc-osnova08t', u'pc-osnova07t', u'pc-osnova06t', u'pc-oscont03t']
Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: 
removing breakpoint on pc-oscont03t
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
FAILED

I checked the os-collect-config of pc-oscont03t but I didn't find any explicit errors. I've uploaded it for a double check. Also I asked them to collect heat resource-list -n5 overcloud and heat deployment-show for any failed deployments. If anything is still needed to troubleshoot please let me know. 

Best Regards,
Chen

Comment 15 Chen 2017-01-23 14:04:59 UTC
Hi Sofer,

I could see some resources are FAILED for this time.

{
  "status": "FAILED", 
  "server_id": "e025c72d-a726-4d6a-8836-37913cf5176f", 
  "config_id": "8728db86-ba19-4191-9972-7ce46f3846d1", 
  "output_values": {
    "deploy_stdout": "", 
    "deploy_stderr": "/var/lib/heat-config/heat-config-script/8728db86-ba19-4191-9972-7ce46f3846d1: line 13: syntax error: unexpected end of file\n", 
    "deploy_status_code": 2
  }, 
  "creation_time": "2016-12-08T04:53:46", 
  "updated_time": "2017-01-23T07:29:41", 
  "input_values": {}, 
  "action": "UPDATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 2", 
  "id": "4a6d1439-c175-4ad1-9249-2b8a06c6fa97"
}

All the FAILED resources' reason was the same: "line 13: syntax error: unexpected end of file\n"

So it seems that pc-cont03t is not the one who should be blame right ? I think I need to have a look at the /var/lib/heat-config/heat-config-script/8728db86-ba19-4191-9972-7ce46f3846d1 line 13 am I right ? Anything else could help to investigate the issue ?

Best Regards,
Chen

Comment 16 Chen 2017-01-23 14:09:52 UTC
Created attachment 1243607 [details]
heat deployment-show for failed resources

Comment 17 Chen 2017-01-24 08:08:40 UTC
Hi Sofer,

After we checked the line 13 of that script we did find that the script was missing a "fi". Sorry for that and the customer will do another test tomorrow.

Best Regards,
Chen

Comment 18 Sofer Athlan-Guyot 2017-01-24 14:41:53 UTC
Hi Chen,

oki, good to know,  hope it's going fine with the lastest version of the script.  You could just run the script on a compute node to validate it, that's basically what's going to happen when you run the update.

Regards,

Comment 19 Sofer Athlan-Guyot 2017-01-30 10:06:55 UTC
Hi Chen,

can I close this one, ie with the fixed script could you end this step of the upgrade ?

Comment 20 Chen 2017-02-03 01:19:04 UTC
Hi Sofer,

Sorry for the delay.

I just came back from Chinese New Year and so does my customer. They will do another test next Monday so it would be better if we could keep this open until then.

Best Regards,
Chen

Comment 21 Chen 2017-02-08 14:40:01 UTC
Hi Sofer,

The customer tried to perform an update yesterday again (2/7 start at 15:36 and timeout on 19:36.) but it never finished and timed out again.

It seems that no resources are marked as FAILED and I don't see any ERRORs for any resource. I uploaded some output to collab-shell.usersys.redhat.com:/cases/01766275. Asking heat resource-list -n5 overcloud | grep -iv complete as well.

The strange is, it seems that every node falls in a dead loop to run the update. I mean, for example, pc-osnova03t, 

$ grep "Running /usr/libexec/os-refresh-config/configure.d/55-heat-config" pc-osnova03t_os-collect-config_02082017.txt | wc -l
8764

All the nodes' os-collect-config is also in collab-shell.usersys.redhat.com:/cases/01766275.

Please guide me with the next steps. Thank you !

Best Regards,
Chen

Comment 22 Sofer Athlan-Guyot 2017-02-09 11:02:34 UTC
Hi,

So from the os-collect-config we can see that the cluster is in a
strange shape:

       openstack-ceilometer-alarm-evaluator       (systemd:openstack-ceilometer-alarm-evaluator): Started pc-oscont03t (unmanaged)
         openstack-ceilometer-alarm-evaluator       (systemd:openstack-ceilometer-alarm-evaluator): Started pc-oscont02t (unmanaged)
     Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] (unmanaged)
         openstack-heat-api-cfn     (systemd:openstack-heat-api-cfn):       Started pc-oscont01t (unmanaged)
         openstack-heat-api-cfn     (systemd:openstack-heat-api-cfn):       Started pc-oscont03t (unmanaged)
         openstack-heat-api-cfn     (systemd:openstack-heat-api-cfn):       Started pc-oscont02t (unmanaged)
     openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started pc-oscont01t (unmanaged)
     Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] (unmanaged)
         openstack-nova-conductor   (systemd:openstack-nova-conductor):     Started pc-oscont01t (unmanaged)
         openstack-nova-conductor   (systemd:openstack-nova-conductor):     Started pc-oscont03t (unmanaged)
         openstack-nova-conductor   (systemd:openstack-nova-conductor):     Started pc-oscont02t (unmanaged)
     Clone Set: neutron-lbaas-agent-clone [neutron-lbaas-agent] (unmanaged)
         neutron-lbaas-agent        (systemd:neutron-lbaas-agent):  Started pc-oscont01t (unmanaged)
         neutron-lbaas-agent        (systemd:neutron-lbaas-agent):  Started pc-oscont03t (unmanaged)
         neutron-lbaas-agent        (systemd:neutron-lbaas-agent):  Started pc-oscont02t (unmanaged)
    
    PCSD Status:
      pc-oscont01t: Online
      pc-oscont02t: Online
      pc-oscont03t: Online


This shouldn't be "unmanaged".

the yum_upgrade.sh command is successful on every node but we can see
this on all nodes:

   "This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register."

Is that expected ?

So first, the cluster should be put back in managed mode, then check if
the subscription is correct.

Furthermore, looking at the "custom" (RedHatMultipath for instance)
...script, I would advise to add:

   set -eux

(and at least "set -ex")

in each and every one of them to have more logs and clear cut error
(test them as adding this can trigger new errors if undefined variable
are used for instance, the "u" option)

In RedHatMultipath this line is hard to read:

   SWIFT_DEVICE="$(sudo multipath -l | grep "$(sudo /lib/udev/scsi_id -u -g /dev/sdb)" | awk '{print $1}')"

It may work, but I would advice using

   SCSI_ID="$(sudo /lib/udev/scsi_id -u -g /dev/sdb)"
   SWIFT_DEVICE="$(sudo multipath -l | grep '${SCSI_ID}' | awk '{print $1}')"

And check:

   if [ -n "${SWIFT_DEVICE}" ]; then
       <the code>

   else
      echo "Could not find swift device"
      exit 1
   fi

Waiting for the output of the heat resource that are not COMPLETE.

Comment 24 Chen 2017-02-13 07:37:18 UTC
Hi Sofer,

For maintenance mode issue I've asked the customer to check.

They are using internal repo server so errors regarding registration are expected.

They modified RedHatMultipath.sh following your advice and I am uploading it.

The customer has uploaded heat resource-list -n5 overcloud | grep -iv complete. I will upload it shortly as well.

Best Regards,
Chen

Comment 25 Chen 2017-02-13 07:41:41 UTC
Created attachment 1249774 [details]
heat resource-list, heat deployment-output-show and RedHatMultipath.sh

Comment 28 Chen 2017-02-20 09:34:44 UTC
Hi Sofer,

The customer updated their undercloud node but suffered a kernel panic after that. They used old kernel and that kernel panic went away. But now, when they run the update command, they will get:

ERROR: no events found or resource UpdateDeployment.

I tried to search a lot but no luck. 

Best Regards,
Chen

Comment 29 Chen 2017-02-22 09:29:08 UTC
Hi Sofer,

The customer updated their undercloud node according to the following docs.

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/upgrading_red_hat_openstack_platform/sect-updating_the_environment#sect-Updating_Director_Packages

But after that they can not perform the update. I am attaching the error messages. I asked them to append a -vv and --debug to their update command. Tried to search a lot but didn't find any useful information to solve this. 

Best Regards,
Chen

Comment 30 Chen 2017-02-22 09:30:37 UTC
Created attachment 1256389 [details]
Error messages in the standard output

Comment 31 Chen 2017-02-22 09:34:10 UTC
One more thing forgot to mention..

The customer found the iptables was not running after they rebooted the director. So they enabled the iptables manually.

Best Regards,
Chen

Comment 35 Chen 2017-02-23 15:07:08 UTC
Hi Lukas,

Sorry in which file did you see that "Engine went down during resource UPDATE" message ?

Best Regards,
Chen

Comment 36 Lukas Bezdicka 2017-02-23 15:13:16 UTC
In latest attachement https://bugzilla.redhat.com/attachment.cgi?id=1256389

Comment 38 Thomas Hervé 2017-02-23 18:48:53 UTC
It looks like https://bugs.launchpad.net/tripleo-common/+bug/1537857 with the patch at https://review.openstack.org/#/c/272213/. I think it's in OSP 9, but not 8. If possible, the workaround can be applied locally.

Comment 39 Sofer Athlan-Guyot 2017-02-23 20:31:31 UTC
Hi Chen,

I've backported the patch to osp8 in the added review.

You can try it like that:

cat > fix.patch <<'EOF'
---

diff --git a/tripleo_common/stack_update.py b/tripleo_common/stack_update.py
index 36e5ef6..030cbd9 100644
--- a/tripleo_common/stack_update.py
+++ b/tripleo_common/stack_update.py
@@ -18,6 +18,8 @@
 import re
 import time
 
+import heatclient.exc
+
 LOG = logging.getLogger(__name__)
 
 
@@ -138,10 +140,13 @@
             stack_name, stack_id = next(
                 x['href'] for x in res.links if
                 x['rel'] == 'stack').rsplit('/', 2)[1:]
-            events = self.heatclient.events.list(
-                stack_id=stack_id,
-                resource_name=res.logical_resource_id,
-                sort_dir='asc')
+            try:
+                events = self.heatclient.events.list(
+                    stack_id=stack_id,
+                    resource_name=res.logical_resource_id,
+                    sort_dir='asc')
+            except heatclient.exc.HTTPNotFound:
+                events = []
             state = 'not_started'
             for ev in events:
                 # ignore events older than start of the last stack change
EOF


and then

sudo patch -d /usr/lib/python2.7/site-packages/ -p1 < fix.patch

on the undercloud.

Meanwhile I try to get a package ready asap, so if you want to wait a little for you.  Your call.

Regards,

Comment 41 Sofer Athlan-Guyot 2017-02-24 12:35:45 UTC
Hi Chen,

So the best would be to install the new tripleo common package 0.3.1-4.el7ost and restart the update.

Kudo to Mike and Thomas for the quick fix/package.

Comment 42 Chen 2017-02-24 14:45:14 UTC
Hi Sofer,

Huge thanks to your help so far.

The 0.3.1-4.el7ost is not available through RH official repo currently and the newest version is 0.3.1-1.el7ost. I've asked the customer apply your hotfix. After 0.3.1-4.el7ost released I'll ask them to update to that package. Do you see any problems within these action plans ? BTW the customer hasn't updated us with the result after applying the hotfix.

Best Regards,
Chen

Comment 43 Sofer Athlan-Guyot 2017-02-24 16:52:43 UTC
Hi Chen,

(In reply to Chen from comment #42)
> Hi Sofer,
> 
> Huge thanks to your help so far.
> 
> The 0.3.1-4.el7ost is not available through RH official repo currently and
> the newest version is 0.3.1-1.el7ost. I've asked the customer apply your
> hotfix. After 0.3.1-4.el7ost released I'll ask them to update to that
> package. Do you see any problems within these action plans ? 

Nothing that I'm aware of, no.

> BTW the
> customer hasn't updated us with the result after applying the hotfix.
> 
> Best Regards,
> Chen

Regards,

Comment 44 Chen 2017-03-01 11:51:30 UTC
Hi Sofer,

Thank you for your hotfix and the update command can run again.

The customer tried the update again today and he found the breakpoint was reached only once and the overcloud status was showing "IN_PROGRESS" all the time.

From the -vv output the customer noticed the update fell in some kind of loop. And I found the following message.

resource_status_reason": "UPDATE paused until Hook pre-update is cleared"

I am attaching the command output after enabling -vv option and would you please have a look ?

Best Regards,
Chen

Comment 45 Chen 2017-03-01 11:54:03 UTC
Created attachment 1258658 [details]
overcloud update stack after enabling -vv option showing "UPDATE paused until Hook pre-update is cleared"

Comment 47 Chen 2017-03-02 12:34:14 UTC
Hi Lukas, Sofer,

The update timed out after 4 hours and only one breakpoint was met. I hope we could get some help from heat team to recover from this unhealthy state.

Best Regards,
Chen

Comment 55 Chen 2017-03-20 14:58:28 UTC
Hi Lukas,

Seems updated heat packages didn't solve the hang issue. 

Here is heat resource-list overcloud -n5 | grep -v COMPLETE.


[stack@pc-osadmin01t ~]$ heat resource-list overcloud -n5 | grep -v COMPLETE
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                                                                                  | resource_status    | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ComputeNodesPostDeployment                    | df3cacac-0d06-4dda-98d2-f46c164ae70d          | OS::TripleO::ComputePostDeployment                                                                             | UPDATE_FAILED      | 2017-01-09T03:45:05 | overcloud                                                                                                                                         |
| 5                                             | 106cc260-d0af-43b1-beea-7314c12ee6da          | OS::Heat::StructuredDeployment                                                                                 | UPDATE_FAILED      | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw                                                            |
| ComputePuppetDeployment                       | 6ce6ac54-cac1-487d-8a03-27afaa05738b          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED      | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo                                                                                                 |
| ComputeAllNodesDeployment                     | 8c5548b8-bafd-4c01-ae4c-eb8a967c02ae          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED      | 2017-02-13T07:40:14 | overcloud                                                                                                                                         |
| ControllerAllNodesDeployment                  | a6e41c06-0b25-4279-94aa-4334b54f6755          | OS::Heat::StructuredDeployments                                                                                | UPDATE_FAILED      | 2017-02-13T07:40:15 | overcloud                                                                                                                                         |
| 1                                             | a1f359a6-bde8-448a-9d78-c15285e59422          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:17 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 2                                             | d43b00fb-2be8-4174-a704-350627b7504a          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:17 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 0                                             | 10c439cd-20c7-4e8f-aa0f-8d4e80500947          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:18 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh                                                                                               |
| 1                                             | 00a9f138-8675-4d38-8fb2-6a1e928fceff          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:18 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 5                                             | b85a41e6-c662-45b0-8dcc-25e21ab1d87a          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:18 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 4                                             | b3a648d3-2d79-41f2-b60d-c058221b6f41          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 6                                             | a02a9448-feec-487a-ad27-b64c9a89de6d          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 7                                             | 087108b6-32b8-4130-b1a7-3987f4938978          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 0                                             | 827b515b-c657-4b77-85da-c397126d3e76          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 2                                             | 2734e08c-92a8-4ffb-b978-b3b4bdda82bf          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| 3                                             | 915cf3f0-1491-4753-8980-5fd1efc259bd          | OS::Heat::StructuredDeployment                                                                                 | CREATE_FAILED      | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv                                                                                                  |
| UpdateDeployment                              | b9d5dd88-8749-41ff-b178-1cf9d2706611          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:42 | overcloud-Compute-xxwiwibat6t7-0-ljujalc56xle                                                                                                     |
| UpdateDeployment                              | 2f4b33f0-8c34-4543-9a57-e991691ece7c          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:44 | overcloud-Controller-3tzdslpfpkkr-0-xahmcczouslg                                                                                                  |
| UpdateDeployment                              | a78c5724-5530-4b13-9143-434f34cbff1e          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:44 | overcloud-Compute-xxwiwibat6t7-6-iu2yypfkf5no                                                                                                     |
| UpdateDeployment                              | df1208d5-74a2-45e1-bc87-1448be83144e          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:47 | overcloud-Compute-xxwiwibat6t7-2-q33r5mndmljh                                                                                                     |
| UpdateDeployment                              | 67b0d474-19d0-4cb5-b5b2-780a0b5c794c          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:49 | overcloud-Controller-3tzdslpfpkkr-1-oxv4xatkah42                                                                                                  |
| UpdateDeployment                              | eb035152-a80c-4212-b77c-045b529d4a62          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:51 | overcloud-Compute-xxwiwibat6t7-4-t4bk53dxfvu3                                                                                                     |
| UpdateDeployment                              | fff1d305-61df-427f-84b5-29c97118f0ef          | OS::Heat::SoftwareDeployment                                                                                   | CREATE_FAILED      | 2017-03-20T08:48:53 | overcloud-Compute-xxwiwibat6t7-3-cjlits3mcwlq                                                                                                     |
| UpdateDeployment                              | 48ce8d66-66d8-4c98-b392-1a7e6c5186e3          | OS::Heat::SoftwareDeployment                                                                                   | UPDATE_FAILED      | 2017-03-20T08:48:57 | overcloud-Compute-xxwiwibat6t7-5-vztnt2dzscnr                                                                                                     |
| Compute                                       | 73d3d481-2692-4cfc-ad51-48b48df36c4a          | OS::Heat::ResourceGroup                                                                                        | UPDATE_FAILED      | 2017-03-20T09:26:45 | overcloud                                                                                                                                         |
| 5                                             | 3e108686-cb53-424e-9f59-e98cf565f780          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:49 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| Controller                                    | f05551d7-8d08-4a27-9362-895dde9cd3c6          | OS::Heat::ResourceGroup                                                                                        | UPDATE_FAILED      | 2017-03-20T09:26:49 | overcloud                                                                                                                                         |
| 6                                             | 7de5ac94-9ebd-44e5-952c-606be4d21c86          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:51 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| 0                                             | 43c63e46-e0db-4fbe-a073-43fdc7463581          | OS::TripleO::Controller                                                                                        | UPDATE_FAILED      | 2017-03-20T09:26:52 | overcloud-Controller-3tzdslpfpkkr                                                                                                                 |
| 1                                             | 00f8b2e0-9544-4b92-8759-c3edc357bcb8          | OS::TripleO::Controller                                                                                        | UPDATE_FAILED      | 2017-03-20T09:26:54 | overcloud-Controller-3tzdslpfpkkr                                                                                                                 |
| 4                                             | 2bc43591-5a87-4f18-bfcb-16c535f6107c          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:54 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| 0                                             | 1f4c2aea-927b-4f0a-a6cc-dd4d7fedc184          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:56 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| 2                                             | e71ec635-4384-4aef-9856-bc9e3f397534          | OS::TripleO::Controller                                                                                        | UPDATE_FAILED      | 2017-03-20T09:26:56 | overcloud-Controller-3tzdslpfpkkr                                                                                                                 |
| 2                                             | 563365b9-0f7e-410e-847c-b7d73e19fbd6          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:58 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| 3                                             | a2c84c85-78ce-4fea-8ada-92af7927b2a8          | OS::TripleO::Compute                                                                                           | UPDATE_IN_PROGRESS | 2017-03-20T09:26:58 | overcloud-Compute-xxwiwibat6t7                                                                                                                    |
| UpdateDeployment                              | 6359b193-392c-4976-9984-cd057571063f          | OS::Heat::SoftwareDeployment                                                                                   | CREATE_FAILED      | 2017-03-20T09:28:30 | overcloud-Controller-3tzdslpfpkkr-2-mihnp6ps3q7b                                                                                                  |
+-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

I can still see "UPDATE paused until Hook pre-update is cleared" and I will upload the newest -vv --debug output. But as bz 1414779 mentioned, the pre-update hook can be cleared in some way and would you please have a double check ?

Best Regards,
Chen

Comment 56 Chen 2017-03-20 15:00:09 UTC
Created attachment 1264775 [details]
-vv --debug enabled standard output but only two breakpoints reached before timing out

Comment 57 Lukas Bezdicka 2017-03-20 16:34:34 UTC
Can I get heat stack-show overcloud-Compute-xxwiwibat6t7 ?

Comment 58 Lukas Bezdicka 2017-03-20 17:04:50 UTC
Id try restarting heat-engine to ensure that all IN_PROGRESS switch to FAILED and after that I'd try re-running the update.

Comment 60 Chen 2017-03-24 09:31:05 UTC
Hi Dave,

Sorry should I give this openstack-tripleo-common package to the customer ? Does it fix the update hang issue ? 

Best Regards,
Chen

Comment 61 Dave Maley 2017-03-28 15:26:23 UTC
(In reply to Chen from comment #60)

> Sorry should I give this openstack-tripleo-common package to the customer ?
> Does it fix the update hang issue ? 

Yes, please go ahead and provide it to your customer. It is believed that it will resolve the issue reported here however your customer should first validate in a test environment before deploying it to production.

Comment 62 Sofer Athlan-Guyot 2017-05-30 16:34:13 UTC
Hi Chen,

Closing this one as the hotfix was delivered.  Don't hesitate to re-open it if needed.

Comment 63 Mike Burns 2017-06-14 13:23:01 UTC
hotfix should not close a bug.  it only closes when shipped to CDN.  Hotfixes go directly to specific customers.

Comment 64 errata-xmlrpc 2017-06-14 15:45:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1465