Bug 1413508
Summary: | Stack minor update hangs at last node | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Chen <cchen> |
Component: | openstack-tripleo-common | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED ERRATA | QA Contact: | Omri Hochman <ohochman> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 8.0 (Liberty) | CC: | augol, cchen, dbecker, dmaley, knakai, lbezdick, mandreou, mburns, mcornea, morazi, peli, rhel-osp-director-maint, sathlang, slinaber |
Target Milestone: | async | Keywords: | Reopened, Triaged |
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-common-0.3.1-4.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-14 15:45:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1428845 | ||
Bug Blocks: | |||
Attachments: |
Description
Chen
2017-01-16 09:01:23 UTC
Hi Lucas, Thank you very much for your reply. I am asking for the output now. But the customer has already gone offline now so maybe tomorrow I could provide you the feedback. BTW the this time the update started on Jan 16th so maybe we can skip the error on Jan 11th. Yes osnova06t was a problematic node which made the whole update stuck. We rebooted that node (restarting os-collect-config seemed not working) and this time os-nova06t finished the update. But I didn't observe that NFS mount error. Best Regards, Chen Hi Lucas, Please disregard my comments about ignoring the output of Jan 11th because you mentioned that the update could hang since then. Sorry for that. Best Regards, Chen Hi Lucas, Here is the output from customer's environment. Thank you for your help in advance. [root@pc-osadmin01t stack]# heat resource-list overcloud -n5 | grep -v COMPLETE +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | ComputeNodesPostDeployment | df3cacac-0d06-4dda-98d2-f46c164ae70d | OS::TripleO::ComputePostDeployment | UPDATE_FAILED | 2017-01-09T03:45:05 | overcloud | | 5 | 106cc260-d0af-43b1-beea-7314c12ee6da | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw | | ComputePuppetDeployment | 6ce6ac54-cac1-487d-8a03-27afaa05738b | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo | | 0 | 9f3fe72b-f1cc-45f7-a6fd-eadc81ccb952 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 1 | 600b7d9f-ce96-43a2-af33-7bba5df6247a | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 3 | bda0ae4f-11fd-44a1-a246-2973c9d731ac | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 6 | 260e097b-326f-4651-9b39-5b8ac0dc9c84 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | ComputeAllNodesDeployment | 8c5548b8-bafd-4c01-ae4c-eb8a967c02ae | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud | | ControllerAllNodesDeployment | a6e41c06-0b25-4279-94aa-4334b54f6755 | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-01-16T03:43:05 | overcloud | | 0 | 0ebb69ae-ac7c-4d06-9995-26ad2f0d2338 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 1 | 7d1f8b6c-9974-46b9-b514-4acbee61babb | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 2 | 3e1aa2e1-bf93-4459-a0ca-c10b3ae6bd64 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 2 | 4310d569-07ef-4bef-8aca-cd7c550846b8 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 4 | 7c498c82-6bca-4871-9a4e-efc28a3f7c82 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 5 | bc590794-6255-449e-861a-cb5b708acb94 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 7 | 7a2a3a88-a7c3-47d0-9b8f-2bd0f9a915e9 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-16T03:43:06 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------------------- Best Regards, Chen | 5 | 106cc260-d0af-43b1-beea-7314c12ee6da | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw | So we have puppet failure on compute 5, let's see. Hi Lucas, If anything still needs to be collected please let me know. My understanding is that the issue is still being investigated and if it is wrong please let me know. Thank you ! Best Regards, Chen Hi Sofer, Marios, Thank you for your help. With your guide, I checked their templates again. I found in the end a script called NovaComputeNFS.sh will be executed. OS::TripleO::ComputeExtraConfigPre: OS-TripleO-ComputeExtraConfigPre.yaml -> NovaComputeNFS.yaml -> script/NovaComputeNFS.sh This is what the script looks like: <snip> sudo mount -t nfs -o nosuid,noexec,nodev,rw,soft $NFS_mountpoint /var/lib/nova/instances echo "Config automount when next boot ..." echo "$NFS_mountpoint /var/lib/nova/instances nfs rw,soft,nodev,noexec,nosuid 0 0" | tee -a /etc/fstab So in the end /etc/fstab will be updated by adding an entry and /var/lib/nova/instances will be mounted at boot time. I collected pc-osnova06t's sosreport and I can see that entry in fstab and /var/lib/nova/instances has been mounted. So I have a question here: OS::TripleO::ComputeExtraConfigPre should be executed when the stack was deployed. So every compute node should contain that fstab entry. But why only pc-nova06t was complaining that /var/lib/nova/instances would be mounted twice ? I will double check with the customer to ask them whether ComputeExtraConfigPre was added after the deployment. And I will also have a look at other compute nodes to see their fstab. Additional logs: pc-osnova06's sosreport is in /cases/01766275/sosreport-20170111-155925/pc-osnova06t.localdomain update command and templates which were used are in /cases/01766275/stack_home.tar.gz Thanks again ! Best Regards, Chen Hi All, So it appears to me that we could remove the OS-TripleO-ComputeExtraConfigPre-enviroment.yaml from the update command to solve this problem, am I right ? openstack overcloud update stack overcloud -i --templates \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/templates/FET-network-environment.yaml \ -e /home/stack/templates/node-extraconfig/FET-storage-enviroment.yaml \ -e /home/stack/templates/node-extraconfig/OS-TripleO-ControllerExtraConfigPre-environment.yaml \ -e /home/stack/templates/node-extraconfig/assent-environment.yaml \ -e /home/stack/templates/node-extraconfig/scheduler_hints_env.yaml \ -e /home/stack/templates/node-extraconfig/timezone.yaml \ -e /home/stack/templates/node-extraconfig/metadata.yaml I still can not explain why other nodes are not encountering this NFS mount problem... Best Regards, Chen Hi, the nfs mounting is happening not on one node but on any node during each run: pc-osnova04, pc-osnova03t, pc-osnova05t, ... were all affected, sometimes mulitples nodes. The details of the failure below[1] I'm afraid that removing the template may not be enough as it will still be in the heat database. You could try to make the script more robust (if it's already mounted don't do anything). Or you could disable it completly by changing the script and adding exit 0 at the top of it. Or you can use the OS::Heat::None association for OS::TripleO::ComputeExtraConfigPre My choice here would be to device a simple check in templates/node-extraconfig/script/NovaComputeNFS.sh to skip the mounting if it's already mounted and you should be fine. Something along the line of if ! findmnt /var/lib/nova/instances | grep -i nfs ; then mount fi Verify that it works, I don't have a nfs env ready to test this command. [1] NFS Failures details: 143981:Dec 20 18:15:49 pc-osnova04t.localdomain os-collect-config[4547]: [2016-12-20 10:15:49,900] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 421110:Jan 11 17:52:55 pc-osnova04t.localdomain os-collect-config[6704]: [2017-01-11 17:52:55,621] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 01162017logs/pc-osnova03t_os-collect-config_01162017.txt 144069:Dec 20 18:44:11 pc-osnova03t.localdomain os-collect-config[4551]: [2016-12-20 10:44:11,969] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 420990:Jan 11 17:58:03 pc-osnova03t.localdomain os-collect-config[6225]: [2017-01-11 17:58:03,960] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 01162017logs/pc-osnova05t_os-collect-config_01162017.txt 143919:Dec 20 18:12:22 pc-osnova05t.localdomain os-collect-config[4609]: [2016-12-20 10:12:22,176] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 334404:Jan 11 17:50:10 pc-osnova05t.localdomain os-collect-config[5389]: [2017-01-11 17:50:10,231] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 01162017logs/pc-osnova08t_os-collect-config_01162017.txt 144126:Dec 20 18:40:38 pc-osnova08t.localdomain os-collect-config[2348]: [2016-12-20 10:40:38,853] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 01162017logs/pc-osnova01t_os-collect-config_01162017.txt 351908:Jan 11 17:58:09 pc-osnova01t.localdomain os-collect-config[7759]: [2017-01-11 17:58:09,452] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 01162017logs/pc-osnova06t_os-collect-config_01162017.txt 144065:Dec 21 01:39:01 pc-osnova06t.localdomain os-collect-config[4614]: [2016-12-20 17:39:01,632] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 333554:Jan 11 15:06:39 pc-osnova06t.localdomain os-collect-config[5031]: [2017-01-11 15:06:39,196] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} 350160:Jan 16 11:06:40 pc-osnova06t.localdomain os-collect-config[3915]: [2017-01-16 11:06:40,530] (heat-config) [INFO] {"deploy_stdout": "Mount NFS on this node ...\n", "deploy_stderr": "mou nt.nfs: /var/lib/nova/instances is busy or already mounted\n", "deploy_status_code": 32} Created attachment 1243522 [details]
os-collect-config of pc-oscont03t
Hi Sofer, Today the customer tried update again but unexpectedly it failed at one of the controller's breakpoint but not computes'. completed: [u'pc-osnova04t', u'pc-osnova02t'] on_breakpoint: [u'pc-oscont01t', u'pc-oscont02t', u'pc-osnova05t', u'pc-osnova03t', u'pc-osnova01t', u'pc-osnova08t', u'pc-osnova07t', u'pc-osnova06t', u'pc-oscont03t'] Breakpoint reached, continue? Regexp or Enter=proceed, no=cancel update, C-c=quit interactive mode: removing breakpoint on pc-oscont03t IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED I checked the os-collect-config of pc-oscont03t but I didn't find any explicit errors. I've uploaded it for a double check. Also I asked them to collect heat resource-list -n5 overcloud and heat deployment-show for any failed deployments. If anything is still needed to troubleshoot please let me know. Best Regards, Chen Hi Sofer, I could see some resources are FAILED for this time. { "status": "FAILED", "server_id": "e025c72d-a726-4d6a-8836-37913cf5176f", "config_id": "8728db86-ba19-4191-9972-7ce46f3846d1", "output_values": { "deploy_stdout": "", "deploy_stderr": "/var/lib/heat-config/heat-config-script/8728db86-ba19-4191-9972-7ce46f3846d1: line 13: syntax error: unexpected end of file\n", "deploy_status_code": 2 }, "creation_time": "2016-12-08T04:53:46", "updated_time": "2017-01-23T07:29:41", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 2", "id": "4a6d1439-c175-4ad1-9249-2b8a06c6fa97" } All the FAILED resources' reason was the same: "line 13: syntax error: unexpected end of file\n" So it seems that pc-cont03t is not the one who should be blame right ? I think I need to have a look at the /var/lib/heat-config/heat-config-script/8728db86-ba19-4191-9972-7ce46f3846d1 line 13 am I right ? Anything else could help to investigate the issue ? Best Regards, Chen Created attachment 1243607 [details]
heat deployment-show for failed resources
Hi Sofer, After we checked the line 13 of that script we did find that the script was missing a "fi". Sorry for that and the customer will do another test tomorrow. Best Regards, Chen Hi Chen, oki, good to know, hope it's going fine with the lastest version of the script. You could just run the script on a compute node to validate it, that's basically what's going to happen when you run the update. Regards, Hi Chen, can I close this one, ie with the fixed script could you end this step of the upgrade ? Hi Sofer, Sorry for the delay. I just came back from Chinese New Year and so does my customer. They will do another test next Monday so it would be better if we could keep this open until then. Best Regards, Chen Hi Sofer, The customer tried to perform an update yesterday again (2/7 start at 15:36 and timeout on 19:36.) but it never finished and timed out again. It seems that no resources are marked as FAILED and I don't see any ERRORs for any resource. I uploaded some output to collab-shell.usersys.redhat.com:/cases/01766275. Asking heat resource-list -n5 overcloud | grep -iv complete as well. The strange is, it seems that every node falls in a dead loop to run the update. I mean, for example, pc-osnova03t, $ grep "Running /usr/libexec/os-refresh-config/configure.d/55-heat-config" pc-osnova03t_os-collect-config_02082017.txt | wc -l 8764 All the nodes' os-collect-config is also in collab-shell.usersys.redhat.com:/cases/01766275. Please guide me with the next steps. Thank you ! Best Regards, Chen Hi, So from the os-collect-config we can see that the cluster is in a strange shape: openstack-ceilometer-alarm-evaluator (systemd:openstack-ceilometer-alarm-evaluator): Started pc-oscont03t (unmanaged) openstack-ceilometer-alarm-evaluator (systemd:openstack-ceilometer-alarm-evaluator): Started pc-oscont02t (unmanaged) Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] (unmanaged) openstack-heat-api-cfn (systemd:openstack-heat-api-cfn): Started pc-oscont01t (unmanaged) openstack-heat-api-cfn (systemd:openstack-heat-api-cfn): Started pc-oscont03t (unmanaged) openstack-heat-api-cfn (systemd:openstack-heat-api-cfn): Started pc-oscont02t (unmanaged) openstack-cinder-volume (systemd:openstack-cinder-volume): Started pc-oscont01t (unmanaged) Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] (unmanaged) openstack-nova-conductor (systemd:openstack-nova-conductor): Started pc-oscont01t (unmanaged) openstack-nova-conductor (systemd:openstack-nova-conductor): Started pc-oscont03t (unmanaged) openstack-nova-conductor (systemd:openstack-nova-conductor): Started pc-oscont02t (unmanaged) Clone Set: neutron-lbaas-agent-clone [neutron-lbaas-agent] (unmanaged) neutron-lbaas-agent (systemd:neutron-lbaas-agent): Started pc-oscont01t (unmanaged) neutron-lbaas-agent (systemd:neutron-lbaas-agent): Started pc-oscont03t (unmanaged) neutron-lbaas-agent (systemd:neutron-lbaas-agent): Started pc-oscont02t (unmanaged) PCSD Status: pc-oscont01t: Online pc-oscont02t: Online pc-oscont03t: Online This shouldn't be "unmanaged". the yum_upgrade.sh command is successful on every node but we can see this on all nodes: "This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register." Is that expected ? So first, the cluster should be put back in managed mode, then check if the subscription is correct. Furthermore, looking at the "custom" (RedHatMultipath for instance) ...script, I would advise to add: set -eux (and at least "set -ex") in each and every one of them to have more logs and clear cut error (test them as adding this can trigger new errors if undefined variable are used for instance, the "u" option) In RedHatMultipath this line is hard to read: SWIFT_DEVICE="$(sudo multipath -l | grep "$(sudo /lib/udev/scsi_id -u -g /dev/sdb)" | awk '{print $1}')" It may work, but I would advice using SCSI_ID="$(sudo /lib/udev/scsi_id -u -g /dev/sdb)" SWIFT_DEVICE="$(sudo multipath -l | grep '${SCSI_ID}' | awk '{print $1}')" And check: if [ -n "${SWIFT_DEVICE}" ]; then <the code> else echo "Could not find swift device" exit 1 fi Waiting for the output of the heat resource that are not COMPLETE. Hi Sofer, For maintenance mode issue I've asked the customer to check. They are using internal repo server so errors regarding registration are expected. They modified RedHatMultipath.sh following your advice and I am uploading it. The customer has uploaded heat resource-list -n5 overcloud | grep -iv complete. I will upload it shortly as well. Best Regards, Chen Created attachment 1249774 [details]
heat resource-list, heat deployment-output-show and RedHatMultipath.sh
Hi Sofer, The customer updated their undercloud node but suffered a kernel panic after that. They used old kernel and that kernel panic went away. But now, when they run the update command, they will get: ERROR: no events found or resource UpdateDeployment. I tried to search a lot but no luck. Best Regards, Chen Hi Sofer, The customer updated their undercloud node according to the following docs. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/upgrading_red_hat_openstack_platform/sect-updating_the_environment#sect-Updating_Director_Packages But after that they can not perform the update. I am attaching the error messages. I asked them to append a -vv and --debug to their update command. Tried to search a lot but didn't find any useful information to solve this. Best Regards, Chen Created attachment 1256389 [details]
Error messages in the standard output
One more thing forgot to mention.. The customer found the iptables was not running after they rebooted the director. So they enabled the iptables manually. Best Regards, Chen Hi Lukas, Sorry in which file did you see that "Engine went down during resource UPDATE" message ? Best Regards, Chen In latest attachement https://bugzilla.redhat.com/attachment.cgi?id=1256389 It looks like https://bugs.launchpad.net/tripleo-common/+bug/1537857 with the patch at https://review.openstack.org/#/c/272213/. I think it's in OSP 9, but not 8. If possible, the workaround can be applied locally. Hi Chen, I've backported the patch to osp8 in the added review. You can try it like that: cat > fix.patch <<'EOF' --- diff --git a/tripleo_common/stack_update.py b/tripleo_common/stack_update.py index 36e5ef6..030cbd9 100644 --- a/tripleo_common/stack_update.py +++ b/tripleo_common/stack_update.py @@ -18,6 +18,8 @@ import re import time +import heatclient.exc + LOG = logging.getLogger(__name__) @@ -138,10 +140,13 @@ stack_name, stack_id = next( x['href'] for x in res.links if x['rel'] == 'stack').rsplit('/', 2)[1:] - events = self.heatclient.events.list( - stack_id=stack_id, - resource_name=res.logical_resource_id, - sort_dir='asc') + try: + events = self.heatclient.events.list( + stack_id=stack_id, + resource_name=res.logical_resource_id, + sort_dir='asc') + except heatclient.exc.HTTPNotFound: + events = [] state = 'not_started' for ev in events: # ignore events older than start of the last stack change EOF and then sudo patch -d /usr/lib/python2.7/site-packages/ -p1 < fix.patch on the undercloud. Meanwhile I try to get a package ready asap, so if you want to wait a little for you. Your call. Regards, Hi Chen, So the best would be to install the new tripleo common package 0.3.1-4.el7ost and restart the update. Kudo to Mike and Thomas for the quick fix/package. Hi Sofer, Huge thanks to your help so far. The 0.3.1-4.el7ost is not available through RH official repo currently and the newest version is 0.3.1-1.el7ost. I've asked the customer apply your hotfix. After 0.3.1-4.el7ost released I'll ask them to update to that package. Do you see any problems within these action plans ? BTW the customer hasn't updated us with the result after applying the hotfix. Best Regards, Chen Hi Chen, (In reply to Chen from comment #42) > Hi Sofer, > > Huge thanks to your help so far. > > The 0.3.1-4.el7ost is not available through RH official repo currently and > the newest version is 0.3.1-1.el7ost. I've asked the customer apply your > hotfix. After 0.3.1-4.el7ost released I'll ask them to update to that > package. Do you see any problems within these action plans ? Nothing that I'm aware of, no. > BTW the > customer hasn't updated us with the result after applying the hotfix. > > Best Regards, > Chen Regards, Hi Sofer, Thank you for your hotfix and the update command can run again. The customer tried the update again today and he found the breakpoint was reached only once and the overcloud status was showing "IN_PROGRESS" all the time. From the -vv output the customer noticed the update fell in some kind of loop. And I found the following message. resource_status_reason": "UPDATE paused until Hook pre-update is cleared" I am attaching the command output after enabling -vv option and would you please have a look ? Best Regards, Chen Created attachment 1258658 [details]
overcloud update stack after enabling -vv option showing "UPDATE paused until Hook pre-update is cleared"
Hi Lukas, Sofer, The update timed out after 4 hours and only one breakpoint was met. I hope we could get some help from heat team to recover from this unhealthy state. Best Regards, Chen Hi Lukas, Seems updated heat packages didn't solve the hang issue. Here is heat resource-list overcloud -n5 | grep -v COMPLETE. [stack@pc-osadmin01t ~]$ heat resource-list overcloud -n5 | grep -v COMPLETE +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | ComputeNodesPostDeployment | df3cacac-0d06-4dda-98d2-f46c164ae70d | OS::TripleO::ComputePostDeployment | UPDATE_FAILED | 2017-01-09T03:45:05 | overcloud | | 5 | 106cc260-d0af-43b1-beea-7314c12ee6da | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo-ComputePuppetDeployment-qdlqqhmip4cw | | ComputePuppetDeployment | 6ce6ac54-cac1-487d-8a03-27afaa05738b | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-01-09T03:45:45 | overcloud-ComputeNodesPostDeployment-wqbaounbuoxo | | ComputeAllNodesDeployment | 8c5548b8-bafd-4c01-ae4c-eb8a967c02ae | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-02-13T07:40:14 | overcloud | | ControllerAllNodesDeployment | a6e41c06-0b25-4279-94aa-4334b54f6755 | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2017-02-13T07:40:15 | overcloud | | 1 | a1f359a6-bde8-448a-9d78-c15285e59422 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:17 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 2 | d43b00fb-2be8-4174-a704-350627b7504a | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:17 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 0 | 10c439cd-20c7-4e8f-aa0f-8d4e80500947 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:18 | overcloud-ControllerAllNodesDeployment-7obqyftxa5gh | | 1 | 00a9f138-8675-4d38-8fb2-6a1e928fceff | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:18 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 5 | b85a41e6-c662-45b0-8dcc-25e21ab1d87a | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:18 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 4 | b3a648d3-2d79-41f2-b60d-c058221b6f41 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 6 | a02a9448-feec-487a-ad27-b64c9a89de6d | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 7 | 087108b6-32b8-4130-b1a7-3987f4938978 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:19 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 0 | 827b515b-c657-4b77-85da-c397126d3e76 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 2 | 2734e08c-92a8-4ffb-b978-b3b4bdda82bf | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | 3 | 915cf3f0-1491-4753-8980-5fd1efc259bd | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-02-13T07:40:20 | overcloud-ComputeAllNodesDeployment-vnylt3caxemv | | UpdateDeployment | b9d5dd88-8749-41ff-b178-1cf9d2706611 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:42 | overcloud-Compute-xxwiwibat6t7-0-ljujalc56xle | | UpdateDeployment | 2f4b33f0-8c34-4543-9a57-e991691ece7c | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:44 | overcloud-Controller-3tzdslpfpkkr-0-xahmcczouslg | | UpdateDeployment | a78c5724-5530-4b13-9143-434f34cbff1e | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:44 | overcloud-Compute-xxwiwibat6t7-6-iu2yypfkf5no | | UpdateDeployment | df1208d5-74a2-45e1-bc87-1448be83144e | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:47 | overcloud-Compute-xxwiwibat6t7-2-q33r5mndmljh | | UpdateDeployment | 67b0d474-19d0-4cb5-b5b2-780a0b5c794c | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:49 | overcloud-Controller-3tzdslpfpkkr-1-oxv4xatkah42 | | UpdateDeployment | eb035152-a80c-4212-b77c-045b529d4a62 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:51 | overcloud-Compute-xxwiwibat6t7-4-t4bk53dxfvu3 | | UpdateDeployment | fff1d305-61df-427f-84b5-29c97118f0ef | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2017-03-20T08:48:53 | overcloud-Compute-xxwiwibat6t7-3-cjlits3mcwlq | | UpdateDeployment | 48ce8d66-66d8-4c98-b392-1a7e6c5186e3 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2017-03-20T08:48:57 | overcloud-Compute-xxwiwibat6t7-5-vztnt2dzscnr | | Compute | 73d3d481-2692-4cfc-ad51-48b48df36c4a | OS::Heat::ResourceGroup | UPDATE_FAILED | 2017-03-20T09:26:45 | overcloud | | 5 | 3e108686-cb53-424e-9f59-e98cf565f780 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:49 | overcloud-Compute-xxwiwibat6t7 | | Controller | f05551d7-8d08-4a27-9362-895dde9cd3c6 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2017-03-20T09:26:49 | overcloud | | 6 | 7de5ac94-9ebd-44e5-952c-606be4d21c86 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:51 | overcloud-Compute-xxwiwibat6t7 | | 0 | 43c63e46-e0db-4fbe-a073-43fdc7463581 | OS::TripleO::Controller | UPDATE_FAILED | 2017-03-20T09:26:52 | overcloud-Controller-3tzdslpfpkkr | | 1 | 00f8b2e0-9544-4b92-8759-c3edc357bcb8 | OS::TripleO::Controller | UPDATE_FAILED | 2017-03-20T09:26:54 | overcloud-Controller-3tzdslpfpkkr | | 4 | 2bc43591-5a87-4f18-bfcb-16c535f6107c | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:54 | overcloud-Compute-xxwiwibat6t7 | | 0 | 1f4c2aea-927b-4f0a-a6cc-dd4d7fedc184 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:56 | overcloud-Compute-xxwiwibat6t7 | | 2 | e71ec635-4384-4aef-9856-bc9e3f397534 | OS::TripleO::Controller | UPDATE_FAILED | 2017-03-20T09:26:56 | overcloud-Controller-3tzdslpfpkkr | | 2 | 563365b9-0f7e-410e-847c-b7d73e19fbd6 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:58 | overcloud-Compute-xxwiwibat6t7 | | 3 | a2c84c85-78ce-4fea-8ada-92af7927b2a8 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2017-03-20T09:26:58 | overcloud-Compute-xxwiwibat6t7 | | UpdateDeployment | 6359b193-392c-4976-9984-cd057571063f | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2017-03-20T09:28:30 | overcloud-Controller-3tzdslpfpkkr-2-mihnp6ps3q7b | +-----------------------------------------------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ I can still see "UPDATE paused until Hook pre-update is cleared" and I will upload the newest -vv --debug output. But as bz 1414779 mentioned, the pre-update hook can be cleared in some way and would you please have a double check ? Best Regards, Chen Created attachment 1264775 [details]
-vv --debug enabled standard output but only two breakpoints reached before timing out
Can I get heat stack-show overcloud-Compute-xxwiwibat6t7 ? Id try restarting heat-engine to ensure that all IN_PROGRESS switch to FAILED and after that I'd try re-running the update. Hi Dave, Sorry should I give this openstack-tripleo-common package to the customer ? Does it fix the update hang issue ? Best Regards, Chen (In reply to Chen from comment #60) > Sorry should I give this openstack-tripleo-common package to the customer ? > Does it fix the update hang issue ? Yes, please go ahead and provide it to your customer. It is believed that it will resolve the issue reported here however your customer should first validate in a test environment before deploying it to production. Hi Chen, Closing this one as the hotfix was delivered. Don't hesitate to re-open it if needed. hotfix should not close a bug. it only closes when shipped to CDN. Hotfixes go directly to specific customers. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1465 |