Bug 1472477
Summary: | rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker-converge step without particular error. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | rhosp-director | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED NOTABUG | QA Contact: | Amit Ugol <augol> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 10.0 (Newton) | CC: | aschultz, cschwede, dbecker, mburns, morazi, rhel-osp-director-maint, sasha, sathlang, smerrow |
Target Milestone: | async | Keywords: | Triaged, ZStream |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-07-19 17:31:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1335596, 1356451 |
Description
Alexander Chuzhoy
2017-07-18 20:45:02 UTC
After runinng "pcs property unset maintenance-mode" , re-attempted the major-upgrade-pacemaker-converge step: The failure was different: 2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua.ExtraDeployments]: UPDATE_FAILED resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua]: UPDATE_FAILED resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7.ControllerExtraConfigPost]: UPDATE_FAILED resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7.ControllerSwiftRingUpdate]: UPDATE_FAILED UPDATE aborted 2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7]: UPDATE_FAILED resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:20Z [AllNodesDeploySteps]: UPDATE_FAILED resources.AllNodesDeploySteps: resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:20Z [overcloud]: UPDATE_FAILED resources.AllNodesDeploySteps: resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 2017-07-18 22:22:20Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerSwiftRingUpdate-3gjcbffsbzf5.SwiftRingUpdate]: UPDATE_FAILED UPDATE aborted 2017-07-18 22:22:20Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerSwiftRingUpdate-3gjcbffsbzf5]: UPDATE_FAILED Operation cancelled Stack overcloud UPDATE_FAILED Heat Stack update failed. [stack@director ~]$ openstack stack failures list overcloud overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.1: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 4ce4197a-4b8a-4779-9d07-0ae8fc58363f status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 deploy_stdout: | deploy_stderr: | overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 5bd8c37e-c4bd-4920-8347-7925e2bf3e89 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 deploy_stdout: | deploy_stderr: | overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.2: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 2d12f697-5fcf-4b4d-8d6a-3f6e71e40487 status: CREATE_FAILED status_reason: | Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5 deploy_stdout: | deploy_stderr: | overcloud.AllNodesDeploySteps.ControllerSwiftRingUpdate.SwiftRingUpdate.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 73eafe29-33dd-4f5b-81c1-f1f00e5d8a92 status: UPDATE_FAILED status_reason: | UPDATE aborted deploy_stdout: | ... /etc/swift/backups/1500312557.container.builder /etc/swift/backups/1500312557.object.builder /etc/swift/backups/1500312558.account.builder /etc/swift/backups/1500312565.account.builder /etc/swift/backups/1500312565.account.ring.gz /etc/swift/backups/1500312566.container.builder /etc/swift/backups/1500312566.container.ring.gz /etc/swift/backups/1500312566.object.builder /etc/swift/backups/1500312566.object.ring.gz /var/lib/heat-config/heat-config-script (truncated, view all with --long) deploy_stderr: | tar: Removing leading `/' from member names Hi Christian. want to loop you. Hi Alexander! I think this is not a Swift issue - the SwiftRingUpdate has been aborted due to to the earlier failures ("UPDATE aborted"). Looking at the sosreport from controller 0, I see a lot of other errors: - Rabbit and Mysql seems to be either down or unreachable (network?) - there are even quite a few segfaults in httpd? - Swift account servers are not reachable too - which might be due to network errors? I'm not sure about the non-Swift errors, maybe someone from the upgrade DFG can help with this? I think the reason for this is not Swift, there seems to be a more general problem. Hi Sasha, let's backtrack to the original error. I think overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.0 is a custom script that just fails. Return code 5 is unusual and the fact that it emits no output as well. To be sure could we have the /var/lib/heat-config directory from controller0 for instance. I want to take a peak at that: Running /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116.json which is the failing command. [root@overcloud-controller-0 ~]# /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116.json [2017-07-19 14:28:03,337] (heat-config) [INFO] deploy_server_id=8c5f4401-2607-4356-8846-3641b70951ce [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_action=CREATE [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv/9409447e-c526-4534-965d-e45ae7ff2570 [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_resource_name=0 [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_id=http://192.168.120.101:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A99a592c416214dfc99c904197bca1709%3Astacks%2Fovercloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv%2F9409447e-c526-4534-965d-e45ae7ff2570%2Fresources%2F0?Timestamp=2017-07-18T20%3A17%3A15Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=6409c9cbf0a543c79b37bc4511055139&SignatureVersion=2&Signature=5Igr8y%2FS2%2FAcaqNtjhfFB9WBV9dxDTurEDG971JOsk8%3D [2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_verb=POST [2017-07-19 14:28:03,339] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116 [2017-07-19 14:28:03,369] (heat-config) [INFO] [2017-07-19 14:28:03,370] (heat-config) [DEBUG] [2017-07-19 14:28:03,370] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116. [5] {"deploy_stdout": "", "deploy_stderr": "", "deploy_status_code": 5}[root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116 #!/bin/bash # BEGIN workaround for BZ 1283721 if [[ $HOSTNAME =~ "cephstorage-0" ]]; then { echo "Checking Ceph pools pg_num..." sleep 10 hiera ceph_pool_pgs {} | HOME="/root" python -c " import re, subprocess, sys, time, yaml def issue_cmd(cmd): print(cmd) for i in range(1, 10): try: return subprocess.check_output(cmd.split(), stderr=subprocess.STDOUT) except subprocess.CalledProcessError as e: if 'EBUSY' in e.output: print ' Cluster is busy, retrying...' time.sleep(5) else: print '{}\nAborting due to fatal error.'.format(e.output) sys.exit(1) print 'Aborting due to excessive retries.' sys.exit(1) def set_pg_num(pool, pg_num): out = issue_cmd('ceph osd pool get {} pg_num'.format(pool)) print(out) m = re.search('(\S*)(pg_num: )(\d+)(\S*)', out) if not m: print 'Aborting due to error reading current pg_num.' sys.exit(1) pg_cur = int(m.group(3)) if pg_cur == pg_num: print 'Pool \'{}\' pg_num already set to {}'.format(pool, pg_num) elif pg_cur > pg_num: print 'Cannot decrease pool \'{}\' pg_num from {} to {}'.format(pool, pg_cur, pg_num) else: print 'Increasing pool \'{}\' pg_num from {} to {}'.format(pool, pg_cur, pg_num) while pg_cur < pg_num: pg_cur *= 2 if pg_cur > pg_num: pg_cur = pg_num issue_cmd('ceph osd pool set {} pg_num {}'.format(pool, pg_cur)) time.sleep(10) issue_cmd('ceph osd pool set {} pgp_num {}'.format(pool, pg_num)) input = ' '.join(map(str.strip, sys.stdin.readlines())) pool_pgs = yaml.load(input.replace('=>', ': ')) for pool in pool_pgs: set_pg_num(pool, pool_pgs[pool]) " } >> /root/post-deploy.txt 2>&1 fi # END workaround for BZ 1283721 if [[ $HOSTNAME =~ "controller" ]]; then { echo "Restarting RGW..." chkconfig --add ceph-radosgw systemctl restart ceph-radosgw } >> /root/post-deploy.txt 2>&1 fi Hi, so this is a custom script for radosgw support. It looks like we already fixed this in https://bugzilla.redhat.com/show_bug.cgi?id=1404810#c38 Basically: diff -ruN pilot/templates/post-deploy.yaml pilot/templates/post-deploy.yaml.working --- pilot/templates/post-deploy.yaml 2017-07-17 09:51:30.000000000 -0500 +++ pilot/templates/post-deploy.yaml.working 2017-07-19 09:47:51.355773310 -0500 @@ -88,7 +88,8 @@ { echo "Restarting RGW..." chkconfig --add ceph-radosgw - systemctl restart ceph-radosgw + sudo pkill radosgw + sudo systemctl restart ceph-radosgw } >> /root/post-deploy.txt 2>&1 fi and the converge step should work. For reference, the env file is included in: pilot/templates/dell-environment.yaml: OS::TripleO::NodeExtraConfigPost: ./post-deploy.yaml Sasha, tell us how it goes. Thanks, After correcting post-deploy.yaml based on comment #7, successfully completed the major-upgrade-pacemaker-converge step. Closing as not a bug. |