Bug 1472477

Summary:	rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker-converge step without particular error.
Product:	Red Hat OpenStack	Reporter:	Alexander Chuzhoy <sasha>
Component:	rhosp-director	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED NOTABUG	QA Contact:	Amit Ugol <augol>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	10.0 (Newton)	CC:	aschultz, cschwede, dbecker, mburns, morazi, rhel-osp-director-maint, sasha, sathlang, smerrow
Target Milestone:	async	Keywords:	Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-07-19 17:31:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1335596, 1356451

Description Alexander Chuzhoy 2017-07-18 20:45:02 UTC

rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker-converge step without particular error.

Environment:
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-21.el7ost.noarch
instack-undercloud-5.3.0-1.el7ost.noarch
openstack-puppet-modules-9.3.0-1.el7ost.noarch


Steps to reproduce:
Follow the upgrade procedure [1], run the step with major-upgrade-pacemaker-converge.yaml

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/upgrading_red_hat_openstack_platform/#sect-Major-Upgrading_the_Overcloud



Result:
2017-07-18 19:38:05Z [overcloud.ControllerAllNodesDeployment]: UPDATE_COMPLETE  state changed
2017-07-18 19:38:05Z [overcloud.ControllerAllNodesValidationDeployment]: UPDATE_IN_PROGRESS  state changed
2017-07-18 19:38:06Z [overcloud.UpdateWorkflow]: CREATE_IN_PROGRESS  state changed
2017-07-18 19:38:07Z [overcloud.ControllerAllNodesValidationDeployment]: UPDATE_COMPLETE  state changed
2017-07-18 19:38:07Z [overcloud.UpdateWorkflow]: CREATE_COMPLETE  state changed
2017-07-18 19:38:07Z [overcloud.AllNodesDeploySteps]: CREATE_IN_PROGRESS  state changed
2017-07-18 20:19:06Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED  Error: resources.AllNodesDeploySteps.resources.ControllerExtraConfigPost.resources.ExtraDeployments.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 5
2017-07-18 20:19:06Z [overcloud]: UPDATE_FAILED  Error: resources.AllNodesDeploySteps.resources.ControllerExtraConfigPost.resources.ExtraDeployments.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 5



[stack@director ~]$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.1:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: c219c63e-41a4-41aa-a819-2d284ccd0fca
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: c8fe68f6-5f23-40af-a9fe-f18ef1feb9eb
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: db320505-6d10-42d5-b583-5686def3b791
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |







[stack@director ~]$ heat resource-list -n5 overcloud|grep -v COMPLETE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead

+----------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                | physical_resource_id                                                            | resource_type                                                                                                             | resource_status | updated_time         | stack_name                                                                                                                            |
+----------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| AllNodesDeploySteps                          | a7a498f5-303f-4b87-8a58-f44036994e2f                                            | OS::TripleO::PostDeploySteps                                                                                              | CREATE_FAILED   | 2017-07-18T19:38:07Z | overcloud                                                                                                                             |
| ControllerExtraConfigPost                    | c6ec8544-2c5d-47d9-9887-eee79940bd60                                            | OS::TripleO::NodeExtraConfigPost                                                                                          | CREATE_FAILED   | 2017-07-18T19:38:08Z | overcloud-AllNodesDeploySteps-2wxbo2x6d6d7                                                                                            |
| 0                                            | c8fe68f6-5f23-40af-a9fe-f18ef1feb9eb                                            | OS::Heat::SoftwareDeployment                                                                                              | CREATE_FAILED   | 2017-07-18T20:17:15Z | overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv                       |
| 1                                            | c219c63e-41a4-41aa-a819-2d284ccd0fca                                            | OS::Heat::SoftwareDeployment                                                                                              | CREATE_FAILED   | 2017-07-18T20:17:15Z | overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv                       |
| 2                                            | db320505-6d10-42d5-b583-5686def3b791                                            | OS::Heat::SoftwareDeployment                                                                                              | CREATE_FAILED   | 2017-07-18T20:17:15Z | overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv                       |
| ExtraDeployments                             | 9409447e-c526-4534-965d-e45ae7ff2570                                            | OS::Heat::SoftwareDeployments                                                                                             | CREATE_FAILED   | 2017-07-18T20:17:15Z | overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua                                                     |
+----------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+





[stack@director ~]$ for i in c8fe68f6-5f23-40af-a9fe-f18ef1feb9eb c219c63e-41a4-41aa-a819-2d284ccd0fca db320505-6d10-42d5-b583-5686def3b791; do echo $i; heat deployment-show $i; done
c8fe68f6-5f23-40af-a9fe-f18ef1feb9eb
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "8c5f4401-2607-4356-8846-3641b70951ce", 
  "config_id": "b44bb644-a5f4-4fcd-ad5d-5119a0e4d116", 
  "output_values": {
    "deploy_stdout": "", 
    "deploy_stderr": "", 
    "deploy_status_code": 5
  }, 
  "creation_time": "2017-07-18T20:17:17Z", 
  "updated_time": "2017-07-18T20:19:04Z", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 5", 
  "id": "c8fe68f6-5f23-40af-a9fe-f18ef1feb9eb"
}
c219c63e-41a4-41aa-a819-2d284ccd0fca
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "9b8a7dd7-d4c0-43c9-a743-f6516cfa881a", 
  "config_id": "afc11f98-8509-498b-83a4-856e7cbd6a2d", 
  "output_values": {
    "deploy_stdout": "", 
    "deploy_stderr": "", 
    "deploy_status_code": 5
  }, 
  "creation_time": "2017-07-18T20:17:16Z", 
  "updated_time": "2017-07-18T20:18:18Z", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 5", 
  "id": "c219c63e-41a4-41aa-a819-2d284ccd0fca"
}
db320505-6d10-42d5-b583-5686def3b791
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "268e8fe0-f688-481b-b24c-487f323db8c1", 
  "config_id": "0aa7e27b-4e28-4285-88f9-3f9fa8d31181", 
  "output_values": {
    "deploy_stdout": "", 
    "deploy_stderr": "", 
    "deploy_status_code": 5
  }, 
  "creation_time": "2017-07-18T20:17:17Z", 
  "updated_time": "2017-07-18T20:18:09Z", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 5", 
  "id": "db320505-6d10-42d5-b583-5686def3b791"
}

[heat-admin@overcloud-controller-1 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
Last updated: Tue Jul 18 20:44:41 2017          Last change: Tue Jul 18 19:47:56 2017 by root via cibadmin on overcloud-controller-0

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-192.168.140.121     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1 (unmanaged)
 ip-192.168.120.127     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2 (unmanaged)
 ip-192.168.120.126     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1 (unmanaged)
 Clone Set: haproxy-clone [haproxy] (unmanaged)
     haproxy    (systemd:haproxy):      Started overcloud-controller-2 (unmanaged)
     haproxy    (systemd:haproxy):      Started overcloud-controller-1 (unmanaged)
     haproxy    (systemd:haproxy):      Started overcloud-controller-0 (unmanaged)
 Master/Slave Set: galera-master [galera] (unmanaged)
     galera     (ocf::heartbeat:galera):        Master overcloud-controller-2 (unmanaged)
     galera     (ocf::heartbeat:galera):        Master overcloud-controller-1 (unmanaged)
     galera     (ocf::heartbeat:galera):        Master overcloud-controller-0 (unmanaged)
 ip-192.168.190.5       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2 (unmanaged)
 ip-192.168.170.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1 (unmanaged)
 ip-192.168.140.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2 (unmanaged)
 Clone Set: rabbitmq-clone [rabbitmq] (unmanaged)
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2 (unmanaged)
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1 (unmanaged)
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0 (unmanaged)
 Master/Slave Set: redis-master [redis] (unmanaged)
     redis      (ocf::heartbeat:redis): Slave overcloud-controller-2 (unmanaged)
     redis      (ocf::heartbeat:redis): Master overcloud-controller-1 (unmanaged)
     redis      (ocf::heartbeat:redis): Slave overcloud-controller-0 (unmanaged)
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-1 (unmanaged)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[heat-admin@overcloud-controller-1 ~]$

Comment 2 Alexander Chuzhoy 2017-07-18 22:39:54 UTC

After runinng 
"pcs property unset maintenance-mode" , re-attempted the major-upgrade-pacemaker-converge step:


The failure was different:

2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua.ExtraDeployments]: UPDATE_FAILED  resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua]: UPDATE_FAILED  resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7.ControllerExtraConfigPost]: UPDATE_FAILED  resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7.ControllerSwiftRingUpdate]: UPDATE_FAILED  UPDATE aborted
2017-07-18 22:22:19Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7]: UPDATE_FAILED  resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:20Z [AllNodesDeploySteps]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:20Z [overcloud]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.ControllerExtraConfigPost: resources.ExtraDeployments: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
2017-07-18 22:22:20Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerSwiftRingUpdate-3gjcbffsbzf5.SwiftRingUpdate]: UPDATE_FAILED  UPDATE aborted
2017-07-18 22:22:20Z [overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerSwiftRingUpdate-3gjcbffsbzf5]: UPDATE_FAILED  Operation cancelled

 Stack overcloud UPDATE_FAILED 

Heat Stack update failed.




[stack@director ~]$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.1:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 4ce4197a-4b8a-4779-9d07-0ae8fc58363f
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 5bd8c37e-c4bd-4920-8347-7925e2bf3e89
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 2d12f697-5fcf-4b4d-8d6a-3f6e71e40487
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 5
  deploy_stdout: |

  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerSwiftRingUpdate.SwiftRingUpdate.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 73eafe29-33dd-4f5b-81c1-f1f00e5d8a92
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
  deploy_stdout: |
    ...
    /etc/swift/backups/1500312557.container.builder
    /etc/swift/backups/1500312557.object.builder
    /etc/swift/backups/1500312558.account.builder
    /etc/swift/backups/1500312565.account.builder
    /etc/swift/backups/1500312565.account.ring.gz
    /etc/swift/backups/1500312566.container.builder
    /etc/swift/backups/1500312566.container.ring.gz
    /etc/swift/backups/1500312566.object.builder
    /etc/swift/backups/1500312566.object.ring.gz
    /var/lib/heat-config/heat-config-script
    (truncated, view all with --long)
  deploy_stderr: |
    tar: Removing leading `/' from member names

Comment 3 Alexander Chuzhoy 2017-07-18 23:47:39 UTC

Hi Christian.
want to loop you.

Comment 4 Christian Schwede (cschwede) 2017-07-19 07:58:13 UTC

Hi Alexander! I think this is not a Swift issue - the SwiftRingUpdate has been aborted due to to the earlier failures ("UPDATE aborted"). Looking at the sosreport from controller 0, I see a lot of other errors:

- Rabbit and Mysql seems to be either down or unreachable (network?) 
- there are even quite a few segfaults in httpd?
- Swift account servers are not reachable too - which might be due to network errors?

I'm not sure about the non-Swift errors, maybe someone from the upgrade DFG can help with this? I think the reason for this is not Swift, there seems to be a more general problem.

Comment 5 Sofer Athlan-Guyot 2017-07-19 10:24:55 UTC

Hi Sasha,

let's backtrack to the original error.

I think overcloud.AllNodesDeploySteps.ControllerExtraConfigPost.ExtraDeployments.0 is a custom script that just fails.  Return code 5 is unusual and the fact that it emits no output as well.

To be sure could we have the /var/lib/heat-config directory from controller0 for instance.

I want to take a peak at that:

   Running /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116.json

which is the failing command.

Comment 6 Alexander Chuzhoy 2017-07-19 14:28:52 UTC

[root@overcloud-controller-0 ~]# /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116.json
[2017-07-19 14:28:03,337] (heat-config) [INFO] deploy_server_id=8c5f4401-2607-4356-8846-3641b70951ce
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_action=CREATE
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv/9409447e-c526-4534-965d-e45ae7ff2570
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_resource_name=0
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_id=http://192.168.120.101:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A99a592c416214dfc99c904197bca1709%3Astacks%2Fovercloud-AllNodesDeploySteps-2wxbo2x6d6d7-ControllerExtraConfigPost-j346umbf36ua-ExtraDeployments-t5sanensxohv%2F9409447e-c526-4534-965d-e45ae7ff2570%2Fresources%2F0?Timestamp=2017-07-18T20%3A17%3A15Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=6409c9cbf0a543c79b37bc4511055139&SignatureVersion=2&Signature=5Igr8y%2FS2%2FAcaqNtjhfFB9WBV9dxDTurEDG971JOsk8%3D
[2017-07-19 14:28:03,338] (heat-config) [INFO] deploy_signal_verb=POST
[2017-07-19 14:28:03,339] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116
[2017-07-19 14:28:03,369] (heat-config) [INFO] 
[2017-07-19 14:28:03,370] (heat-config) [DEBUG] 
[2017-07-19 14:28:03,370] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116. [5]

{"deploy_stdout": "", "deploy_stderr": "", "deploy_status_code": 5}[root@overcloud-controller-0 ~]# 






[root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/b44bb644-a5f4-4fcd-ad5d-5119a0e4d116
#!/bin/bash
# BEGIN workaround for BZ 1283721
if [[ $HOSTNAME =~ "cephstorage-0" ]]; then
{
  echo "Checking Ceph pools pg_num..."
  sleep 10
  hiera ceph_pool_pgs {} | HOME="/root" python -c "
import re, subprocess, sys, time, yaml
def issue_cmd(cmd):
    print(cmd)
    for i in range(1, 10):
        try:
            return subprocess.check_output(cmd.split(), stderr=subprocess.STDOUT)
        except subprocess.CalledProcessError as e:
            if 'EBUSY' in e.output:
                print '  Cluster is busy, retrying...'
                time.sleep(5)
            else:
                print '{}\nAborting due to fatal error.'.format(e.output)
                sys.exit(1)

    print 'Aborting due to excessive retries.'
    sys.exit(1)

def set_pg_num(pool, pg_num):
    out = issue_cmd('ceph osd pool get {} pg_num'.format(pool))
    print(out)
    m = re.search('(\S*)(pg_num: )(\d+)(\S*)', out)
    if not m:
        print 'Aborting due to error reading current pg_num.'
        sys.exit(1)
    
    pg_cur = int(m.group(3))
    if pg_cur == pg_num:
        print 'Pool \'{}\' pg_num already set to {}'.format(pool, pg_num)
    elif pg_cur > pg_num:
        print 'Cannot decrease pool \'{}\' pg_num from {} to {}'.format(pool, pg_cur, pg_num)
    else:
        print 'Increasing pool \'{}\' pg_num from {} to {}'.format(pool, pg_cur, pg_num)
        while pg_cur < pg_num:
            pg_cur *= 2
            if pg_cur > pg_num:
                pg_cur = pg_num

            issue_cmd('ceph osd pool set {} pg_num {}'.format(pool, pg_cur))
            time.sleep(10)

        issue_cmd('ceph osd pool set {} pgp_num {}'.format(pool, pg_num))

input = ' '.join(map(str.strip, sys.stdin.readlines()))
pool_pgs = yaml.load(input.replace('=>', ': '))
for pool in pool_pgs:
    set_pg_num(pool, pool_pgs[pool])
"
} >> /root/post-deploy.txt 2>&1
fi
# END workaround for BZ 1283721
if [[ $HOSTNAME =~ "controller" ]]; then
{
  echo "Restarting RGW..."
  chkconfig --add ceph-radosgw
  systemctl restart ceph-radosgw
} >> /root/post-deploy.txt 2>&1
fi

Comment 7 Sofer Athlan-Guyot 2017-07-19 14:58:10 UTC

Hi,

so this is a custom script for radosgw support.  It looks like we already fixed this in https://bugzilla.redhat.com/show_bug.cgi?id=1404810#c38

Basically:


diff -ruN  pilot/templates/post-deploy.yaml  pilot/templates/post-deploy.yaml.working 
--- pilot/templates/post-deploy.yaml    2017-07-17 09:51:30.000000000 -0500
+++ pilot/templates/post-deploy.yaml.working    2017-07-19 09:47:51.355773310 -0500
@@ -88,7 +88,8 @@
         {
           echo "Restarting RGW..."
           chkconfig --add ceph-radosgw
-          systemctl restart ceph-radosgw
+          sudo pkill radosgw
+          sudo systemctl restart ceph-radosgw
         } >> /root/post-deploy.txt 2>&1
         fi

and the converge step should work.

For reference, the env file is included in:

pilot/templates/dell-environment.yaml:    OS::TripleO::NodeExtraConfigPost: ./post-deploy.yaml

Sasha, tell us how it goes.

Thanks,

Comment 9 Alexander Chuzhoy 2017-07-19 17:31:21 UTC

After correcting post-deploy.yaml based on comment #7, successfully completed the major-upgrade-pacemaker-converge step.

Closing as not a bug.