Bug 1650184

Summary: [RFE] Do not execute client role tasks serially but in parallel
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Giulio Fidente <gfidente>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 3.1CC: anharris, aschoen, augol, ceph-eng-bugs, ceph-qe-bugs, edonnell, gabrioux, gfidente, gmeno, hnallurv, johfulto, lbezdick, nthomas, sankarshan, tchandra, tserlin, vashastr, yprokule
Target Milestone: z1Keywords: FutureFeature
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.2.4-1.el7cp Ubuntu: ceph-ansible_3.2.4-2redhat1 Doc Type: Enhancement
Doc Text:
Previously, the `rolling-update.yml` playbook executed the client roles one by one. With this update, users can specify the number of nodes to be processed in one batch using the new variable `client_update_batch`. This makes the upgrade process for client nodes much faster. If no value is passed, it defaults to the value of the variable for `ansible_forks`, which is `5` by default.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-31 10:36:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730    

Description Giulio Fidente 2018-11-15 14:35:21 UTC
In ceph-ansible 3.1 the client role tasks are executed serially on each client node.

This causes tasks like scale up or upgrade to take a lot of time depending on how many client nodes are using the cluster; it should be possible instead to execute the client role tasks in parallel on all nodes at the same time.

Comment 3 Lukas Bezdicka 2018-11-15 15:15:59 UTC
We should provide set of new playbooks where we will try to do something like:
  serial: "{{ ((groups['<group>'] | length)  * 0.2) | round(0,'ceil') | int }}"
And even full parallel on clients as it makes no sense to containerize and upgrade one by one node if you have hundreds of nodes.

Comment 4 John Fulton 2018-12-20 15:40:01 UTC
- This is about rolling update; we see "serial: 1" here: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L798-L803
- Can we override this value for the client role, e.g. by customizing the inventory?

Comment 5 John Fulton 2019-01-02 14:18:05 UTC
As per a conversation with Seb: 

- the ceph-ansible team will remove "serial: 1" from rolling_update.yml playbook for the clients
- it will be put into 3.2

Comment 12 Giulio Fidente 2019-01-16 16:38:56 UTC
Looks like the fix introduced a new issue [1]

2019-01-16 17:32:34,485 p=28500 u=mistral |  ERROR! The field 'serial' has an invalid value, which includes an undefined variable. The error was: 'ansible_forks' is undefined                                    
                                                                                                                                                                                                                  
The error appears to have been in '/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml': line 737, column 3, but may                                                                              
be elsewhere in the file depending on the exact syntax problem. 
                                                                
The offending line appears to be:                               
                                                                                                                                                                                                                  
                                                                                                                                                                                                                  
- name: upgrade ceph client node                                
  ^ here                                                        
                                                                
exception type: <class 'ansible.errors.AnsibleUndefinedVariable'>                                                                                                                                                 
exception: 'ansible_forks' is undefined    

1. https://paste.fedoraproject.org/paste/QV7A0FhAl4t4uQMjXUK3Tw

Comment 14 Giulio Fidente 2019-01-17 17:36:14 UTC
(In reply to Giulio Fidente from comment #12)
> Looks like the fix introduced a new issue [1]
> 
> 2019-01-16 17:32:34,485 p=28500 u=mistral |  ERROR! The field 'serial' has
> an invalid value, which includes an undefined variable. The error was:
> 'ansible_forks' is undefined                                    
>                                                                             
> 
> The error appears to have been in
> '/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml': line
> 737, column 3, but may                                                      
> 
> be elsewhere in the file depending on the exact syntax problem. 
>                                                                 
> The offending line appears to be:                               
>                                                                             
> 
>                                                                             
> 
> - name: upgrade ceph client node                                
>   ^ here                                                        
>                                                                 
> exception type: <class 'ansible.errors.AnsibleUndefinedVariable'>           
> 
> exception: 'ansible_forks' is undefined    
> 
> 1. https://paste.fedoraproject.org/paste/QV7A0FhAl4t4uQMjXUK3Tw

I think the problem is that ansible_forks was added in ansible 2.5

Lukas, do you know what version of ansible was installed on the undercloud when the run failed?

Comment 27 Guillaume Abrioux 2019-01-22 15:02:38 UTC
Hi Tejas,

yes, the default behaviour is to update clients in parallel.

Comment 30 errata-xmlrpc 2019-01-31 10:36:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0223