Description of problem: Running an parallel update of a composed deployment running ipv6 we failed to update database-1 during update steps in the configuration. From undercloud-0/home/stack/overcloud_update_run_Database.log 2020-01-13 15:02:08 | TASK [Wait for puppet host configuration to finish] ... 2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (4 retries left). 2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (3 retries left). 2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (2 retries left). 2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (1 retries left). 2020-01-13 16:07:45 | fatal: [database-1]: FAILED! => {"ansible_job_id": "223897471669.270700", "attempts": 1200, "changed": false, "failed_when_result": true, "finished": 0, "started": 1} in database-1/var/log/messages Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 341/360 Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' so it seems partition with quorum is never reached and ansible run out of retry for the async check, killed the process and exit with failure (last ansible check 1h later : Jan 13 16:07:45 database-1 ansible-async_status[353652]: Invoked with jid=223897471669.270700 mode=status _async_dir=/root/.ansible_async) which seems to make sense as the cluster is shut down "by sysadmin": database-1/var/log/cluster/corosync.log Jan 13 15:00:25 [265749] database-1 corosync notice [CFG ] Node 5 was shut down by sysadmin Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Unloading all Corosync service engines. Jan 13 15:00:25 [265749] database-1 corosync info [QB ] withdrawing server sockets Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Jan 13 15:00:25 [265749] database-1 corosync info [QB ] withdrawing server sockets Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync configuration map access Jan 13 15:00:25 [265749] database-1 corosync info [QB ] withdrawing server sockets Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync configuration service Jan 13 15:00:25 [265749] database-1 corosync info [QB ] withdrawing server sockets Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jan 13 15:00:25 [265749] database-1 corosync info [QB ] withdrawing server sockets Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jan 13 15:00:25 [265749] database-1 corosync notice [SERV ] Service engine unloaded: corosync profile loading service Jan 13 15:00:25 [265749] database-1 corosync notice [MAIN ] Corosync Cluster Engine exiting normally which is explained by this in database-1/var/log/messages Jan 13 15:00:23 database-1 pacemaker-controld[265783]: error: We didn't ask to be shut down, yet our DC is telling us to. Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: State transition S_NOT_DC -> S_STOPPING Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Stopped 0 recurring operations at shutdown... waiting (3 remaining) Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from the executor Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from Corosync Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from the CIB manager Jan 13 15:00:23 database-1 pacemaker-controld[265783]: warning: Inhibiting respawn Jan 13 15:00:23 database-1 pacemakerd[265776]: warning: Shutting cluster down because pacemaker-controld[265783] had fatal failure Not that on database-0 in messages, it looks like database-1 becomes the "new attribute writer" (was messaging-2) and then it's lost: Jan 13 15:00:23 database-0 pacemaker-fenced[255777]: notice: Node database-1 state is now member Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Node database-1 state is now member Jan 13 15:00:23 database-0 pacemaker-based[255776]: notice: Node database-1 state is now member Jan 13 15:00:23 database-0 corosync[255752]: [QUORUM] Members[8]: 1 2 3 4 5 6 7 9 Jan 13 15:00:23 database-0 corosync[255752]: [MAIN ] Completed service synchronization, ready to provide service. Jan 13 15:00:23 database-0 pacemakerd[255774]: notice: Node database-1 state is now member Jan 13 15:00:23 database-0 pacemaker-controld[255781]: notice: Node database-1 state is now member Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Recorded new attribute writer: database-1 (was messaging-2) Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Recorded new attribute writer: messaging-2 (was messaging-0) Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Node database-1 state is now lost Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Removing all database-1 attributes for peer loss Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Purged 1 peer with id=5 and/or uname=database-1 from the membership cache Version-Release number of selected component (if applicable): this is an update from RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 How reproducible: we had a similar issue with jobs 38 but this time it messaging-2 that failed. Same puddle. The job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6-scale-up/39/ Note: OSP16 standard deployment with ipv4 doesn't have this kind of issue. Side note: curl is emitting warnings: Jan 13 16:13:27 database-1 podman[358926]: curl: (3) IPv6 Jan 13 16:13:27 database-1 podman[358926]: 000 :0 0.000000 seconds Jan 13 16:13:27 database-1 podman[358926]: numerical address used in URL without brackets Jan 13 16:13:27 database-1 podman[358926]: Error: exit status 1 it's because the bind_address used for check is without the [], in the hieradata. Certainly unrelated but worth a check.
So after a debugging with Michele and Damien (thanks to them both) this is likely a bug in pacemaker due to the parallel update style used there. To confirm that we have use a less aggressive update path (one role after the other) and everything went fine[1] So now we're testing if we reproduce the // update bug *with* fencing as pacemaker people are likely to ask about that. When we have confirmation that fencing still has the issue, we raise the bz in pacemaker so that we can melt some fat out of that OSP16 bugzilla list. [1] using that DNM patch https://review.opendev.org/702648
So we are now not doing // role update by default in the ci. When the above pacemaker is available in the puddle we will reactivate it. Closing this one now.