1791135 – Parallel update of ipv6 composed deployment (ctl,db,messaging,networker) one of the database fails to update

Bug 1791135 - Parallel update of ipv6 composed deployment (ctl,db,messaging,networker) one of the database fails to update

Summary: Parallel update of ipv6 composed deployment (ctl,db,messaging,networker) one ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.0 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sofer Athlan-Guyot
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:	1791841
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-14 23:33 UTC by Sofer Athlan-Guyot
Modified:	2020-03-06 09:47 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-06 09:47:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sofer Athlan-Guyot 2020-01-14 23:33:21 UTC

Description of problem: Running an parallel update of a composed deployment running ipv6 we failed to update database-1 during update steps in the configuration.

From undercloud-0/home/stack/overcloud_update_run_Database.log

2020-01-13 15:02:08 | TASK [Wait for puppet host configuration to finish]
...

2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (4 retries left).
2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (3 retries left).
2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (2 retries left).
2020-01-13 16:07:45 | FAILED - RETRYING: Wait for puppet host configuration to finish (1 retries left).
2020-01-13 16:07:45 | fatal: [database-1]: FAILED! => {"ansible_job_id": "223897471669.270700", "attempts": 1200, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}

in database-1/var/log/messages

Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 341/360
Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Jan 13 16:01:52 database-1 puppet-user[270728]: Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'


so it seems partition with quorum is never reached and ansible run out of retry for the async check, killed the process and exit with failure (last ansible check 1h later : Jan 13 16:07:45 database-1 ansible-async_status[353652]: Invoked with jid=223897471669.270700 mode=status _async_dir=/root/.ansible_async)

which seems to make sense as the cluster is shut down "by sysadmin": database-1/var/log/cluster/corosync.log

Jan 13 15:00:25 [265749] database-1 corosync notice  [CFG   ] Node 5 was shut down by sysadmin
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Unloading all Corosync service engines.
Jan 13 15:00:25 [265749] database-1 corosync info    [QB    ] withdrawing server sockets
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Jan 13 15:00:25 [265749] database-1 corosync info    [QB    ] withdrawing server sockets
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration map access
Jan 13 15:00:25 [265749] database-1 corosync info    [QB    ] withdrawing server sockets
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration service
Jan 13 15:00:25 [265749] database-1 corosync info    [QB    ] withdrawing server sockets
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jan 13 15:00:25 [265749] database-1 corosync info    [QB    ] withdrawing server sockets
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 13 15:00:25 [265749] database-1 corosync notice  [SERV  ] Service engine unloaded: corosync profile loading service
Jan 13 15:00:25 [265749] database-1 corosync notice  [MAIN  ] Corosync Cluster Engine exiting normally

which is explained by this in database-1/var/log/messages

Jan 13 15:00:23 database-1 pacemaker-controld[265783]: error: We didn't ask to be shut down, yet our DC is telling us to.
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: State transition S_NOT_DC -> S_STOPPING
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Stopped 0 recurring operations at shutdown... waiting (3 remaining)
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from the executor
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from Corosync
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: notice: Disconnected from the CIB manager
Jan 13 15:00:23 database-1 pacemaker-controld[265783]: warning: Inhibiting respawn
Jan 13 15:00:23 database-1 pacemakerd[265776]: warning: Shutting cluster down because pacemaker-controld[265783] had fatal failure

Not that on database-0 in messages, it looks like database-1 becomes the "new attribute writer" (was messaging-2) and then it's lost:

Jan 13 15:00:23 database-0 pacemaker-fenced[255777]: notice: Node database-1 state is now member
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Node database-1 state is now member
Jan 13 15:00:23 database-0 pacemaker-based[255776]: notice: Node database-1 state is now member
Jan 13 15:00:23 database-0 corosync[255752]:  [QUORUM] Members[8]: 1 2 3 4 5 6 7 9
Jan 13 15:00:23 database-0 corosync[255752]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 13 15:00:23 database-0 pacemakerd[255774]: notice: Node database-1 state is now member
Jan 13 15:00:23 database-0 pacemaker-controld[255781]: notice: Node database-1 state is now member
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Recorded new attribute writer: database-1 (was messaging-2)
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Recorded new attribute writer: messaging-2 (was messaging-0)
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Node database-1 state is now lost
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Removing all database-1 attributes for peer loss
Jan 13 15:00:23 database-0 pacemaker-attrd[255779]: notice: Purged 1 peer with id=5 and/or uname=database-1 from the membership cache


Version-Release number of selected component (if applicable): this is an update from RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3


How reproducible: we had a similar issue with jobs 38 but this time it messaging-2 that failed.  Same puddle.

The job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6-scale-up/39/

Note: OSP16 standard deployment with ipv4 doesn't have this kind of issue.

Side note:

curl is emitting warnings:

Jan 13 16:13:27 database-1 podman[358926]: curl: (3) IPv6
Jan 13 16:13:27 database-1 podman[358926]: 000 :0 0.000000 seconds
Jan 13 16:13:27 database-1 podman[358926]: numerical address used in URL without brackets
Jan 13 16:13:27 database-1 podman[358926]: Error: exit status 1

it's because the bind_address used for check is without the [], in the hieradata.  Certainly unrelated but worth a check.

Comment 1 Sofer Athlan-Guyot 2020-01-15 16:52:32 UTC

So after a debugging with Michele and Damien (thanks to them both)
this is likely a bug in pacemaker due to the parallel update style
used there.  To confirm that we have use a less aggressive update path
(one role after the other) and everything went fine[1]

So now we're testing if we reproduce the // update bug *with* fencing
as pacemaker people are likely to ask about that. When we have
confirmation that fencing still has the issue, we raise the bz in
pacemaker so that we can melt some fat out of that OSP16 bugzilla
list.

[1] using that DNM patch https://review.opendev.org/702648

Comment 4 Sofer Athlan-Guyot 2020-03-06 09:47:10 UTC

So we are now not doing // role update by default in the ci.  When the above pacemaker is available in the puddle we will reactivate it.

Closing this one now.

Note You need to log in before you can comment on or make changes to this bug.