Bug 1822125

Summary:

Allow XML IDs of arbitrary length in Pacemaker CIB

Product:

Red Hat Enterprise Linux 8

Reporter:

michael.oldham

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

medium

Docs Contact:

Priority:

high

Version:

8.3

CC:

ccaulfie, cfeist, cluster-maint, jfriesse, kgaillot, kostos, msmazova, nwahl, sbradley, tojeline

Target Milestone:

Keywords:

Triaged

Target Release:

8.8

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

pacemaker-2.1.5-1.el8

Doc Type:

Bug Fix

Doc Text:

Cause: Pacemaker read XML IDs into a fixed-length buffer. Consequence: XML IDs longer than the buffer size could crash Pacemaker. Fix: Pacemaker reads XML IDs into dynamically sized memory. Result: XML IDs of arbitrary length can be used in the Pacemaker CIB.

Story Points:

---

Clone Of:

Clones:

1824206 (view as bug list)

Environment:

Last Closed:

2023-05-16 08:35:22 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1824206

Attachments:

Description	Flags
stack back trace	none
corosync.log excerpt	none
Corosync log post hot fix	none
Debug core	none

Description michael.oldham 2020-04-08 10:22:18 UTC

User-Agent:       Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0
Build Identifier: 

Excerpt from our log:

2020-01-30 07:38:39.765 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)>
2020-01-30 07:38:45.667 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)<
pcs resource update sv.svtest.aa.sv.monitor sv_ha_cluster_managed=0 --wait=30 --force => Return=12020-01-30 07:38:39.765 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)>
2020-01-30 07:38:45.667 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)<
pcs resource update sv.svtest.aa.sv.monitor sv_ha_cluster_managed=0 --wait=30 --force => Return=1

Matching excerpt from corosync.log
Jan 30 07:38:39 [5568] sinlxsveshi01        cib:     info: cib_process_request:	Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=sinlxsveshi01/cibadmin/2, version=0.440.0)
Jan 30 07:38:39 [5573] sinlxsveshi01       crmd:     info: abort_transition_graph:	Transition aborted by rsc_colocation.pcs_rsc_colocation_set_sv.svtest.aa.partition.10_sv.svtest.aa.partition.11_sv.svtest.aa.partition.12_sv.svtest.aa.partition.13_sv.svtest.aa.partition.14_sv.svtest.aa.partition.15_sv.svtest.aa.partition.16_sv.svtest.aa.partition.17_sv.svtest.aa.partition.18_sv.svtest.aa.partition.19_sv.svtest.aa.partition.20_sv.svtest.aa.partition.21_sv.svtest.aa.partition.22_sv.svtest.aa.partition.23_sv.s
Jan 30 07:38:39 [5573] sinlxsveshi01       crmd:   notice: do_state_transition:	State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Jan 30 07:38:39 [5573] sinlxsveshi01       crmd:    error: crm_ipc_read:	Connection to cib_shm failed
Jan 30 07:38:39 [5573] sinlxsveshi01       crmd:    error: mainloop_gio_callback:	Connection to cib_shm[0x5631eed85e70] closed (I/O condition=25)
Jan 30 07:38:39 [5567] sinlxsveshi01 pacemakerd:    error: child_waitpid:	Managed process 5568 (cib) dumped core
Jan 30 07:38:39 [5571] sinlxsveshi01      attrd:    error: crm_ipc_read:	Connection to cib_rw failed
Jan 30 07:38:39 [5569] sinlxsveshi01 stonith-ng:    error: crm_ipc_read:	Connection to cib_rw failed
Jan 30 07:38:39 [5567] sinlxsveshi01 pacemakerd:    error: pcmk_child_exit:	The cib process (5568) terminated with signal 11 (core=1)

Back trace
[New LWP 5568]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/pacemaker/cib'.
Program terminated with signal 11, Segmentation fault.
#0  cpg_dispatch (handle=7749363892505018368, dispatch_types=dispatch_types@entry=CS_DISPATCH_ONE) at cpg.c:571
571				cpg_inst->finalize = 1;



Reproducible: Always

Steps to Reproduce:
1.Only occurs at a customer site unable to locally reproduce
2.
3.
Actual Results:  
We are attempting to create our own resources the time.


I have investigated the issue and tracked it back to cpg.c and the cpg_dispatch function line 571 ( from the backtrace).
I think from code inspection that the cpg_inst data structure has been freed, which is expected behaviour if finalize is set. Thus we are using unsafe data to set the finalize flag (which should already be set) and the segv occurs.
I have supplied a hot-fix to the customer (single .so file) and the problem has resolved.

Comment 2 michael.oldham 2020-04-08 10:23:58 UTC

Created attachment 1677193 [details]
stack back trace

Comment 3 michael.oldham 2020-04-08 10:31:48 UTC

Created attachment 1677195 [details]
corosync.log excerpt

Comment 4 Jan Friesse 2020-04-08 14:43:15 UTC

@michael: Thank you for the report. Would you mind to share also hot-fix?

Comment 5 michael.oldham 2020-04-08 22:13:28 UTC

Hi Jan, Yes sorry the hot-fix was simply to delete the line at #571 cpg_inst->finalize = 1;
I then supplied a new version of libcpg.so (corosync.2.4.3)
Do you need the actual library file I sent?

Comment 6 Jan Friesse 2020-04-09 07:42:33 UTC

@michael: Thank you for the hot-fix description. No, actual library file is not needed :) I just wanted to make sure that I understand what was happening.

I've spent some time reading and trying to understand backtrace, but sadly it haven't helped much.

1. I've tried simple example where cpg_finalize is called in the dispatch callback - just for sure, but I was pretty positive we solved that problem long time ago. That doesn't reproduce the bug and behave correctly because of refcounting.
2. I'm sure that deleting cpg_inst->finalize (no matter how useless that line is) is not really solution for the problem. If you take a look to backtrace, there are few weird things:
- cpg_inst = 0x273d6469405b6665 - Points to weird address. Other pointers are much shorter.
- cpg_inst_copy - it contains corrupted data. This one shouldn't be really corrupted. Few examples, finalize = 1769236852 - should be zero or one, it's never set to such huge value, model = (CPG_MODEL_V1 | unknown: 23846) - should be just CPG_MODEL_V1, ... But the most weird one is cpg_totem_confchg_fn = 0x273d6469405b6665, which exactly equals to cpg_inst. AFAIK cpg_totem_confchg_fn is/was not used by pcmk in rhel 7.7 so it should be NULL.
3. Checked twice refcounting. It works as expected and really every function in libcpg gets and puts handle back to handle_db as expected, so refcounting works. Also get of freed handle correctly doesn't work.
4. Studying pcmk code haven't helped much ether. It is pretty normal for pcmk to call cpg_finalize, but it always logs the problem. There was not such log in the logfile.

All and all, I believe problem is ether somewhere in the pcmk for rhel 7.7 where memory is overwritten (more specifically stack memory) or it may be some hw problem (corrupted memory, ...)

Ken, do you have any opinion? Does it ring the bell maybe?

Comment 7 michael.oldham 2020-04-09 09:46:05 UTC

Created attachment 1677488 [details]
Corosync log post hot fix

Comment 8 michael.oldham 2020-04-09 09:55:13 UTC

Great .. I had typed a lot in here and lost it when added my attachment ..sigh!

I agree with your assessment, I did not uncover the cause of the memory corruption. I just removed some of the impact.

However given the comment
       /*
         * Make copy of callbacks, message data, unlock instance, and call callback
         * A risk of this dispatch method is that the callback routines may
         * operate at the same time that cpgFinalize has been called.
         */
and the lack of any thread locking logic I realise I do not yet understand the code (other than the line I removed being redundant).

I have added the corosync.log post the hot fix, look for 'Bad Handle' and you can see that things are still far from happy but they do appear to be functional as the cluster creates and starts. Maybe this will help reproduce the problem.

Currently the customer cannot proceed as we cannot support using the hot-fix.

I did look at trying to investigate further but there is little trace in this code and I suspect we would need thread enabled trace anyway.

Fyi we think this is also occurring in another of their clusters.
We could get the old cluster to create if they drastically reduced the number of resources.

Comment 9 michael.oldham 2020-04-09 12:36:57 UTC

Created attachment 1677532 [details]
Debug core

Comment 10 Jan Friesse 2020-04-14 14:56:34 UTC

@michael: Yup, bugzilla is not super intuitive.

That comment is there since the very beginning and I'm not entirely convinced that is really correct. Actually I'm pretty sure the whole idea is incorrect (thanks to refcounting) and copying of callbacks is not really needed. I've filled upstream issue to take a second look if that copy is really needed and potentionaly remove it (https://github.com/corosync/corosync/issues/550), because it is no longer needed. No matter what, pcmk doesn't use multiple threads for cpg calls, so it wouldn't apply anyway.

"Bad handle" error is result of `cpg_inst_copy.finalize` being set to "random" non-zero value. So it is actually not the problem I was having in mind. The log message like "Could not get local node id from the CPG API", "Could not obtain the CPG API connection", ... would be interesting and would mean `cpg_finalize` was called in the dispatch callback. But I don't think it was the case.

I really believe main problem is stack corruption in pcmk code resulting of broken cpg api data structures.

Maybe you may try to generate a huge cib update to trigger the bug easily (at least messages logged right before "Bad Handle" log messages seems to be result of quite a big cib change)?

Comment 11 Ken Gaillot 2020-04-14 16:16:55 UTC

The first unusual thing I noticed is an insanely long XML id in the CIB change just before the core dump:

Jan 30 07:38:39 [5568] sinlxsveshi01        cib:     info: cib_perform_op:      ++ /cib/configuration/constraints:  <rsc_colocation score="INFINITY" id="pcs_rsc_colocation_set_sv.svtest.aa.partition.10_sv.svtest.aa.partition.11_sv.svtest.aa.partition.12_sv.svtest.aa.partition.13_sv.svtest.aa.partition.14_sv.svtest.aa.partition.15_sv.svtest.aa.partition.16_sv.svtest.aa.partition.17_sv.svtest.aa.partition.18_sv.svtest.aa.partition.19_sv.svtest.aa.partition.20_sv.svtest.aa.partition.21_sv.svtest.aa.partition.22_sv

The log message cuts off after a certain number of characters, so there's no telling how long it actually is. I did some tests, and I was able to reproduce the core dump with an id 5000 characters long (1024 was OK, haven't checked what the exact threshhold is).

Hopefully I can get a fix in 7.9.

The obvious workaround is to keep XML IDs under 1KB.

Comment 12 michael.oldham 2020-04-14 22:54:20 UTC

Thanks Guys for the assessment. 

I had thought it linked to the number of resources which is why I asked the customer to drop the number (whereupon it worked).
However as you can see from this excerpt from our trace file we do not directly create this XML id.

We do perform a 'pcs constraint order' and a 'pcs constraint colocation' prior to the 'pcs resource update' which triggers the core dump.
The number of resource listed (about 90) are required for a 7 node cluster supporting a number of data partitions per node (about 7).
The only things I could think to do would be to reduce the length of the names  or attempt to break these commands up which either may not work or not suport the required semantics.

It would seem that the creation of the id is performed by pacemaker from the list of associated resources.
The length of the Id would be 2730 byte i think.

Is the reference to 7.9 for Linux version?
I concerned that the Severity has been altered to medium when I still can't see a viable workaround yet (although I will be looking for it now!)

Regards Michael

P.S. Jan I am relieved that the comment is incorrect .. I couldn't find any sign of thread locking!
-----@@@@-----
pcs constraint order set sv.svtest.aa.sv.monitor-clone action=start set sv.svtest.aa.partition.10 sv.svtest.aa.partition.11 sv.svtest.aa.partition.12 sv.svtest.aa.partition.13 sv.svtest.aa.partition.14 sv.svtest.aa.partition.15 sv.svtest.aa.partition.16 sv.svtest.aa.partition.17 sv.svtest.aa.partition.18 sv.svtest.aa.partition.19 sv.svtest.aa.partition.20 sv.svtest.aa.partition.21 sv.svtest.aa.partition.22 sv.svtest.aa.partition.23 sv.svtest.aa.partition.24 sv.svtest.aa.partition.25 sv.svtest.aa.partition.26 sv.svtest.aa.partition.27 sv.svtest.aa.partition.28 sv.svtest.aa.partition.29 sv.svtest.aa.partition.30 sv.svtest.aa.partition.31 sv.svtest.aa.partition.32 sv.svtest.aa.partition.33 sv.svtest.aa.partition.34 sv.svtest.aa.partition.35 sv.svtest.aa.partition.36 sv.svtest.aa.partition.37 sv.svtest.aa.partition.38 sv.svtest.aa.partition.39 sv.svtest.aa.partition.40 sv.svtest.aa.partition.41 sv.svtest.aa.partition.42 sv.svtest.aa.partition.43 sv.svtest.aa.partition.44 sv.svtest.aa.partition.45 sv.svtest.aa.partition.46 sv.svtest.aa.partition.47 sv.svtest.aa.partition.48 sv.svtest.aa.partition.49 sv.svtest.aa.partition.50 sv.svtest.aa.partition.51 sv.svtest.aa.partition.52 sv.svtest.aa.partition.53 sv.svtest.aa.partition.54 sv.svtest.aa.partition.56 sv.svtest.aa.partition.backup.10 sv.svtest.aa.partition.backup.11 sv.svtest.aa.partition.backup.12 sv.svtest.aa.partition.backup.13 sv.svtest.aa.partition.backup.14 sv.svtest.aa.partition.backup.15 sv.svtest.aa.partition.backup.16 sv.svtest.aa.partition.backup.17 sv.svtest.aa.partition.backup.18 sv.svtest.aa.partition.backup.19 sv.svtest.aa.partition.backup.20 sv.svtest.aa.partition.backup.21 sv.svtest.aa.partition.backup.22 sv.svtest.aa.partition.backup.23 sv.svtest.aa.partition.backup.24 sv.svtest.aa.partition.backup.25 sv.svtest.aa.partition.backup.26 sv.svtest.aa.partition.backup.27 sv.svtest.aa.partition.backup.28 sv.svtest.aa.partition.backup.29 sv.svtest.aa.partition.backup.30 sv.svtest.aa.partition.backup.31 sv.svtest.aa.partition.backup.32 sv.svtest.aa.partition.backup.33 sv.svtest.aa.partition.backup.34 sv.svtest.aa.partition.backup.35 sv.svtest.aa.partition.backup.36 sv.svtest.aa.partition.backup.37 sv.svtest.aa.partition.backup.38 sv.svtest.aa.partition.backup.39 sv.svtest.aa.partition.backup.40 sv.svtest.aa.partition.backup.41 sv.svtest.aa.partition.backup.42 sv.svtest.aa.partition.backup.43 sv.svtest.aa.partition.backup.44 sv.svtest.aa.partition.backup.45 sv.svtest.aa.partition.backup.46 sv.svtest.aa.partition.backup.47 sv.svtest.aa.partition.backup.48 sv.svtest.aa.partition.backup.49 sv.svtest.aa.partition.backup.50 sv.svtest.aa.partition.backup.51 sv.svtest.aa.partition.backup.52 sv.svtest.aa.partition.backup.53 sv.svtest.aa.partition.backup.54 sv.svtest.aa.partition.backup.56 sv.svtest.aa.std sequential=false action=start require-all=true setoptions kind=Optional symmetrical=true => Return=0

2020-01-30 07:38:39.205 [18804:sinlxsveshi01] sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4739) <Rc=0 Errno=0
2020-01-30 07:38:39.205 [18804:sinlxsveshi01] sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4685) > noexec=0 always=0 <pcs constraint colocation set sv.svtest.aa.partition.10 sv.svtest.aa.partition.11 sv.svtest.aa.partition.12 sv.svtest.aa.partition.13 sv.svtest.aa.partition.14 sv.svtest.aa.partition.15 sv.svtest.aa.partition.16 sv.svtest.aa.partition.17 sv.svtest.aa.partition.18 sv.svtest.aa.partition.19 sv.svtest.aa.partition.20 sv.svtest.aa.partition.21 sv.svtest.aa.partition.22 sv.svtest.aa.partition.23 sv.svtest.aa.partition.24 sv.svtest.aa.partition.25 sv.svtest.aa.partition.26 sv.svtest.aa.partition.27 sv.svtest.aa.partition.28 sv.svtest.aa.partition.29 sv.svtest.aa.partition.30 sv.svtest.aa.partition.31 sv.svtest.aa.partition.32 sv.svtest.aa.partition.33 sv.svtest.aa.partition.34 sv.svtest.aa.partition.35 sv.svtest.aa.partition.36 sv.svtest.aa.partition.37 sv.svtest.aa.partition.38 sv.svtest.aa.partition.39 sv.svtest.aa.partition.40 sv.svtest.aa.partition.41 sv.svtest.aa.partition.42 sv.svtest.aa.partition.43 sv.svtest.aa.partition.44 sv.svtest.aa.partition.45 sv.svtest.aa.partition.46 sv.svtest.aa.partition.47 sv.svtest.aa.partition.48 sv.svtest.aa.partition.49 sv.svtest.aa.partition.50 sv.svtest.aa.partition.51 sv.svtest.aa.partition.52 sv.svtest.aa.partition.53 sv.svtest.aa.partition.54 sv.svtest.aa.partition.56 sv.svtest.aa.partition.backup.10 sv.svtest.aa.partition.backup.11 sv.svtest.aa.partition.backup.12 sv.svtest.aa.partition.backup.13 sv.svtest.aa.partition.backup.14 sv.svtest.aa.partition.backup.15 sv.svtest.aa.partition.backup.16 sv.svtest.aa.partition.backup.17 sv.svtest.aa.partition.backup.18 sv.svtest.aa.partition.backup.19 sv.svtest.aa.partition.backup.20 sv.svtest.aa.partition.backup.21 sv.svtest.aa.partition.backup.22 sv.svtest.aa.partition.backup.23 sv.svtest.aa.partition.backup.24 sv.svtest.aa.partition.backup.25 sv.svtest.aa.partition.backup.26 sv.svtest.aa.partition.backup.27 sv.svtest.aa.partition.backup.28 sv.svtest.aa.partition.backup.29 sv.svtest.aa.partition.backup.30 sv.svtest.aa.partition.backup.31 sv.svtest.aa.partition.backup.32 sv.svtest.aa.partition.backup.33 sv.svtest.aa.partition.backup.34 sv.svtest.aa.partition.backup.35 sv.svtest.aa.partition.backup.36 sv.svtest.aa.partition.backup.37 sv.svtest.aa.partition.backup.38 sv.svtest.aa.partition.backup.39 sv.svtest.aa.partition.backup.40 sv.svtest.aa.partition.backup.41 sv.svtest.aa.partition.backup.42 sv.svtest.aa.partition.backup.43 sv.svtest.aa.partition.backup.44 sv.svtest.aa.partition.backup.45 sv.svtest.aa.partition.backup.46 sv.svtest.aa.partition.backup.47 sv.svtest.aa.partition.backup.48 sv.svtest.aa.partition.backup.49 sv.svtest.aa.partition.backup.50 sv.svtest.aa.partition.backup.51 sv.svtest.aa.partition.backup.52 sv.svtest.aa.partition.backup.53 sv.svtest.aa.partition.backup.54 sv.svtest.aa.partition.backup.56 sv.svtest.aa.std sequential=false set sv.svtest.aa.sv.monitor-clone setoptions score=INFINITY> Errno=0
2020-01-30 07:38:39.205 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)>
2020-01-30 07:38:39.764 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)<
pcs constraint colocation set sv.svtest.aa.partition.10 sv.svtest.aa.partition.11 sv.svtest.aa.partition.12 sv.svtest.aa.partition.13 sv.svtest.aa.partition.14 sv.svtest.aa.partition.15 sv.svtest.aa.partition.16 sv.svtest.aa.partition.17 sv.svtest.aa.partition.18 sv.svtest.aa.partition.19 sv.svtest.aa.partition.20 sv.svtest.aa.partition.21 sv.svtest.aa.partition.22 sv.svtest.aa.partition.23 sv.svtest.aa.partition.24 sv.svtest.aa.partition.25 sv.svtest.aa.partition.26 sv.svtest.aa.partition.27 sv.svtest.aa.partition.28 sv.svtest.aa.partition.29 sv.svtest.aa.partition.30 sv.svtest.aa.partition.31 sv.svtest.aa.partition.32 sv.svtest.aa.partition.33 sv.svtest.aa.partition.34 sv.svtest.aa.partition.35 sv.svtest.aa.partition.36 sv.svtest.aa.partition.37 sv.svtest.aa.partition.38 sv.svtest.aa.partition.39 sv.svtest.aa.partition.40 sv.svtest.aa.partition.41 sv.svtest.aa.partition.42 sv.svtest.aa.partition.43 sv.svtest.aa.partition.44 sv.svtest.aa.partition.45 sv.svtest.aa.partition.46 sv.svtest.aa.partition.47 sv.svtest.aa.partition.48 sv.svtest.aa.partition.49 sv.svtest.aa.partition.50 sv.svtest.aa.partition.51 sv.svtest.aa.partition.52 sv.svtest.aa.partition.53 sv.svtest.aa.partition.54 sv.svtest.aa.partition.56 sv.svtest.aa.partition.backup.10 sv.svtest.aa.partition.backup.11 sv.svtest.aa.partition.backup.12 sv.svtest.aa.partition.backup.13 sv.svtest.aa.partition.backup.14 sv.svtest.aa.partition.backup.15 sv.svtest.aa.partition.backup.16 sv.svtest.aa.partition.backup.17 sv.svtest.aa.partition.backup.18 sv.svtest.aa.partition.backup.19 sv.svtest.aa.partition.backup.20 sv.svtest.aa.partition.backup.21 sv.svtest.aa.partition.backup.22 sv.svtest.aa.partition.backup.23 sv.svtest.aa.partition.backup.24 sv.svtest.aa.partition.backup.25 sv.svtest.aa.partition.backup.26 sv.svtest.aa.partition.backup.27 sv.svtest.aa.partition.backup.28 sv.svtest.aa.partition.backup.29 sv.svtest.aa.partition.backup.30 sv.svtest.aa.partition.backup.31 sv.svtest.aa.partition.backup.32 sv.svtest.aa.partition.backup.33 sv.svtest.aa.partition.backup.34 sv.svtest.aa.partition.backup.35 sv.svtest.aa.partition.backup.36 sv.svtest.aa.partition.backup.37 sv.svtest.aa.partition.backup.38 sv.svtest.aa.partition.backup.39 sv.svtest.aa.partition.backup.40 sv.svtest.aa.partition.backup.41 sv.svtest.aa.partition.backup.42 sv.svtest.aa.partition.backup.43 sv.svtest.aa.partition.backup.44 sv.svtest.aa.partition.backup.45 sv.svtest.aa.partition.backup.46 sv.svtest.aa.partition.backup.47 sv.svtest.aa.partition.backup.48 sv.svtest.aa.partition.backup.49 sv.svtest.aa.partition.backup.50 sv.svtest.aa.partition.backup.51 sv.svtest.aa.partition.backup.52 sv.svtest.aa.partition.backup.53 sv.svtest.aa.partition.backup.54 sv.svtest.aa.partition.backup.56 sv.svtest.aa.std sequential=false set sv.svtest.aa.sv.monitor-clone setoptions score=INFINITY => Return=0

2020-01-30 07:38:39.765 [18804:sinlxsveshi01] sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4739) <Rc=0 Errno=0
2020-01-30 07:38:39.765 [18804:sinlxsveshi01] sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4685) > noexec=0 always=0 <pcs resource update sv.svtest.aa.sv.monitor sv_ha_cluster_managed=0 --wait=30 --force> Errno=0
2020-01-30 07:38:39.765 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)>
2020-01-30 07:38:45.667 [18804:sinlxsveshi01] CM_Pacemaker sv_ha_ctl(6850).Main(6724).zClusterCreate(3190).zClusterCreateActiveActive(3256).zCreateResource(3376).CM_ResourceCreate(353).CM_ResourceCreate(1995).zCM_Exec(4683)<
pcs resource update sv.svtest.aa.sv.monitor sv_ha_cluster_managed=0 --wait=30 --force => Return=1
Warning: Agent 'ocf:csg:sv' is not installed or does not provide valid metadata: Metadata query for ocf:csg:sv failed: Input/output error
-----@@@@@@@-------

Comment 13 Jan Friesse 2020-04-15 14:15:16 UTC

@michael: Yup, corosync APIs are really not thread safe (at least they are not declared as being so), eventho actually there is thread locking - the refcounting acts also as a lock. And messages sent/received by QB IPC should be atomic, so in theory it should work fine... but I wouldn't try it anyway.

Comment 14 Ken Gaillot 2020-04-15 14:51:37 UTC

Hi Michael,

I've cloned this BZ for the pcs component, which is what creates the long ID. The number of resources is fine -- the cluster should be able to handle 100s or even 1000s of resources. By 7.9, I mean the fix is targeted to be released in RHEL 7.9 (7.8 was just released).

As a workaround, you could use pcs to create the configuration in a separate file, then edit that file to change the IDs before uploading it to the cluster. With this method, you first save the current configuration to a file:

   pcs cluster cib <filename>

Then, you can run all the configuration commands you normally would, but putting "-f <filename>" at the start (like "pcs -f <filename> constraint order ..."). At this point, you could manually edit the file to use short IDs. Finally, upload the configuration to the cluster with:

   pcs cluster cib-push <filename> --config

Alternatively, you could create the constraint XML yourself instead of with pcs, and use "pcs cluster edit" or "cibadmin" to add it to the cluster. You can look at your existing CIB XML with "pcs cluster cib" to see what the XML syntax for constraints is (not very user-friendly).

With "pcs cluster edit", you see the raw XML of the CIB, and can manually edit it. If you want to automate it rather than edit it manually, you can create the constraint XML in a file (starting with the constraints element and putting your new constraint as its child), then run "cibadmin --create --scope constraints -x <filename>".

Comment 16 Ken Gaillot 2020-05-14 19:46:31 UTC

Due to a fix in pcs to not create long IDs, and a workaround of shortening the IDs being available, handling such IDs in pacemaker is now a lower priority and will be considered for RHEL 8 only

Comment 25 Reid Wahl 2022-09-19 22:15:26 UTC

Fixed upstream by commit a120f342

Comment 29 Markéta Smazová 2023-01-09 20:24:39 UTC

verified as SanityOnly in pacemaker-2.1.5-4.el8

Comment 31 errata-xmlrpc 2023-05-16 08:35:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2818