Bug 2148997
| Summary: | Problem with a move of the LVM-activate resource with partial_activation='True' | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Simon Foucek <sfoucek> | |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
| Status: | ASSIGNED --- | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.7 | CC: | agk, cfeist, cluster-maint, fdinitto, teigland | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2151203 (view as bug list) | Environment: | ||
| Last Closed: | Type: | Bug | ||
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2151203 | |||
|
Description
Simon Foucek
2022-11-28 13:13:38 UTC
If you add --full to debug-start you should be able to see exactly where it fails. 'debug-start --full raidvg' exits with 0 too, there are no error logs inside >[root@virt-280 ~]# pcs resource debug-start --full raidvg
(unpack_rsc_op_failure) warning: Unexpected result (error: raidvg: failed to activate.) was recorded for start of raidvg on virt-281 at Nov 28 15:14:47 2022 | rc=1 id=raidvg_last_failure_0
(log_xmllib_err) error: XML Error: Entity: line 1: parser error : Start tag expected, '<' not found
(log_xmllib_err) error: XML Error: pvscan[280336] PV /dev/sda online.
(log_xmllib_err) error: XML Error: ^
(string2xml) warning: Parsing failed (domain=1, level=3, code=4): Start tag expected, '<' not found
...
Ok, I found this error message inside
(In reply to Simon Foucek from comment #3) > >[root@virt-280 ~]# pcs resource debug-start --full raidvg > (unpack_rsc_op_failure) warning: Unexpected result (error: raidvg: failed > to activate.) was recorded for start of raidvg on virt-281 at Nov 28 > 15:14:47 2022 | rc=1 id=raidvg_last_failure_0 > (log_xmllib_err) error: XML Error: Entity: line 1: parser error : Start tag > expected, '<' not found > (log_xmllib_err) error: XML Error: pvscan[280336] PV /dev/sda online. > (log_xmllib_err) error: XML Error: ^ > (string2xml) warning: Parsing failed (domain=1, level=3, code=4): Start tag > expected, '<' not found > ... > > Ok, I found this error message inside I find it strange that if I put the offline state of a specific PV on both nodes, the resource has no problem running/starting on the current node. When I move it, it generates this error and does not start. If I then remove the constraint and call the start action, it starts fine on the current node again. Can you try running it with pcs --debug ? That should show you which commands pcs runs and it's output. (In reply to Oyvind Albrigtsen from comment #5) > Can you try running it with pcs --debug ? > > That should show you which commands pcs runs and it's output. Here is end of the output with errors: Running: /usr/sbin/crm_resource -r raidvg --force-start Return Value: 1 --Debug Output Start-- crm_resource: Error performing operation: Error occurred Operation force-start for raidvg (ocf:heartbeat:LVM-activate) returned 1 (error: raidvg: failed to activate.) pvscan[292413] PV /dev/vda2 online. pvscan[292413] PV /dev/sdb ignore foreign VG. pvscan[292413] PV /dev/sdc ignore foreign VG. pvscan[292413] PV /dev/sdd ignore foreign VG. pvscan[292413] PV /dev/sde ignore foreign VG. pvscan[292413] PV /dev/sdf ignore foreign VG. active Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. Nov 28 16:14:42 INFO: Activating raidvg WARNING: VG raidvg is missing PV dedOMV-6V4n-pw2e-mkoV-0pBN-CP3s-GzyDXx (last written to /dev/sda). Cannot change VG raidvg while PVs are missing. See vgreduce --removemissing and vgextend --restoremissing. Cannot process volume group raidvg Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. Nov 28 16:14:42 ERROR: PARTIAL MODE. Incomplete logical volumes will be processed. Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. ocf-exit-reason:raidvg: failed to activate. --Debug Output End-- crm_resource: Error performing operation: Error occurred Operation force-start for raidvg (ocf:heartbeat:LVM-activate) returned 1 (error: raidvg: failed to activate.) pvscan[292413] PV /dev/vda2 online. pvscan[292413] PV /dev/sdb ignore foreign VG. pvscan[292413] PV /dev/sdc ignore foreign VG. pvscan[292413] PV /dev/sdd ignore foreign VG. pvscan[292413] PV /dev/sde ignore foreign VG. pvscan[292413] PV /dev/sdf ignore foreign VG. active Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. Nov 28 16:14:42 INFO: Activating raidvg WARNING: VG raidvg is missing PV dedOMV-6V4n-pw2e-mkoV-0pBN-CP3s-GzyDXx (last written to /dev/sda). Cannot change VG raidvg while PVs are missing. See vgreduce --removemissing and vgextend --restoremissing. Cannot process volume group raidvg Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. Nov 28 16:14:42 ERROR: PARTIAL MODE. Incomplete logical volumes will be processed. Cannot access VG raidvg with system ID virt-281 with local system ID virt-280. ocf-exit-reason:raidvg: failed to activate. It seems that problem is in denial of access because of the wrong system_id, but the resource should cover this by itself, I think. Also, partial_activation='True', so there shouldn't be a problem with missing PVs and the resource should move successfully. Tried the same scenario with lvmlockd, worked properly. The error is present only with system_id option. These issues are described here: https://bugzilla.redhat.com/show_bug.cgi?id=2066156#c2 There were two changes I mentioned we could make in that bugzilla: 1. reverting an LVM-activate commit related to partial activation that I believe is incorrect: https://github.com/ClusterLabs/resource-agents/commit/30b800156921c9f1524faef07185221a944358f7 2. adding a new vgchange option to enable modifying an incomplete VG so the system ID can be changed (the cause of the error reported above.) Neither change has been made. I implemented 2, but we have not pushed it out as a feature until there is a clear need for it (the cluster group could weigh in on that). A devel branch with the feature is https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=03653a0a570e94222eda49149e260cf1a8358bd1 The larger issue would be the testing required to support lvm raid with cluster failover. bug 2066156 was resolved because the user switched to using md raid instead of lvm raid. This is what we would always recommend, even if the two issues above were resolved. Using lvmlockd is also a valid solution, but it involves added complexity, so in many cases system ID based failover will be preferred. Thank you for your response! I have these questions: 1. Is there a reason to keep the partial_activation option if it doesn't work correctly? Because if I imagine a user situation where I have an N-nodes cluster, some physical volume fails, so all nodes see VG/LV as partial, then the node running my resource fails, and the resource fails to move. This seems to be an essential and crucial feature of the HA cluster, and it's broken right now. 2. If we implement the second option from the devel branch, can you provide a little bit more description about the issue with lvm raid and cluster failover testing which will occur? (In reply to Simon Foucek from comment #10) > 1. Is there a reason to keep the partial_activation option if it doesn't work correctly? As explained in the other bug, I think it should be removed because it doesn't do what I think you want. > Because if I imagine a user situation where I have an > N-nodes cluster, some physical volume fails, so all nodes see VG/LV as > partial, then the node running my resource fails, and the resource fails to > move. This seems to be an essential and crucial feature of the HA cluster, > and it's broken right now. It's clearly useful, and it works right now if you use md raid under the VG. That's what most users do, and that's what we recommend if you want to use software raid. It's possible that there's a good reason the user needs to use lvm raid instead of md raid, and in that case we'd be interested to hear more about that. In that case, we can look at adding the devel patch that would permit failing over VGs with missing PVs. > 2. If we implement the second option from the devel branch, can you provide > a little bit more description about the issue with lvm raid and cluster > failover testing which will occur? QE will need to write and run tests for failover under those conditions, which I suspect they don't have. Those scenarios get complicated when you consider all the combinations of: number of failed devices, which specific devices fail, which LVs are using the failed devices, and whether the raid level for those LVs can tolerate that number of failed devices. Some LVs in the VG may be able to tolerate the missing devices, and other LVs may not. All of these complications are avoided if you just use md raid under the VG, and that's why I suspect we would prefer to state that the only way we handle raid in an HA setup is via md. |