Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2096267

Summary:	HostedEngine .shard file size=0 in all nodes
Product:	[oVirt] ovirt-hosted-engine-setup	Reporter:	Corrado Zabeo <corrado.zabeo>
Component:	General	Assignee:	Gobinda Das <godas>
Status:	CLOSED UPSTREAM	QA Contact:	meital avital <mavital>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	---	CC:	bugs, stirabos, vharihar
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-16 10:13:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corrado Zabeo 2022-06-13 12:06:08 UTC

Description of problem:
hi team,
my configuration is as follows: 3 replica gluster servers containing VM lvm and VM hostedEngine 3.8.10 + 1 server running all VMs.
Due to a prolonged nighttime power failure and ups battery consumption the system shut down, after starting looking at the logs I saw 2 boot in 3 minutes, so I assume that the power was cut several times.
HostedEngine was paused upon complete restart of all services.
I checked the situation with "gluster volume heal engine info" ... 3 split-brain connected nodes on .shard (15 files), all files were size = 0 on node1.
14 files I have recovered and aligned gfid from the replica, while I find a file size = 0 in all nodes. So the split-brain remains active.
I would like to know how I can fix this and be able to recreate the segment with the correct size.
Thanks in advance

Version-Release number of selected component (if applicable):3.8.10


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2022-06-16 11:09:14 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Gobinda Das 2022-06-16 13:59:30 UTC

@Vinayak Can you please help?

Comment 3 Corrado Zabeo 2022-06-30 10:58:32 UTC

(In reply to Corrado Zabeo from comment #0)
> Description of problem:
> hi team,
> my configuration is as follows: 3 replica gluster servers containing VM lvm
> and VM hostedEngine 3.8.10 + 1 server running all VMs.
> Due to a prolonged nighttime power failure and ups battery consumption the
> system shut down, after starting looking at the logs I saw 2 boot in 3
> minutes, so I assume that the power was cut several times.
> HostedEngine was paused upon complete restart of all services.
> I checked the situation with "gluster volume heal engine info" ... 3
> split-brain connected nodes on .shard (15 files), all files were size = 0 on
> node1.
> 14 files I have recovered and aligned gfid from the replica, while I find a
> file size = 0 in all nodes. So the split-brain remains active.
> I would like to know how I can fix this and be able to recreate the segment
> with the correct size.
> Thanks in advance
> 
> Version-Release number of selected component (if applicable):3.8.10
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:

hi, sorry for not replying earlier.
I solved the split-brain problem in the following way:
1 - I identified the bricks at zero with the command "gluster volume heal engine info" and checked the differences with "getfattr -d -m. -E hex", in my case in /bricks/engine/brick/.shard
2 - I deleted the zero bricks in the .shard folder and the relative links in the .glusterfs folder
3 - the bricks have been automatically recreated
4 - I was left with one last problem brick 7013 was zero on all nodes, I proceeded to delete the brick and related links in the 3 nodes, they were automatically recreated and the split-brain disappeared
The operating system restarted correctly, so fortunately the brick was empty.
However, I don't understand why such an inconvenience happened.
Below is the "volume heal" screen.
Greetings


[root@vmgluster01 zones]# gluster volume heal engine info
Brick 192.170.254.3:/bricks/engine/brick
/.shard - Is in split-brain

/__DIRECT_IO_TEST__ 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7013 
Status: Connected
Number of entries: 3

Brick 192.170.254.4:/bricks/engine/brick
/.shard - Is in split-brain

/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7015 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7016 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7017 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7018 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7019 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7020 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7021 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7022 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7024 
/__DIRECT_IO_TEST__ 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7013 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7023 
Status: Connected
Number of entries: 13

Brick 192.170.254.6:/bricks/engine/brick
/.shard - Is in split-brain

/__DIRECT_IO_TEST__ 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7013 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7015 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7016 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7017 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7018 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7019 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7020 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7021 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7022 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7023 
/.shard/6a48d9f7-8aaa-4763-84ef-98adee5781d9.7024 
Status: Connected
Number of entries: 13

Comment 5 Sandro Bonazzola 2023-01-16 10:13:36 UTC

Moved to GitHub: https://github.com/oVirt/ovirt-hosted-engine-setup/issues/73

Comment 6 Red Hat Bugzilla 2023-09-18 04:39:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days