1638609 – HE VM goes paused after attaching the gluster logical network to the network interface

Bug 1638609 - HE VM goes paused after attaching the gluster logical network to the network interface

Summary: HE VM goes paused after attaching the gluster logical network to the network ...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Frontend.Core
Sub Component:
Version:	4.2.7.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gobinda Das
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1637898
TreeView+	depends on / blocked

Reported:	2018-10-12 04:56 UTC by SATHEESARAN
Modified:	2020-05-13 02:58 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:	1637898
Environment:
Last Closed:	2020-05-13 02:57:17 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description SATHEESARAN 2018-10-12 04:56:59 UTC

Description of problem:
-----------------------
Hosts in the RHHI setup has 2 network interfaces, one for VM traffic & other for gluster traffic. Once the HE VM is up, gluster network is created with role for migration traffic & gluster. When this logical network is attached to the host network interface ( dedicated for gluster traffic ), HE VM goes paused.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHV 4.2.7 nightly build
RHHI 2.0
RHGS 3.4.1 nightly build

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. As part of RHHI deployment, complete gluster deployment and HE deployment.
2. Once HE VM is up, create a logical network with role for VM migration and gluster network
3. Attach this newly created logical network to the network interface which is connected to the network dedicate for gluster traffic ( 10Gbps )

Actual results:
---------------
HE VM goes paused

Expected results:
-----------------
HE VM should not be paused while attaching the gluster network to the host

Comment 1 bipin 2018-10-12 05:03:19 UTC

Could see similar issue while running HE deployment on single node. While attaching the gluster network , the HE gets to paused state and the gluster IP seems vanishing. 
Restarting the network brought the gluster IP alive.

Will be attaching the vdsm logs and sosreport.

Comment 3 Sahina Bose 2018-10-24 06:53:15 UTC

Moving to network team to take a look

Comment 4 Dan Kenigsberg 2018-10-24 07:27:26 UTC

bipin, could you help me with the logs? I cannot see in supervdsmd.log the setupNetowrk command. What is the name of your gluster network? On which iface you place it and with which IP?

Why do you think it is a regression in Network? with which  version Engine/Vdsm did it last work?

Comment 5 Dominik Holler 2018-10-25 06:22:57 UTC

Satheesaran, can you point me to the procedure to install RHHI 2.0?
I just can find the doc for RHHI 1.1 in https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrastructure/1.1/ .
Is the logical network for gluster created before or after the master storage domain is created?

Comment 6 Sahina Bose 2018-10-26 09:23:16 UTC

(In reply to Dominik Holler from comment #5)
> Satheesaran, can you point me to the procedure to install RHHI 2.0?
> I just can find the doc for RHHI 1.1 in
> https://access.redhat.com/documentation/en-us/
> red_hat_hyperconverged_infrastructure/1.1/ .
> Is the logical network for gluster created before or after the master
> storage domain is created?

HE domain is the master storage domain with the RHV 4.2 HE installation. The logical network for gluster is setup post the installation (that setups the storage domains and the additional hosts)

Comment 7 Dominik Holler 2018-10-26 09:46:21 UTC

(In reply to Sahina Bose from comment #6)
> (In reply to Dominik Holler from comment #5)
> > Satheesaran, can you point me to the procedure to install RHHI 2.0?
> > I just can find the doc for RHHI 1.1 in
> > https://access.redhat.com/documentation/en-us/
> > red_hat_hyperconverged_infrastructure/1.1/ .
> > Is the logical network for gluster created before or after the master
> > storage domain is created?
> 
> HE domain is the master storage domain with the RHV 4.2 HE installation. The
> logical network for gluster is setup post the installation (that setups the
> storage domains and the additional hosts)

Does this mean it is expected to work that it is possible to change the IP address and the network interface, which provides the storage of the hosted engine VM, while hosted engine keeps running?

Comment 9 bipin 2018-10-26 10:04:01 UTC

Setting back the needinfo on Sas. Canceled it by mistake

Comment 10 Sahina Bose 2018-10-26 10:57:57 UTC

(In reply to Dominik Holler from comment #7)
> (In reply to Sahina Bose from comment #6)
> > (In reply to Dominik Holler from comment #5)
> > > Satheesaran, can you point me to the procedure to install RHHI 2.0?
> > > I just can find the doc for RHHI 1.1 in
> > > https://access.redhat.com/documentation/en-us/
> > > red_hat_hyperconverged_infrastructure/1.1/ .
> > > Is the logical network for gluster created before or after the master
> > > storage domain is created?
> > 
> > HE domain is the master storage domain with the RHV 4.2 HE installation. The
> > logical network for gluster is setup post the installation (that setups the
> > storage domains and the additional hosts)
> 
> Does this mean it is expected to work that it is possible to change the IP
> address and the network interface, which provides the storage of the hosted
> engine VM, while hosted engine keeps running?

HE is connected to a replica 3 gluster volume, so even if 1 of the network goes down, it is expected to continue working.

Comment 11 Sahina Bose 2018-10-26 11:26:21 UTC

(In reply to Sahina Bose from comment #10)
> (In reply to Dominik Holler from comment #7)
> > (In reply to Sahina Bose from comment #6)
> > > (In reply to Dominik Holler from comment #5)
> > > > Satheesaran, can you point me to the procedure to install RHHI 2.0?
> > > > I just can find the doc for RHHI 1.1 in
> > > > https://access.redhat.com/documentation/en-us/
> > > > red_hat_hyperconverged_infrastructure/1.1/ .
> > > > Is the logical network for gluster created before or after the master
> > > > storage domain is created?
> > > 
> > > HE domain is the master storage domain with the RHV 4.2 HE installation. The
> > > logical network for gluster is setup post the installation (that setups the
> > > storage domains and the additional hosts)
> > 
> > Does this mean it is expected to work that it is possible to change the IP
> > address and the network interface, which provides the storage of the hosted
> > engine VM, while hosted engine keeps running?
> 
> HE is connected to a replica 3 gluster volume, so even if 1 of the network
> goes down, it is expected to continue working.

Let me clarify - this issue is seen when attaching gluster logical network on the host where HE VM is running as the HE mount loses connection to all the bricks when the logical network is assigned. Once the network connection is regained, the mount should be functional again and HE VM should resume.
This is currently not the case.

Comment 12 Dominik Holler 2018-10-29 11:34:21 UTC

bipin, could you help me to locate the setupNetworks command which triggers the issue in the attached logs?

Comment 14 bipin 2018-10-30 03:29:26 UTC

(In reply to Dominik Holler from comment #12)
> bipin, could you help me to locate the setupNetworks command which triggers
> the issue in the attached logs?

Dominik,

The Network is setup as you demonstrated in the video. Also in the latest rhvm-4.2.7.4-0.1.el7ev.noarch, I couldnt reproduce this.

Comment 15 Dominik Holler 2018-10-30 08:00:54 UTC

I do not like to close this bug as not reproducible because there might be a problem I like to understand. Any help on reproducing the issue is welcome!

Comment 16 Dan Kenigsberg 2018-11-06 11:08:42 UTC

Closing due to inability to reproduce. Please reopen with full logs if this reproduces with RHV>=4.2.7.

Comment 17 SATHEESARAN 2019-01-14 03:02:01 UTC

I could hit the same problem with RHV 4.2.8. I had a look at the logs and screen recording of Dominik.
The difference, there I see is that we use DHCP rather than static IPs for the setup.

@Dominik, What are all the logs those are needed so that I can collect those logs and setup for you ?
I am so happy to have a bluejeans session to show the actual problem here.

Comment 18 Dominik Holler 2019-01-14 09:30:36 UTC

(In reply to SATHEESARAN from comment #17)
> I could hit the same problem with RHV 4.2.8. I had a look at the logs and
> screen recording of Dominik.
> The difference, there I see is that we use DHCP rather than static IPs for
> the setup.
> 
> @Dominik, What are all the logs those are needed so that I can collect those
> logs and setup for you ?

If it does not hurt, would you be able to share complete /var/log of host and VM?

> I am so happy to have a bluejeans session to show the actual problem here.

Sounds good, please contact me via irc, email or create an appointment on the calendar.

Comment 20 Sahina Bose 2019-01-29 11:00:15 UTC

Sas, can you provide the logs from host and VM as requested since Dominic cannot reproduce this.
If we do not have the info, let's close and re-open if you encounter it again and have all the requested data

Comment 21 Dominik Holler 2019-02-05 11:08:36 UTC

Closing due to inability to reproduce. Please reopen with full logs.

Comment 22 SATHEESARAN 2019-04-27 02:53:47 UTC

(In reply to Dominik Holler from comment #21)
> Closing due to inability to reproduce. Please reopen with full logs.

With RHV 4.3, I am no longer seeing HE going to paused state when attaching gluster network to the host running HE VM

Comment 23 Mugdha Soni 2019-06-15 09:38:11 UTC

This issue can be reproduced in following config:-
1. rhvh-4.3.0.8-0.20190610
2. glusterfs-server-3.12.2-47.2.el7rhgs.x86_64
3. ansible-2.8.1-1.el7ae.noarch

4.gluster-ansible-repositories-1.0-1.el7rhgs.noarch
gluster-ansible-maintenance-1.0.1-1.el7rhgs.noarch
gluster-ansible-infra-1.0.3-3.el7rhgs.noarch
gluster-ansible-cluster-1.0-1.el7rhgs.noarch
gluster-ansible-roles-1.0.4-4.el7rhgs.noarch
gluster-ansible-features-1.0.4-5.el7rhgs.noarch

Attaching the sosreports location below :-

"http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/"

Comment 24 Sahina Bose 2019-06-25 08:56:01 UTC

From the engine mount logs - 

[2019-06-14 09:13:58.385634] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-0: server 10.70.36.79:49156 has not responded in the last 30 seconds, disconnecting.
[2019-06-14 09:13:58.386098] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-2: server 10.70.36.81:49152 has not responded in the last 30 seconds, disconnecting.
[2019-06-14 09:13:58.386157] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-1: server 10.70.36.80:49152 has not responded in the last 30 seconds, disconnecting.
[2019-06-14 09:13:58.386167] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-engine-client-0: disconnected from engine-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-06-14 09:13:58.386189] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2019-06-14 09:13:58.386245] W [MSGID: 108001] [afr-common.c:5341:afr_notify] 0-engine-replicate-0: Client-quorum is not met

It looks like the network connection is lost at the time of adding the network.
Which was the host running the HE VM where the gluster network was added?
Can you check the sosreport file again - I'm unable to see any files in there

Comment 25 Dominik Holler 2019-06-25 13:10:29 UTC

Is the gluster network a VM network (with bridges)?

Comment 26 Sahina Bose 2019-06-27 06:20:17 UTC

(In reply to Dominik Holler from comment #25)
> Is the gluster network a VM network (with bridges)?

I don't think it is - Mugdha can you confirm?

Comment 27 Dominik Holler 2019-06-28 20:30:49 UTC

(In reply to Sahina Bose from comment #26)
> (In reply to Dominik Holler from comment #25)
> > Is the gluster network a VM network (with bridges)?
> 
> I don't think it is - Mugdha can you confirm?

According to the logfiles the gluster network is without a bridge.

Comment 28 Dominik Holler 2019-06-28 20:35:50 UTC

From my limited understanding, engine.log should mention ManageNetworkClustersCommand UpdateNetworkOnClusterCommand and PropagateNetworksToClusterHostsCommand at the moment the gluster networking role is assigned. But I cannot find this in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/engine.log .
Mugdha, can you please check if the logfile is complete, and describe with your own words how the issue was triggered and how you extracted the logfiles?

Comment 29 SATHEESARAN 2019-07-01 07:21:16 UTC

(In reply to Sahina Bose from comment #26)
> (In reply to Dominik Holler from comment #25)
> > Is the gluster network a VM network (with bridges)?
> 
> I don't think it is - Mugdha can you confirm?

Yes, this gluster network is **not** a VM network

Comment 30 Mugdha Soni 2019-07-02 10:25:34 UTC

(In reply to Dominik Holler from comment #28)
> From my limited understanding, engine.log should mention
> ManageNetworkClustersCommand UpdateNetworkOnClusterCommand and
> PropagateNetworksToClusterHostsCommand at the moment the gluster networking
> role is assigned. But I cannot find this in
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/
> engine.log .
> Mugdha, can you please check if the logfile is complete, and describe with
> your own words how the issue was triggered and how you extracted the
> logfiles?

I tried to reproduce this issue 4 times again with the same steps followed as mentioned in #comment23 . But could not reproduce the issue .

Comment 31 Dominik Holler 2019-07-02 14:14:51 UTC

Hello Tal, I am in doubt if networking is responsible for the issue. Can you please have a look at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/ if the root cause might be storage related

Comment 32 Tal Nisan 2019-07-09 11:57:17 UTC

Sahina, it seems more Gluster related, can someone from your team please have a look?

Comment 33 Sahina Bose 2019-07-19 06:28:35 UTC

Gobinda, can you take a look at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/ if it's a gluster error?

Comment 35 Michal Skrivanek 2020-03-24 11:33:10 UTC

We didn't get any other report since 4.2
need to reproduce in 4.4. Is it still happening?

Comment 36 Lukas Svaty 2020-03-24 12:48:30 UTC

Per Regression keyword, targeting to 4.4.0.

Comment 37 RHEL Program Management 2020-03-24 12:48:36 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 38 Michal Skrivanek 2020-04-06 14:11:29 UTC

waiting a little bit longer for confirmation. Without reproducer this will get CLOSED

Comment 39 Lukas Svaty 2020-04-06 15:19:08 UTC

Needinfo works if you want need some info :)

Can we recheck this on 4.4?

Comment 41 SATHEESARAN 2020-05-13 02:57:17 UTC

(In reply to Lukas Svaty from comment #39)
> Needinfo works if you want need some info :)
> 
> Can we recheck this on 4.4?

Yes, tested this behavior with RHV 4.4 ( rhv-4.4.0-33 ) and this issue is not seen

Note You need to log in before you can comment on or make changes to this bug.