Description of problem: ----------------------- Hosts in the RHHI setup has 2 network interfaces, one for VM traffic & other for gluster traffic. Once the HE VM is up, gluster network is created with role for migration traffic & gluster. When this logical network is attached to the host network interface ( dedicated for gluster traffic ), HE VM goes paused. Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHV 4.2.7 nightly build RHHI 2.0 RHGS 3.4.1 nightly build How reproducible: ------------------ Always Steps to Reproduce: ------------------- 1. As part of RHHI deployment, complete gluster deployment and HE deployment. 2. Once HE VM is up, create a logical network with role for VM migration and gluster network 3. Attach this newly created logical network to the network interface which is connected to the network dedicate for gluster traffic ( 10Gbps ) Actual results: --------------- HE VM goes paused Expected results: ----------------- HE VM should not be paused while attaching the gluster network to the host
Could see similar issue while running HE deployment on single node. While attaching the gluster network , the HE gets to paused state and the gluster IP seems vanishing. Restarting the network brought the gluster IP alive. Will be attaching the vdsm logs and sosreport.
Moving to network team to take a look
bipin, could you help me with the logs? I cannot see in supervdsmd.log the setupNetowrk command. What is the name of your gluster network? On which iface you place it and with which IP? Why do you think it is a regression in Network? with which version Engine/Vdsm did it last work?
Satheesaran, can you point me to the procedure to install RHHI 2.0? I just can find the doc for RHHI 1.1 in https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrastructure/1.1/ . Is the logical network for gluster created before or after the master storage domain is created?
(In reply to Dominik Holler from comment #5) > Satheesaran, can you point me to the procedure to install RHHI 2.0? > I just can find the doc for RHHI 1.1 in > https://access.redhat.com/documentation/en-us/ > red_hat_hyperconverged_infrastructure/1.1/ . > Is the logical network for gluster created before or after the master > storage domain is created? HE domain is the master storage domain with the RHV 4.2 HE installation. The logical network for gluster is setup post the installation (that setups the storage domains and the additional hosts)
(In reply to Sahina Bose from comment #6) > (In reply to Dominik Holler from comment #5) > > Satheesaran, can you point me to the procedure to install RHHI 2.0? > > I just can find the doc for RHHI 1.1 in > > https://access.redhat.com/documentation/en-us/ > > red_hat_hyperconverged_infrastructure/1.1/ . > > Is the logical network for gluster created before or after the master > > storage domain is created? > > HE domain is the master storage domain with the RHV 4.2 HE installation. The > logical network for gluster is setup post the installation (that setups the > storage domains and the additional hosts) Does this mean it is expected to work that it is possible to change the IP address and the network interface, which provides the storage of the hosted engine VM, while hosted engine keeps running?
Setting back the needinfo on Sas. Canceled it by mistake
(In reply to Dominik Holler from comment #7) > (In reply to Sahina Bose from comment #6) > > (In reply to Dominik Holler from comment #5) > > > Satheesaran, can you point me to the procedure to install RHHI 2.0? > > > I just can find the doc for RHHI 1.1 in > > > https://access.redhat.com/documentation/en-us/ > > > red_hat_hyperconverged_infrastructure/1.1/ . > > > Is the logical network for gluster created before or after the master > > > storage domain is created? > > > > HE domain is the master storage domain with the RHV 4.2 HE installation. The > > logical network for gluster is setup post the installation (that setups the > > storage domains and the additional hosts) > > Does this mean it is expected to work that it is possible to change the IP > address and the network interface, which provides the storage of the hosted > engine VM, while hosted engine keeps running? HE is connected to a replica 3 gluster volume, so even if 1 of the network goes down, it is expected to continue working.
(In reply to Sahina Bose from comment #10) > (In reply to Dominik Holler from comment #7) > > (In reply to Sahina Bose from comment #6) > > > (In reply to Dominik Holler from comment #5) > > > > Satheesaran, can you point me to the procedure to install RHHI 2.0? > > > > I just can find the doc for RHHI 1.1 in > > > > https://access.redhat.com/documentation/en-us/ > > > > red_hat_hyperconverged_infrastructure/1.1/ . > > > > Is the logical network for gluster created before or after the master > > > > storage domain is created? > > > > > > HE domain is the master storage domain with the RHV 4.2 HE installation. The > > > logical network for gluster is setup post the installation (that setups the > > > storage domains and the additional hosts) > > > > Does this mean it is expected to work that it is possible to change the IP > > address and the network interface, which provides the storage of the hosted > > engine VM, while hosted engine keeps running? > > HE is connected to a replica 3 gluster volume, so even if 1 of the network > goes down, it is expected to continue working. Let me clarify - this issue is seen when attaching gluster logical network on the host where HE VM is running as the HE mount loses connection to all the bricks when the logical network is assigned. Once the network connection is regained, the mount should be functional again and HE VM should resume. This is currently not the case.
bipin, could you help me to locate the setupNetworks command which triggers the issue in the attached logs?
(In reply to Dominik Holler from comment #12) > bipin, could you help me to locate the setupNetworks command which triggers > the issue in the attached logs? Dominik, The Network is setup as you demonstrated in the video. Also in the latest rhvm-4.2.7.4-0.1.el7ev.noarch, I couldnt reproduce this.
I do not like to close this bug as not reproducible because there might be a problem I like to understand. Any help on reproducing the issue is welcome!
Closing due to inability to reproduce. Please reopen with full logs if this reproduces with RHV>=4.2.7.
I could hit the same problem with RHV 4.2.8. I had a look at the logs and screen recording of Dominik. The difference, there I see is that we use DHCP rather than static IPs for the setup. @Dominik, What are all the logs those are needed so that I can collect those logs and setup for you ? I am so happy to have a bluejeans session to show the actual problem here.
(In reply to SATHEESARAN from comment #17) > I could hit the same problem with RHV 4.2.8. I had a look at the logs and > screen recording of Dominik. > The difference, there I see is that we use DHCP rather than static IPs for > the setup. > > @Dominik, What are all the logs those are needed so that I can collect those > logs and setup for you ? If it does not hurt, would you be able to share complete /var/log of host and VM? > I am so happy to have a bluejeans session to show the actual problem here. Sounds good, please contact me via irc, email or create an appointment on the calendar.
Sas, can you provide the logs from host and VM as requested since Dominic cannot reproduce this. If we do not have the info, let's close and re-open if you encounter it again and have all the requested data
Closing due to inability to reproduce. Please reopen with full logs.
(In reply to Dominik Holler from comment #21) > Closing due to inability to reproduce. Please reopen with full logs. With RHV 4.3, I am no longer seeing HE going to paused state when attaching gluster network to the host running HE VM
This issue can be reproduced in following config:- 1. rhvh-4.3.0.8-0.20190610 2. glusterfs-server-3.12.2-47.2.el7rhgs.x86_64 3. ansible-2.8.1-1.el7ae.noarch 4.gluster-ansible-repositories-1.0-1.el7rhgs.noarch gluster-ansible-maintenance-1.0.1-1.el7rhgs.noarch gluster-ansible-infra-1.0.3-3.el7rhgs.noarch gluster-ansible-cluster-1.0-1.el7rhgs.noarch gluster-ansible-roles-1.0.4-4.el7rhgs.noarch gluster-ansible-features-1.0.4-5.el7rhgs.noarch Attaching the sosreports location below :- "http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/"
From the engine mount logs - [2019-06-14 09:13:58.385634] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-0: server 10.70.36.79:49156 has not responded in the last 30 seconds, disconnecting. [2019-06-14 09:13:58.386098] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-2: server 10.70.36.81:49152 has not responded in the last 30 seconds, disconnecting. [2019-06-14 09:13:58.386157] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-engine-client-1: server 10.70.36.80:49152 has not responded in the last 30 seconds, disconnecting. [2019-06-14 09:13:58.386167] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-engine-client-0: disconnected from engine-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2019-06-14 09:13:58.386189] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2019-06-14 09:13:58.386245] W [MSGID: 108001] [afr-common.c:5341:afr_notify] 0-engine-replicate-0: Client-quorum is not met It looks like the network connection is lost at the time of adding the network. Which was the host running the HE VM where the gluster network was added? Can you check the sosreport file again - I'm unable to see any files in there
Is the gluster network a VM network (with bridges)?
(In reply to Dominik Holler from comment #25) > Is the gluster network a VM network (with bridges)? I don't think it is - Mugdha can you confirm?
(In reply to Sahina Bose from comment #26) > (In reply to Dominik Holler from comment #25) > > Is the gluster network a VM network (with bridges)? > > I don't think it is - Mugdha can you confirm? According to the logfiles the gluster network is without a bridge.
From my limited understanding, engine.log should mention ManageNetworkClustersCommand UpdateNetworkOnClusterCommand and PropagateNetworksToClusterHostsCommand at the moment the gluster networking role is assigned. But I cannot find this in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/engine.log . Mugdha, can you please check if the logfile is complete, and describe with your own words how the issue was triggered and how you extracted the logfiles?
(In reply to Sahina Bose from comment #26) > (In reply to Dominik Holler from comment #25) > > Is the gluster network a VM network (with bridges)? > > I don't think it is - Mugdha can you confirm? Yes, this gluster network is **not** a VM network
(In reply to Dominik Holler from comment #28) > From my limited understanding, engine.log should mention > ManageNetworkClustersCommand UpdateNetworkOnClusterCommand and > PropagateNetworksToClusterHostsCommand at the moment the gluster networking > role is assigned. But I cannot find this in > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/ > engine.log . > Mugdha, can you please check if the logfile is complete, and describe with > your own words how the issue was triggered and how you extracted the > logfiles? I tried to reproduce this issue 4 times again with the same steps followed as mentioned in #comment23 . But could not reproduce the issue .
Hello Tal, I am in doubt if networking is responsible for the issue. Can you please have a look at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/ if the root cause might be storage related
Sahina, it seems more Gluster related, can someone from your team please have a look?
Gobinda, can you take a look at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/mugdha/reopenbuglogs/ if it's a gluster error?
We didn't get any other report since 4.2 need to reproduce in 4.4. Is it still happening?
Per Regression keyword, targeting to 4.4.0.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
waiting a little bit longer for confirmation. Without reproducer this will get CLOSED
Needinfo works if you want need some info :) Can we recheck this on 4.4?
(In reply to Lukas Svaty from comment #39) > Needinfo works if you want need some info :) > > Can we recheck this on 4.4? Yes, tested this behavior with RHV 4.4 ( rhv-4.4.0-33 ) and this issue is not seen