Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1103973

Summary: gluster bricks marked down in ovirt after vdsm restarted
Product: [Retired] oVirt Reporter: Alastair Neil <aneil2>
Component: vdsmAssignee: Sahina Bose <sabose>
Status: CLOSED CURRENTRELEASE QA Contact: Gil Klein <gklein>
Severity: high Docs Contact:
Priority: high    
Version: 3.4CC: aneil2, bazulay, bugs, dnarayan, gklein, iheim, kmayilsa, luf, mgoldboi, rbalakri, rnachimu, sabose, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: gluster
Fixed In Version: ovirt-3.5.0_rc4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1152877 1152882 (view as bug list) Environment:
Last Closed: 2014-10-17 12:29:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1152877, 1152882    
Attachments:
Description Flags
log from ovirt hosted engine
none
log from vm1 ovirt host
none
log from vm2 ovirt host
none
vdsm log from vm1
none
vdsm log from vm1
none
vdsm log from vm1
none
vdsm log from vm1
none
vdsm log from vm2
none
vdsm log from vm2
none
vdsm log from vm2
none
vdsm log from vm2
none
vdsm log from vm2 none

Description Alastair Neil 2014-06-03 03:47:37 UTC
Description of problem:
restart of vdsm on a gluster node causes the bricks in replicated volumes on the affected node to be marked as down even after vdsm comes back up. Gluster reports Bricks are fine and Volumes are fine.  Only stopping and restarting the volume fixes issue in ovirt-console.

Version-Release number of selected component (if applicable):
engine node:
ovirt-engine-cli-3.4.0.5-1.fc19.noarch
ovirt-engine-userportal-3.4.1-1.fc19.noarch
ovirt-engine-3.4.1-1.fc19.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.4.1-1.fc19.noarch
ovirt-engine-setup-base-3.4.1-1.fc19.noarch
ovirt-release34-1.0.1-1.noarch
ovirt-engine-sdk-python-3.4.1.1-1.fc19.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.4.1-1.fc19.noarch
ovirt-engine-webadmin-portal-3.4.1-1.fc19.noarch
ovirt-log-collector-3.4.2-1.fc19.noarch
ovirt-host-deploy-java-1.2.1-1.fc19.noarch
ovirt-engine-websocket-proxy-3.4.1-1.fc19.noarch
ovirt-iso-uploader-3.4.1-1.fc19.noarch
ovirt-engine-restapi-3.4.1-1.fc19.noarch
ovirt-engine-tools-3.4.1-1.fc19.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.4.1-1.fc19.noarch
ovirt-host-deploy-1.2.1-1.fc19.noarch
ovirt-engine-lib-3.4.1-1.fc19.noarch
ovirt-engine-setup-3.4.1-1.fc19.noarch
ovirt-engine-dbscripts-3.4.1-1.fc19.noarch
ovirt-image-uploader-3.4.1-1.fc19.noarch
libgovirt-0.1.0-1.fc19.x86_64
ovirt-engine-backend-3.4.1-1.fc19.noarch
glusterfs-3.5.0-3.fc19.x86_64
glusterfs-api-3.5.0-3.fc19.x86_64
glusterfs-fuse-3.5.0-3.fc19.x86_64
glusterfs-libs-3.5.0-3.fc19.x86_64

gluster node:
glusterfs-server-3.5.0-2.el6.x86_64
glusterfs-api-3.5.0-2.el6.x86_64
glusterfs-3.5.0-2.el6.x86_64
glusterfs-fuse-3.5.0-2.el6.x86_64
glusterfs-rdma-3.5.0-2.el6.x86_64
glusterfs-libs-3.5.0-2.el6.x86_64
glusterfs-cli-3.5.0-2.el6.x86_64
vdsm-4.14.8.1-0.el6.x86_64
vdsm-python-zombiereaper-4.14.8.1-0.el6.noarch
vdsm-gluster-4.14.8.1-0.el6.noarch
vdsm-python-4.14.8.1-0.el6.x86_64
vdsm-cli-4.14.8.1-0.el6.noarch
vdsm-xmlrpc-4.14.8.1-0.el6.noarch


How reproducible:
Always

Steps to Reproduce:
1.on gluster0 stop vdsm
2.on ovirt-console bricks unavailable
3.on gluster0 start vsdm
4.on ovirt-console bricks remains unavailable

Actual results:
ovirt-admin console continues to show the brick as down even though gluster is healthy and unaffected

Expected results:
brick should show as down while vdsm is offline and then be shown as up when vdsm comes back up

Additional info:

I have a vm store volume which I cannot add new vm disks too because all bricks are marked down.  I cannot restart it because I have active VMs - hell I should not have to restart it it is not a gluster issue.

Comment 1 Sahina Bose 2014-07-08 08:15:29 UTC
Darshan, can you check what glusterVolumesList vdsm command returns. If this returns the brick status correctly, it may be an engine issue.

Comment 2 Darshan 2014-07-08 10:39:28 UTC
GlusterVolumeStatus command is returning the brick status correctly.

Comment 3 Ludek Finstrle 2014-09-29 16:28:19 UTC
Is there any progress/workaround? I see exactly the same problem with oVirt 3.4.4 + gluster 3.5.2:

engine: ovirt-engine-3.4.4-1.el6.noarch
hosts: vdsm-4.14.17-0.el6.x86_64, vdsm-gluster-4.14.17-0.el6.noarch, glusterfs-server-3.5.2-1.el6.x86_64

Comment 4 Ludek Finstrle 2014-09-29 16:45:49 UTC
How is the hostname in glusterVolumeStatus gathered?
I have multiple nics in host machines (gluster listening on 0.0.0.0) with:
vm1.lab:
eth1: 192.168.254.1/30
ovirtmgmt (bridge on eth2): 192.168.254.129/27, 192.168.254.161/27
$ grep vm1 /etc/hosts
192.168.254.1    vm1.lab.host
192.168.254.129  vm1.lab.gluster

vm2.lab:
eth1: 192.168.254.5/30
ovirtmgmt (bridge on eth2): 192.168.254.130/27, 192.168.254.162/27
$ grep vm2 /etc/hosts
192.168.254.5    vm2.lab.host
192.168.254.130  vm2.lab.gluster

I'm not aware of any hosts IP-Name change related to this. However I also modified the /etc/hosts during oVirt upgrade.

And glusterVolumeStatus output:

$ vdsClient -s localhost glusterVolumeStatus volumeName=storage
{'status': {'code': 0, 'message': 'Done'},
 'volumeStatus': {'bricks': [{'brick': 'vm1.lab.gluster:/gluster/vms/storage',
                              'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                              'pid': '1894',
                              'port': '50153',
                              'status': 'ONLINE'},
                             {'brick': 'vm2.lab.gluster:/gluster/vms/storage',
                              'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                              'pid': '1758',
                              'port': '50159',
                              'status': 'ONLINE'}],
                  'name': 'storage',
                  'nfs': [{'hostname': '192.168.254.5',
                           'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                           'pid': '24817',
                           'port': '2049',
                           'status': 'ONLINE'},
                          {'hostname': '192.168.254.129',
                           'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                           'pid': '16277',
                           'port': '2049',
                           'status': 'ONLINE'}],
                  'shd': [{'hostname': '192.168.254.5',
                           'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                           'pid': '24824',
                           'status': 'ONLINE'},
                          {'hostname': '192.168.254.129',
                           'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                           'pid': '16286',
                           'status': 'ONLINE'}]}}

Comment 5 Ludek Finstrle 2014-09-29 16:51:33 UTC
The previous output is from vm2.lab.gluster host.

Now here is the output from vm1.lab.gluster (doesn't match =
Local IP is wrong, remote IP is ok)/

$ vdsClient -s localhost glusterVolumeStatus volumeName=storage
{'status': {'code': 0, 'message': 'Done'},
 'volumeStatus': {'bricks': [{'brick': 'vm1.lab.gluster:/gluster/vms/storage',
                              'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                              'pid': '1894',
                              'port': '50153',
                              'status': 'ONLINE'},
                             {'brick': 'vm2.lab.gluster:/gluster/vms/storage',
                              'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                              'pid': '1758',
                              'port': '50159',
                              'status': 'ONLINE'}],
                  'name': 'storage',
                  'nfs': [{'hostname': '192.168.254.1',
                           'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                           'pid': '16277',
                           'port': '2049',
                           'status': 'ONLINE'},
                          {'hostname': '192.168.254.130',
                           'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                           'pid': '24817',
                           'port': '2049',
                           'status': 'ONLINE'}],
                  'shd': [{'hostname': '192.168.254.1',
                           'hostuuid': '0c779c52-a097-4101-9c85-c9636499ce82',
                           'pid': '16286',
                           'status': 'ONLINE'},
                          {'hostname': '192.168.254.130',
                           'hostuuid': '21a5bd1e-78e3-4824-b299-f7a7c72b7d7a',
                           'pid': '24824',
                           'status': 'ONLINE'}]}}

Comment 6 Sahina Bose 2014-09-30 05:48:38 UTC
The host names returned from gluster volume status is mapped to engine's host using the hostuuid field.

In your engine, the hosts that you have:
vm1.lab.gluster
vm2.lab.gluster, could you tell me the host uuid for these.

You could do this, by running the query:
psql engine postgres -c "select vds_name, gluster_server_uuid  from vds_static, gluster_server where vds_id= server_id;"

Also, could you attach the engine.log to the bug?

Comment 7 Ludek Finstrle 2014-09-30 06:27:28 UTC
Created attachment 942592 [details]
log from ovirt hosted engine

Comment 8 Ludek Finstrle 2014-09-30 06:28:11 UTC
Created attachment 942593 [details]
log from vm1 ovirt host

Comment 9 Ludek Finstrle 2014-09-30 06:28:51 UTC
Created attachment 942594 [details]
log from vm2 ovirt host

Comment 10 Ludek Finstrle 2014-09-30 06:44:33 UTC
I attached requested log (and also vdsm logs from nodes). I'm running two node oVirt with hosted engine. The hosts are also gluster nodes. All gluster volumes are replica with two bricks.

They're whole day logs. I don't remember exactly when I started the upgrade. But definitely I upgraded in order:
1) ovirt-engine
2) vm1.lab
3) vm2.lab

The gluster worked perfectly all the time. What I describe below is just the status in oVirt admin console.

The status of gluster volumes in oVirt was ok in the morning. After that (and I don't remember if after 1st or 2nd step - but I think after 2nd step) the gluster bricks on vm1 went down (red triangle). Than I upgraded also vm2.lab. All gluster bricks on vm1 were red and on vm2 there question marks (no red no green triangle just black question mark). As the last thing I tried to stop & start isos brick from admin console and it went into green status.

In the meanwhile I tried restart whole environment (engine, hosts) without impact to the running VMs. Also I tried stop & start isos and it went into green status but after some another step it went into vm1 red and vm2 question mark state.

I'm trying to run pgsql query but I was never logged to the ovirt internal pgsql instance (so I don't know credentials - or where to get them - and I don't have identd installed). So it'll take me some time. Maybe I'll change ident to trust auth :)

Comment 11 Ludek Finstrle 2014-09-30 06:49:39 UTC
# sudo -u postgres psql engine postgres -c "select vds_name, gluster_server_uuid  from vds_static, gluster_server where vds_id= server_id;"
   vds_name      |         gluster_server_uuid          
-----------------+--------------------------------------
 vm1.lab.gluster | 0c779c52-a097-4101-9c85-c9636499ce82
 vm2.lab.gluster | 21a5bd1e-78e3-4824-b299-f7a7c72b7d7a
(2 rows)

Comment 12 Ludek Finstrle 2014-09-30 07:35:04 UTC
I see how to reproduce the question mark state:
1) put the node under maintenance (not sure if needed)
2) stop vdsmd service on that node
3) Refresh capabilities from web admin console for that node while vdsmd is down

I see hot to reproduce the down (red triangle) state:
1) put the node under maintenance (not sure if needed)
2) stop glusterd service on that node
3) Refresh capabilities from web admin console for that node while glusterd is down

Comment 13 Sahina Bose 2014-09-30 09:51:48 UTC
Hi!

Thanks for the detailed analysis on the bug.

This is designed as per bug - https://bugzilla.redhat.com/show_bug.cgi?id=1021441#c4

1) In the first case - when vdsmd is down, there is no communication possible between the engine and node. This could be due to many reasons - host powered down or vdsmd service not running. So the brick status is temporarily moved to Unknown (?) - the brick will be moved back to UP state, during the next refresh cycle when gluster volume status returns the brick as online.

2) In the second case - when glusterd is down, the bricks are marked Down (red) since gluster volume status will no longer list these bricks.


If the UNKNOWN (?) state is misleading, we could change our refresh logic - to always compare results from gluster volume status output. If the brick is not listed in the output, then mark this as DOWN.

Please let me know if the UNKNOWN state was the issue. (As the second case seems to be expected behaviour)

Comment 14 Ludek Finstrle 2014-09-30 12:17:10 UTC
States are ok and I understand it. I have no problem with it.

There is no problem that it went into that state. The problem is that it never returns from that state.
It's in that status one day (with vdsmd and glusterd up). Also Refresh capabilities doesn't help.

The Gluster status refresh is weird. Right now I'm after series of test in situation that vm1 is down (host is up but all services includind wdmd, sanlock, gluster, vdsm is down) - for longer time than 2 hours but I see in ovirt console that brick on vm1 is up and brick on vm2 is down (opposite to the reality).

Comment 15 Ludek Finstrle 2014-09-30 12:21:06 UTC
I see how to reproduce the question mark state:
1) put the node under maintenance (not sure if needed)
2) stop vdsmd service on that node
3) Refresh capabilities from web admin console for that node while vdsmd is down
4) start vdsmd service on that node
5) brick remain in unknown state even Refresh capabilities doesn't help (the only possible transition is to down state)

I see hot to reproduce the down (red triangle) state:
1) put the node under maintenance (not sure if needed)
2) stop glusterd service on that node
3) Refresh capabilities from web admin console for that node while glusterd is down
4) start glusterd service on that node
5) brick remain in down state even Refresh capabilities doesn't help

The problem arrived yesterday and I still see the strange state.
The only way from this is to stop & start gluster volume from ovirt console.

Comment 16 Alastair Neil 2014-09-30 14:13:23 UTC
(In reply to Sahina Bose from comment #13)
> Hi!
> 
> Thanks for the detailed analysis on the bug.
> 
> This is designed as per bug -
> https://bugzilla.redhat.com/show_bug.cgi?id=1021441#c4
> 
> 1) In the first case - when vdsmd is down, there is no communication
> possible between the engine and node. This could be due to many reasons -
> host powered down or vdsmd service not running. So the brick status is
> temporarily moved to Unknown (?) - the brick will be moved back to UP state,
> during the next refresh cycle when gluster volume status returns the brick
> as online.

per my original bug report, agreed this is the expected behaviour but this does not happen in my cluster, is this resolved in 3.5? Until this additional report came in I had not seen any action on this bug, or even a confirmation that there was a problem, or a request for information.

> 
> 2) In the second case - when glusterd is down, the bricks are marked Down
> (red) since gluster volume status will no longer list these bricks.
> 
> 
> If the UNKNOWN (?) state is misleading, we could change our refresh logic -
> to always compare results from gluster volume status output. If the brick is
> not listed in the output, then mark this as DOWN.
> 
> Please let me know if the UNKNOWN state was the issue. (As the second case
> seems to be expected behaviour)

Comment 17 Sahina Bose 2014-10-01 05:36:10 UTC
(In reply to Alastair Neil from comment #16)
> (In reply to Sahina Bose from comment #13)
> > Hi!
> > 
> > Thanks for the detailed analysis on the bug.
> > 
> > This is designed as per bug -
> > https://bugzilla.redhat.com/show_bug.cgi?id=1021441#c4
> > 
> > 1) In the first case - when vdsmd is down, there is no communication
> > possible between the engine and node. This could be due to many reasons -
> > host powered down or vdsmd service not running. So the brick status is
> > temporarily moved to Unknown (?) - the brick will be moved back to UP state,
> > during the next refresh cycle when gluster volume status returns the brick
> > as online.
> 
> per my original bug report, agreed this is the expected behaviour but this
> does not happen in my cluster, is this resolved in 3.5? Until this
> additional report came in I had not seen any action on this bug, or even a
> confirmation that there was a problem, or a request for information.


This fix was introduced in http://gerrit.ovirt.org/#/c/21444/ and should be available since ovirt-3.4.0

We had tried to reproduce the issue, but were unable to. We may be missing a scenario here.

If this happens again, could you provide the engine log and vdsm log?

> 
> > 
> > 2) In the second case - when glusterd is down, the bricks are marked Down
> > (red) since gluster volume status will no longer list these bricks.
> > 
> > 
> > If the UNKNOWN (?) state is misleading, we could change our refresh logic -
> > to always compare results from gluster volume status output. If the brick is
> > not listed in the output, then mark this as DOWN.
> > 
> > Please let me know if the UNKNOWN state was the issue. (As the second case
> > seems to be expected behaviour)

Comment 18 Sahina Bose 2014-10-01 05:38:33 UTC
(In reply to Ludek Finstrle from comment #15)
> I see how to reproduce the question mark state:
> 1) put the node under maintenance (not sure if needed)
> 2) stop vdsmd service on that node
> 3) Refresh capabilities from web admin console for that node while vdsmd is
> down
> 4) start vdsmd service on that node
> 5) brick remain in unknown state even Refresh capabilities doesn't help (the
> only possible transition is to down state)
> 
> I see hot to reproduce the down (red triangle) state:
> 1) put the node under maintenance (not sure if needed)
> 2) stop glusterd service on that node
> 3) Refresh capabilities from web admin console for that node while glusterd
> is down
> 4) start glusterd service on that node
> 5) brick remain in down state even Refresh capabilities doesn't help
> 
> The problem arrived yesterday and I still see the strange state.
> The only way from this is to stop & start gluster volume from ovirt console.

The refresh capabilities only executes the getVdsCaps on the node. It does not execute the gluster commands - hence would not result in changing brick state to green. I did not notice any error in your logs regarding the gluster commands. Let me dig into this further now that there are reproducible steps.

Comment 19 Sahina Bose 2014-10-01 08:57:44 UTC
As the logs of engine and vdsm were not from the same time, there was not much I could infer from the logs. No exceptions related to volumestatus found in either.

One possibility is that brick status does not get updated due to host names differing - the one that gluster returns and the one engine is aware of. The referenced patch addresses this issue.

Comment 20 Ludek Finstrle 2014-10-01 12:35:54 UTC
Ops, I'm sorry I didn't notice that log rotation proceeded while I downloaded logs.
Are you interested in vdsm logs from 9/29/2014?

Will this patch be included in 3.5.0 release or 3.5.1?

Comment 21 Sahina Bose 2014-10-01 13:52:15 UTC
(In reply to Ludek Finstrle from comment #20)
> Ops, I'm sorry I didn't notice that log rotation proceeded while I
> downloaded logs.
> Are you interested in vdsm logs from 9/29/2014?

If you do have it, yes. And this is the time during which the brick status was in red state, correct?

> 
> Will this patch be included in 3.5.0 release or 3.5.1?

It should be in 3.5.0

Comment 22 Ludek Finstrle 2014-10-02 13:34:52 UTC
Created attachment 943369 [details]
vdsm log from vm1

Comment 23 Ludek Finstrle 2014-10-02 13:35:09 UTC
Created attachment 943370 [details]
vdsm log from vm1

Comment 24 Ludek Finstrle 2014-10-02 13:35:37 UTC
Created attachment 943371 [details]
vdsm log from vm1

Comment 25 Ludek Finstrle 2014-10-02 13:36:00 UTC
Created attachment 943372 [details]
vdsm log from vm1

Comment 26 Ludek Finstrle 2014-10-02 13:36:29 UTC
Created attachment 943373 [details]
vdsm log from vm2

Comment 27 Ludek Finstrle 2014-10-02 13:36:53 UTC
Created attachment 943374 [details]
vdsm log from vm2

Comment 28 Ludek Finstrle 2014-10-02 13:37:17 UTC
Created attachment 943375 [details]
vdsm log from vm2

Comment 29 Ludek Finstrle 2014-10-02 13:37:39 UTC
Created attachment 943376 [details]
vdsm log from vm2

Comment 30 Ludek Finstrle 2014-10-02 13:38:01 UTC
Created attachment 943377 [details]
vdsm log from vm2

Comment 31 Ludek Finstrle 2014-10-02 13:40:33 UTC
  I upploaded vdsm logs from vm1 and vm2 hopefully from the right time.
Previously attached vdsm logs (from 9/30) are from time when the status of gluster volumes is displayed wrong (red for all except isos - I stopped & started isos).

Comment 32 Sandro Bonazzola 2014-10-17 12:29:31 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 33 Alastair Neil 2014-10-20 21:17:33 UTC
I can confirm that this issue is resolved for me with verion 3.5GA