Bug 1534658
Summary: | [downstream clone - 4.1.9] Host installation made all other hosts timeout - add logging to make heartbeat issues more visible | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
Component: | vdsm-jsonrpc-java | Assignee: | Piotr Kliczewski <pkliczew> |
Status: | CLOSED ERRATA | QA Contact: | Petr Matyáš <pmatyas> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.0.7 | CC: | bazulay, gveitmic, lsurette, mgoldboi, mperina, pkliczew, pstehlik, rbalakri, Rhev-m-bugs, srevivo, ykaul |
Target Milestone: | ovirt-4.1.9 | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vdsm-jsonrpc-java-1.3.16-1.el7ev | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1502391 | Environment: | |
Last Closed: | 2018-01-24 14:42:48 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1502391 | ||
Bug Blocks: |
Description
rhev-integ
2018-01-15 17:00:36 UTC
There were some stuck ProcessOvfUpdateForStorageDomain commands, but these have been there for quite some time and don't look related. (Originally by Germano Veit Michel) Piotr, could you please take a look? (Originally by Martin Perina) It looks like the issues that we had before and it was already fixed. To make sure that it is really the issue please provide being installed vdsm logs and server.log. (Originally by Piotr Kliczewski) (In reply to Piotr Kliczewski from comment #4) > It looks like the issues that we had before and it was already fixed. To > make sure that it is really the issue please provide being installed vdsm > logs and server.log. Hi Piotr, I checked server.log previously and it was clean, that's why I did not upload it. Attaching now. VDSM logs will follow soon (keeping needinfo on me) Could you please point to the bug/patch that you think already fixed it? (Originally by Germano Veit Michel) From vdsm logs I see very strange vdsm behavior: 1. Vdsm was started at 2017-10-12 22:13:35,669+0200 2. The engine connected at 2017-10-11 22:13:36,037+0200 3. The engine got disconnected at 2017-10-12 12:38:33,063+0200 4. Vdsm was restarted at 2017-10-12 12:49:45,843+0200 and became fully functional Please note that between 1 and 3 vdsm never received CONNECT frame and there is no sign of mom (localhost) connecting to vdsm. From the engine perspective I see that engine attempted to call ping method for some time but failed which resulted in fencing flow at 2017-10-11 22:15:01,784. The host was switched to maintenance at 2017-10-11 22:16:25,393. There was balancing info: 2017-10-11 22:16:27,144 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGuestDistributionBalancePolicyUnit] (DefaultQuartzScheduler93) [] There is no host with less than 33 running guests 2017-10-11 22:16:27,144 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler93) [] All candidate hosts have been filtered, can't balance the cluster 'Tele2Cloud3' based on the CPU usage, will try memory based approach I see that all other hosts started to fail one by one at 2017-10-11 22:16:36,057 and ended at 2017-10-11 22:18:27,617. The reason for all of the failures was no response within higher level timeout. Please provide vdsm log for one of the vdsms which started to fail after failed installation. (Originally by Piotr Kliczewski) There is nothing special in the log. I can see that the last request from the engine arrived at 2017-10-11 22:13:28,083+0200 and later the connection was closed by the engine at 2017-10-11 22:16:38,160+0200. I haven't seen similar behavior before. Here is what we know: 1. Host was installed and the handshake was successful but no CONNECT frame was sent 2. After the host install issue other hosts slowly started to timeout according to the timeout defined in the db Checking the engine log a bit further I see enormous amount of ovf stream uploads and when the engine was about to connect to vdsm the log entries stopped to appear and here are the last entires: 2017-10-11 22:13:38,339 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler79) [43b80e86] Command 'ProcessOvfUpdateForStorageDomain' (id: '7c2f78e9-c18b-4610-9f57-35f34a38dd27') waiting on child command id: '95057e8d-4906-4d8c-99c8-f7c25db0fcc0' type:'UploadStream' to complete 2017-10-11 22:14:16,403 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler89) [7fce1c8b] Command 'ProcessOvfUpdateForStorageDomain' (id: '144d5adf-5777-41b6-8a33-73c59d783a47') waiting on child command id: '84800851-e7a3-4638-98e3-9016c493c0f4' type:'UploadStream' to complete Do we know anything about this massive upload operations? They seems to occupy many threads. (Originally by Piotr Kliczewski) (In reply to Piotr Kliczewski from comment #10) > There is nothing special in the log. I can see that the last request from > the engine arrived at 2017-10-11 22:13:28,083+0200 and later the connection > was closed by the engine at 2017-10-11 22:16:38,160+0200. I haven't seen > similar behavior before. Yes. I too only saw the engine not talking to any hosts, no clues anywhere else. > > Here is what we know: > 1. Host was installed and the handshake was successful but no CONNECT frame > was sent > 2. After the host install issue other hosts slowly started to timeout > according to the timeout defined in the db > > Checking the engine log a bit further I see enormous amount of ovf stream > uploads and when the engine was about to connect to vdsm the log entries > stopped to appear and here are the last entires: > > 2017-10-11 22:13:38,339 INFO > [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] > (DefaultQuartzScheduler79) [43b80e86] Command > 'ProcessOvfUpdateForStorageDomain' (id: > '7c2f78e9-c18b-4610-9f57-35f34a38dd27') waiting on child command id: > '95057e8d-4906-4d8c-99c8-f7c25db0fcc0' type:'UploadStream' to complete > 2017-10-11 22:14:16,403 INFO > [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] > (DefaultQuartzScheduler89) [7fce1c8b] Command > 'ProcessOvfUpdateForStorageDomain' (id: > '144d5adf-5777-41b6-8a33-73c59d783a47') waiting on child command id: > '84800851-e7a3-4638-98e3-9016c493c0f4' type:'UploadStream' to complete > > Do we know anything about this massive upload operations? They seems to > occupy many threads. We don't. The engine was waiting for all these commands to complete for a very long time. Our oldest logs still show these commands waiting, so the info on why this is still there was rotated in the logs. It's been like that for at least 1 month. After engine restart the OVF updates were cleaned up. I understand we both see an issue, but there are no clues where it's coming from. As an engine restart fixed it, it probably indicates something wrong in the engine itself right? Do you have any ideas on what could trigger this to try to reproduce it? If not, I'm afraid we don't have enough data to tackle this bug. (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #11) > > We don't. The engine was waiting for all these commands to complete for a > very long time. Our oldest logs still show these commands waiting, so the > info on why this is still there was rotated in the logs. It's been like that > for at least 1 month. After engine restart the OVF updates were cleaned up. > > I understand we both see an issue, but there are no clues where it's coming > from. As an engine restart fixed it, it probably indicates something wrong > in the engine itself right? Do you have any ideas on what could trigger this > to try to reproduce it? If not, I'm afraid we don't have enough data to > tackle this bug. I agree, there is engine side issue here but I am not able to say what is causing it. I haven't seen anything similar before. Now the only thing we can do is try to reproduce it. Do you have any suggestions how to do it? (Originally by Piotr Kliczewski) (In reply to Piotr Kliczewski from comment #12) > I agree, there is engine side issue here but I am not able to say what is > causing it. I haven't seen anything similar before. Now the only thing we > can do is try to reproduce it. Do you have any suggestions how to do it? No idea. Except for those ovf streams (which we don't know how they got in that state) I don't see anything exceptional. Do you think those could be related? How would it block Host<->Engine communication? I assume it's not easy to reproduce, otherwise would be happening all the time. (Originally by Germano Veit Michel) I am not sure whether ovf streams are affecting the communication. For some reason the responses never arrived and for installed host we saw that CONNECT frame was not sent. Ovf stream upload uses http client and different physical connection. I do not see any clues in the logs what could be going on. Maybe the cause is the same for both. I am not able to come up with steps to reproduce so I suggest it to close with insufficient data and reopen when we see it one more time. What do you think? (Originally by Piotr Kliczewski) Any log prints we can add? (Originally by Oved Ourfali) Based on the logs provided I am not sure what happened. For sure logs in debug would be more helpful but I am not able to specify any place which would be helpful. (Originally by Piotr Kliczewski) Moran, let's close this bug for now with INSUFFICIENT_DATA and let's reopen when we get reproducing steps or debug logs (Originally by Martin Perina) (In reply to Piotr Kliczewski from comment #14) > I am not able to come up with steps to reproduce so I suggest it to close > with insufficient data and reopen when we see it one more time. What do you > think? Agreed. Would be nice to have some better logging of heartbeat issues though, as these happen quite frequently (most of the times due to network issues though) and we are usually in the dark about them. Not sure what could be done. (Originally by Germano Veit Michel) I am happy to provide more logging around heartbeat issues. Please suggest information that would be helpful and I will make sure they are in. (Originally by Piotr Kliczewski) (In reply to Piotr Kliczewski from comment #19) > I am happy to provide more logging around heartbeat issues. Please suggest > information that would be helpful and I will make sure they are in. Hi Piotr, I don't know the heartbeat mechanism well enough to comment on this. But if we could have some logging that could help us determining if host (or vdsm) stopped the heartbeats maybe it could help. Like some warning: heartbeat not sent within secs, heartbeat not received in the last secs.. Heartbeat thread blocked.. Not sure if this makes sense and I know some of these may be impossible to code, as they are not just logging. (Originally by Germano Veit Michel) OK, I will add such log to the code. (Originally by Piotr Kliczewski) Hi Piotr, The patch is on 4.1 tree but there is no Z-Stream clone of this BZ. Is this really going out with 4.1.9? Because we have a re-occurence and would be nice to confirm it. It's a good chance to catch the problem. Also, is it safe to run 4.1.8 with vdsm-jsonrpc-java-1.3.16-1.el7ev? Thank you! (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #26) > The patch is on 4.1 tree but there is no Z-Stream clone of this BZ. Is this > really going out with 4.1.9? As I understand this is the z-stream bug for 4.1 so no need to clone. I think that Martin can clarify if I am wrong. > > Also, is it safe to run 4.1.8 with vdsm-jsonrpc-java-1.3.16-1.el7ev? > Yes, There are no api changes so it is safe. (Originally by Piotr Kliczewski) Verified on vdsm-jsonrpc-java-1.3.16-1.el7ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0131 BZ<2>Jira re-sync |