hanging LVMoISCSISR processes during snapshot operations #285

loopway · 2019-10-02T14:17:19Z

We use a simple shell script to create snapshots of our vms and export them to have VM Image backup (basically xe vm-snapshot, xe vm-export, xe vm-uninstall). Since migrating to xcp-ng 8 we observe hanging LVMoISCSISR processes and lots of defuncts of the same.

/usr/bin/python /opt/xensource/sm/LVMoISCSISR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:e954eb4a-e2bc-484c-86f0-7ce98bec882f</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>multihomelist</name><value>10.100.0.2:3260,10.5.0.40:3260</value></member><member><name>SCSIid</name><value>36005076300808514700000000000000c</value></member><member><name>targetIQN</name><value>iqn.1986-03.com.ibm:2145.idhv3700.node1</value></member><member><name>target</name><value>10.100.0.2</value></member><member><name>port</name><value>3260</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:aa7492c4-98a0-4868-b9a2-27d637a3fb5f</value></member><member><name>sr_ref</name><value>OpaqueRef:ebe0ec41-e1b6-4551-ba20-f1c79e3bb1c1</value></member><member><name>sr_uuid</name><value>b7e96994-3254-ea40-3a54-b0528d391dfa</value></member><member><name>vdi_ref</name><value>OpaqueRef:ff799859-1000-477a-a5b5-4851e4ca5813</value></member><member><name>vdi_location</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>vdi_uuid</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>subtask_of</name><value>DummyRef:|2b5276ea-3f02-413f-b3de-c71eb6f93d08|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>
root     12076 11867  0 16:05 ?        00:00:00 /usr/bin/python /opt/xensource/sm/LVMoISCSISR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:e954eb4a-e2bc-484c-86f0-7ce98bec882f</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>multihomelist</name><value>10.100.0.2:3260,10.5.0.40:3260</value></member><member><name>SCSIid</name><value>36005076300808514700000000000000c</value></member><member><name>targetIQN</name><value>iqn.1986-03.com.ibm:2145.idhv3700.node1</value></member><member><name>target</name><value>10.100.0.2</value></member><member><name>port</name><value>3260</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:aa7492c4-98a0-4868-b9a2-27d637a3fb5f</value></member><member><name>sr_ref</name><value>OpaqueRef:ebe0ec41-e1b6-4551-ba20-f1c79e3bb1c1</value></member><member><name>sr_uuid</name><value>b7e96994-3254-ea40-3a54-b0528d391dfa</value></member><member><name>vdi_ref</name><value>OpaqueRef:ff799859-1000-477a-a5b5-4851e4ca5813</value></member><member><name>vdi_location</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>vdi_uuid</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>subtask_of</name><value>DummyRef:|2b5276ea-3f02-413f-b3de-c71eb6f93d08|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>

# ps -ef| grep defunct | tail -5
root     32610 11867  0 06:22 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32617 11867  0 00:50 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32673 11867  0 09:34 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32728 11867  0 00:22 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32749 11867  0 05:57 ?        00:00:00 [LVMoISCSISR] <defunct>

# ps -ef| grep defunct | wc -l
1276

the backup script finishes without errors
we suspect a performance degradation on the xen node running the backups
after killing the hanging processes, all the defuncts are gone

The text was updated successfully, but these errors were encountered:

olivierlambert · 2019-10-02T15:09:39Z

Are you snapshotting all the VMs at once? This is something we avoided in Xen Orchestra since a while, because all LVM related SR (local LVM, HBA, LVMoiSCSI) are prone to race conditions. So XO is doing snapshots 2 by 2.

I would also suggest to use XO or to improve your script to avoid this. Another alternative is to use a file based SR, which are far more stable in those conditions.

loopway · 2019-10-02T15:22:39Z

We loop over all running VMs and only do one at the time, so that hopefully that avoids any race conditions.

The script is in use since xen server 6.X, but we only noticing this behaviour since we upgraded from xen server 7.3 to xcp-ng 8. Could this be a bug?

olivierlambert · 2019-10-02T15:36:43Z

We aren't aware of a behavior change. It might be related to your iSCSI configuration that was reset, like timeouts or max requests etc. Double check there and modify the values.

We strongly suggest to switch to NFS if you can.

loopway · 2019-10-07T15:29:58Z

Changing from iSCSI to NFS does not seem like a solution for the problem, no offence :) Our storage subsystem does not support it anyways.

I found a similar issue here:
NAUbackup/VmBackup#91

I still suspect it to be a bug.

olivierlambert · 2019-10-07T15:36:45Z

Switching to NFS is a solution to get rid of the mess regarding how SMAPIv1 is written and how it deals with iSCSI. We have almost 0 support ticket in XCP-ng Support related to NFS, unlike a usual flow of people having problems on LVMoiSCSI. They probably introduced an even worse bug, but as this shitty code is full of race conditions everywhere, it might be anything.

If @BogdanRudas want to share the fix you had from Citrix with us, we could probably apply the patch on your XCP-ng install and see the result.

BogdanRudas · 2019-10-07T16:51:32Z

Hi!

There is XS71ECU2019 for zombies, however coalescing still bad. I guess that every that zombie is caused by failed coalescing attempt.

https://support.citrix.com/article/CTX262019

nagilum99 · 2019-10-07T17:24:13Z

XS71ECU2019 is already on the bugtracker - we're at least 2, where the patch isn't helping!
https://bugs.xenserver.org/browse/XSO-966

olivierlambert · 2019-10-07T17:38:08Z

Watching the ticket, let's see how it goes.

BogdanRudas · 2019-10-07T17:40:22Z

@nagilum99 I had some host died because zombies exhausted PID limit in dom0 (32k), thus a patch rather better then nothing. However I still have to do vm-pause for certain VMs to let coalescing do it's job on underloaded SSD RAID. I'm nearly sure that thing were good prior at the level of XS71ECU2007

nagilum99 · 2019-10-07T18:53:14Z

It did not help at all - over here.
So - for us - this patch is pretty useless. We have constantly one vhd-util process running and dying after 5 minutes. 24/7, all time 100 %. Waiting a while now for support to call back. Creating the ticket at night is probably medium good idea.

BogdanRudas · 2019-10-08T17:25:50Z

@nagilum99 I have a call with Citrix support rep. regarding coalescing issues and will have another case# for that. As I understand current implementation can't handle moderate write workload during coalescing. So for current 7.6 installation (both Citrix and XCP-ng) I'm going to keep on current version unless some improvement will come from either side.

olivierlambert · 2019-10-08T17:52:33Z

Can you confirm this issue is only on 8.0 and not on 7.6?

stormi · 2019-11-22T15:36:27Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hanging LVMoISCSISR processes during snapshot operations #285

hanging LVMoISCSISR processes during snapshot operations #285

loopway commented Oct 2, 2019 •

edited by olivierlambert

Loading

olivierlambert commented Oct 2, 2019 •

edited

Loading

loopway commented Oct 2, 2019

olivierlambert commented Oct 2, 2019

loopway commented Oct 7, 2019

olivierlambert commented Oct 7, 2019 •

edited

Loading

BogdanRudas commented Oct 7, 2019

nagilum99 commented Oct 7, 2019

olivierlambert commented Oct 7, 2019

BogdanRudas commented Oct 7, 2019 •

edited

Loading

nagilum99 commented Oct 7, 2019 •

edited

Loading

BogdanRudas commented Oct 8, 2019 •

edited

Loading

olivierlambert commented Oct 8, 2019

stormi commented Nov 22, 2019

loopway commented Dec 26, 2019

stormi commented Apr 20, 2020

hanging LVMoISCSISR processes during snapshot operations #285

hanging LVMoISCSISR processes during snapshot operations #285

Comments

loopway commented Oct 2, 2019 • edited by olivierlambert Loading

olivierlambert commented Oct 2, 2019 • edited Loading

loopway commented Oct 2, 2019

olivierlambert commented Oct 2, 2019

loopway commented Oct 7, 2019

olivierlambert commented Oct 7, 2019 • edited Loading

BogdanRudas commented Oct 7, 2019

nagilum99 commented Oct 7, 2019

olivierlambert commented Oct 7, 2019

BogdanRudas commented Oct 7, 2019 • edited Loading

nagilum99 commented Oct 7, 2019 • edited Loading

BogdanRudas commented Oct 8, 2019 • edited Loading

olivierlambert commented Oct 8, 2019

stormi commented Nov 22, 2019

loopway commented Dec 26, 2019

stormi commented Apr 20, 2020

loopway commented Oct 2, 2019 •

edited by olivierlambert

Loading

olivierlambert commented Oct 2, 2019 •

edited

Loading

olivierlambert commented Oct 7, 2019 •

edited

Loading

BogdanRudas commented Oct 7, 2019 •

edited

Loading

nagilum99 commented Oct 7, 2019 •

edited

Loading

BogdanRudas commented Oct 8, 2019 •

edited

Loading