Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hanging LVMoISCSISR processes during snapshot operations #285

Closed
loopway opened this issue Oct 2, 2019 · 15 comments
Closed

hanging LVMoISCSISR processes during snapshot operations #285

loopway opened this issue Oct 2, 2019 · 15 comments

Comments

@loopway
Copy link

loopway commented Oct 2, 2019

We use a simple shell script to create snapshots of our vms and export them to have VM Image backup (basically xe vm-snapshot, xe vm-export, xe vm-uninstall). Since migrating to xcp-ng 8 we observe hanging LVMoISCSISR processes and lots of defuncts of the same.

/usr/bin/python /opt/xensource/sm/LVMoISCSISR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:e954eb4a-e2bc-484c-86f0-7ce98bec882f</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>multihomelist</name><value>10.100.0.2:3260,10.5.0.40:3260</value></member><member><name>SCSIid</name><value>36005076300808514700000000000000c</value></member><member><name>targetIQN</name><value>iqn.1986-03.com.ibm:2145.idhv3700.node1</value></member><member><name>target</name><value>10.100.0.2</value></member><member><name>port</name><value>3260</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:aa7492c4-98a0-4868-b9a2-27d637a3fb5f</value></member><member><name>sr_ref</name><value>OpaqueRef:ebe0ec41-e1b6-4551-ba20-f1c79e3bb1c1</value></member><member><name>sr_uuid</name><value>b7e96994-3254-ea40-3a54-b0528d391dfa</value></member><member><name>vdi_ref</name><value>OpaqueRef:ff799859-1000-477a-a5b5-4851e4ca5813</value></member><member><name>vdi_location</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>vdi_uuid</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>subtask_of</name><value>DummyRef:|2b5276ea-3f02-413f-b3de-c71eb6f93d08|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>
root     12076 11867  0 16:05 ?        00:00:00 /usr/bin/python /opt/xensource/sm/LVMoISCSISR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:e954eb4a-e2bc-484c-86f0-7ce98bec882f</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>multihomelist</name><value>10.100.0.2:3260,10.5.0.40:3260</value></member><member><name>SCSIid</name><value>36005076300808514700000000000000c</value></member><member><name>targetIQN</name><value>iqn.1986-03.com.ibm:2145.idhv3700.node1</value></member><member><name>target</name><value>10.100.0.2</value></member><member><name>port</name><value>3260</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:aa7492c4-98a0-4868-b9a2-27d637a3fb5f</value></member><member><name>sr_ref</name><value>OpaqueRef:ebe0ec41-e1b6-4551-ba20-f1c79e3bb1c1</value></member><member><name>sr_uuid</name><value>b7e96994-3254-ea40-3a54-b0528d391dfa</value></member><member><name>vdi_ref</name><value>OpaqueRef:ff799859-1000-477a-a5b5-4851e4ca5813</value></member><member><name>vdi_location</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>vdi_uuid</name><value>0c44ac53-f357-43b9-bb5f-78de17a1cfab</value></member><member><name>subtask_of</name><value>DummyRef:|2b5276ea-3f02-413f-b3de-c71eb6f93d08|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>
# ps -ef| grep defunct | tail -5
root     32610 11867  0 06:22 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32617 11867  0 00:50 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32673 11867  0 09:34 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32728 11867  0 00:22 ?        00:00:00 [LVMoISCSISR] <defunct>
root     32749 11867  0 05:57 ?        00:00:00 [LVMoISCSISR] <defunct>
# ps -ef| grep defunct | wc -l
1276
  • the backup script finishes without errors
  • we suspect a performance degradation on the xen node running the backups
  • after killing the hanging processes, all the defuncts are gone
@olivierlambert
Copy link
Member

olivierlambert commented Oct 2, 2019

Are you snapshotting all the VMs at once? This is something we avoided in Xen Orchestra since a while, because all LVM related SR (local LVM, HBA, LVMoiSCSI) are prone to race conditions. So XO is doing snapshots 2 by 2.

I would also suggest to use XO or to improve your script to avoid this. Another alternative is to use a file based SR, which are far more stable in those conditions.

@loopway
Copy link
Author

loopway commented Oct 2, 2019

We loop over all running VMs and only do one at the time, so that hopefully that avoids any race conditions.

The script is in use since xen server 6.X, but we only noticing this behaviour since we upgraded from xen server 7.3 to xcp-ng 8. Could this be a bug?

@olivierlambert
Copy link
Member

We aren't aware of a behavior change. It might be related to your iSCSI configuration that was reset, like timeouts or max requests etc. Double check there and modify the values.

We strongly suggest to switch to NFS if you can.

@loopway
Copy link
Author

loopway commented Oct 7, 2019

Changing from iSCSI to NFS does not seem like a solution for the problem, no offence :) Our storage subsystem does not support it anyways.

I found a similar issue here:
NAUbackup/VmBackup#91

I still suspect it to be a bug.

@olivierlambert
Copy link
Member

olivierlambert commented Oct 7, 2019

Switching to NFS is a solution to get rid of the mess regarding how SMAPIv1 is written and how it deals with iSCSI. We have almost 0 support ticket in XCP-ng Support related to NFS, unlike a usual flow of people having problems on LVMoiSCSI. They probably introduced an even worse bug, but as this shitty code is full of race conditions everywhere, it might be anything.

If @BogdanRudas want to share the fix you had from Citrix with us, we could probably apply the patch on your XCP-ng install and see the result.

@BogdanRudas
Copy link

Hi!

There is XS71ECU2019 for zombies, however coalescing still bad. I guess that every that zombie is caused by failed coalescing attempt.

https://support.citrix.com/article/CTX262019

@nagilum99
Copy link

XS71ECU2019 is already on the bugtracker - we're at least 2, where the patch isn't helping!
https://bugs.xenserver.org/browse/XSO-966

@olivierlambert
Copy link
Member

Watching the ticket, let's see how it goes.

@BogdanRudas
Copy link

BogdanRudas commented Oct 7, 2019

@nagilum99 I had some host died because zombies exhausted PID limit in dom0 (32k), thus a patch rather better then nothing. However I still have to do vm-pause for certain VMs to let coalescing do it's job on underloaded SSD RAID. I'm nearly sure that thing were good prior at the level of XS71ECU2007

@nagilum99
Copy link

nagilum99 commented Oct 7, 2019

It did not help at all - over here.
So - for us - this patch is pretty useless. We have constantly one vhd-util process running and dying after 5 minutes. 24/7, all time 100 %. Waiting a while now for support to call back. Creating the ticket at night is probably medium good idea.

@BogdanRudas
Copy link

BogdanRudas commented Oct 8, 2019

@nagilum99 I have a call with Citrix support rep. regarding coalescing issues and will have another case# for that. As I understand current implementation can't handle moderate write workload during coalescing. So for current 7.6 installation (both Citrix and XCP-ng) I'm going to keep on current version unless some improvement will come from either side.

@olivierlambert
Copy link
Member

Can you confirm this issue is only on 8.0 and not on 7.6?

@stormi
Copy link
Member

stormi commented Nov 22, 2019

See also #298

We released an update for 8.0 which solves most of the issue though the last element of the leaf chain remains. It's based on this patch this patch which is the same patch as what was released by Citrix in the latest 7.1 LTS hotfix.

@loopway
Copy link
Author

loopway commented Dec 26, 2019

After installing the latest xcp-ng 8.0 rpms I can confirm we have no more LVMoISCSISR zombies.

@stormi
Copy link
Member

stormi commented Apr 20, 2020

Closing as per last comment.

@stormi stormi closed this as completed Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants