Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attach VM root volumes as disk devices #14532

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

MggMuggins
Copy link
Contributor

@MggMuggins MggMuggins commented Nov 26, 2024

Includes commits from #14491

This PR enables attaching a virtual machine's root storage volume to another virtual machine via a disk device:

# vm2 yaml
...
devices:
  v1-root:
    type: disk
    pool: default
    source: virtual-machine/vm1

This has some constraints because simultaneous access to storage volumes with content-type block is unsafe:

  • vm1's root volume can be attached to exactly one other instance if vm1 has security.protection.start: true
  • vm1's root volume can be attached to any number of other instances if the storage volume virtual-machine/vm1 has security.shared: true

security.protection.start is recommended for interactive use; e.g. a user temporarily needs to access a bricked machine's root volume to fix it or recover data. security.shared can be used if more than one running instance must have access to the block volume.

TODO

  • Docs
  • lxd-ci tests to validate attachments to running VMs. Done
  • Bootorder bug (see comment) - current workaround is to hotplug disk devices to avoid UUID/LABEL collisions at boot time.

@github-actions github-actions bot added the Documentation Documentation needs updating label Nov 26, 2024
@MggMuggins
Copy link
Contributor Author

MggMuggins commented Nov 26, 2024

Currently when booting a VM with two root disk devices attached, the kernel chooses which partitions to mount at /boot/efi, /boot and / seemingly at random. I've seen it pick the correct disk for all three partitions, the incorrect disk for all three partitions, and any combination of the three:

# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   10G  0 disk
├─sda1    8:1    0    9G  0 part /
├─sda14   8:14   0    4M  0 part
├─sda15   8:15   0  106M  0 part
└─sda16 259:0    0  913M  0 part
sdb       8:16   0   10G  0 disk
├─sdb1    8:17   0    9G  0 part
├─sdb14   8:30   0    4M  0 part
├─sdb15   8:31   0  106M  0 part /boot/efi
└─sdb16 259:1    0  913M  0 part /boot

# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   10G  0 disk
├─sda1    8:1    0    9G  0 part /
├─sda14   8:14   0    4M  0 part
├─sda15   8:15   0  106M  0 part
└─sda16 259:0    0  913M  0 part /boot
sdb       8:16   0   10G  0 disk
├─sdb1    8:17   0  2.1G  0 part
├─sdb14   8:30   0    4M  0 part
└─sdb15   8:31   0  106M  0 part /boot/efi

# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   10G  0 disk
├─sda1    8:1    0    9G  0 part
├─sda14   8:14   0    4M  0 part
├─sda15   8:15   0  106M  0 part
└─sda16 259:1    0  913M  0 part
sdb       8:16   0   10G  0 disk
├─sdb1    8:17   0    9G  0 part /
├─sdb14   8:30   0    4M  0 part
├─sdb15   8:31   0  106M  0 part /boot/efi
└─sdb16 259:0    0  913M  0 part /boot

# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   10G  0 disk
├─sda1    8:1    0    9G  0 part
├─sda14   8:14   0    4M  0 part
├─sda15   8:15   0  106M  0 part /boot/efi
└─sda16 259:1    0  913M  0 part
sdb       8:16   0   10G  0 disk
├─sdb1    8:17   0    9G  0 part /
├─sdb14   8:30   0    4M  0 part
├─sdb15   8:31   0  106M  0 part
└─sdb16 259:0    0  913M  0 part /boot

Still investigating the cause.

EDIT: The above scenarios all occurred with ubuntu:noble

@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 1a573ed to 679879e Compare November 26, 2024 20:38
@MggMuggins
Copy link
Contributor Author

MggMuggins commented Nov 26, 2024

A little more info on the bootorder bug; this is worse than I thought. When two VMs are launched from the same cloud image, the UUIDs for the filesystems and partitions are all the same:

root@vm1:~# efibootmgr -v
BootCurrent: 0001
Timeout: 0 seconds
BootOrder: 0007,0001,0003,0004,0005,0006,0000,0002
Boot0000* UiApp	FvVol(7cb8bdc9-f8eb-4f34-aaea-3ee4af6516a1)/FvFile(462caa21-7614-4503-836e-8ab6f4662331)
Boot0001* UEFI QEMU QEMU HARDDISK 	PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/SCSI(0,1)N.....YM....R,Y.
Boot0002* EFI Internal Shell	FvVol(7cb8bdc9-f8eb-4f34-aaea-3ee4af6516a1)/FvFile(7c04a583-9e3e-4f1c-ad65-e05268d0b4d1)
Boot0003* UEFI PXEv4 (MAC:00163EF03AE4)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ef03ae4,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot0004* UEFI PXEv6 (MAC:00163EF03AE4)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ef03ae4,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot0005* UEFI HTTPv4 (MAC:00163EF03AE4)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ef03ae4,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot0006* UEFI HTTPv6 (MAC:00163EF03AE4)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ef03ae4,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot0007* ubuntu	HD(15,GPT,6714cd0e-2211-4b5f-8daa-341fcbae2865,0x2800,0x35000)/File(\EFI\ubuntu\shimx64.efi)
root@vm1:~# blkid
/dev/sda15: LABEL_FATBOOT="UEFI" LABEL="UEFI" UUID="F1D8-37B4" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="6714cd0e-2211-4b5f-8daa-341fcbae2865"
/dev/sda1: LABEL="cloudimg-rootfs" UUID="fec1c9ae-0df3-419c-80dd-f3035049b845" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="40c8f90e-558a-4475-b818-ec6e9e5d02fb"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"
/dev/sda14: PARTUUID="69646288-a3aa-469f-96fc-0782de904d84"
root@vm2:~# efibootmgr -v
BootCurrent: 0001
Timeout: 0 seconds
BootOrder: 0007,0001,0003,0004,0005,0006,0000,0002
Boot0000* UiApp	FvVol(7cb8bdc9-f8eb-4f34-aaea-3ee4af6516a1)/FvFile(462caa21-7614-4503-836e-8ab6f4662331)
Boot0001* UEFI QEMU QEMU HARDDISK 	PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/SCSI(0,1)N.....YM....R,Y.
Boot0002* EFI Internal Shell	FvVol(7cb8bdc9-f8eb-4f34-aaea-3ee4af6516a1)/FvFile(7c04a583-9e3e-4f1c-ad65-e05268d0b4d1)
Boot0003* UEFI PXEv4 (MAC:00163EA0318E)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ea0318e,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot0004* UEFI PXEv6 (MAC:00163EA0318E)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ea0318e,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot0005* UEFI HTTPv4 (MAC:00163EA0318E)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ea0318e,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot0006* UEFI HTTPv6 (MAC:00163EA0318E)	PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163ea0318e,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot0007* ubuntu	HD(15,GPT,6714cd0e-2211-4b5f-8daa-341fcbae2865,0x2800,0x35000)/File(\EFI\ubuntu\shimx64.efi)
root@vm2:~# blkid
/dev/sda15: LABEL_FATBOOT="UEFI" LABEL="UEFI" UUID="F1D8-37B4" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="6714cd0e-2211-4b5f-8daa-341fcbae2865"
/dev/sda1: LABEL="cloudimg-rootfs" UUID="fec1c9ae-0df3-419c-80dd-f3035049b845" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="40c8f90e-558a-4475-b818-ec6e9e5d02fb"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"
/dev/sda14: PARTUUID="69646288-a3aa-469f-96fc-0782de904d84"

The UUID for the root filesystem is passed via kernel cmdline in Jammy:

# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-1067-kvm root=PARTUUID=40c8f90e-558a-4475-b818-ec6e9e5d02fb ro console=tty1 console=ttyS0

But ubuntu:noble images use the partition label instead of the UUID:

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.0-49-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0

TL;DR, VM root volume attachment as implemented here is only safe between machines deployed using different cloud images, using Jammy or older as the recovery machine. Working on figuring out how(/if) the public clouds get around this.

@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 679879e to 0c3707d Compare November 27, 2024 22:01
@MggMuggins
Copy link
Contributor Author

MggMuggins commented Nov 28, 2024

Done some more reading:

  • Openstack VMs behave exactly the same as LXD; two VMs created from the same image have the same partition/fs UUIDs. OpenStack doesn't support removing boot disks from instances at all; there is a spec and part of an implementation that was abandoned in 2018
  • I haven't looked super hard at GCP but the docs are pretty thin on attaching boot disks to other instances; this implies that an instance can only have one boot disk at a time. Unclear if "boot disk" is an immutable feature of disk devices or something else; I have little/no experience with GCP.
  • The workaround documented in the AWS docs is likely to work for the root device. It looks to me like the EFI boot entries in the instance's NVRAM use the UUID of /boot/efi, so changing labels will only force the kernel to select the correct root partition. I'm not sure that this is super viable, since one of the primary use cases for this feature is "the kernel is broken and the system won't boot" and the kernel executable is stored in /boot/efi.

I need to do some more reading on how we control the UEFI boot entries in LXD. More Friday.

@MggMuggins
Copy link
Contributor Author

In OpenStack, the procedure for getting access to another VM's root volume is to create an image from the VM, a volume from the image, and then mount the volume on another VM:

os server image create --name vm1_snap0 vm1
os volume create --image vm1_snap0 vm1_snap0_image --size 20GiB
os server add volume vm2 vm1_snap0_image --device /dev/vdb

This does suffer from the same duplicate UUID problem as described above. A VM can be created based on the modified volume.

@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch 3 times, most recently from 5ec4d3b to d3143c7 Compare December 2, 2024 22:21
@github-actions github-actions bot added the API Changes to the REST API label Dec 2, 2024
@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from d3143c7 to 094830b Compare December 2, 2024 22:46
@MggMuggins MggMuggins marked this pull request as ready for review December 2, 2024 23:51
@tomponline tomponline marked this pull request as draft December 3, 2024 09:14
@tomponline
Copy link
Member

converting to draft until TODO items are marked as completed

@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 094830b to 729f788 Compare December 4, 2024 03:32
@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 729f788 to 9c9ac1f Compare December 5, 2024 00:09
@MggMuggins MggMuggins marked this pull request as ready for review December 5, 2024 00:28
@MggMuggins MggMuggins requested a review from tomponline December 5, 2024 00:29
@tomponline
Copy link
Member

please can you rebase

lxc/storage_volume.go Outdated Show resolved Hide resolved
@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 9c9ac1f to 2f64287 Compare December 9, 2024 22:00
@MggMuggins
Copy link
Contributor Author

Rebase is complete @tomponline

@@ -122,7 +122,7 @@ Storage volumes can be of the following types:
: LXD automatically creates one of these storage volumes when you launch an instance.
It is used as the root disk for the instance, and it is destroyed when the instance is deleted.

This storage volume is created in the storage pool that is specified in the profile used when launching the instance (or the default profile, if no profile is specified).
This storage volume is created in the storage pool that is specified when launching the instance (or in the default profile, if no pool or profile is specified).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This storage volume is created in the storage pool that is specified when launching the instance (or in the default profile, if no pool or profile is specified).
This storage volume is created in the storage pool that is specified when launching the instance. This pool can be specified explicitly or by specifying a profile. If no pool or profile is specified, LXD uses the storage pool of the default profile.

Copy link
Contributor

@minaelee minaelee Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this suggested version to be easier to understand because it explains that the storage pool can be specified explicitly or via the profile. The current version only implies (in "if no pool or profile is specified") that a profile can be used to specify the pool.

Without this edit, it's also difficult to parse what clause the parenthesized content modifies--that is, what does the part following "or" replace? I know, only after reading it a couple times, that it's meant to replace "when launching the instance". However, at first glance, it's natural to parse it as "This storage volume is created in the default profile, if no pool or profile is specified" due to the symmetry of "in the storage pool" and "in the default profile", and that doesn't make sense.

Copy link
Contributor Author

@MggMuggins MggMuggins Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree RE clarity. Having looked at this again today I'm not sure that this is the right spot in the docs to nail down the semantics of which pool is selected for a root disk device at all; the section above is probably more appropriate. I've pared this down some (necessarily removing some nuance). Let me know what you think.

I'm wondering if the relationship between the root disk device and the container/virtual-machine volume type is a bit of a docs gap; figuring out the relationship between those two took me some time. Probably out of scope for this PR.

@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 2f64287 to 0229c19 Compare December 12, 2024 00:23
Signed-off-by: Wesley Hershberger <[email protected]>
These keys can only be live-updated on containers

Signed-off-by: Wesley Hershberger <[email protected]>
Signed-off-by: Wesley Hershberger <[email protected]>
virtual-machine/container volumes in the default project do not include
the project name.

Signed-off-by: Wesley Hershberger <[email protected]>
The change to the source property makes `vol1` and `custom/vol1` semantically
identical even though they are not syntactically identical. It's not
correct to simply compare the strings anymore.

This is the only instance of this comparison in the lxc and client packages.

We also don't need to check the volume type as an incorrect volume type
will just give a "No device found for this storage volume" error.

Signed-off-by: Wesley Hershberger <[email protected]>
…tart

This allows a VM root disk to be attached to another instance without
setting `security.shared`.

If we only allow vm roots to be attached when security.shared is set on
the volume, it makes it possible to forget to unset security.shared when
the volume is detached. Forgetting to unset security.protection.start is
harder :)

Signed-off-by: Wesley Hershberger <[email protected]>
I'm having a hard time coming up with a scenario where this would be
desirable.

Signed-off-by: Wesley Hershberger <[email protected]>
We can no longer short-circut here because a VM's root disk migt be
attached to another instance.

I fixed this proactively for containers as well, but it does incur
a performance penalty.

Signed-off-by: Wesley Hershberger <[email protected]>
...from a virtual machine when the VM's root disk device is attached to
another instance.

This works when the key is set on a profile or instance, since it checks
the expanded config.

Signed-off-by: Wesley Hershberger <[email protected]>
Will allow us to check when updating `virtual-machine` volumes

Signed-off-by: Wesley Hershberger <[email protected]>
If a virtual-machine volume is attached to more than one instance, don't
allow removing security.shared.

Signed-off-by: Wesley Hershberger <[email protected]>
Signed-off-by: Wesley Hershberger <[email protected]>
@MggMuggins MggMuggins force-pushed the jira-2160/attach-vm-root-as-disk branch from 0229c19 to a4ebc1e Compare December 12, 2024 00:37
Copy link
Contributor

@hamistao hamistao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your idea of using security.protection.start, good one!

Sorry for so many comments, hope they aren't too much of a pain to address.

#14593 has been merged so you can rebase.

Lastly, I was wondering if we should now check for volume type here and here

@@ -206,7 +206,6 @@ func DiskVolumeSourceParse(source string) (volType drivers.VolumeType, dbVolType
case cluster.StoragePoolVolumeTypeNameContainer:
err = errors.New("Using container storage volumes is not supported")
case cluster.StoragePoolVolumeTypeNameVM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case cluster.StoragePoolVolumeTypeNameVM:
case cluster.StoragePoolVolumeTypeNameVM, cluster.StoragePoolVolumeTypeNameCustom:

Can be simplified as such, while removing case cluster.StoragePoolVolumeTypeNameCustom: a few lines below.

@@ -971,6 +971,40 @@ func InstanceContentType(inst instance.Instance) drivers.ContentType {
return contentType
}

// volumeIsUsedByDevice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment incomplete?

Comment on lines +2539 to +2544
## `instance_root_volume_attachment`

Adds support for instance root volumes to be attached to other instances as disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I unserstood the scope of this PR correctly, this should state virtual machines instead of instances as container volumes won't be available for attaching.

Comment on lines +986 to +992
rootVolumeType := cluster.StoragePoolVolumeTypeNameContainer
if inst.Type == instancetype.VM {
rootVolumeType = cluster.StoragePoolVolumeTypeNameVM
}

if inst.Name == vol.Name && rootVolumeType == vol.Type {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use something like

volumeType, err := InstanceTypeToVolumeType(inst.Type)
if err != nil {
    return false, err
}

volumeDBType, _ := VolumeTypeToDBType(inst.Type)

and compare using cluster.StoragePoolVolumeTypeNames[volumeDBType]

@@ -257,7 +259,7 @@ func (c *cmdStorageVolumeAttach) run(cmd *cobra.Command, args []string) error {
device := map[string]string{
"type": "disk",
"pool": resource.name,
"source": volName,
"source": args[1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user provides just the volume name without the type?

}

apiInst.ExpandedConfig = instancetype.ExpandInstanceConfig(d.state.GlobalConfig.Dump(), apiInst.Config, inst.Profiles)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could put a comment here explaining why security.protection.start makes it okay to attach the root volume on another instance.

Comment on lines +2923 to +2929
if shared.IsFalseOrEmpty(changedConfig["security.shared"]) && volDBType == cluster.StoragePoolVolumeTypeVM {
err = allowRemoveSecurityShared(b.state, inst.Project().Name, &curVol.StorageVolume)
if err != nil {
return err
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VM volume sharing we should also consider the usage of security.protection.start here right?
Say a VM volume is attached to more than one instance but one of them has security.protection.start enabled, then we should allow unsetting security.shared, isnt that correct?

Comment on lines -89 to -98
// Handle instance volumes.
if vol.Type == cluster.StoragePoolVolumeTypeNameContainer || vol.Type == cluster.StoragePoolVolumeTypeNameVM {
volName, snapName, isSnap := api.GetParentAndSnapshotName(vol.Name)
if isSnap {
return []string{api.NewURL().Path(version.APIVersion, "instances", volName, "snapshots", snapName).Project(vol.Project).String()}, nil
}

return []string{api.NewURL().Path(version.APIVersion, "instances", volName).Project(vol.Project).String()}, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For containers, doesn't it make sense to return early?

You are claiming it doesn't affect performance, but how can that be if we are running VolumeUsedByInstanceDevices and VolumeUsedByProfileDevices unnecessarily for containers?

I may have missed something here, if so sorry about this.

Comment on lines +133 to +138
// If vol is the instance's root volume and it is defined in a profile,
// it won't be added to the list by VolumeUsedByInstanceDevices.
instancePath := api.NewURL().Path(version.APIVersion, "instances", volName).Project(vol.Project).String()
if !slices.Contains(volumeUsedBy, instancePath) {
volumeUsedBy = append(volumeUsedBy, instancePath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the volume is being used in a profile device as an additional drive, instances with this device won't be added to volumeUsedBy, right?

This is probably expected anyway, I am asking more for my own understanding.

return err
}
}

// Load storage volume from database.
dbVol, err := VolumeDBGet(b, inst.Project().Name, inst.Name(), volType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your changes, but are we getting the same volume from the database again?

On line 2895 we have curVol, err := VolumeDBGet(b, inst.Project().Name, inst.Name(), volType)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes to the REST API Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants