Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Can crc use krunkit? #4233

Closed
cfergeau opened this issue Jun 14, 2024 · 18 comments
Closed

[Spike] Can crc use krunkit? #4233

cfergeau opened this issue Jun 14, 2024 · 18 comments
Assignees
Labels
kind/epic Large chunk of work os/macos

Comments

@cfergeau
Copy link
Contributor

krunkit is a drop-in replacement for vfkit from a cmdline argument point of view. podman-machine can make use of it, see https://docs.google.com/document/d/1IZCWAY5zMHqd0YlbnpGtCe7HNeWKQNHi8RuhujAJmg0/edit for some details.

Since it has additional features compared to vfkit, it would be interesting to know if crc can make use of it.

In order to test krunkit + crc, a few steps that come to mind:

  • krunkit will need to be installed from brew and functional, and symlinked to ~/.crc/bin
  • checkVfkitInstalled in preflight_checks_darwin.go needs to be skipped or adjusted as it contains a vfkit version check which likely won't work with krunkit (different version number). There must be a crc config set skip-xxxx option to avoid this code
  • maybe the NewVfkitCache code and related methods in cache_darwin.go will need to be changed (but I think this code won't be run during testing).
@cfergeau cfergeau added kind/bug Something isn't working status/need triage os/macos kind/epic Large chunk of work and removed kind/bug Something isn't working status/need triage labels Jun 14, 2024
@cfergeau
Copy link
Contributor Author

cfergeau commented Jun 17, 2024

One initial issue is containers/krunkit#8 - krunkit is currently only available on Apple Silicon machines, it's not available for Intel-based macs.

@vyasgun vyasgun moved this to Work In Progress in Project planning: crc Jul 17, 2024
@vyasgun
Copy link
Contributor

vyasgun commented Aug 6, 2024

krunkit does not accept certain arguments such as --kernel and --kernel-cmdline which are currently being used by crc to start a vfkit machine. These arguments can be removed if the boot mode is changed to UEFI (The issue: #4180). Addressing this first.

@praveenkumar
Copy link
Member

@vyasgun but at least it is tried without those options?

@vyasgun vyasgun changed the title Can crc use krunkit? [Spike] Can crc use krunkit? Aug 7, 2024
@vyasgun
Copy link
Contributor

vyasgun commented Aug 9, 2024

@praveenkumar I appreciate the initiative to create the PR for using UEFI with vfkit. Running the VM without those options could have only be tried with the said code changes. Another flag that needs to be removed for krunkit VM is --timesync.

I have tried using the new options with krunkit and there has been progress. The VM process is running but there is some issue with the virtuo-net device.

If I am correct, according to the code, the device is only being added to vfkit when system mode networking is used:

Can you confirm this? And its relevance to the networking modes?
(Please note, I have added this by changing my personal fork of the codebase here: https://github.com/vyasgun/crc/tree/spike/uefi but I have some questions/clarifications I need)

Apologies if the question is too naive but there's not much documentation to follow :)

@praveenkumar
Copy link
Member

If I am correct, according to the code, the device is only being added to vfkit when system mode networking is used:

you can remove virtio-net option because we are not allowing system-mode networking for mac and it is not even tested.

Another flag that needs to be removed for krunkit VM is --timesync.

This needs some more digging to provide a better answer, but for time being (for poc) if something work without it that should be a progress (also check how podman-machine handle time sync).

With all those changes are you able to run the VM with krunkit and provision cluster (microshift/openshift)? If yes, does it have advantage over vfkit (in terms of performance)?

@gbraad
Copy link
Contributor

gbraad commented Aug 9, 2024

tmesync was due to a problem with the sleep/idle state of the VM. it might need some more investigation in general to determine if this time skewing still happens. In conclusion; leave this out for now; will need a new issue.

@vyasgun
Copy link
Contributor

vyasgun commented Aug 9, 2024

This needs some more digging to provide a better answer, but for time being (for poc) if something work without it that should be a progress (also check how podman-machine handle time sync).

podman-machine is not using --timesync in both vfkit and krunkit. A little more digging into is required. However, virtio-net is being used and it would be helpful for me to understand a slightly more detailed explanation on its relevance in our usecase.

Yes, I can get the krunkit process running.
The command being used:

podmanqe@dev-platform-mac4 ~ % /opt/homebrew/bin/krunkit --cpus 2 --memory 4096 --bootloader efi,variable-store=/Users/podmanqe/.crc/machines/crc/efistore.nvram,create --device virtio-fs,sharedDir=/Users/podmanqe,mountTag=dir0 --device virtio-rng --device virtio-blk,path=/Users/podmanqe/.crc/machines/crc/crc.img --device virtio-vsock,port=1024,socketURL=/Users/podmanqe/.crc/tap.sock,listen --restful-uri tcp://localhost:8080

podmanqe@dev-platform-mac4 ~ % curl 127.0.0.1:8080  --output -
{"state": "VirtualMachineStateRunning"}%

you can remove virtio-net option because we are not allowing system-mode networking for mac and it is not even tested.

krunkit goes to the api login page to manually enter the password. I just want to be sure if not using virtio-net might be affecting this.

Image

@praveenkumar
Copy link
Member

krunkit goes to the api login page to manually enter the password. I just want to be sure if not using virtio-net might be affecting this.

This is when you are trying to run it directly using cli command, does it work when you change the crc code base and use krunkit binary instead vfkit? I think with cli it is expected since no ssh key is passed.

@vyasgun
Copy link
Contributor

vyasgun commented Aug 12, 2024

This is when you are trying to run it directly using cli command, does it work when you change the crc code base and use krunkit binary instead vfkit? I think with cli it is expected since no ssh key is passed.

No, it doesn't seamlessly run through CRC code base as of now which is why I am trying to figure out the required options. Except this part, the machine is in running state as mentioned in my previous comment. Either the ssh settings or ignition config. podman-machine logs in directly (it is using podman-machine-default-ignition.sock) as during its startup, a certain set of commands is executed.

Can you still point me to the use of virtio-net and why is it only used for system mode networking? It will be helpful for me. Thanks :)

@praveenkumar
Copy link
Member

Can you still point me to the use of virtio-net and why is it only used for system mode networking?

Before migrating to vfkit we used to use the hyperkit ( https://github.com/moby/hyperkit ) as driver and it was using virtio-net but that didn't provide us way to effectively handle the vpn connections so we went with https://github.com/containers/gvisor-tap-vsock (user-mode networking) and have support for both but slowly made this as default networking solution by obsoleting virtio-net and we are not even testing it any more.

More info around virtio-net : https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net

@praveenkumar
Copy link
Member

No, it doesn't seamlessly run through CRC code base as of now which is why I am trying to figure out the required options.

To me, this machine is booted and sshd service should be running I am more interested in now if you just rename the krunkit to vfkit and try crc start --log-level debug what issue you get as error.

@vyasgun
Copy link
Contributor

vyasgun commented Aug 12, 2024

I was able to bring up the crc VM using the following changes: vyasgun@6eafcf6 (Please note it's just a POC with some hardcode just for testing purposes)

Verifying it's using krunkit:

podmanqe@dev-platform-mac4 ~ % crc config view
- consent-telemetry                     : no
- cpus                                  : 4
- memory                                : 16384
- preset                                : microshift
- skip-check-vfkit-installed            : true

podmanqe@dev-platform-mac4 crc % crcssh
Warning: Permanently added '[127.0.0.1]:2222' (ED25519) to the list of known hosts.
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is GREEN - Health Check SUCCESS
[core@api ~]$ ls /dev/dri
by-path  card0  renderD128

I also ran an InstructLab pod on CRC with the following spec and made it run some prompts by using an interactive terminal ( kubectl exec -ti mistral-pod -- bash ). The prompts are working but the responses are very slow compared to podman-machine using krunkit.

podmanqe@dev-platform-mac4 gunjan % cat mistral-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mistral-pod
spec:
  containers:
  - image: quay.io/slopezpa/fedora-vgpu-llama
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300; done;" ]
    name: mistral-pod
    volumeMounts:
    - mountPath: /dev/dri
      name: dev-dri
    - mountPath: /models
      name: downloads
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: dev-dri
    hostPath:
      path: /dev/dri
  - name: downloads
    hostPath:
      path: /Users/podmanqe/Downloads

However, crc status doesn't show the VM as running so the proper changes for everything to work in sync need to be looked into even though it can be ssh'd into.

podmanqe@dev-platform-mac4 ~ % crc status
CRC VM:                  Stopped
MicroShift:              Stopped (v4.16.4)
RAM Usage:               0B of 0B
Disk Usage:              0B of 0B (Inside the CRC VM)
Persistent Volume Usage: 0B of 0B (Allocated)
Cache Usage:             67.12GB
Cache Directory:         /Users/podmanqe/.crc/cache

Conclusion:

According to the spike, CRC can use krunkit. The next steps depend on if we want to simply replace vfkit with krunkit in our code or we want to support it along with vfkit. The code changes seem straightforward.

@cfergeau
Copy link
Contributor Author

podman-machine is not using --timesync in both vfkit and krunkit. A little more digging into is required.

They are using https://chrony-project.org/doc/4.5/chrony.conf.html#makestep instead:
https://github.com/containers/podman-machine-os/blob/main/podman-image-daily/50-podman-makestep.conf

@cfergeau
Copy link
Contributor Author

I also ran an InstructLab pod on CRC with the following spec

Did you use the same yaml with podman-machine for comparison? For a start, you could ssh into the crc krunkit VM, and run an AI workload by directly using podman ...

@vyasgun
Copy link
Contributor

vyasgun commented Aug 28, 2024

@cfergeau Yes, it's the same yaml. I tried running the llama.cpp code in the following ways and here are the results (For reference: ggml-org/llama.cpp#1323 (comment) has the following list which describes the parameters):

  • load time: loading model file
  • sample time: generating tokens from the prompt/file choosing the next likely token.
  • prompt eval time: how long it took to process the prompt/file by LLaMa before generating new text.
  • eval time: how long it took to generate the output (until [end of text] or the user set limit).
  • total: all together

Running a podman pod directly on the system:

lama_print_timings:        load time =    4430.66 ms
llama_print_timings:      sample time =      16.25 ms /   259 runs   (    0.06 ms per token, 15937.48 tokens per second)
llama_print_timings: prompt eval time =    1631.53 ms /     5 tokens (  326.31 ms per token,     3.06 tokens per second)
llama_print_timings:        eval time =   12403.26 ms /   258 runs   (   48.07 ms per token,    20.80 tokens per second)
llama_print_timings:       total time =   14076.18 ms /   263 tokens

Running a podman pod after ssh-ing into crc VM:

llama_print_timings:        load time =    3422.64 ms
llama_print_timings:      sample time =      50.78 ms /   649 runs   (    0.08 ms per token, 12781.38 tokens per second)
llama_print_timings: prompt eval time =    1780.76 ms /     5 tokens (  356.15 ms per token,     2.81 tokens per second)
llama_print_timings:        eval time =   38451.63 ms /   648 runs   (   59.34 ms per token,    16.85 tokens per second)
llama_print_timings:       total time =   40348.54 ms /   653 tokens

Running a kubernetes pod on crc (takes much longer):

llama_print_timings:        load time =   45553.22 ms
llama_print_timings:      sample time =      43.01 ms /   563 runs   (    0.08 ms per token, 13089.37 tokens per second)
llama_print_timings: prompt eval time =   44973.51 ms /     9 tokens ( 4997.06 ms per token,     0.20 tokens per second)
llama_print_timings:        eval time = 4552762.02 ms /   562 runs   ( 8101.00 ms per token,     0.12 tokens per second)
llama_print_timings:       total time = 4602622.83 ms /   571 tokens

@cfergeau
Copy link
Contributor Author

Running a kubernetes pod on crc (takes much longer):

Could it be picking up an amd64 image instead of an arm64? This would explain the problems.
You could try to get a shell inside the pod to try to understand what's happening, or try to compare commandlines in the VM to see if there are obvious differences

@vyasgun
Copy link
Contributor

vyasgun commented Aug 28, 2024

@cfergeau The image is arm64 (i checked inside the VM)

[core@api ~]$ sudo crictl inspecti quay.io/slopezpa/fedora-vgpu-llama | jq -r '.info.imageSpec.architecture'
arm64

And also inside the mistral-pod, the binary being run is built for arm64:

gvyas@Gunjans-MacBook-Pro specs % kubectl logs -f mistral-pod
Log start
main: build = 2238 (56d03d92)
main: built with cc (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6) for aarch64-redhat-linux

Update

Running the pod as privileged was required for accessing the gpu. Now it takes roughly the same amount of time.

llama_print_timings:        load time =    4772.51 ms
llama_print_timings:      sample time =      58.51 ms /   669 runs   (    0.09 ms per token, 11433.36 tokens per second)
llama_print_timings: prompt eval time =    1780.24 ms /     5 tokens (  356.05 ms per token,     2.81 tokens per second)
llama_print_timings:        eval time =   40126.60 ms /   668 runs   (   60.07 ms per token,    16.65 tokens per second)
llama_print_timings:       total time =   42043.88 ms /   673 tokens

@vyasgun
Copy link
Contributor

vyasgun commented Aug 29, 2024

The next steps will be documented in: #4341

@vyasgun vyasgun closed this as completed Aug 29, 2024
@github-project-automation github-project-automation bot moved this from Work In Progress to Done in Project planning: crc Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Large chunk of work os/macos
Projects
Status: Done
Development

No branches or pull requests

4 participants