Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tpu-info can not access metrics #6

Open
steveepreston opened this issue Oct 19, 2024 · 4 comments
Open

[Bug] tpu-info can not access metrics #6

steveepreston opened this issue Oct 19, 2024 · 4 comments

Comments

@steveepreston
Copy link

Minimal working code is Here. Code follows GoogleCloudPlatform example

Code run completed successfully on TPU VM v3-8, but while call !tpu-info at the end, it shows:

TPU Chips                                    
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━┓
┃ Chip        ┃ Type        ┃ Devices ┃ PID ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━┩
│ /dev/accel0 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel1 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel2 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel3 │ TPU v3 chip │ 2       │ 13  │
└─────────────┴─────────────┴─────────┴─────┘
Libtpu metrics unavailable. Is there a framework using the TPU? See https://github.com/google/cloud-accelerator-diagnostics/tree/main/tpu_info for more information
@steveepreston
Copy link
Author

it shows same error on repo example:

import torch
import torch_xla
t = torch.randn((300, 300), device=torch_xla.device())

!tpu-info
Libtpu metrics unavailable.

@jcole75
Copy link

jcole75 commented Oct 29, 2024

Minimal working code is Here. Code follows GoogleCloudPlatform example

Code run completed successfully on TPU VM v3-8, but while call !tpu-info at the end, it shows:

TPU Chips                                    
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━┓
┃ Chip        ┃ Type        ┃ Devices ┃ PID ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━┩
│ /dev/accel0 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel1 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel2 │ TPU v3 chip │ 2       │ 13  │
│ /dev/accel3 │ TPU v3 chip │ 2       │ 13  │
└─────────────┴─────────────┴─────────┴─────┘
Libtpu metrics unavailable. Is there a framework using the TPU? See https://github.com/google/cloud-accelerator-diagnostics/tree/main/tpu_info for more information

Same issue for me on different TPUs (v3 or v4)

@steveepreston
Copy link
Author

@will-cromar @SurbhiJainUSC @bvandermoon Please take a look, Thanks

@peregilk
Copy link

peregilk commented Nov 5, 2024

I am seeing the same issue. Using jax, and creating a v4-8 from --runtime-version tpu-ubuntu2204-base.

I see the comment about "Releases from before 2024 may not be compatible.". However, it is not clear to me what that refers to, and what needs to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants