Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta indices simplification #791

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

IliaLarchenko
Copy link
Contributor

What this does

This PR removes unnecessary back-and-forth conversion between delta_indices and delta_timestamps.

  1. In the model configuration file, we define delta_indices for actions and observations. For example, let's look at observations in Diffusion:
@property
def observation_delta_indices(self) -> list:
    return list(range(1 - self.n_obs_steps, 1))
  1. Then in datasets/factory.py we transform them into the delta timestamps by dividing them by fps:
delta_timestamps[key] = [i / ds_meta.fps for i in cfg.observation_delta_indices]
  1. Eventually, in lerobot_dataset.py we use get_delta_indices from dataset/utils.py to transform them back by multiplying them by fps:
def get_delta_indices(delta_timestamps: dict[str, list[float]], fps: int) -> dict[str, list[int]]:
    delta_indices = {}
    for key, delta_ts in delta_timestamps.items():
        delta_indices[key] = [round(d * fps) for d in delta_ts]

    return delta_indices

In the end, we use delta_indices, which we set up at the very beginning.

Basically we define delta_indices -> transform them to delta_timestapms -> trainsform them back to delta_indices

All of this happens before we query data from the dataset. When the actual query happens we generate query_indices from delta_indices and also query_timestamps from delta_indices and use them for the actual data query.

I think that the transformation delta_indices -> delta_timestapms -> delta_indices is completely unnecessary. Because we define delta_indices in the policy and use delta_indices inside the dataset.

I removed all unnecessary transformations, all tests related to functions that were doing these transformations, and fixed examples.

Removing these steps doesn't have much impact on performance but lets us clean up the code. And also opens up some other opportunities. As we directly use delta_indices defined in the policy we can use something more flexible than a simple list.

How it was tested

This PR doesn't break anything that is using train.py all policies work fine without any changes. All configs already use delta_indices so don't need to change anything.
But it can break someone's custom training pipeline because LerobotDataset expects delta_indices instead of delta_timestapms parameter now but it is very easy to fix as you can always get delta_indices by multiplying delta_timestapms by fps.

How to checkout & try? (for the reviewer)

You can try to train any policy, eg:

python lerobot/scripts/train.py \
    --output_dir=outputs/train/diffusion_pusht \
    --policy.type=diffusion \
    --dataset.repo_id=lerobot/pusht \
    --seed=100000 \
    --env.type=pusht \
    --batch_size=64 \
    --steps=200000 \
    --eval_freq=25000 \
    --save_freq=25000 \
    --wandb.enable=true

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best code is no code.
This looks good, thanks! Left a few minor comments.

@Cadene I know you had some opinion about keeping the spec in delta_timestamps rather than delta_indices. I think the simplification here is worth it, besides right now we don't really have any use case or workflow where delta_timestamps is necessary over having delta_indices, AFAIK.

On a side note, I'm curious about how we should handle unsynced features like cameras with different fps, or variable refresh rate sensors. This is not directly an issue with using indices over timestamps but related nonetheless. Happy to get your opinion about this @IliaLarchenko @Cadene

@IliaLarchenko
Copy link
Contributor Author

We still have a main dataset level fps and timestamps for each index and we check that all of them are valid and inside tolerance bounds:

# Check timestamps
timestamps = torch.stack(self.hf_dataset["timestamp"]).numpy()
episode_indices = torch.stack(self.hf_dataset["episode_index"]).numpy()
ep_data_index_np = {k: t.numpy() for k, t in self.episode_data_index.items()}
check_timestamps_sync(timestamps, episode_indices, ep_data_index_np, self.fps, self.tolerance_s)

We do the same check when we save an episode:

check_timestamps_sync(
episode_buffer["timestamp"],
episode_buffer["episode_index"],
ep_data_index_np,
self.fps,
self.tolerance_s,
)

Then for actually querying video frames we use timestamps:

if len(self.meta.video_keys) > 0:
current_ts = item["timestamp"].item()
query_timestamps = self._get_query_timestamps(current_ts, query_indices)
video_frames = self._query_videos(query_timestamps, ep_idx)
item = {**video_frames, **item}

And when we actually decode video we check it again:

is_within_tol = min_ < tolerance_s
assert is_within_tol.all(), (
f"One or several query timestamps unexpectedly violate the tolerance ({min_[~is_within_tol]} > {tolerance_s=})."
"It means that the closest frame that can be loaded from the video is too far away in time."
"This might be due to synchronization issues with timestamps during data collection."
"To be safe, we advise to ignore this item during training."
f"\nqueried timestamps: {query_ts}"
f"\nloaded timestamps: {loaded_ts}"
f"\nvideo: {video_path}"
f"\nbackend: {backend}"
)

@Cadene
Copy link
Collaborator

Cadene commented Mar 2, 2025

Thanks for your work. Indeed our code can be improved and simplified.

However, I didnt understand your argument for changing the interface from timestamps to indices. I agree it simplifies things downstream, but for me it is at the expense of expressivity. Maybe the solution would be to simplify the code downstream, instead of upstream. For instance, the policy should not have indices defined, but timestamps.

Here is my argument for keeping timestamps as input: it is more expressive than indices. In this example, if we were to use indices, even with this comment, we have a hard time understanding what -75 , -50, -25, etc. mean. Instead we have to map these indices into timestamps, because it's more meaningful for us.

    # loads 8 state vectors: 1.5 seconds before, 1 second before, ... 200 ms, 100 ms, and current frame
    "observation.state": [-1.5, -1, -0.5, -0.20, -0.10, 0],  # <-- timestamps
    "observation.state": [-75, -50, -25, -10, -5, 0],  # <-- indices

It's also important to reason in timestamps, because at 10 fps, an index of -10 corresponds to -1 second, but at 50 fps, it's - 20ms which doesnt correspond at all to the same temporal learning for your model. By forcing the interface of dataset to be in timestamps, it's easier to not make this mistakes.

I think there are some issues in our API, but the best way to take the correct decisions is to make progress on other features such as training on multiple datasets with various fps. It will automatically raise some issues regarding timestamps and indices.

As of now, I would prefer to not make this change yet.

What do you think?

@IliaLarchenko
Copy link
Contributor Author

I don’t have a strong preference for indices over timestamps.

Currently, timestamps are only used in examples with a manual training loop. In practice, if you train a model with any policy, you specify indices, and inside the dataset, you use indices to query items.

Regardless of the approach, you need to account for FPS when specifying either delta_indices or delta_timestamps. Timestamps are more intuitive in your example, but multiplying them by FPS makes them just as clear:

"observation.state": [dt * fps for dt in [-1.5, -1, -0.5, -0.20, -0.10, 0]]

For observations, timestamps feel more natural, while for actions, indices make more sense. For example, if you want to predict the next 2 seconds of actions, you have to account for FPS anyway:

  • Using timestamps:
    [i / fps for i in range(fps * 2)]
  • Using indices:
    range(fps * 2)

So ultimately, they’re not that different, but my approach skips a couple of intermediate steps.

For multi-dataset scenarios, this gets even trickier. If one dataset has 10 FPS and another has 50 FPS, it's unclear what the best way to handle them is.

My initial goal was to remove unnecessary conversions and allow policies to define more flexible delta_indices as an iterable. This way, you could use something like the iterator below to sample different indices dynamically:

class RandomizedDeltaIndices:
    def __iter__(self):
        return iter([
            random.randint(-25, -15),
            random.randint(-15, -5),
            *range(-4, 1)
        ])

However, this approach can be a bit messy. An alternative is passing a delta_indices post-processing function, like in this commit: fb6e9e7) . But defining a function as a dataset parameter isn’t ideal either. I don't have a perfect solution yet.

@Cadene
Copy link
Collaborator

Cadene commented Mar 2, 2025

A thought: For delta timestamps given as input to a dataset which don't match its fps, we can do some interpolation for state and action, and returns the closest frames for the vision modalities.

Thus, in the multi-dataset case, we can keep the same delta timestamps for both.

Then for training your DOT policy with some random delta timestamps, we can provide a "data augmentation" callable function to the dataset.

In any case, I think the delta timestamps interface to LeRobotDataset is to be prefered over delta indices.

@IliaLarchenko
Copy link
Contributor Author

Got it. So, I can close this PR. And later when DOT policy is integrated I can try to come up with some nice solutions for delta_indices augmentaitons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants