Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible IO proposal #1161

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Flexible IO proposal #1161

wants to merge 3 commits into from

Conversation

theomonnom
Copy link
Member

No description provided.

Copy link

changeset-bot bot commented Dec 2, 2024

⚠️ No Changeset found

Latest commit: 678208b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@theomonnom theomonnom changed the title Create proposal.md Flexible IO proposal Dec 2, 2024
proposal.md Outdated
Comment on lines 22 to 40
class PipelineIO(ABC):

def before_stt_node(self, source: AsyncIterator[rtc.AudioFrame]) -> AsyncIterator[rtc.AudioFrame]:
return source

def after_stt_node(self, source: AsyncIterator[SpeechEvent]) -> AsyncIterator[SpeechEvent]:
return source

def before_llm_node(self, chat_ctx: ChatContext) -> AsyncIterator[ChatChunk] | None:
return None

def after_llm_node(self, source: AsyncIterator[ChatChunk]) -> AsyncIterator[ChatChunk]:
return source

def before_tts_node(self, source: AsyncIterator[str]) -> AsyncIterator[rtc.AudioFrame] | None:
return source

def after_tts_node(self, source: AsyncIterator[rtc.AudioFrame]) -> AsyncIterator[rtc.AudioFrame]:
return source

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my wheelhouse, but is this in any way more convenient or idiomatic than if each pipeline stage managed its own pre/post transform callbacks? E.G.:

def passthrough_audio(source: AsyncIterator[rtc.AudioFrame]) -> AsyncIterator[rtc.AudioFrame]:
    return source

def filter_swearwords(source: AsyncIterator[SpeechEvent]) -> AsyncIterator[SpeechEvent]:
    return source

agent = PipelineAgent(
    stt=STT(pre=passthrough_audio, post=filter_swearwords)
    ...
)

Copy link
Member Author

@theomonnom theomonnom Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope is different, ideally, we can add new parameters like speech_id inside each step like before_tts_node, ...

class TextOutput(Protocol):
async def write(self, text: str) -> None: ...

def flush(self) -> None: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should frontend applications reason about "turns" with these two output types? is that what "flush" means? UI will likely want to render each complete "message" in a chat bubble, for instance. maybe having a unique id somewhere could help?

STT -> LLM -> TTS

```python
AudioInput = AsyncIterator[rtc.AudioFrame | rtc.AudioFrameEvent]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the scope of what you're working on but where would text input/image/file "chat" input fit in?

def clear_queue(self) -> None: ...


class TextOutput(Protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to text and audio output that are essentially "talking" or "chat", many applications "output" other structured things either by returning images or through function calls (i.e. JSON output). do we have any thoughts about whether it would make sense to provide an affordance for that in pipelineagent?

return source

def before_tts_node(self, source: AsyncIterator[str] | str) -> AsyncIterator[rtc.AudioFrame] | None:
return source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this method return AsyncIterator[str] | str?


class PipelineIO(ABC):

def before_stt_node(self, source: AsyncIterator[rtc.AudioFrame]) -> AsyncIterator[rtc.AudioFrame]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these methods all be async?

return source

def before_llm_node(self, chat_ctx: ChatContext) -> AsyncIterator[ChatChunk] | None:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one different than the others? it feels odd that it doesn't have the same return type as input type, and the default implementation returns None which implies its actually very semantically different than the other methods which are all open-ended hooks to transform data in the pipeline or add logging or other side effects.

def flush(self) -> None: ...


class PipelineIO(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name feels a little odd, given that in addition to PipelineIO we also have PipelineOutput (and maybe some forthcoming Input protocol too), but the "IO" in PipelineIO is not related to the pipeline's input nor output itself... might be better as PipelineHooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants