Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirascope v2.0 Roadmap #896

Open
8 tasks
willbakst opened this issue Mar 8, 2025 · 5 comments
Open
8 tasks

Mirascope v2.0 Roadmap #896

willbakst opened this issue Mar 8, 2025 · 5 comments
Assignees

Comments

@willbakst
Copy link
Contributor

willbakst commented Mar 8, 2025

Why v2.0 major version bump?

There are some items on my laundry list of TODOs for Mirascope that are breaking changes, so I've been ignoring them / pushing them off until the time felt right to do a big push and release a new major version (which is a lot of work).

Namely:

  • Clean up core in favor of llm.call as the default
  • Big focus on performance including all non-API call benchmarks
    • Mirascope should (ideally) not really exist when it comes to the clock.
    • That may be true today. We should know for sure (and make it so).
  • Shift the package more towards only having to learn Mirascope
    • Essentially this means the "Learn" section should only ever import from Mirascope
    • Anything provider-specific should only be necessary for truly provider-specific features
  • Clean up class / function naming to remove Base from anything where users are not subclassing
    • Users are using BaseMessageParam but not subclassing it, so MessageParam should suffice
    • Same is true for BaseDynamicConfig etc.
  • Make pydantic an optional dependency
    • While Pydantic is great for many things, it introduces a lot of unnecessary overhead
      • (namely in places where we use it internally and don't have to)
    • This breaks the above point about "only Mirascope" since Response Models require Pydantic (but don't have to)
    • For users who want Pydantic, they can and should be able to use it.

Why push for llm as the default?

Originally we implemented Mirascope such that it would be easy to switch providers, but the interface was not truly agnostic. You would need to use higher-order functions (decorators) dynamically on @prompt_template decorated functions. This is not great.

We are making strides towards a truly provider-agnostic interface with llm.call and llm.override etc. but still rely on the original provider-specific calls followed by conversion when requested or on construction. This results in a bunch of unnecessary compute time spent on all of the provider-specific class creation rather than just always converting everything to a common type.

If we're hoping to build the standardized base interface for building with LLMs, then everything must be provider-agnostic by default.

Accessing provider-specific features should be possible with minimal changes to existing code and only necessary when Mirascope does not natively support such a feature (e.g. OpenAI releases a new feature that only they support).

To me, this means that for almost all use-cases, llm.call and other llm module methods (e.g. override) should be sufficient (and in fact be the right solution). We can then implement new/different interfaces that work with llm or are separate for things that are provider-specific.

For example, right now if a user wants to use a custom client (e.g. to access Vertex through the google module), they need to learn about the genai package. Instead, we could do something like llm.client(provider="google") that overloads based on provider and provides any additional arguments they may need (such as vertexai = True). When using the llm.override or llm.context methods, we would set client to None if the provider is different and no client is provided (which would use the default client internally for the given provider).

There are only a few places where there are currently provider-specific features that we support that don't really fit into the provider-agnostic interfaces as currently designed, namely custom messages, strict structured outputs, and Anthropic prompt caching. For things like images/audio/video, I think our current approach is good (namely support a provider-agnostic interface but raise errors for providers that don't support it).

The purpose of custom messages is to enable accessing newly released provider-specific features while still being able to take advantage of the rest of the Mirascope eco-system. For example, when OpenAI released GPT-4-Vision, users could write prompts with images using custom messages but still take advantage of e.g. Response Models. I still think it's really important to support this, but I think users would still want provider-agnostic support downstream as mentioned. We could do something like allow provider-specific config return types (e.g. OpenAIDynamicConfig) that allow provider-specific messages, and then if we detect an llm.context with a different provider we raise a runtime user error saying that the function is provider-specific and cannot use a different provider. This would then ensure that all of the other override features (e.g. json_mode or stream etc.) are still available even in the provider-specific case.

For strict structured outputs, I think we could solve this by implementing an additional structured or parse decorator that implements provider-agnostic support for strict structured outputs (and only allows for provider settings that support it, such as OpenAI, Gemini, potentially Outlines in the future, etc.). Another option would be to differentiate between ResponseModel and StrictResponseModel and only allow certain providers to accept StrictResponseModel in the type hint overloads. This requires some of the stuff around making Pydantic optional, so see below for more details.

For Anthropic prompt caching, I think we can just keep what we have (i.e. CacheControlPart) as a provider agnostic way of implementing cache controls, and then we can raise a runtime user error if it's used with a non-Anthropic model. I feel this is the right path since there are other providers (such as Bedrock and Vertex) that also technically support this when using Claude models on their platforms, so it's not truly Anthropic-only (it's Claude-only).

I haven't yet figured everything out here, and there's a lot to figure out, but this is the general direction I want to take the library.

Performance matters

LLM API calls are slow. Mirascope should not make them any slower than necessary. As part of this major version push, we should strive for the best performance we can. Constructing a call should be as fast as possible. It should only happen once. Data should only be validated when absolutely necessary. Data should be restructured / formatted only when necessary. Classes should be created only when necessary. Why should we create an OpenAICallResponse through the openai.call decorator under the hood that ultimately becomes a CallResponse instance when using the llm.call decorator. We should just start from CallResponse.

In fact, why is CallResponse a Pydantic model at all? What are we validating? Information that's already been validated by the provider-specific models. We should be using something like attrs instead to provide the same interface without the overhead. We could then provide additional support for e.g. serialization through cattrs, which would also make it much easier for users to implement their own custom serialization logic on top of CallResponse.

Why the shift toward "only having to learn Mirascope"?

I think it's extremely important that users who are learning Mirascope only need to learn Mirascope unless otherwise absolutely necessary. The fewer concepts needed to get started -- the less a user has to learn to find Mirascope valuable -- the better.

For example, why should a user who wants to use a custom client have to learn what client to import from a provider-specific package? Users should not need to learn about provider-specific SDK types unless accessing truly provider-specific features that require them. A good example of this would be using provider-specific call parameters (e.g. OpenAICallParams) rather than CommonCallParams because a certain call parameter is only supported by a certain provider (e.g. Google's safety configuration options). Here, it's necessary that a user learn this, and they have already likely learned it since they discovered it was possible most likely by reading the provider-specific documentation.

And that's just for LLM providers. What about a user who wants to structure their outputs? Why should they have to learn about Pydantic if they've never heard about it before and just want to use a dataclass? Sure, if a user knows about Pydantic and wants to take advantage of certain validation or serlialization features for Response Models, by all means that should be supported and possible. But it should be optional and not the default requirement that you learn Pydantic. We should opt for the default to use the Python everyone already knows and loves. Everything else should be opt-in.

What's wrong with the current naming conventions?

There's nothing inherently wrong with them. But I care a lot about semantics. Things should be immediately clear just from their naming. For example, the Tool class in the llm module is used only for the actual structured tool output. Users should use BaseTool for defining tools (which makes the use of Base make sense as a parent class users should subclass).

In this vein, it makes sense to remove Base from all things that we generally don't expect or recommend subclassing. For example, we could have BaseDynamicConfig internally for supporting provider specific configs like OpenAIDynamicConfig, but a user using the llm.call decorator with dynamic configuration should just use llm.DynamicConfig. Same is true for e.g. CommonCallParams -> CallParams.

While not necessarily a huge deal for people, naming matters to me and I think it's worth the additional thought.

Why make Pydantic an optional dependency?

I think a better first question to ask is whether or not Pydantic is necessary. Don't get me wrong. Pydantic is great. The library has done a tremendous amount of good for Python and especially LLM-powered Python.

But it's a lot of overhead. As mentioned earlier, we shouldn't be using Pydantic for validation when we don't need validation. That's just additional compute spent for no reason.

Everything we receive from the LLM provider APIs has already been validated. We don't need to validate it again. For things like model_dump and other serialization, we can implement more native support through libraries such as attrs and cattrs without the additional overhead. This also means that users can more easily implement their own custom serialization without the overhead.

If, for some reason, users really want things like CallResponse to support Pydantic, we can just implement something like PydanticCallResponse that can be easily constructed from a CallResponse instance, but again I would only want to implement this if it's actually useful / desired.

Response Models are a different story. For most cases, Pydantic is overkill. If we're just validating the types, attrs and cattrs are sufficient, and we can easily push those under-the-hood through something like an llm.response_model decorator that converts an object into a ResponseModel[OriginalClass] type. This would also support something like StrictResponseModel[OriginalClass] through e.g. llm.response_model(strict=True) as mentioned before.

This is in line with the principle of only having to learn Mirascope. If we support llm.response_model, then Response Models become a Mirascope-specific thing (and not a Pydantic thing).

We would then of course add opt-in support for using Pydantic with Response Models so that users could take advantage of their additional validation features. For example, we could add the mirascope[pydantic] and then just allow response_model to accept a Pydantic BaseModel type definition. In the Learn section, we could put this at the very end with a link to Pydantic for those who want to learn more, but the core Response Model features would be Mirascope-specific from the user's perspective.

Roadmap

All-in-all I'm excited about this direction. There are a few items we should complete first as part of v1.x that are not breaking changes. Implementing them as part of v1.x will also give us the opportunity to see if we like the interfaces, and if there are breaking changes we want to make we can do so as part of the v2.0 push once we've identified them.

I think the roadmap for this work breaks down as follows:

Remaining v1.x Implementation Goals

  • Add support for Gemini Structured Outputs #472 (in prep for strict=True response models)
  • Migrate Pre-Made mirascope.tools -> MCP Community #904 I'm thinking that instead of premade tools we should implement them as MCP servers. This would make the tools useable even if not using Mirascope, which I think is important. We might even want to make this a separate library (e.g. mirascope-mcp). I will mark this item as done once I have had a chance to create a new issue around this idea (which I will then add to the v2.0 goals)
  • Add support for Gemini's Multimodal Live API #741 is important to me because I think Realtime APIs are super cool. Since both OpenAI and Gemini support this, I think it's time we finalize our own standardized Python Realtime API and push it into production stable with support for all providers that currently support realtime. Implementing this as part of v1.x is important also since it gives us the freedom to implement breaking changes as part of v2.0 if necessary.
  • Add documentation for how to use tools and response_model together #756 and Update documentation to use llm.call as the default everywhere that makes sense #811 will provide nearly all of the documentation changes we will want for v2.0 but are not blocked by it. We can leave any additional documentation updates (such as renaming updates) to the v2.0 goals, which I've included as an item below.
  • Continue an output if response runs out of output tokens #804 is an extremely common use-case, but I am uncertain about the design. This is another case where I would love to implement and iterate on this as part of v1.x in case we identify breaking changes we want.
  • Formally support Anthropic Claude 3.7 Thinking (without type errors) #872 should be implemented keeping in mind that nearly all providers are now releasing thinking models in some shape or form. Even though OpenAI doesn't currently release thinking traces (unless that's changed?), they likely will in the future given other competitors are as well. We should be designing the Mirascope standardized interface for interacting with thinking models.
  • llm.context: ContextManager for overriding llm call parameters #884 is another key step toward v2.0 but is not blocked by it. There are still a lot of design decisions to make here and things to figure out. For example, it does not currently seem possible to use llm.context and ctx.apply to properly update type hint overrides because we do not have access to the original return type (which we need if no structural overrides are applied). We could probably do something like allow sending the original function as an optional argument to llm.context such that we can properly type it, and if the user calls ctx.apply on their function but don't provide it to llm.context then the return type will be Unknown or something.
  • Only Support Open Telemetry And Deprecate Existing Integrations #889 is important for longer term maintenance of the library. I want to push all external integration dependencies into their respective libraries and only keep Mirascope-specific stuff as part of mirascope. If users are using e.g. logfire with Mirascope, they should be importing the instrumentation from logfire and not mirascope. If something is wrong with Mirascope, then we can make that update to Mirascope and all other things will benefit from the change/fix. If something is wrong with e.g. logfire, then we can make that update to Logfire and all other things will benefit from the change/fix.
  • Add support for Mistral OCR #893 and Add support for Gemini Document Understanding #895 are worth adding since right now we only support the :document tag and parts for Anthropic. It's worth making sure this can be provider-agnostic like other features (e.g. images).
  • Automatically generate internal type hint overloads #913 is extremely important for the provider-agnostic -> model-agnostic switch
  • Native support for image outputs #918 is worthwhile even though only Gemini seems to support this right now. More and more providers/models will support image outputs natively, so we should support this in a model-agnostic way.

Once we complete these items, we will freeze the v1.x minor version while we start work on v2.0. Of course, we will continue to implement bug fixes as necessary, which we will then merge over into the v2 development branch.

v2.0 Implementation Goals

As things progress and become more clear, I will likely convert the below roadmap into sub-issues that are individually fleshed out and implemented. I think this will provide necessarily clarity around the individual components we implement (and also make reviewing the work easier).

  • Restructure everything around the llm module as the default, updating core to provide utilities rather than provider-specific call decorators. I imagine this would be somewhat similar to our costs module where instead of having provider-specific modules we have utility specific modules that accept a provider argument and then route to the correct provider-specific utility (such as message conversion). This will of course require import suppression etc.
  • Update our naming convention as discussed (i.e. removing Base etc.)
  • Implement a performance benchmarking suite that we can run on v1.x and then run on v2.0 to ensure we're actually optimizing performance. It's important that we build the benchmark around the performance metrics that really matter.
  • Further update/refine documentation around all changes and "only having to learn Mirascope" (a lot of which will have already been done as part of Update documentation to use llm.call as the default everywhere that makes sense #811). Push all things provider-specific into it's own Provider Specific Usage page or something.
  • Completely replace our own internal usage of Pydantic with attrs and cattrs.
  • Make Pydantic optional for Response Models where the llm.response_model decorator is the default.
  • Update examples (e.g. Evals, Agents) to replace Pydantic with standard Python. No reason to subclass BaseModel when implementing an eval or agent -- just use llm.response_model or implement an __init__.
  • Remove gemini and vertex providers in favor of the google provider only.

I will further flesh out the above v2.0 roadmap into sub-issues once things become more clear around specifics of implementation details and plan.

Feedback

Any and all feedback, comments, questions, etc. is welcome with open arms!

I will compile everything here as makes sense and update the roadmap in accordance.

Compiled Notes

  • Mirascope v2.0 should be "model-agnostic" (rather than "provider-agnostic"). This distinction is important. The provider is an implementation detail (e.g. the same exact model can be hosted on various providers). From the user's perspective, it's really the model's capabilities that matter, and we need to make sure we properly handle and type things on that (e.g. enable passing in reasoning options if the model supports reasoning).
  • Auto-generating type hint overload signatures so that we can manage proper type hints across the explosive number necessary to handle a per-model level overload hierarchy. We should start by implementing this in v1.x.

Final Notes

Let's make Mirascope the standard interface for building with LLMs! I'm really excited about the library and direction, so I hope everyone else is too.

@willbakst willbakst self-assigned this Mar 8, 2025
@teamdandelion
Copy link
Contributor

I like this roadmap, and its clear focus on making Mirascope the standard interface for building on top of LLMs. Especially since intense competition and fast-follow dynamics are pushing foundational models towards commoditization, there's very clear value in building standards one layer up. It positions users of Mirascope as seamless beneficiaries of that competition and commoditization.

If I'm understanding correctly, there's a key shift in focus between 1.x and 2.0, in 1.x was still "provider centric", but focused on providing a consistent patterns for working with those providers in a way that facilitates portability. Whereas in 2.0, we're writing a generic interface for LLMs in general, and the providers are an implementation detail that should mostly live under the hood.

Building on that (and on some offline conversations), I propose that Mirascope 2.0 reorient from being "provider agnostic" to "model agnostic", treating the model (and their capabilities) as the core abstraction, and the provider as an implementation detail. This makes it natural to do things like mix and match models models from different providers. Thus, you don't care whether something is an "openai model" or an "anthropic model": you care whether it implements the specific Mirascope interfaces you depend on, like llm.Reasoning, llm.Image, etc. And, within the equivalence class of models that implement your needed interfaces, you care about the cost/performance tradeoff they offer.

At an API level, we would stop having pairs of provider: Provider, model: str arguments, and instead have a single model argument. In most cases I think we can use a string to identify the model along with the provider being used, e.g. anthropic:claude-3-5-sonnet-latest. Under the hood, I expect we'll have a model class that is registered and looked up by those names. Note that this pretty similar to how PydanticAI does things.

The key benefits of focusing on models not providers are realized if we can encode model capabilities into the type signature. That way you can swap between models confidently, knowing the type system will warn you if your new model choice is missing capabilities that your app needs. It would enable users to very precisely manage their location on the cross model cost/performance tradeoff spectrum, since they could potentially shop across providers for the cheapest image model, the cheapest audio model, etc. (And on the Lilypad / managed generation side, we could automatically do that for them, while ensuring that their performance on the automated evals stays high.)

Tracking per-model type signatures may be painful on the type declaration side (a la lines 47-3484 of llm/_override.py). However I think it's addressable with code generation, and worthwhile if we can manage it. Since the general 2.0 roadmap seems oriented on "improve 1.X in preparation for landing 2.0", I think we could start by auto-generating type signatures as needed for the providers in 1.X, and see how we feel about it.

Beyond that, just reflecting on what I see (and like) in the roadmap you've laid out:

  • Making all unnecessary dependencies optional—both on a codepath level (e.g. not requiring Pydantic) but also on a learning and conceptual level (not needing to think about providers). But, keeping the option to take on dependencies when you actually need them, and are okay with the (code or conceptual) complexity they add.
  • Cleaning up naming and centering the llm module as the core abstraction.
  • Optimizing performance, starting by benchmarking 1.x and then using that to inform the design of 2.0.

@willbakst
Copy link
Contributor Author

I'm glad this is well received! I will compile these notes into the main description and add an issue to the list of v1.x remaining items to work on auto-generating type signatures.

This is certainly non-trivial work that I believe is worthwhile as you've mentioned.

@willbakst
Copy link
Contributor Author

willbakst commented Mar 14, 2025

After some additional thought, I believe we should view the goal for Mirascope v2.0 as building a full agent platform:

  • The standardized interface for building with LLMs, mirascope
  • The lilypad platform (pun intended) for versioning, tracing, evaluation, and optimization

I imagine moving lilypad inside of mirascope so that everything can be more tightly integrated. For example:

from dataclasses import dataclass

from mirascope import lilypad, llm

lilypad.configure()

@dataclass
class Deps:
    user_name: str

def escalate(ctx: llm.AgentContext[Deps]) -> str: ...

@llm.agent(
    deps_type=Deps,
    model="google:gemini-2.0-flash",
    tools=[escalate],
)
def support_bot(ctx: llm.AgentContext[Deps]) -> str:
    return f"You are a customer support bot helping {ctx.deps.user_name}"

deps = Deps(user_name="William")
response = support_bot("I'm unable to access my account", deps=deps) 
print(response.content)

Here, lilypad.configure would automatically version and trace the support_bot agent.

We could also then just use Mirascope type directly for managed generations:

from mirascope import lilypad, llm

@lilypad.generation(managed=True)
def recommend_book(genre: str) -> llm.CallResponse: ...

response = recommend_book("fantasy")
print(response.content)

This way we don't have to build Lilypad wrappers around Mirascope stuff and can instead just use Mirascope directly. We could also then likely implement things like Managed Agents that enable building Agents in a fully no-code way.

Of course, lilypad would still work with non-Mirascope calls through the generation decorator and other existing features.

I also imagine that uv add "mirascope[lilypad]" would also install the Lilypad CLI such that you could run e.g.

lilypad deploy --agent support_bot.py

and it would deploy the bot to the platform.

For integrations, we would support e.g. MCP. Here, I imagine using MCP Community (or any other MCP server) with the Mirascope MCP Client sse_client context manager for easily integrating an agent with pre-built tools. Similarly, we could run e.g.

lilypad deploy --mcp duckduckgo

where this would deploy the MCP server to the Lilypad platform. We could also support deploying the server directly from the platform rather than the CLI if that makes sense.

It may also make sense to push MCP Community inside of the mirascope.mcp module and take on the burden on maintaining a bunch of pre-built servers (in place of the mirascope.tools library). This could enable tighter integrations. For example:

from mirascope import llm, mcp

@llm.agent(
    model="google:gemini-2.0-flash",
    tools=[mcp.DuckDuckGo],
)
def bot() -> str:
    return "You are a web search agent."

response = support_bot("Any recent news about LLMs?") 
print(response.content)

This structure (imo) makes everything cleaner and much more clear. Furthermore, the tight coupling would open the door for some really interesting functionality e.g.

  • lilypad deploy --fsm state_machine.py that deploys any state machine (not just agents)
  • support_bot.compile could compile the agent into an optimal Finite State Machine for cost reduction
  • lilypad optimize support_bot.py that automatically optimizes things (e.g. prompt, cost, latency, etc.)

Curious to hear what people think about this.

@teamdandelion
Copy link
Contributor

I imagine moving lilypad inside of mirascope so that everything can be more tightly integrated.

When I first read this, I thought the plan was to migrate the full lilypad codebase into mirascope/mirascope, which confused me. Having discussed it offline with @willbakst, I understand the plan a bit better, which I'll summarize here:

First, we keep all the Mirascope 2.0 goals from the top of the issue:

  • Reorganize around provider-agnostic llm calls by default
  • Benchmark and optimize performance
  • Minimize dependency footprint (e.g. remove Pydantic)
  • "Just learn Mirascope" — keep provider-specific details under the hood

Then, we build on that minimzed footprint by integrating the Mirascope-specific Lilypad SDK directly into mirascope/mirascope. Meaning:

  • Users who want to use Lilypad with Mirascope can import the relevant APIs as from mirascope import lilypad
  • Those APIs will be really convenient for Mirascope usage since the endpoints and types are co-developed with Mirascope
  • So as to avoid dependency bloat for non-Lilypad Mirascope users, enabling Lilypad requires pip install mirascope[lilypad]
  • Those who want to use Lilypad without Mirascope can depend on the mirascope/lilypad repository directly

Under the hood, this implies a restructure for Lilypad too—which refactors it to no longer depend on Mirascope itself, but instead to provide the underlying APIs needed for the Mirascope-Lilypad SDK that lives in mirascope/mirascope. As an added benefit, this should make it easier / cleaner to build alternative SDKs for Lilypad that connect to other languages or frameworks (eg Lilypad for Typescript).

This has the effect of more clearly positioning Mirascope-the-library as the top of the funnel into (optional, but very well supported) Lilypad usage, and folding Lilypad into the Mirascope brand, rather than it appearing to users as kind of a separate thing. So Lilypad is a key part of "the Mirascope platform" essentially.

@willbakst
Copy link
Contributor Author

This hits the nail on the head and provides necessary additional clarity. Thank you!

I imagine we'll do something like:

  • Mirascope/lilypad repository where the application lives
  • Mirascope/lilypad-python-sdk auto-generated SDK for using the application's API in Python (likely using e.g. Stainless or Fern so we can get additional language support for free).
  • lilypad module in Mirascope/mirascope that implements additional language-specific interfaces and functionality such as automatic versioning, type-safe managed generations, etc.

The last bullet here is key. The Lilypad API would not be able to provide versioning beyond accepting the code for versioning. The actual closure computation is language-specific, so even if using provider SDKs directly (e.g. OpenAI), I think it makes sense for the lilypad.generation interface to live inside of mirascope. As we work to support additional languages in the future, we will follow a similar structure.

Also, tightly integrating e.g. automatic versioning with the llm module means we need it to live inside of mirascope. Similarly, the API would provide endpoints for pulling Managed Generation information (prompt, call params, etc.), but mirascope would implement the type-safe interface for Python (where lilypad.generation(managed=True) would directly use Mirascope types e.g. CallResponse rather than thin wrappers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants