Skip to content

Support filter script push down for calcite #3812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

qianheng-aws
Copy link
Collaborator

Description

Implement filter script push down for calcite by registering a new script language for java code string. This new script language can only be invoked by Node client since it requires a option of engine_type which is not allowed in Rest client.

Related Issues

Resolves #3379

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Heng Qian <[email protected]>
@penghuo
Copy link
Collaborator

penghuo commented Jun 23, 2025

This new script language can only be invoked by Node client since it requires a option of engine_type which is not allowed in Rest client.

This is a side project that runs the plugin as a standalone CLI and leverages DSL queries to interact with the OpenSearch cluster.
Is there any workaround, can we encode the engine_type with script_code?

@LantaoJin
Copy link
Member

This new script language can only be invoked by Node client since it requires a option of engine_type which is not allowed in Rest client.

This is a side project that runs the plugin as a standalone CLI and leverages DSL queries to interact with the OpenSearch cluster.

Is there any workaround, can we encode the engine_type with script_code?

Yes, we can check the "calcite" related package information in script code to determine the engine type, rather than encode additional information.

@qianheng-aws
Copy link
Collaborator Author

This new script language can only be invoked by Node client since it requires a option of engine_type which is not allowed in Rest client.

This is a side project that runs the plugin as a standalone CLI and leverages DSL queries to interact with the OpenSearch cluster. Is there any workaround, can we encode the engine_type with script_code?

Add option incompatible with rest client is also for security consideration. Otherwise, users can use Rest client to submit java code script without limitation.

Comment on lines 313 to 333
public static Function1<DataContext, Object[]> compile(String code, Object reason) {
try {
ClassBodyEvaluator cbe = new ClassBodyEvaluator();
cbe.setClassName("Reducer");
cbe.setExtendedClass(Utilities.class);
cbe.setImplementedInterfaces(new Class[] {Function1.class, Serializable.class});
cbe.setParentClassLoader(RexExecutable.class.getClassLoader());
cbe.cook(new Scanner((String) null, new StringReader(code)));
Class c = cbe.getClazz();
Constructor<Function1<DataContext, Object[]>> constructor = c.getConstructor();
return (Function1) constructor.newInstance();
} catch (IOException
| InstantiationException
| IllegalAccessException
| InvocationTargetException
| NoSuchMethodException
| CompileException var5) {
Exception e = var5;
throw new RuntimeException("While compiling " + reason, e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since we already get the code string, we can new RexExecutable and call getFunction() to avoid this copied method?

@penghuo
Copy link
Collaborator

penghuo commented Jun 24, 2025

Add option incompatible with rest client is also for security consideration. Otherwise, users can use Rest client to submit java code script without limitation.

Current implementation depend on Script.java in Core deny any options, if it is not true in future, user can run ANY java code.

if (options.size() > 1 || options.size() == 1 && options.get(CONTENT_TYPE_OPTION) == null) {
    options.remove(CONTENT_TYPE_OPTION);

    throw new IllegalArgumentException("illegal compiler options [" + options + "] specified");
}

Are there any workarounds we can consider? Instead of using raw Java code, can we leverage LINQ4j Enumerable or V2 Expression to limit user input to only allowlisted expressions and reduce the risk?

Comment on lines +971 to +976
// Compile code when creating to detect exception as early as possible
JavaTypeFactoryImpl typeFactory =
new JavaTypeFactoryImpl(rexBuilder.getTypeFactory().getTypeSystem());
RexToLixTranslator.InputGetter getter =
new ScriptInputGetter(typeFactory, rowType, fieldTypes);
this.code = CalciteScriptEngine.translate(rexBuilder, List.of(rexNode), getter, rowType);
Copy link
Collaborator

@dai-chen dai-chen Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking is it possible we delay this compilation to Java until compile() in ScriptEngine? In this way, we serialize and deserialize RexCall instead, similar as V2 expression pushdown? Just some thoughts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RexNode is unserializable. We've though of similar method as mentioned in the option2 here: #3379 (comment).

@songkant-aws is working on option2 but seems it's not feasible.

Copy link
Contributor

@songkant-aws songkant-aws Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leveraging Kryo serialization for classes without zero-argument constructor requires explicitly registering specific class serializer. That means we may need to spend more efforts on creating separate customized Kryo serializers for Calcite Expression classes/Enumerable classes.

If we go this way, I'm not confident yet to ensure all kinds of expressions can be correctly serialized.

Copy link
Collaborator

@dai-chen dai-chen Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RexNode is unserializable.

You mean it's not by JDK or Kryo right? Does other serializer work, or even Json's?

Do we know how many classes related to RexCall we need to serialize? Just thinking if that's not many and if we can ignore fields not needed by code-gen, can we have customize serializer?

Copy link
Collaborator

@dai-chen dai-chen Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qianheng-aws I checked the option 2 you posted. I feel it's different?

What I was thinking is:

  • Coordinator: RexNode -> serialized code
  • Worker: Code -> RexNode -> Linq4j expression -> Interpret or compile to Java

In this way, the script engine can be bounded to allowlisted RexNode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songkant-aws Not sure if you've explored the idea above. If so, could you share your PoC branch if any? I can do more test from my side. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dai-chen For the sake of convenience, I pulled Heng's code in my local and add my Kryo serialization prototype code on top of it. And then pushed the code to my personal branch: https://github.com/songkant-aws/sql/tree/kryo-serialization

Basically, the logic is similar. I just feel adding additional serialization logic for third party classes requires too much effort.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how many classes related to RexCall we need to serialize?

The common RexNode we need may not be too many. Like RexCall, RexLiteral, RexInputRef, RexLambda, RexLambdaRef... I think the biggest blocker is the inner fields of them. Like RexCall, it has plenty of SqlOperators implemented by Calcite or us.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songkant-aws @qianheng-aws Got it. Will have a look and also see if other options on my side. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Calcite Engine Framework: Pushdown scripts
6 participants