Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake write: escape backslash in CSV files #5551

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

turb
Copy link
Contributor

@turb turb commented Jan 28, 2025

Unsure it is the right way to do that — can't find how to properly configure escapes chars with kantan.

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 61.28%. Comparing base (d5d20ad) to head (40c7611).

Files with missing lines Patch % Lines
...scala/com/spotify/scio/snowflake/SnowflakeIO.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5551      +/-   ##
==========================================
- Coverage   61.29%   61.28%   -0.01%     
==========================================
  Files         314      314              
  Lines       11250    11250              
  Branches      793      776      -17     
==========================================
- Hits         6896     6895       -1     
- Misses       4354     4355       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RustedBones
Copy link
Contributor

If the data contains \, shouldn't it be better to set ESCAPE_UNENCLOSED_FIELD=NONE as explained here ?

@turb
Copy link
Contributor Author

turb commented Jan 28, 2025

If the data contains , shouldn't it be better to set ESCAPE_UNENCLOSED_FIELD=NONE as explained here ?

I don't think it would work with data that may contains all between comma , backslash \ and simple quote ' — which is escaped by \'.

We found this because we have data with a string like something \ (or in Scala syntax "something \\") that encodes in CSV as 'something \'. Then the Snowflake parser interprets the quote as escaped.

With the fix it's now 'something \\' ("something \\\\" in scala syntax).

It may be managed by kantan (with kantan.csv.CsvConfiguration), but it seems only at the kantan.csv level, not kantan.codecs.

@turb
Copy link
Contributor Author

turb commented Jan 30, 2025

FYI I also opened a PR in Apache Beam: apache/beam#33803

I think it's difficult to have something really clean, since Beam does half a CSV serialization: doing a naive fields.mkString("'", "','", "'"). So providing the other half (array of serialized fields) can only be a hack around that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants