Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIREFLY-1615: Improve file type detection #1670

Merged
merged 2 commits into from
Dec 17, 2024
Merged

Conversation

loitly
Copy link
Contributor

@loitly loitly commented Nov 18, 2024

Ticket: https://jira.ipac.caltech.edu/browse/FIREFLY-1615

  • upgrade duckdb to latest version 1.1.3
  • install magic extension for file type detection
  • move TableUtil.Format to FormatUtil.Format
  • added detectFormat to FormatUtil.
  • deprecate TableUtil.guessFormat; use FileUtil.detectFormat instead
  • add FormatUtilTest

Although many files were modified, most changes involved moving Format from TableUtil to FormatUtil. This should not pose significant issues.

From ticket:

The logic is to first use the file's magic number to determine the format. If that fails, a small sample of the file content is analyzed to detect known file types. Finally, trial and error is applied to selected formats based on the file's MIME type.  This logic does not rely on the file extension. 

Test: https://fireflydev.ipac.caltech.edu/firefly-1615-file-type-detection/firefly/
Upload various file types and verify that Firefly functions as expected.

Also, ticket https://jira.ipac.caltech.edu/browse/FIREFLY-1628
My guess is that the server crashes unexpectedly which causes unreproducible errors.
This PR upgrade DuckDB to the latest version which seems to address some of the crashes.

Test: https://firefly-1615-file-type-detection.irsakudev.ipac.caltech.edu/applications/Spitzer/SHA/
Compare to: https://irsatest.ipac.caltech.edu/applications/Spitzer/SHA/

Use SHA IRS Enhanced Products for testing.
Position = ngc 1333
select IRS Enhanced Products
-> Search
Switch to IRS Enhanced Products and repeatedly select different rows. Each row fetches a new table, creating a new database. For each order, a new temporary table is also created.
Compare to this, irsatest was crashing after several tries. You can tell through error messages or by the application becoming unresponsive.

@loitly loitly added this to the 2024.4 milestone Nov 18, 2024
@loitly loitly requested a review from robyww November 18, 2024 17:27
@loitly loitly self-assigned this Nov 18, 2024
Copy link
Contributor

@robyww robyww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine. My big concern, and I might not be understanding the code, is that you are relying on the file extension for type detection. I don't think we did that before.

Comment on lines 117 to 123
public static String getExtension(String s) {
String ext = "";
int i = s.lastIndexOf('.');
if (i > 0 && i < s.length() - 1) {
ext = s.substring(i + 1).toLowerCase();
}
return ext;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to reformat all of FileUtil.java?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the file was poorly formatted, and I had intended to clean up FileUtil. However, based on your suggestion to move my logic elsewhere, I’ll go ahead and revert the entire file.

* @throws IOException If an I/O error occurs while accessing the file.
*/
@Nonnull
public static Format detectFormat(String fname) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this detectFormat function. It probably needs to need in its own file. It make FileUtil.java have a lot of external dependencies. Before it was only dependent on java.

String mime = getMimeType(fname);
DuckDbAdapter.MimeDesc mimeDesc = null;
Format format = mapToFormat(mime, null);
if (format == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that if it can get the format from the file extension then it does not look inside the file like it use to. I think that is dangerous. I think our own server code somethings save stuff with the wrong extension. Because code that was intended for a specific purpose got generalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tika first checks the magic number and then the file extension. DuckDB's extension, similar to the file command, relies solely on the magic number but can reliably identify Parquet files.

The logic here is to use these two tools to determine the file format. If they fail, we fall back to trial and error based on the resolved MIME type. For example, if the file is identified as XML, we can attempt to process it using a VOTable reader. I’ve implemented handling for common cases and expect to add more as needed.

This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.

Do we need to handle this edge case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.

This is exactly my concern. I think there are cases that we download a file with the fits extension even it is an ipac table. I don't remember right now where. I know many of our files have a .ul extension.

Before you do anything let me look into this case and see if I can find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. If we need to handle misleading file extensions, I can explore the best approach. The .ul extension is not an issue since it's not a recognized file type.

void run() throws Exception;
}

public static void tryIt(TryWithEx func) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is confusing. I studied it and understand now but tryIt() seems to vague for a general string function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove. I plan to introduce more functional helpers gradually in a separate util class.

Comment on lines +33 to +34
// database check
DbAdapter.getAdapter().execQuery("select 1", null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to add.

@loitly loitly force-pushed the FIREFLY-1615-file-type-detection branch 2 times, most recently from 40e71a9 to 7a7784f Compare December 13, 2024 14:53
Copy link
Contributor

@robyww robyww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. There are a few questions I have that are listed below.

@@ -17,12 +17,12 @@
public class DuckDbUDF {
public static String deg2pix = importDeg2Pix();
public static final String decimate_key = """
CREATE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS
CREATE OR REPLACE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the stability issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This occurs only in edge cases. I included it to ensure the create function never fails and always contains the latest version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version of DuckDB appears to fix some of the unexpected crashing issue.

Comment on lines +165 to +167
Rows: %,15d Peak Rows: %,15d
Databases: %,10d Peak Databases: %,10d Total Database: %,10d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you added this.

Comment on lines 167 to 172
public static boolean isVoTable(File inFile) {
try {
DataGroup[] tbl = voToDataGroups(inFile.getAbsolutePath(), true, 0);
return tbl != null && tbl.length > 0;
} catch (Exception e) { return false; }
}
Copy link
Contributor

@robyww robyww Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function appears to never be called. My bigger question is that is seems to go through code the retrieves the file? That is unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s intended to read only the header, similar to IpacTableReader. However, after reviewing our guesser, I’m confident it does a good enough job detecting VOTable. As a result, this is no longer needed, and I’ll remove it.


static FileInfo ingestTable(DbAdapter dbAdapter, DataGroup table, boolean searchForSpectrum) throws DataAccessException {
if (searchForSpectrum) SpectrumMetaInspector.searchForSpectrum(table,true);
Copy link
Contributor

@robyww robyww Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the DataGroup only partially read here? Is the meta data all in place or searchForSpectrum with not work.

My bigger question - are there cases where searchForSpectrum was being call before but is not longer being called because of the streaming ingestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I’m aware. DataGroup is intended to contain the full table header without the data.

public static Format detect(File inFile) throws IOException {

Format format = null;
DuckDbAdapter.MimeDesc mimeDesc = DuckDbAdapter.getMimeType(inFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this line actually looks inside the file? The function name is a little confusing since it implies it is just looking at the extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line uses DuckDB's magic extension to determine the file's MIME type, similar to the Unix file command. So yes, it will read the magic number(if exists) along with additional data as needed.

@loitly loitly force-pushed the FIREFLY-1615-file-type-detection branch from 700aed6 to 66314c8 Compare December 17, 2024 00:10
@loitly loitly merged commit 58d27ac into dev Dec 17, 2024
@loitly loitly deleted the FIREFLY-1615-file-type-detection branch December 17, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants