FIREFLY-1615: Improve file type detection #1670

loitly · 2024-11-18T17:27:44Z

Ticket: https://jira.ipac.caltech.edu/browse/FIREFLY-1615

upgrade duckdb to latest version 1.1.3
install magic extension for file type detection
move TableUtil.Format to FormatUtil.Format
added detectFormat to FormatUtil.
deprecate TableUtil.guessFormat; use FileUtil.detectFormat instead
add FormatUtilTest

Although many files were modified, most changes involved moving Format from TableUtil to FormatUtil. This should not pose significant issues.

From ticket:

The logic is to first use the file's magic number to determine the format. If that fails, a small sample of the file content is analyzed to detect known file types. Finally, trial and error is applied to selected formats based on the file's MIME type. This logic does not rely on the file extension.

Test: https://fireflydev.ipac.caltech.edu/firefly-1615-file-type-detection/firefly/
Upload various file types and verify that Firefly functions as expected.

Also, ticket https://jira.ipac.caltech.edu/browse/FIREFLY-1628
My guess is that the server crashes unexpectedly which causes unreproducible errors.
This PR upgrade DuckDB to the latest version which seems to address some of the crashes.

Test: https://firefly-1615-file-type-detection.irsakudev.ipac.caltech.edu/applications/Spitzer/SHA/
Compare to: https://irsatest.ipac.caltech.edu/applications/Spitzer/SHA/

Use SHA IRS Enhanced Products for testing.
Position = ngc 1333
select IRS Enhanced Products
-> Search
Switch to IRS Enhanced Products and repeatedly select different rows. Each row fetches a new table, creating a new database. For each order, a new temporary table is also created.
Compare to this, irsatest was crashing after several tries. You can tell through error messages or by the application becoming unresponsive.

robyww

The code looks fine. My big concern, and I might not be understanding the code, is that you are relying on the file extension for type detection. I don't think we did that before.

robyww · 2024-12-10T18:39:28Z

src/firefly/java/edu/caltech/ipac/util/FileUtil.java

+    public static String getExtension(String s) {
+        String ext = "";
+        int i = s.lastIndexOf('.');
+        if (i > 0 && i < s.length() - 1) {
+            ext = s.substring(i + 1).toLowerCase();
+        }
+        return ext;


Did you mean to reformat all of FileUtil.java?

Yes, the file was poorly formatted, and I had intended to clean up FileUtil. However, based on your suggestion to move my logic elsewhere, I’ll go ahead and revert the entire file.

robyww · 2024-12-10T18:42:39Z

src/firefly/java/edu/caltech/ipac/util/FileUtil.java

+     * @throws IOException If an I/O error occurs while accessing the file.
+     */
+    @Nonnull
+    public static Format detectFormat(String fname) throws IOException {


I am not sure this detectFormat function. It probably needs to need in its own file. It make FileUtil.java have a lot of external dependencies. Before it was only dependent on java.

robyww · 2024-12-10T18:45:32Z

src/firefly/java/edu/caltech/ipac/util/FileUtil.java

+        String mime = getMimeType(fname);
+        DuckDbAdapter.MimeDesc mimeDesc = null;
+        Format format = mapToFormat(mime, null);
+        if (format == null) {


It appears that if it can get the format from the file extension then it does not look inside the file like it use to. I think that is dangerous. I think our own server code somethings save stuff with the wrong extension. Because code that was intended for a specific purpose got generalized.

Tika first checks the magic number and then the file extension. DuckDB's extension, similar to the file command, relies solely on the magic number but can reliably identify Parquet files.

The logic here is to use these two tools to determine the file format. If they fail, we fall back to trial and error based on the resolved MIME type. For example, if the file is identified as XML, we can attempt to process it using a VOTable reader. I’ve implemented handling for common cases and expect to add more as needed.

This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.

Do we need to handle this edge case?

This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.

This is exactly my concern. I think there are cases that we download a file with the fits extension even it is an ipac table. I don't remember right now where. I know many of our files have a .ul extension.

Before you do anything let me look into this case and see if I can find it.

Of course. If we need to handle misleading file extensions, I can explore the best approach. The .ul extension is not an issue since it's not a recognized file type.

robyww · 2024-12-10T18:49:24Z

src/firefly/java/edu/caltech/ipac/util/StringUtils.java

+        void run() throws Exception;
+    }
+
+    public static void tryIt(TryWithEx func) {


This function name is confusing. I studied it and understand now but tryIt() seems to vague for a general string function.

I will remove. I plan to introduce more functional helpers gradually in a separate util class.

robyww · 2024-12-10T18:50:48Z

src/firefly/java/edu/caltech/ipac/firefly/server/servlets/HealthCheck.java

+            // database check
+            DbAdapter.getAdapter().execQuery("select 1", null);


This makes sense to add.

robyww

This looks good. There are a few questions I have that are listed below.

robyww · 2024-12-13T19:08:16Z

src/firefly/java/edu/caltech/ipac/firefly/server/db/DuckDbUDF.java

@@ -17,12 +17,12 @@
 public class DuckDbUDF {
    public static String deg2pix = importDeg2Pix();
    public static final String decimate_key = """
-        CREATE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS
+        CREATE OR REPLACE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS


Is this related to the stability issue?

No. This occurs only in edge cases. I included it to ensure the create function never fails and always contains the latest version.

The latest version of DuckDB appears to fix some of the unexpected crashing issue.

robyww · 2024-12-13T19:08:47Z

src/firefly/java/edu/caltech/ipac/firefly/server/servlets/ServerStatus.java

+            Rows: %,15d       Peak Rows: %,15d 
+            Databases: %,10d       Peak Databases: %,10d      Total Database: %,10d
+


Glad you added this.

robyww · 2024-12-13T19:17:25Z

src/firefly/java/edu/caltech/ipac/table/io/VoTableReader.java

+    public static boolean isVoTable(File inFile) {
+        try {
+            DataGroup[] tbl = voToDataGroups(inFile.getAbsolutePath(), true, 0);
+            return tbl != null && tbl.length > 0;
+        } catch (Exception e) { return false; }
+    }


This function appears to never be called. My bigger question is that is seems to go through code the retrieves the file? That is unexpected.

It’s intended to read only the header, similar to IpacTableReader. However, after reviewing our guesser, I’m confident it does a good enough job detecting VOTable. As a result, this is no longer needed, and I’ll remove it.

robyww · 2024-12-13T19:24:51Z

src/firefly/java/edu/caltech/ipac/firefly/server/db/DbDataIngestor.java


+    static FileInfo ingestTable(DbAdapter dbAdapter, DataGroup table, boolean searchForSpectrum) throws DataAccessException {
+        if (searchForSpectrum) SpectrumMetaInspector.searchForSpectrum(table,true);


Is the DataGroup only partially read here? Is the meta data all in place or searchForSpectrum with not work.

My bigger question - are there cases where searchForSpectrum was being call before but is not longer being called because of the streaming ingestion?

Yes, I’m aware. DataGroup is intended to contain the full table header without the data.

robyww · 2024-12-13T19:32:29Z

src/firefly/java/edu/caltech/ipac/util/FormatUtil.java

+    public static Format detect(File inFile) throws IOException {
+
+        Format format = null;
+        DuckDbAdapter.MimeDesc mimeDesc = DuckDbAdapter.getMimeType(inFile);


So this line actually looks inside the file? The function name is a little confusing since it implies it is just looking at the extension.

This line uses DuckDB's magic extension to determine the file's MIME type, similar to the Unix file command. So yes, it will read the magic number(if exists) along with additional data as needed.

- update DuckDB to v1.1.3 - add DuckDB magic extension

loitly added the Infrastructure label Nov 18, 2024

loitly added this to the 2024.4 milestone Nov 18, 2024

loitly requested a review from robyww November 18, 2024 17:27

loitly self-assigned this Nov 18, 2024

robyww reviewed Dec 10, 2024

View reviewed changes

loitly force-pushed the FIREFLY-1615-file-type-detection branch 2 times, most recently from 40e71a9 to 7a7784f Compare December 13, 2024 14:53

robyww approved these changes Dec 13, 2024

View reviewed changes

loitly added 2 commits December 16, 2024 14:10

FIREFLY-1615: Improve file type detection

287609f

- update DuckDB to v1.1.3 - add DuckDB magic extension

adjustments made to address reviewer concern

66314c8

loitly force-pushed the FIREFLY-1615-file-type-detection branch from 700aed6 to 66314c8 Compare December 17, 2024 00:10

loitly merged commit 58d27ac into dev Dec 17, 2024

loitly deleted the FIREFLY-1615-file-type-detection branch December 17, 2024 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIREFLY-1615: Improve file type detection #1670

FIREFLY-1615: Improve file type detection #1670

loitly commented Nov 18, 2024 •

edited

Loading

robyww left a comment

robyww Dec 10, 2024

loitly Dec 10, 2024

robyww Dec 10, 2024

robyww Dec 10, 2024

loitly Dec 10, 2024

robyww Dec 10, 2024

loitly Dec 10, 2024

robyww Dec 10, 2024

loitly Dec 10, 2024

robyww Dec 10, 2024

robyww left a comment

robyww Dec 13, 2024

loitly Dec 16, 2024

loitly Dec 16, 2024

robyww Dec 13, 2024

robyww Dec 13, 2024 •

edited

Loading

loitly Dec 16, 2024

robyww Dec 13, 2024 •

edited

Loading

loitly Dec 16, 2024

robyww Dec 13, 2024

loitly Dec 16, 2024

		// database check
		DbAdapter.getAdapter().execQuery("select 1", null);

		Rows: %,15d Peak Rows: %,15d
		Databases: %,10d Peak Databases: %,10d Total Database: %,10d


		static FileInfo ingestTable(DbAdapter dbAdapter, DataGroup table, boolean searchForSpectrum) throws DataAccessException {
		if (searchForSpectrum) SpectrumMetaInspector.searchForSpectrum(table,true);

FIREFLY-1615: Improve file type detection #1670

FIREFLY-1615: Improve file type detection #1670

Conversation

loitly commented Nov 18, 2024 • edited Loading

robyww left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robyww left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robyww Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robyww Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loitly commented Nov 18, 2024 •

edited

Loading

robyww Dec 13, 2024 •

edited

Loading

robyww Dec 13, 2024 •

edited

Loading