-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIREFLY-1615: Improve file type detection #1670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine. My big concern, and I might not be understanding the code, is that you are relying on the file extension for type detection. I don't think we did that before.
public static String getExtension(String s) { | ||
String ext = ""; | ||
int i = s.lastIndexOf('.'); | ||
if (i > 0 && i < s.length() - 1) { | ||
ext = s.substring(i + 1).toLowerCase(); | ||
} | ||
return ext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to reformat all of FileUtil.java
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the file was poorly formatted, and I had intended to clean up FileUtil. However, based on your suggestion to move my logic elsewhere, I’ll go ahead and revert the entire file.
* @throws IOException If an I/O error occurs while accessing the file. | ||
*/ | ||
@Nonnull | ||
public static Format detectFormat(String fname) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this detectFormat
function. It probably needs to need in its own file. It make FileUtil.java have a lot of external dependencies. Before it was only dependent on java.
String mime = getMimeType(fname); | ||
DuckDbAdapter.MimeDesc mimeDesc = null; | ||
Format format = mapToFormat(mime, null); | ||
if (format == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that if it can get the format from the file extension then it does not look inside the file like it use to. I think that is dangerous. I think our own server code somethings save stuff with the wrong extension. Because code that was intended for a specific purpose got generalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tika first checks the magic number and then the file extension. DuckDB's extension, similar to the file command, relies solely on the magic number but can reliably identify Parquet files.
The logic here is to use these two tools to determine the file format. If they fail, we fall back to trial and error based on the resolved MIME type. For example, if the file is identified as XML, we can attempt to process it using a VOTable reader. I’ve implemented handling for common cases and expect to add more as needed.
This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.
Do we need to handle this edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach works well in most cases. However, it fails when a file lacks a magic number but has a misleading file extension—for instance, a .fits file that is actually an IPAC table.
This is exactly my concern. I think there are cases that we download a file with the fits extension even it is an ipac table. I don't remember right now where. I know many of our files have a .ul
extension.
Before you do anything let me look into this case and see if I can find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. If we need to handle misleading file extensions, I can explore the best approach. The .ul
extension is not an issue since it's not a recognized file type.
void run() throws Exception; | ||
} | ||
|
||
public static void tryIt(TryWithEx func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is confusing. I studied it and understand now but tryIt()
seems to vague for a general string function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove. I plan to introduce more functional helpers gradually in a separate util class.
// database check | ||
DbAdapter.getAdapter().execQuery("select 1", null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to add.
40e71a9
to
7a7784f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. There are a few questions I have that are listed below.
@@ -17,12 +17,12 @@ | |||
public class DuckDbUDF { | |||
public static String deg2pix = importDeg2Pix(); | |||
public static final String decimate_key = """ | |||
CREATE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS | |||
CREATE OR REPLACE FUNCTION decimate_key(xVal, yVal, xMin, yMin, nX, nY, xUnit, yUnit) AS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the stability issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This occurs only in edge cases. I included it to ensure the create function never fails and always contains the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version of DuckDB appears to fix some of the unexpected crashing issue.
Rows: %,15d Peak Rows: %,15d | ||
Databases: %,10d Peak Databases: %,10d Total Database: %,10d | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you added this.
public static boolean isVoTable(File inFile) { | ||
try { | ||
DataGroup[] tbl = voToDataGroups(inFile.getAbsolutePath(), true, 0); | ||
return tbl != null && tbl.length > 0; | ||
} catch (Exception e) { return false; } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function appears to never be called. My bigger question is that is seems to go through code the retrieves the file? That is unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s intended to read only the header, similar to IpacTableReader. However, after reviewing our guesser, I’m confident it does a good enough job detecting VOTable. As a result, this is no longer needed, and I’ll remove it.
|
||
static FileInfo ingestTable(DbAdapter dbAdapter, DataGroup table, boolean searchForSpectrum) throws DataAccessException { | ||
if (searchForSpectrum) SpectrumMetaInspector.searchForSpectrum(table,true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the DataGroup only partially read here? Is the meta data all in place or searchForSpectrum with not work.
My bigger question - are there cases where searchForSpectrum was being call before but is not longer being called because of the streaming ingestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I’m aware. DataGroup is intended to contain the full table header without the data.
public static Format detect(File inFile) throws IOException { | ||
|
||
Format format = null; | ||
DuckDbAdapter.MimeDesc mimeDesc = DuckDbAdapter.getMimeType(inFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this line actually looks inside the file? The function name is a little confusing since it implies it is just looking at the extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line uses DuckDB's magic extension to determine the file's MIME type, similar to the Unix file
command. So yes, it will read the magic number(if exists) along with additional data as needed.
- update DuckDB to v1.1.3 - add DuckDB magic extension
700aed6
to
66314c8
Compare
Ticket: https://jira.ipac.caltech.edu/browse/FIREFLY-1615
Although many files were modified, most changes involved moving Format from TableUtil to FormatUtil. This should not pose significant issues.
From ticket:
Test: https://fireflydev.ipac.caltech.edu/firefly-1615-file-type-detection/firefly/
Upload various file types and verify that Firefly functions as expected.
Also, ticket https://jira.ipac.caltech.edu/browse/FIREFLY-1628
My guess is that the server crashes unexpectedly which causes unreproducible errors.
This PR upgrade DuckDB to the latest version which seems to address some of the crashes.
Test: https://firefly-1615-file-type-detection.irsakudev.ipac.caltech.edu/applications/Spitzer/SHA/
Compare to: https://irsatest.ipac.caltech.edu/applications/Spitzer/SHA/
Use SHA IRS Enhanced Products for testing.
Position = ngc 1333
select
IRS Enhanced Products
-> Search
Switch to
IRS Enhanced Products
and repeatedly select different rows. Each row fetches a new table, creating a new database. For each order, a new temporary table is also created.Compare to this, irsatest was crashing after several tries. You can tell through error messages or by the application becoming unresponsive.