-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sketch of basic input io interface. #5
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,2 @@ | ||
# Compiled class file | ||
*.class | ||
|
||
# Log file | ||
*.log | ||
|
||
# BlueJ files | ||
*.ctxt | ||
|
||
# Mobile Tools for Java (J2ME) | ||
.mtj.tmp/ | ||
|
||
# Package Files # | ||
*.jar | ||
*.war | ||
*.nar | ||
*.ear | ||
*.zip | ||
*.tar.gz | ||
*.rar | ||
|
||
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml | ||
hs_err_pid* | ||
# Mill output directory | ||
out/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
import mill._, scalalib._ | ||
|
||
object core extends JavaModule { | ||
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
package org.samtools.htsjdk.core.io; | ||
|
||
import java.nio.ByteBuffer; | ||
|
||
/** | ||
* Interface for classes that produce binary data (i.e. bytes) that can be decoded | ||
* into other datatypes. | ||
* | ||
* Methods should throw exceptions if: | ||
* - The underlying data source/stream/channel is closed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or if you tried to seek / jump to a negative position or one that's > length/ eof. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not specific to this PR: Because it is the first code in the repo, I would suggest that we also put some rules about javadocs for the public API, to render properly as HTML to include the javadocs somewhere (e.g., lists using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. What are your thoughts on using something like https://github.com/Abnaxos/markdown-doclet so that we can write JavaDoc using MarkDown instead of HTML tags? And yes, once we settle on the actual interface I 100% want to write highly detailed and well formatted javadoc here to try and set the bar for how our public classes should be documented. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like markdown, but most of the java folks that I know use HTML javadoc style; switching it to a custom doclet might produce some undocumented code or errors in rendering. Anyway, I am fine with both, as far as we enforce some style... |
||
* - There are insufficient bytes available (after blocking) to read a single item of the datatype requested | ||
* | ||
* NOTES: | ||
* - Almost all the methods here could have default implementations that read via one of the readBytes() methods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not having a package protected abstract class with the default implementation? Do we really want this defaults to be part of the public API? If they change, then they mean a major bump! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could. I just figure that for all the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we change the contract, the version is bump; but what if the implementation behaves badly with other methods, but the contract stays the same? Anyway, I am fine with the simple implementations using |
||
* - Should think through how to deal with unsigned types | ||
* - Need to think how this would work with BlockCompressed data | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My idea for a small project to make seekable sources ( |
||
*/ | ||
public interface ByteSource extends AutoCloseable { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks a lot like the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Funny that :) I looked at |
||
/** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */ | ||
String getDescription(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes please |
||
|
||
// Methods related to buffering/repositioning/skipping etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This two block separation suggest that this could be split into two interfaces... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My suggestion would be this:
All of which would be done long before even an alpha release. I just find it easier to keep it in one interface while we work things out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even in that case, when implementations start to get more complicated it will be difficult to uncouple the two of them. But I don't know what is the timeframe for splitting them that you are suggesting. |
||
/** Returns true if the source has a known finite length, false otherwise. */ | ||
boolean hasLength(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer not to use special values when possible. I'm happy to change the name. How would you feel about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I vote for use |
||
|
||
/** Returns the length if {@link #hasLength()} is true, otherwise throws an exception. */ | ||
default long length() { throw new UnsupportedOperationException(); } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this throws by default, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I meant |
||
|
||
/** Returns the offset of the next byte that will be read by any operation. */ | ||
long position(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this going to be a virtual file pointer in the case of a compressed file? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I won't recommend that: a compressed file should also be able to seek/getPosition from the disk-representation (I had some struggles about it with the bgzip implementation in the previous htsjdk) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking that yes, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have two concerns about handling position in bgzip (and other compressed formats) with virtual file pointers with the same interface:
The concrete use case that I have in mind is the bgzip FASTA reference, which I implemented in the lates HTSJDK release. It requires a .gzi index file to convert to the virtual file pointer the .fai coordinates. Nevertheless, an implementation to seek to a non-virtual file pointer is possible and it might support non-gzi-indexed files, and at the same time handle as a normal stream if interest to seek to a concrete file position is required. Separating both concepts is something benefitial in my opinion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Methods for seeking to a virtual file-pointer in bgzip and/or uncompressed-data pointer in other compressed format can be supported with a different method where the value passed is interpreted in an implementation-dependent way. That could be another interface to represent data that is transformed (compressed, block-compressed, maybe encrypted...) |
||
|
||
/** Returns true if the source can skip bytes without reading them. */ | ||
boolean canNavigateForward(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy to name them whatever. Perhaps it could even be a single There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any situation where the navigation forward won't be allowed? Skipping bytes could be always done by reading them... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I was trying to get across is that I think clients may wish to know whether they can skip forward. E.g. if I'm reading a 100GB BAM from S3 or GCS and can't skip, maybe I'll find another way to get the unmapped reads at the end of the file. I think this highlights though that we need to hash out really clear names and semantics for the functions that tell you whether you can re-position/skip in general and/or within some buffered region. |
||
|
||
/** Returns true if the source can always be repositioned behind the current position. Note that a buffered | ||
* source may allow limited backwards navigation within the buffered data, while returning false here. */ | ||
boolean canNavigateBackwards(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might call this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what the behavior should be for buffered readers that have some amount of look behind. We could use mark/reset but that's always kind of awkward. We could have some method like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, I find There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of returning false here for limited backwards navigation, we can return number of bytes that can be "safely" navigate backwards:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And related with this method, we can also include a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did think about this @magicDGS I think the problem is that this doesn't work great for cases where you're at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. Maybe for random access data it should return the maximum long value and for buffered data the buffer size. This would behave correctly in your case of format detection: even in position 0, the answer is true with that API definition. |
||
|
||
/** Position the source at numBytes after the current position. May be implemented by skipping if the | ||
* underlying source is capable of skipping, or by reading and discarding numBytes. | ||
* | ||
* @param numBytes the number of bytes desired to be skipped. | ||
* @return either numBytes or a smaller number if EOF is hit before numBytes is read | ||
*/ | ||
int navigateForward(final long numBytes); // TODO: implement default method that reads | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we might want to split the navigation methods out into a separate interface? It might make it clearer to document which operations require seekable inputs and which ones don't. |
||
|
||
/** Position the source at numBytes before the current position. | ||
* | ||
* @param numBytes the number of bytes desired to be skipped. | ||
* @return either numBytes or a smaller number if numBytes > position | ||
*/ | ||
default int navigateBackward(final long numBytes) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe something to match |
||
throw new UnsupportedOperationException(); | ||
} | ||
|
||
/** Positions the source such that the next byte read will be at offset `position`. */ | ||
default int navigateToPosition(final long position) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we get rid of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's fine by me. |
||
throw new UnsupportedOperationException(); | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should have an EOF() method of some sort as a standard way to tell if we're at the end of file/stream. Otherwise people will implement it inconsistently themselves. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. |
||
// Methods for actually reading stuff | ||
|
||
/** Attempt to fill the array with bytes and returns the number of bytes read. */ | ||
int readBytes(final byte[] buffer); | ||
|
||
/** Attempt to fill the regions of the array with bytes and returns the number of bytes read. */ | ||
int readBytes(final byte[] buffer, final int offset, final int length); | ||
|
||
/** Attempt to read numBytes bytes into the buffer starting at the buffer's current position. */ | ||
int readBytes(final ByteBuffer buffer, final int numBytes); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Channel's implementation of read doesn't take a number of bytes, just the buffer, and it just tries to fill it. Would it make more sense to match that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, probably. Though we'll likely be copying from an internal buffer into this buffer (if we're doing any buffering internally) so we could provide the option to limit the bytes read. OTOH maybe we should just take this one out? Probably if someone wants to do this, there's an argument to be made that a different |
||
|
||
/** Reads a single byte. */ | ||
byte readByte(); | ||
|
||
/** Reads a boolean. */ | ||
boolean readBoolean(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of this methods looks like |
||
|
||
/** Reads a little endian signed short. */ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also support big-endian reading in case it is required in the future? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above comment to @lbergelson. |
||
short readShort(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we care about big endian data at all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm gonna say no. I think the only reason this is even a question these days is because Java (thank Sun!) is big endian. But all x86 compatible chips are little endian, and so any format that has non-JVM implementations will want to externalize as little endian. I think it'll be less confusing overall to have a single implementation, at least until we have a use case for big endian. |
||
|
||
/** Reads a little endian signed int. */ | ||
int readInt(); | ||
|
||
/** Reads a little endian signed long. */ | ||
long readLong(); | ||
|
||
/** Reads a little endian unsigned byte. */ | ||
short readUnsignedByte(); | ||
|
||
/** Reads a little endian unsigned short. */ | ||
int readUnsignedShort(); | ||
|
||
/** Reads a little endian unsigned int. */ | ||
long readUnsignedInt(); | ||
|
||
/** Reads a little endian float. */ | ||
float readFloat(); | ||
|
||
/** Reads a little endian double. */ | ||
double readDouble(); | ||
|
||
/** Reads an ascii string, with one-byte per character, of the given length. */ | ||
String readAsciiString(final int length); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we also want a more generic readString(int, Charset) for dealing with UTF-8 files and others like it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking that in this interface we'd have the necessary methods for reading ascii strings ini binary files as we have in BAM (and I presume BCF, CRAM etc.). But that we'd have a separate |
||
|
||
/** Reads a null-terminated string, with one byte per character, of unknown length. */ | ||
String readNullTerminatedAsciiString(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this take a max length for safety? |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this was unintentional..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since all the generated stuff lives in
out
it was intentionalThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the crash logs are placed in
out
for Mill? That's neat if so!