Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 56a2269762b7ccd8986790aa9f0d235172ff3161
Author: Whitney O'Meara <[email protected]>
Date:   Wed Jun 26 20:40:33 2024 +0000

    udpated type-utils version

commit ab0098dbe86f2d15bf2a18c1b74dc22623183c30
Author: Whitney O'Meara <[email protected]>
Date:   Wed Jun 26 18:51:44 2024 +0000

    Reapply "Feature/serialization minimap (#23)"

    This reverts commit 8255413.

commit 1a772f6480d1edf6fa383e2e6f4dc6138209b88e
Merge: 8255413 ebca7ce
Author: Whitney O'Meara <[email protected]>
Date:   Wed Jun 26 18:16:19 2024 +0000

    Merge remote-tracking branch 'origin/main' into feature/mapService

commit ebca7ce
Author: Moon Moon <[email protected]>
Date:   Wed Jun 26 13:31:05 2024 -0400

    Preventing npe from empty data or ingest type string (#38)

commit 8255413
Author: Whitney O'Meara <[email protected]>
Date:   Tue Jun 25 13:32:06 2024 +0000

    Revert "Feature/serialization minimap (#23)"

    This reverts commit 5b62bf6.

commit 7760b83
Author: Moriarty <[email protected]>
Date:   Tue Jun 25 09:45:29 2024 +0000

    [maven-release-plugin] prepare for next development iteration

commit 6a34dce
Author: Moriarty <[email protected]>
Date:   Tue Jun 25 09:45:27 2024 +0000

    [maven-release-plugin] prepare release 4.0.2

commit f589568
Merge: add487b 5b62bf6
Author: Whitney O'Meara <[email protected]>
Date:   Mon Jun 24 22:08:51 2024 +0000

    Merge remote-tracking branch 'origin/main' into feature/mapService

commit 5b62bf6
Author: Moon Moon <[email protected]>
Date:   Mon Jun 24 13:42:14 2024 -0400

    Feature/serialization minimap (#23)

    * Adding ability to parse string with minimap

    * Creating mini-map during serialization

    * Changing to TreeSet for ordering purposes

    * WIP forming new mini-map string

    * Ensuring ordered types during serialization

    * Removing duplicate unit test

    * Adding ability to parse string with minimap

    * Creating mini-map during serialization

    * Changing to TreeSet for ordering purposes

    * WIP forming new mini-map string

    * Ensuring ordered types during serialization

    * Removing duplicate unit test

    * Moving hard coded strings

    * Formatting

    * Removing old method calls

    * Removing unnecessary exception throwing

    * Formatting

    * Updating to remove HashSet to preserve ordering

    * Updating unit tests

    * Updating unit tests again

    * Updating unit tests again again

    * Formatting

    * Removing old methods

    * Adding in fieldName creation

    * Updates based on testing

    * Returning immutable map

    * Fixing concatenated dataTypes

    ---------

    Co-authored-by: Ivan Bella <[email protected]>

commit 6f2d4d4
Author: Moriarty <[email protected]>
Date:   Mon Jun 24 12:15:21 2024 -0400

    Support field cardinality across a date range (#37)

    * Add seeking filter for the F column to support getting field cardinality across a date range

    * guard against empty ranges

    * move log messages from debug to trace

commit 44482bd
Author: Moriarty <[email protected]>
Date:   Tue Jun 18 10:57:37 2024 -0400

    Cleanup code, logging formats for AllFieldMetadataHelper. Wrap scanner in try-with-resources blocks (#36)

commit f288080
Author: Moriarty <[email protected]>
Date:   Fri Jun 14 09:30:16 2024 -0400

    MetadataHelper address try-with-resources warnings (#35)

    * Wrap scanners in try-with-resources

    * Additional instances of try-with-resources

commit c33f5ea
Author: Moriarty <[email protected]>
Date:   Thu Jun 6 11:54:16 2024 +0000

    [maven-release-plugin] prepare for next development iteration

commit e86baa2
Author: Moriarty <[email protected]>
Date:   Thu Jun 6 11:54:14 2024 +0000

    [maven-release-plugin] prepare release 4.0.1

commit cb733db
Author: Moriarty <[email protected]>
Date:   Wed Jun 5 08:07:04 2024 -0400

    Add table test for the MetadataHelper, update docs, general code cleanup (#34)

commit add487b
Merge: bada436 2948b55
Author: Whitney O'Meara <[email protected]>
Date:   Thu May 23 04:18:11 2024 +0000

    Merge remote-tracking branch 'origin/main' into feature/mapService

commit 2948b55
Author: Whitney O'Meara <[email protected]>
Date:   Mon May 20 17:49:29 2024 +0000

    [maven-release-plugin] prepare for next development iteration

commit a30194b
Author: Whitney O'Meara <[email protected]>
Date:   Mon May 20 17:49:27 2024 +0000

    [maven-release-plugin] prepare release 4.0.0

commit 870ccfd
Author: Whitney O'Meara <[email protected]>
Date:   Mon May 20 17:48:56 2024 +0000

    updated to tagged release

commit 570d8a9
Author: Whitney O'Meara <[email protected]>
Date:   Mon May 20 12:32:27 2024 -0400

    Feature/query microservices (#33)

    * bumped release version

    * bumped versions for some modules

    * Updated with latest changes from main/integration

    * Updated package names for commons.lang3 classes due to type-utils fix
  • Loading branch information
jwomeara committed Jun 26, 2024
1 parent bada436 commit 00b192c
Show file tree
Hide file tree
Showing 9 changed files with 2,638 additions and 622 deletions.
14 changes: 10 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
<parent>
<groupId>gov.nsa.datawave.microservice</groupId>
<artifactId>datawave-microservice-parent</artifactId>
<version>3.0.5-SNAPSHOT</version>
<version>4.0.1-SNAPSHOT</version>
<relativePath>../../../microservices/microservice-parent/pom.xml</relativePath>
</parent>
<artifactId>metadata-utils</artifactId>
<version>3.0.3-SNAPSHOT</version>
<version>4.0.3-SNAPSHOT</version>
<url>https://code.nsa.gov/datawave-metadata-utils</url>
<licenses>
<license>
Expand All @@ -24,12 +24,12 @@
</scm>
<properties>
<spotbugs.excludes.file>${project.basedir}/src/main/spotbugs/excludes.xml</spotbugs.excludes.file>
<version.accumulo-utils>3.0.1</version.accumulo-utils>
<version.accumulo-utils>4.0.0</version.accumulo-utils>
<version.caffeine>2.8.0</version.caffeine>
<version.easymock>4.0.2</version.easymock>
<version.kryo>2.20</version.kryo>
<version.powermock>2.0.2</version.powermock>
<version.type-utils>2.0.3-SNAPSHOT</version.type-utils>
<version.type-utils>3.0.2-SNAPSHOT</version.type-utils>
</properties>
<dependencyManagement>
<dependencies>
Expand Down Expand Up @@ -119,6 +119,12 @@
<groupId>gov.nsa.datawave.microservice</groupId>
<artifactId>accumulo-utils</artifactId>
</dependency>
<dependency>
<groupId>gov.nsa.datawave.microservice</groupId>
<artifactId>common-utils</artifactId>
<version>2.0.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>gov.nsa.datawave.microservice</groupId>
<artifactId>type-utils</artifactId>
Expand Down
173 changes: 173 additions & 0 deletions src/main/java/datawave/iterators/MetadataFColumnSeekingFilter.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
package datawave.iterators;

import java.io.IOException;
import java.util.Map;
import java.util.TreeSet;

import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.PartialKey;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterators.IteratorEnvironment;
import org.apache.accumulo.core.iterators.OptionDescriber;
import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
import org.apache.accumulo.core.iterators.user.SeekingFilter;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.io.Text;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.google.common.base.Splitter;

import datawave.data.ColumnFamilyConstants;

/**
* A {@link SeekingFilter} that operates on the metadata table's {@link ColumnFamilyConstants#COLF_F} column.
* <p>
* This filter solves the problem of calculating field cardinality for a small date range on a system that contains many days worth of data, i.e., it is not
* practical to simply filter based on date and/or datatype.
* <p>
* Given that the F column is simply the datatype and date concatenated with a null byte, it is easy to calculate a seek range that limits the time spent
* iterating across useless keys.
*/
public class MetadataFColumnSeekingFilter extends SeekingFilter implements OptionDescriber {

private static final Logger log = LoggerFactory.getLogger(MetadataFColumnSeekingFilter.class);

public static final String DATATYPES_OPT = "datatypes";
public static final String START_DATE = "start.date";
public static final String END_DATE = "end.date";

private TreeSet<String> datatypes;
private String startDate;
private String endDate;

@Override
public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
if (!validateOptions(options)) {
throw new IllegalArgumentException("Iterator not configured with correct options");
}

String opt = options.get(DATATYPES_OPT);
if (StringUtils.isBlank(opt)) {
datatypes = new TreeSet<>();
} else {
datatypes = new TreeSet<>(Splitter.on(',').splitToList(opt));
}

startDate = options.get(START_DATE);
endDate = options.get(END_DATE);

super.init(source, options, env);
}

@Override
public IteratorOptions describeOptions() {
IteratorOptions opts = new IteratorOptions(getClass().getName(), "Filter keys by datatype and date range", null, null);
opts.addNamedOption(DATATYPES_OPT, "The set of datatypes used as a filter");
opts.addNamedOption(START_DATE, "The start date, used for seeking");
opts.addNamedOption(END_DATE, "The end date, used for seeking");
return null;
}

@Override
public boolean validateOptions(Map<String,String> options) {
return options.containsKey(DATATYPES_OPT) && options.containsKey(START_DATE) && options.containsKey(END_DATE);
}

/**
* A key is filtered if one of the following three conditions is met. Otherwise, the source will call next.
* <ol>
* <li>datatype miss</li>
* <li>key date is before the start date</li>
* <li>key date is after the end date</li>
* </ol>
*
* @param k
* a key
* @param v
* a value
* @return a {@link FilterResult}
*/
@Override
public FilterResult filter(Key k, Value v) {
if (log.isTraceEnabled()) {
log.trace("filter key: {}", k.toStringNoTime());
}
String cq = k.getColumnQualifier().toString();
int index = cq.indexOf('\u0000');
String datatype = cq.substring(0, index);
if (!datatypes.isEmpty() && !datatypes.contains(datatype)) {
return new FilterResult(false, AdvanceResult.USE_HINT);
}

String date = cq.substring(index + 1);
if (date.compareTo(startDate) < 0) {
return new FilterResult(false, AdvanceResult.USE_HINT);
}

if (date.compareTo(endDate) > 0) {
return new FilterResult(false, AdvanceResult.USE_HINT);
}

return new FilterResult(true, AdvanceResult.NEXT);
}

@Override
public Key getNextKeyHint(Key k, Value v) {
if (log.isTraceEnabled()) {
log.trace("get next hint for key: {}", k.toStringNoTime());
}

Key hint;
String cq = k.getColumnQualifier().toString();
int index = cq.indexOf('\u0000');
String datatype = cq.substring(0, index);

if (!datatypes.isEmpty() && !datatypes.contains(datatype)) {
hint = getSeekToNextDatatypeKey(k, datatype);
} else {
String date = cq.substring(index + 1);
if (date.compareTo(startDate) < 0) {
hint = getSeekToStartDateKey(k, datatype);
} else if (date.compareTo(endDate) > 0) {
hint = getDatatypeRolloverKey(k, datatype);
} else {
hint = k.followingKey(PartialKey.ROW_COLFAM_COLQUAL);
}
}

log.trace("hint: {}", hint);
return hint;
}

private Key getSeekToNextDatatypeKey(Key key, String datatype) {
if (datatypes.isEmpty()) {
// no datatypes provided, so we must instead produce a 'rollover' start key
return getDatatypeRolloverKey(key, datatype);
}

// otherwise datatypes were provided
String nextDatatype = datatypes.higher(datatype);
if (nextDatatype != null) {
log.trace("seek to next datatype");
Text nextColumnQualifier = new Text(nextDatatype + '\u0000' + startDate);
return new Key(key.getRow(), key.getColumnFamily(), nextColumnQualifier);
} else {
log.trace("seek to next ROW_COLFAM");
// out of datatypes, we're done. This partial range will trigger a "beyond source" condition
return key.followingKey(PartialKey.ROW_COLFAM);
}
}

private Key getDatatypeRolloverKey(Key key, String datatype) {
log.trace("seek to rollover datatype");
Text cq = new Text(datatype + '\u0000' + '\uffff');
return new Key(key.getRow(), key.getColumnFamily(), cq);
}

private Key getSeekToStartDateKey(Key k, String datatype) {
log.trace("seek to start date");
Text cq = new Text(datatype + '\u0000' + startDate);
return new Key(k.getRow(), k.getColumnFamily(), cq);
}
}
Loading

0 comments on commit 00b192c

Please sign in to comment.