- Submitter (Affiliation): Ben Strong (Earth Genome), Hutch (Tom) Ingold (Earth Genome)
- Data Provider (Legal Entity): Earth Genome (501(c)(3) Nonprofit)
- Homepage: http://earthindex.ai/
Earth Index is a platform for tile-level geospatial search and classification of satellite imagery. Earth Index works by “pre-indexing” the planet for search using earth observation foundation models.
We've been using embeddings for human-in-the-loop tile-level search (approximate nearest neighbors) and classification (linear or other lightweight models). Some downstream applications include monitoring targets like:
- Poultry CAFOs
- Artisinal gold mining
- Infrastucture development (roads, dams, etc.)
- Landfills and waste sites
- URL: [tbd, on source.coop]
- Documentation: [tbd]
- Projection: EPSG:4326
- License: CC-BY
- GeoParquet validation:
gpq validate embeddings.parquet
- Emb sample metadata and parquet metadata validation:
python3 validate.py
Field Name | Type | Description |
---|---|---|
geometry | geometry | Geometry of tile used to generate embeddings |
id | string | Unique identifier for tile |
embedding | array | The vector embedding |
The files originate from MGRS 100km x 100km imagery tile, e.g. 21MNT
, which the parquet files inherit.
That being said, using a filename as metadata is an easy way to lose context so we make sure that all relevant
metadata from the filename also gets embedding in the file itself.
The embedding.parquet
file is also a valid Geoparquet file, with the expected metadata stored under the geo
key.
Additionally we have added an emb
key for the embeddings metadata - also in JSON format.
The metadata adheres to the json schema provided. For interoperability with STAC this schema references some STAC schema elements, specifically provider, datetime and licensing
Field Name | Type | Description |
---|---|---|
version | string | fixed at 0.0.1 (required) |
model | object<string, object> | see Model metadata |
providers | object<string, object> | provider metadata -- see provider |
licensing | object<string, object> | licensing metadata -- see licensing |
datetime | object<string, object> | datetime metadata -- see datetime |
source_datasets | object<string, object> | datasets used to generate embeddings - see Dataset metadata |
embedding | object<string, object> | Embeddings metadata |
Field Name | Type | Description |
---|---|---|
id | string | id of model (required) |
source | string | URI for model (required) |
version | string | version of model |
family | string | model family |
name | string | Name of model |
description | string | human readable description of model |
config | string | Configuration of model; includes information needed to generate embeddings (e.g. what layers were extracted) |
Field Name | Type | Description |
---|---|---|
id | string | Dataset id (required) |
name | string | Dataset used to generate the embeddings |
description | string | Dataset used to generate the embeddings |
source | string | URI for dataset that was used by the embeddings model (required) |
Field Name | Type | Description |
---|---|---|
dim | int | Embeddings size |
quantization | string | Description of quantization scheme, if any |
- We also duplicate most of this metadata in STAC and have reused STAC metadata definitions to ease interoperability between parquet and STAC .
- We've opted to put the metadata within the
emb
key, in the same style as geoparquet. But we're not committed to this and are interested in other opinions.