Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(index): deduplicate map values #61

Open
cmdoret opened this issue Mar 19, 2025 · 4 comments
Open

perf(index): deduplicate map values #61

cmdoret opened this issue Mar 19, 2025 · 4 comments

Comments

@cmdoret
Copy link
Member

cmdoret commented Mar 19, 2025

The index currently looks like this:

{
  "types": [
    "<http://schema.org/Organization>",
    "<http://schema.org/Person>",
    "<http://schema.org/SoftwareSourceCode>"
  ],
  "map": {
    "8836142820109335346": [
      0,
      0
    ],
    "16856853018305323212": [
      2
    ],
    "15184171154146416828": [
      0
    ],
    "8501891617301323111": [
      1
    ],
    "8343770519458026299": [
      1,
      1
    ],
    "16751148418374530336": [
      1
    ],
    "5818193110665307485": [
      0,
      0,
      0,
      0,
      0
    ]
  }
}

In each value of map, we should store only unique values. For performance reasons, we may want to keep using SmallVec instead of HashSet (on the heap), but then we should add a check on append.

The desired structure of above example would be:

{
  "types": [
    "<http://schema.org/Organization>",
    "<http://schema.org/Person>",
    "<http://schema.org/SoftwareSourceCode>"
  ],
  "map": {
    "8836142820109335346": [
      0
    ],
    "16856853018305323212": [
      2
    ],
    "15184171154146416828": [
      0
    ],
    "8501891617301323111": [
      1
    ],
    "8343770519458026299": [
      1
    ],
    "16751148418374530336": [
      1
    ],
    "5818193110665307485": [
      0
    ]
  }
}
@cmdoret
Copy link
Member Author

cmdoret commented Mar 19, 2025

alternative: use bitflags to keep fast lookup and stack-allocation while allowing for multi-typed instances.

Note

Different example than above to illustrate instances with multiple types:

with vecs:

{
  "types": [
    "<http://example.org/Employee>",
    "<http://example.org/Organization>",
    "<http://example.org/Person>",
    "<http://example.org/Researcher>"
  ],
  "map": {
    "8836142820109335346": [
      0,
      2,
      3
    ],
    "16856853018305323212": [
      1
    ],
    "15184171154146416828": [
      0,
      2
    ],
    "8501891617301323111": [
      0,
      3
    ]
  }
}

With bitflags:
4 types -> max value is $2^4-1$

{
  "types": [
    "<http://example.org/Employee>",
    "<http://example.org/Organization>",
    "<http://example.org/Person>",
    "<http://example.org/Researcher>"
  ],
  "map": {
    "8836142820109335346": 12,  # == 0b1101
    "16856853018305323212": 1,  # == 0b0001
    "15184171154146416828": 5,  # == 0b0101
    "8501891617301323111": 8    # == 0b1001
  }
}

Significantly more compact than the current format...

@cmdoret
Copy link
Member Author

cmdoret commented Mar 19, 2025

Using bitflags would limit us to 64 types, and using bitvecs/bitarrays would be too much overhead. let's stick to smallVec

@gabyx
Copy link
Contributor

gabyx commented Mar 20, 2025

arent there infinite bitflags, which compact multiple u64 types?

@cmdoret
Copy link
Member Author

cmdoret commented Mar 20, 2025

Yeah i think that's what bitvec/bitarrays are (e.g. https://github.com/ferrilab/bitvec)

But it seems like more overhead than just using smallvecs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants