Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate source code archive metadata from Python-only to C API #48

Open
janheinrichmerker opened this issue Mar 10, 2025 · 3 comments · May be fixed by #67
Open

Migrate source code archive metadata from Python-only to C API #48

janheinrichmerker opened this issue Mar 10, 2025 · 3 comments · May be fixed by #67
Labels
enhancement New feature or request

Comments

@janheinrichmerker
Copy link
Contributor

In the discussion of #46, we had the idea to migrate the feature of a "code archive" (i.e., a ZIP of the Git repo, if in a Git context) from the Python API to the generic C API.
I believe this should be possible without new dependencies, as we already link libgit2.

A potential workflow could look like:

  1. get the repository head reference,
  2. get the reference's target ID,
  3. look up the corresponding tree,
  4. iterate over that tree's files, and
  5. add each file of the tree to a ZIP archive.
@TheMrSheldon
Copy link
Member

The archival would basically be the same as the hashing here:

static std::string hashAllFiles(git_repository* repo) {
Chocobo1::SHA1 hash;
git_status_list* list;
git_status_options opts = GIT_STATUS_OPTIONS_INIT;
opts.flags = GIT_STATUS_OPT_INCLUDE_UNTRACKED | GIT_STATUS_OPT_INCLUDE_UNMODIFIED;
git_status_list_new(&list, repo, &opts);
auto changes = git_status_list_entrycount(list);
std::filesystem::path root = git_repository_workdir(repo);
for (size_t i = 0; i < changes; ++i) {
auto entry = git_status_byindex(list, i);
std::ifstream is(root / entry->index_to_workdir->new_file.path, std::ios::binary);
if (!is) {
tirex::log::error("gitstats", "Error opening file: {}", entry->index_to_workdir->new_file.path);
continue;
}
for (char buffer[8192]; is; is.read(buffer, sizeof(buffer)))
hash.addData(buffer, is.gcount());
}
git_status_list_free(list);
return hash.finalize().toString();
}

but an archive is created instead of a hash. As such, the tree walk may not even be necessary. It would probably work as well but would look less pretty since git_tree_walk takes the payload as the last element.
Though it may be possible to use std::bind with placeholders and to pass this and point to a reference of the operator() similar to what
template <typename T>
static int wrap(const char* name, git_oid* oid, void* payload) {
auto& fn = *static_cast<T*>(payload);
return fn(name, oid);
}

is used for for the tags. This should also be C++17 compatible though there it is not constexpr below C++20. But I would say that this would be a refactoring for the tag code and we should avoid the tree walk unless we notice that the status list is too inefficient.

Regarding the number of dependencies: we additionally need a library for creating zip archives and for compression. This would otherwise be simple to write ourselves but I would advise against it.

How should the native API be modified to instruct it to create the archive and how would the archive be passed back to the caller?

@TheMrSheldon TheMrSheldon added the more info requested Further information is requested label Mar 23, 2025
@janheinrichmerker
Copy link
Contributor Author

As the matter appeared again in #65, I would like to get to discussion again:
Right now, the Python client does some heavy lifting there to put all non-gitignored contents into a Zip archive, in order to later reference the Python script inside that archive.
My understanding is that most of that non-language-specific archiving code would be possible to be moved to the core C library so that by default, we have the following new measures:

  • GIT_REPOSITORY_DIR_PATH: The "nearest" path that is a Git repository when navigating up the working dir's parents (could be the working dir itself).
  • GIT_REPOSITORY_ARCHIVE_FILE_PATH A .zip archive of that Git repository's contents, excluding ignored files from any .gitignore.

The latter can then be used to determine relative paths inside the Git repository, such as the PYTHON_SCRIPT_FILE_PATH_IN_CODE_ARCHIVE and similar measures.
I also think, we should then clarify the definitions of the related measures here:

Measure.PYTHON_SCRIPT_FILE_PATH: MeasureInfo(
description="Path to the Python script file.",
data_type=ResultType.STRING,
example=dumps("/path/to/script.py"),
),
Measure.PYTHON_NOTEBOOK_FILE_PATH: MeasureInfo(
description="Path to the Jupyter notebook file.",
data_type=ResultType.STRING,
example=dumps("/path/to/notebook.ipynb"),
),
Measure.PYTHON_CODE_ARCHIVE_PATH: MeasureInfo(
description="The archive that contains a snapshot of the code.",
data_type=ResultType.STRING,
example=dumps("/path/to/code.zip"),
),
Measure.PYTHON_SCRIPT_FILE_PATH_IN_CODE_ARCHIVE: MeasureInfo(
description="The script that was executed in the code archive.",
data_type=ResultType.STRING,
example=dumps("script.py"),
),
Measure.PYTHON_NOTEBOOK_FILE_PATH_IN_CODE_ARCHIVE: MeasureInfo(
description="The notebook that was executed in the code archive.",
data_type=ResultType.STRING,
example=dumps("notebook.ipynb"),
),

Ideally, each definition should describe exactly what the expected output would be, for example:

  • Is a path relative or absolute (relative to what reference path)?
  • What exactly does the archive contain? Which compression is used? (probably .zip)

@TheMrSheldon TheMrSheldon linked a pull request Mar 26, 2025 that will close this issue
@TheMrSheldon TheMrSheldon removed the more info requested Further information is requested label Mar 26, 2025
@TheMrSheldon
Copy link
Member

I began work on this here: #67.

And also the obligatory comment that zip is not a compression method and does not need to be compressed :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants