Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make generated padded integers useful for joining #170

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Aastedet
Copy link
Collaborator

By setting the random seed to length, generated integers of the same length are made identical. This allows the synthetic data to be used for joining by id variables like pnr, cpr, recnum and dw_ek_kontakt.

…ning by pnr, cpr, recnum and dw_ek_kontakt.
@Aastedet Aastedet changed the title Patch make generated padded integers useful for joining Make generated padded integers useful for joining Dec 19, 2024
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! @signekb was asking about and figuring out the same issue with seeds. I think setting the seed in the targets pipeline might be better for overall reproducibility, so maybe remove the code?

@signekb
Copy link
Contributor

signekb commented Dec 19, 2024

Ah! @signekb was asking about and figuring out the same issue with seeds. I think setting the seed in the targets pipeline might be better for overall reproducibility, so maybe remove the code?

I'm on it in #167 💪

@Aastedet
Copy link
Collaborator Author

I converted this back to a draft. I'll pick it up once #167 is merged and see what needs using.

@Aastedet
Copy link
Collaborator Author

Seems like this PR is still needed after #167. Verify that the same values of e.g. recnum are in both lpr_adm and lpr_diag:

register_data$lpr_diag$recnum[register_data$lpr_diag$recnum %in% register_data$lpr_adm$recnum] should return something. I'll update this draft with suggestions from @lwjohnst86 and convert to PR.

@Aastedet Aastedet marked this pull request as ready for review December 20, 2024 08:53
@@ -151,6 +153,7 @@ create_fake_date <- function(n, from = "1977-01-01", to = lubridate::today()) {
#' @examples
#' create_padded_integer(5, 10)
create_padded_integer <- function(n, length) {
set.seed(length)
Copy link
Contributor

@signekb signekb Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why length? Couldn't the seed be 123? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, that should also work.

@@ -151,6 +153,7 @@ create_fake_date <- function(n, from = "1977-01-01", to = lubridate::today()) {
#' @examples
#' create_padded_integer(5, 10)
create_padded_integer <- function(n, length) {
set.seed(length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an issue to set a seed inside a function?
Is it an issue to set a seed here "locally" and also globally in the pipeline?

@signekb
Copy link
Contributor

signekb commented Dec 20, 2024

Just writing some points down from our talk,@Aastedet:

The issue addressed in the PR is that for us to be able to join the simulated data from the different data sources in a meaningful way, we need an overlap in the variables we join by. This includes the variables like recnum and pnr.

So, a suggested solution for this is to set a seed in the functions that create these random padded numbers, so the seed will be reset every time the function is run. This way, we ensure that the same padded integers will be created across data sources.

@lwjohnst86 What's your thoughts on this? :)

EDIT: This is actually already addressed in a TODO on L388 in simulate.R:

TODO: Need a function to reuse recnum and dw_ek_kontakt in LPR data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants