-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make generated padded integers useful for joining #170
base: main
Are you sure you want to change the base?
Conversation
…ning by pnr, cpr, recnum and dw_ek_kontakt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! @signekb was asking about and figuring out the same issue with seeds. I think setting the seed in the targets pipeline might be better for overall reproducibility, so maybe remove the code?
I converted this back to a draft. I'll pick it up once #167 is merged and see what needs using. |
Seems like this PR is still needed after #167. Verify that the same values of e.g.
|
@@ -151,6 +153,7 @@ create_fake_date <- function(n, from = "1977-01-01", to = lubridate::today()) { | |||
#' @examples | |||
#' create_padded_integer(5, 10) | |||
create_padded_integer <- function(n, length) { | |||
set.seed(length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why length? Couldn't the seed be 123? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, that should also work.
@@ -151,6 +153,7 @@ create_fake_date <- function(n, from = "1977-01-01", to = lubridate::today()) { | |||
#' @examples | |||
#' create_padded_integer(5, 10) | |||
create_padded_integer <- function(n, length) { | |||
set.seed(length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an issue to set a seed inside a function?
Is it an issue to set a seed here "locally" and also globally in the pipeline?
Just writing some points down from our talk,@Aastedet: The issue addressed in the PR is that for us to be able to join the simulated data from the different data sources in a meaningful way, we need an overlap in the variables we join by. This includes the variables like So, a suggested solution for this is to set a seed in the functions that create these random padded numbers, so the seed will be reset every time the function is run. This way, we ensure that the same padded integers will be created across data sources. @lwjohnst86 What's your thoughts on this? :) EDIT: This is actually already addressed in a
|
By setting the random seed to
length
, generated integers of the same length are made identical. This allows the synthetic data to be used for joining by id variables like pnr, cpr, recnum and dw_ek_kontakt.