-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge two DataFrames only to missing values #2243
Comments
I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as |
Stata has a good API for this via
I think Bogumil's idea for this is elegant, and is good for 1.0. But I think following Stata on this could solve the coordination problem for what a good option is. My proposal would be
This would perform the coalesce automatically. A quick google search shows that |
What does Stata do when:
|
Interestingly, this will add both observations
You can specify whether or not the matches need to be unique. The default behavior in this case is also to add new rows. See below.
I will have to think more about the mapping of Stata's |
OK - let us try to find out a consistent design requirements (like we did (and I will be commenting from a "defensive position" 😄) In particular I think that this operation is most intuitive if also allowed to be done "in place" (but we need API for this). |
Thanks. I will try and work on a proposal that is similar to stata but matches our current infrastructure. This is yet another opportunity to combine the best of stata and dplyr. |
I guess I am not that late to the party. Here would be the approach from SQL
|
Some kind of functionality like this, in general an "update" would be really nice to add. We just need to find a general design that is flexible and future proof (i.e. to have a swiss army knife function like e.g. As proposed above maybe we can add to
|
This works as long as you're not supporting various options on the update itself. Adding kwargs that only modify one other kwarg seems like a bad idea and cries out for splitting. This is not a feature I've ever needed to use, so I don't have strong opinions on it, but the fact that it's found in stata and SQL makes a compelling case to add it as a separate function that could have its own options.
Not sure if you've ever owned a Swiss army knife, but the downside of being able to do everything is that you don't do any of them super well. Not sure how much to torture the metaphor, but as a kid I always wanted the big fat ones with tons of different tools, because they seemed so cool, but my (Swiss) grandfather's with 2 blades and a bottle opener was almost always a better choice for actually doing anything with. |
As discussed on Slack - we have a tension here. We can either leave joins "as is" and add a new method doing updates of existing columns and working in place (something like Adding a new function is probably easier to understand. On the other hand it would do almost the same as |
I think another function might be best. Since I was also going to bring up that not only does Stata have an So adding that would risk bloating the function even further. I would propose |
What would be the exact difference between them? I was thinking of an API of the kind:
|
One only replaces if |
I would prefer the same function and just have some nice way to specify the operation you want to do, I guess there are 4 natural operations:
|
Actually, I've just thought of a problem with this approach. Lets say you have df1 and df2 where df2 has many new columns you are interested When you do
you get
This kind of scenario is an argument for an |
I do not think it would be a problem, as adding demographic columns, would be a type 4 of the operation (just add columns). So in a sense |
In Stata terms this is the I'm not sure I totally agree with the approach above because generally the number of columns I want to be added and not overridden is an order of magnitude larger than the columns I do want overridden. I would inevitably do An additional problem: We have 5 different |
Yeah, as usual - given the Stata experience - is it possible to put some proposal to start with? |
The SQL syntax I find it easier (example of in-place),
|
@Nosferican this is mostly what I have imagined ideally but without metaprogramming it is hard to come up with a concise syntax, i.e. it could be something like:
where So your example would read:
This would be almost fully flexible (it would not allow full column replacement, but it is easy enough to almost achieve it using Still I am not fully happy with this (nor any other syntax I could come up with). But maybe we will find the |
OK. I have thought about it and here are some conclusions. Joins in DataFrames.jl work in three steps (I omit implementation details):
What we discuss here is essentially replacing TABLE MERGING step by a custom function. This means (I give na example on
can be seen as calling
where And actually we could expose this Now if we went this way it is also natural how The only problem is that |
I also found this feature would be much helpful in our work. the the problems here are how to handle content conflicts. current implementation will throw error or if but for some common use cases, for example, to for flexibility, a relatively low-level API @bkamins mentioned may be needed, but to the most users, I think |
I guess If it's not rude to ask: May I know what does Thanks for your tremendous contribution to DataFrames, @bkamins ! |
Yes, it was a typo; fixed
It gives you a vector stored as a column
This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and |
Thank you so much @bkamins for this quick response! Even though Thank you for pointing me to the documentation for the doc syntax, |
OK. I think we are ready to add it in DataFrames.jl 1.5 release. I was thinking about the signature and came up with (in-place version):
(also a copying The The meaning of the arguments is:
Now the things to discuss are:
CC @nalimilan |
@jariji - this is another important PR where API design is a key challenge. |
There's an operation I perform a lot. I'm adding here because I think it's related. Maybe this is worth considering now that the API is under development. Alternatively, it'd be nice to consider this functionality. Basically, the type operation is as follows gdf = groupby(df[condition,:], groups)
result = combine(gdf, col => func => pre_alloc)
leftjoin!(df, result, on=groups) A MWE dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
is_public =[true,false, true,false],
firm_revenue = collect(10:10:40))
4×3 DataFrame
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
2 │ beer false 20
3 │ wine true 30
4 │ wine false 40
gdf = groupby(dff[dff.is_public .== true,:], :industry)
result = combine(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)
2×2 DataFrame
Row │ industry revenue_of_all_public_firms
│ String Int64
─────┼───────────────────────────────────────
1 │ beer 10
2 │ wine 30
and then i "update" the original dataframe
I guess Thanks!!! |
Currently you can almost do what you want efficiently with:
almost - because, as you can see, we will have Another approach, but it is less general is:
|
Yes, that was my point. I use I was thinking along the lines of using:
These operations are quite common I think. Basically, you put all industry information in columns to compute each firm's result relative to industry measures (e.g., a market share, although computing market shares can obviously be done in a simpler way than all this). Maybe Thanks!!! |
@alfaromartino I still don't understand, why would you not do |
Because @alfaromartino wants to perform some operation on a filtered data frame and the approach like:
is not wanted (this is my understanding) |
Ah I see. This would give like |
or - in other words, |
Yes. Hopefully in DataFramesMeta.jl we can make this solution a bit better with
|
To add to the ongoing discussion here, I have a non-breaking patch that replaces
It also marks the use of N.B. This approach is potentially extensible to other actions to combine duplicate columns e.g. :overwrite or :ignore. PR here #3366 |
Suppose I have an dataframe called
df_missing
:and I also have another dataset, called
df_completion
:So my suggestion is: couldn't there be a more straightforward way to replace the missing values using the common id of the two dataframes, thus creating
df_full
?The two current ways that seem to be the best ones are:
julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
or
besides, of course, creating a loop. Maybe there could be a method for
merging
two DataFrames in this way? I think it's a very common problem.The text was updated successfully, but these errors were encountered: