There are two aggregated tables with some concided columns:
Where
I want to "restore" (disaggregate) them somehow to get new table
Such that
Of course there exists a lot of solves for that's system. So I make a guess that results are most independent.
Independence is not possible in general situation, so i will minimize N sums
Or we can just minimize
The first you should install another packages: numpy, scipy, pandas and cvxpy.
pip install numpy
pip install scipy
pip install pandas
pip install cvxpy
Then move rat
folder to a needed directory.
import numpy as np
import pandas as pd
There are two tables: employment with columns District, Agriculture, Industry, Production, Service;
df_employment = pd.read_csv('tables/employment.csv')
df_employment
District | Agriculture | Industry | Production | Service | |
---|---|---|---|---|---|
0 | East Forests | 2063 | 3644 | 504 | 5562 |
1 | North Mountains | 1258 | 3807 | 862 | 11540 |
and environment with columns District, Urban, Suburban, Rural.
df_environment = pd.read_csv('tables/environment.csv')
df_environment
District | Urban | Suburban | Rural | |
---|---|---|---|---|
0 | East Forests | 3866 | 1510 | 6397 |
1 | North Mountains | 3438 | 5779 | 8250 |
So there is column District in both tables.
We want to "restore" that by making them more independence. That we can make with function restore_table
from ratrestore
module
from rat.ratrestore import restore_table
df_restored = restore_table(df_employment, df_environment, name_a='Employment', name_b='Environment', name_res='Count')
df_restored.head(6)
District | Employment | Environment | Count | |
---|---|---|---|---|
0 | East Forests | Agriculture | Rural | 1120.955661 |
1 | East Forests | Agriculture | Suburban | 264.599508 |
2 | East Forests | Agriculture | Urban | 677.444831 |
3 | East Forests | Industry | Rural | 1980.010872 |
4 | East Forests | Industry | Suburban | 467.377899 |
5 | East Forests | Industry | Urban | 1196.611229 |
If we want to "restore" that by making minimal square sums, we can change parameter obj_type
from default 'dependences'
to 'squares'
:
Warning: That can be noncorrect.
df_restored = restore_table(df_employment, df_environment, name_a='Employment', name_b='Environment', name_res='Count', obj_type='squares')
df_restored.head(6)
District | Employment | Environment | Count | |
---|---|---|---|---|
0 | East Forests | Agriculture | Rural | 1268.107791 |
1 | East Forests | Agriculture | Suburban | 244.740815 |
2 | East Forests | Agriculture | Urban | 611.218573 |
3 | East Forests | Industry | Rural | 2161.590402 |
4 | East Forests | Industry | Suburban | 129.786239 |
5 | East Forests | Industry | Urban | 1504.701185 |
Suppose we have DataFrame df_family
df_family = pd.read_csv('tables/family.csv')
df_family
District | Sex | Single | Marriged | Widower | |
---|---|---|---|---|---|
0 | East Forests | Female | 2545 | 2248 | 314 |
1 | East Forests | Male | 1702 | 2920 | 2044 |
2 | North Mountains | Female | 3059 | 3352 | 617 |
3 | North Mountains | Male | 3207 | 4578 | 2654 |
But we want to make one column Family Status from columns Single, Marriged and Widower.
We can use function roll_weak
from package ratedit
from rat.ratedit import roll_weak
df_family_weak = roll_weak(df_family, ['Single', 'Marriged', 'Widower'], value_name='Family Status', res_name='Persons')
df_family_weak
District | Sex | Family Status | Persons | |
---|---|---|---|---|
0 | East Forests | Female | Single | 2545 |
1 | East Forests | Female | Marriged | 2248 |
2 | East Forests | Female | Widower | 314 |
3 | East Forests | Male | Single | 1702 |
4 | East Forests | Male | Marriged | 2920 |
5 | East Forests | Male | Widower | 2044 |
6 | North Mountains | Female | Single | 3059 |
7 | North Mountains | Female | Marriged | 3352 |
8 | North Mountains | Female | Widower | 617 |
9 | North Mountains | Male | Single | 3207 |
10 | North Mountains | Male | Marriged | 4578 |
11 | North Mountains | Male | Widower | 2654 |
Another situation: we want to get df)family with unique values in column District by making columns from values of column Sex.
We can use function roll_strong
from package ratedit
to get two tables
from rat.ratedit import roll_strong
df_family_strong, cols_family_strong = roll_strong(df_family, ['District'], ['Sex'], ['Single', 'Marriged', 'Widower'], value_name='Family status')
The first is interesting table with renamed columns:
df_family_strong
District | col_0 | col_1 | col_2 | col_3 | col_4 | col_5 | |
---|---|---|---|---|---|---|---|
0 | East Forests | 2545 | 1702 | 2248 | 2920 | 314 | 2044 |
1 | North Mountains | 3059 | 3207 | 3352 | 4578 | 617 | 2654 |
And the second is explanation of new column names:
cols_family_strong
Sex | Family status | |
---|---|---|
col_0 | Female | Single |
col_1 | Female | Marriged |
col_2 | Female | Widower |
col_3 | Male | Single |
col_4 | Male | Marriged |
col_5 | Male | Widower |
You can change index_prefix
parameter from default 'col_'
to any you like.
If we want to restore many tables consistently (employment, enviroment and family), we should edit them to same names columns and different value columns.
In our case that is just change df_family
to df_family_strong
. So we have three tables (df_employment
, df_enviroment
and df_family_strong
) with name-column District.
To restore them consistently, we can use restore_alot
function from package ratrestore
from rat.ratrestore import restore_alot
df_alot = restore_alot([df_employment, df_environment, df_family_strong], name_cols=['District'],
tab_names=['Employment', 'Environment', 'Family'], name_res='Count')
df_alot.head(5)
District | Employment | Environment | Family | Count | |
---|---|---|---|---|---|
0 | East Forests | Agriculture | Rural | col_0 | 242.319898 |
1 | East Forests | Agriculture | Rural | col_1 | 162.054407 |
2 | East Forests | Agriculture | Rural | col_2 | 214.041309 |
3 | East Forests | Agriculture | Rural | col_3 | 278.025187 |
4 | East Forests | Agriculture | Rural | col_4 | 29.897229 |
But Family column is not beautiful. So let change it to values from cols_family_strong
, by function add_indexed_cols
from package ratedit
:
from rat.ratedit import add_indexed_cols
df_alot = add_indexed_cols(df_alot, 'Family', cols_family_strong)
df_alot.head(5)
District | Employment | Environment | Sex | Family status | Count | |
---|---|---|---|---|---|---|
0 | East Forests | Agriculture | Rural | Female | Single | 242.319898 |
1 | East Forests | Agriculture | Rural | Female | Marriged | 162.054407 |
2 | East Forests | Agriculture | Rural | Female | Widower | 214.041309 |
3 | East Forests | Agriculture | Rural | Male | Single | 278.025187 |
4 | East Forests | Agriculture | Rural | Male | Marriged | 29.897229 |