Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time #37

Merged
merged 50 commits into from
Feb 6, 2024
Merged

Time #37

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
d96c39c
First mild code refactors towards time series handling
EgorKraevTransferwise Nov 6, 2023
05bfd57
Factor heuristic preselection of candidates into a separate class. Al…
EgorKraevTransferwise Nov 6, 2023
f285f7f
More small steps towards time series. All tests pass.
EgorKraevTransferwise Nov 7, 2023
fd2b06d
More small steps towards time series. All tests pass.
EgorKraevTransferwise Nov 10, 2023
67b57c7
Make heuristic pre-selection incremental. All tests pass.
EgorKraevTransferwise Nov 10, 2023
2d911fc
First rough cut of time series; its test fails (commented out), all o…
EgorKraevTransferwise Nov 10, 2023
316c61e
Time series regression runs without exceptions, but detected segments…
EgorKraevTransferwise Nov 10, 2023
22c4563
First working-ish version of time series-aware Wise Pizza.
EgorKraevTransferwise Dec 1, 2023
a9fc54f
Update __init__.py
AlxdrPolyakov Dec 6, 2023
5772636
Merge branch 'main' of https://github.com/transferwise/wise-pizza int…
EgorKraevTransferwise Dec 15, 2023
0661e3d
Post-merge commit
EgorKraevTransferwise Dec 15, 2023
27146b5
First OK-ish cut of TS graph, not great so far
EgorKraevTransferwise Dec 18, 2023
c5a33db
Merge branch 'time' of https://github.com/transferwise/wise-pizza int…
EgorKraevTransferwise Dec 18, 2023
249184e
Graph starts to look better, basis still very primitive
EgorKraevTransferwise Dec 22, 2023
2f8de41
A pretty decent cut with the OMP solver
EgorKraevTransferwise Dec 22, 2023
a19a599
An OK cut of plotting for TS; plotting for active customers still TBD
EgorKraevTransferwise Jan 5, 2024
22074e2
Much nicer plotting
EgorKraevTransferwise Jan 8, 2024
f60db71
Simple fixes to address memory issues for TS; not enough:)
EgorKraevTransferwise Jan 8, 2024
b709785
Refactor plotting prior to doing joint fit for TS weights and totals.
EgorKraevTransferwise Jan 8, 2024
602cacb
more refactor
EgorKraevTransferwise Jan 8, 2024
d755e6a
Nice joint plotting of weights, avgs and totals
EgorKraevTransferwise Jan 9, 2024
5cc9abd
Legend fix
EgorKraevTransferwise Jan 9, 2024
7c69690
A failed attempt to use fitted weights for volume forecasting
EgorKraevTransferwise Jan 9, 2024
6fc6512
Refactor to limit memory consumption for TS
EgorKraevTransferwise Jan 9, 2024
1eec9fd
Optimized memory consumption for higher depths
EgorKraevTransferwise Jan 9, 2024
a3112de
Fixes to pass tests
EgorKraevTransferwise Jan 11, 2024
5ba96d0
Comment out numba usage, didn't seem to help
EgorKraevTransferwise Jan 12, 2024
f2acc61
Refactor TS plotting in preparation for log fits
EgorKraevTransferwise Jan 16, 2024
f6f1f46
Kinda working cut of fitting weights in rescaled space.
EgorKraevTransferwise Jan 17, 2024
5ff4b55
Factor transforms into separate class, segment contributions still br…
EgorKraevTransferwise Jan 17, 2024
fcd84b9
Fitting in transformed space appears to actually work
EgorKraevTransferwise Jan 17, 2024
e300846
A shot at log space fitting for totals too, broken so far
EgorKraevTransferwise Jan 18, 2024
5ace523
Combine plots for segments that are identical except for time profile
EgorKraevTransferwise Jan 19, 2024
37a81fe
Consistent fitting in log space almost works, only green segment line…
EgorKraevTransferwise Jan 22, 2024
65fdec8
Move segment impact to right y axis
EgorKraevTransferwise Jan 22, 2024
43b06b0
Pre-forecasting snapshot; fitting both weights and avgs works, fittin…
EgorKraevTransferwise Jan 29, 2024
f645911
Add a fit_sizes parameter, fitting of growth data works both with it …
EgorKraevTransferwise Jan 29, 2024
a014ac0
Fix index type of extended basis
EgorKraevTransferwise Jan 29, 2024
e1290fe
predict() runs without exceptions
EgorKraevTransferwise Jan 29, 2024
31281af
Minor refactors on the way to plotting predictions. Fits both with an…
EgorKraevTransferwise Jan 29, 2024
e1135d7
Snapshot of prediction plotting. Plot displays but is garbage.
EgorKraevTransferwise Jan 30, 2024
bed7748
Plotting approaches reasonable, only a bit of a jump in the transition
EgorKraevTransferwise Jan 30, 2024
287d0d7
Pretty decent cut of forecasting w/o weights
EgorKraevTransferwise Jan 30, 2024
46955fb
First cut of blending log transform with forecasting. Plots still ver…
EgorKraevTransferwise Jan 30, 2024
625c6a6
Tidy up the plotting data interface, fix nan scaling bug. Predict wit…
EgorKraevTransferwise Jan 31, 2024
38742e6
Added weights TS to avg-only TS fit. Usage of external weights still …
EgorKraevTransferwise Jan 31, 2024
38a7c33
External weights seem to work OK now, plots pretty.
EgorKraevTransferwise Jan 31, 2024
c271b58
Forecasting with given future weights definitely works
EgorKraevTransferwise Feb 1, 2024
70b0b97
added plot_is_static, colors, timeseries notebook
Feb 2, 2024
e2cc8c2
added scale for impact line
Feb 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145,515 changes: 145,515 additions & 0 deletions data/synth_time_data.csv

Large diffs are not rendered by default.

Large diffs are not rendered by default.

262 changes: 262 additions & 0 deletions notebooks/Finding interesting segments in time series.ipynb

Large diffs are not rendered by default.

103 changes: 83 additions & 20 deletions notebooks/Finding interesting segments.ipynb

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ scipy>=1.8.0
tqdm
cloudpickle
pivottablejs
streamlit==1.28.0
streamlit==1.28.0
nbformat>=4.2.0
51 changes: 50 additions & 1 deletion tests/test_fit.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,17 @@
import pandas as pd
import pytest

from wise_pizza.data_sources.synthetic import synthetic_data
from wise_pizza.data_sources.synthetic import synthetic_data, synthetic_ts_data
from wise_pizza.explain import (
explain_changes_in_average,
explain_changes_in_totals,
explain_levels,
explain_timeseries,
)
from wise_pizza.segment_data import SegmentData
from wise_pizza.solver import solve_lasso, solve_lp
from wise_pizza.time import create_time_basis
from wise_pizza.plotting_time import plot_time

np.random.seed(42)

Expand Down Expand Up @@ -137,6 +140,10 @@ def test_categorical():
def test_synthetic_template(nan_percent: float):
all_data = synthetic_data(init_len=1000)
data = all_data.data

data.loc[(data["dim0"] == 0) & (data["dim1"] == 1), "totals"] += 100
data.loc[(data["dim1"] == 0) & (data["dim2"] == 1), "totals"] += 300

if nan_percent > 0:
data = values_to_nan(data, nan_percent)
sf = explain_levels(
Expand All @@ -160,6 +167,48 @@ def test_synthetic_template(nan_percent: float):
print("yay!")


@pytest.mark.parametrize("nan_percent", [0.0, 1.0])
def test_synthetic_ts_template(nan_percent: float):
all_data = synthetic_ts_data(init_len=10000)

# Add some big trends to the data
# TODO: insert trend break patterns too
months = np.array(sorted(all_data.data[all_data.time_col].unique()))
basis = create_time_basis(months, baseline_dims=1)
joined = pd.merge(all_data.data, basis, left_on="TIME", right_index=True)
df = joined.drop(columns=basis.columns)

loc1 = (df["dim0"] == 0) & (df["dim1"] == 1)
loc2 = (df["dim1"] == 0) & (df["dim2"] == 1)

df.loc[loc1, "totals"] += 100 * joined.loc[loc1, "Slope"]
df.loc[loc2, "totals"] += 300 * joined.loc[loc2, "Slope"]

if nan_percent > 0:
df = values_to_nan(df, nan_percent)
sf = explain_timeseries(
df,
dims=all_data.dimensions,
total_name=all_data.segment_total,
time_name=all_data.time_col,
size_name=all_data.segment_size,
max_depth=2,
min_segments=5,
verbose=True,
)
print("***")
for s in sf.segments:
print(s)

plot_time(sf)

assert abs(sf.segments[0]["coef"] - 300) < 2
assert abs(sf.segments[1]["coef"] - 100) < 2

# sf.plot()
print("yay!")


@pytest.mark.parametrize(
"how, solver, plot_is_static, function, nan_percent, size_one_percent",
deltas_test_cases,
Expand Down
1 change: 1 addition & 0 deletions wise_pizza/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
explain_levels,
explain_changes_in_totals,
explain_changes_in_average,
explain_timeseries,
)
26 changes: 21 additions & 5 deletions wise_pizza/data_sources/synthetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
np.random.seed(42)


def synthetic_data(num_dims: int = 5, dim_values: int = 5, init_len=10000):
def synthetic_data(num_dims: int = 5, dim_values: int = 5, init_len=10000) -> SegmentData:
np.random.seed(42)
cols = {}
for dim in range(num_dims):
Expand All @@ -17,9 +17,25 @@ def synthetic_data(num_dims: int = 5, dim_values: int = 5, init_len=10000):
cols["totals"] = np.random.lognormal(0, 1, size=init_len)
dims = [k for k in cols.keys() if "dim" in k]

df = pd.DataFrame(cols).groupby(dims, as_index=False).sum()
# deduplicate dimension values
df = pd.DataFrame(cols).groupby(dims, as_index=False).sum().reset_index(drop=True)
return SegmentData(data=df, dimensions=dims, segment_total="totals")

df.loc[(df["dim0"] == 0) & (df["dim1"] == 1), "totals"] += 100
df.loc[(df["dim1"] == 0) & (df["dim2"] == 1), "totals"] += 300

return SegmentData(data=df, dimensions=dims, segment_total="totals")
def synthetic_ts_data(num_dims: int = 5, dim_values: int = 5, init_len=10000, ts_len: int = 12):
pre_data = synthetic_data(num_dims, dim_values, int(init_len/ts_len))
small_df = pre_data.data
dfs = []
months = np.array(pd.date_range(start="2023-01-01", periods=ts_len, freq="MS"))

for m in months:
this_df = small_df.copy()
this_df["TIME"] = m
this_df["totals"] = np.random.lognormal(0, 1, size=len(this_df))
dfs.append(this_df)

df = pd.concat(dfs)
pre_data.time_col = "TIME"

pre_data.data = df.sort_values(pre_data.dimensions + [pre_data.time_col]).reset_index(drop=True)
return pre_data
Loading
Loading