Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in verify_new_data() #39

Open
Sanaxen opened this issue Nov 18, 2022 · 6 comments
Open

Error in verify_new_data() #39

Sanaxen opened this issue Nov 18, 2022 · 6 comments

Comments

@Sanaxen
Copy link

Sanaxen commented Nov 18, 2022

Error in `verify_new_data()`:
! `new_data` includes obs that we can't generate predictions.

What is the cause of this error?
It may be a coincidence, but this error seems to occur with other data as well, apparently monthly data.

For example, the famous `AirlinePassengers.csv' also gives this error

@cregouby
Copy link
Collaborator

cregouby commented Nov 18, 2022

Hello @Sanaxen,

You get the cause of it in there :

tft/R/predict.R

Lines 148 to 160 in 226d21f

possible_dates <- future_data(past_data, horizon = object$config$horizon,
input_types = input_types)
not_allowed <- new_data %>%
dplyr::anti_join(
possible_dates,
by = c(tsibble::index_var(possible_dates), tsibble::key_vars(possible_dates))
)
if (nrow(not_allowed) > 0) {
cli::cli_abort(c(
"{.var new_data} includes obs that we can't generate predictions.",
"x" = "Found {.var {nrow(not_allowed)}} observations."
))
}

Or, in plain english, you have some observations with the same index and key in common between your past_data and your new_data

This smells like data leakage, that you want to avoid at any cost.

Hope it helps !

@Sanaxen
Copy link
Author

Sanaxen commented Nov 19, 2022

Hi, @cregouby,

Data divided into train, valid, and test

> print(train,n=100)
# A tibble: 84 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1949-01-01    112 0                                 0 
 2 1949-02-01    118 0                               268.
 3 1949-03-01    132 0                               510.
 4 1949-04-01    129 0                               778.
 5 1949-05-01    121 0                              1037.
 6 1949-06-01    135 0                              1305.
 7 1949-07-01    148 0                              1564.
 8 1949-08-01    148 0                              1832.
 9 1949-09-01    136 0                              2100.
10 1949-10-01    119 0                              2359.
11 1949-11-01    104 0                              2627.
12 1949-12-01    118 0                              2886.
13 1950-01-01    115 0                              3154.
14 1950-02-01    126 0                              3421.
15 1950-03-01    141 0                              3663.
16 1950-04-01    135 0                              3931.
17 1950-05-01    125 0                              4190.
18 1950-06-01    149 0                              4458.
19 1950-07-01    170 0                              4717.
20 1950-08-01    170 0                              4985.
21 1950-09-01    158 0                              5253.
22 1950-10-01    133 0                              5512.
23 1950-11-01    114 0                              5780.
24 1950-12-01    140 0                              6039.
25 1951-01-01    145 0                              6307.
26 1951-02-01    150 0                              6575.
27 1951-03-01    178 0                              6817.
28 1951-04-01    163 0                              7085.
29 1951-05-01    172 0                              7344 
30 1951-06-01    178 0                              7612.
31 1951-07-01    199 0                              7871.
32 1951-08-01    199 0                              8139.
33 1951-09-01    184 0                              8407.
34 1951-10-01    162 0                              8666.
35 1951-11-01    146 0                              8934.
36 1951-12-01    166 0                              9193.
37 1952-01-01    171 0                              9461.
38 1952-02-01    180 0                              9729.
39 1952-03-01    193 0                              9979.
40 1952-04-01    181 0                             10247.
41 1952-05-01    183 0                             10506.
42 1952-06-01    218 0                             10774.
43 1952-07-01    230 0                             11033.
44 1952-08-01    242 0                             11301.
45 1952-09-01    209 0                             11569.
46 1952-10-01    191 0                             11828.
47 1952-11-01    172 0                             12096 
48 1952-12-01    194 0                             12355.
49 1953-01-01    196 0                             12623.
50 1953-02-01    196 0                             12891.
51 1953-03-01    236 0                             13133.
52 1953-04-01    235 0                             13401.
53 1953-05-01    229 0                             13660.
54 1953-06-01    243 0                             13928.
55 1953-07-01    264 0                             14187.
56 1953-08-01    272 0                             14455.
57 1953-09-01    237 0                             14723.
58 1953-10-01    211 0                             14982.
59 1953-11-01    180 0                             15250.
60 1953-12-01    201 0                             15509.
61 1954-01-01    204 0                             15777.
62 1954-02-01    188 0                             16044.
63 1954-03-01    235 0                             16286.
64 1954-04-01    227 0                             16554.
65 1954-05-01    234 0                             16813.
66 1954-06-01    264 0                             17081.
67 1954-07-01    302 0                             17340.
68 1954-08-01    293 0                             17608.
69 1954-09-01    259 0                             17876.
70 1954-10-01    229 0                             18135.
71 1954-11-01    203 0                             18403.
72 1954-12-01    229 0                             18662.
73 1955-01-01    242 0                             18930.
74 1955-02-01    233 0                             19198.
75 1955-03-01    267 0                             19440 
76 1955-04-01    269 0                             19708.
77 1955-05-01    270 0                             19967.
78 1955-06-01    315 0                             20235.
79 1955-07-01    364 0                             20494.
80 1955-08-01    347 0                             20762.
81 1955-09-01    312 0                             21030.
82 1955-10-01    274 0                             21289.
83 1955-11-01    237 0                             21557.
84 1955-12-01    278 0                             21816 
> print(valid,n=100)
# A tibble: 48 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1956-01-01    284 0                             22084.
 2 1956-02-01    277 0                             22352.
 3 1956-03-01    317 0                             22602.
 4 1956-04-01    313 0                             22870.
 5 1956-05-01    318 0                             23129.
 6 1956-06-01    374 0                             23397.
 7 1956-07-01    413 0                             23656.
 8 1956-08-01    405 0                             23924.
 9 1956-09-01    355 0                             24192 
10 1956-10-01    306 0                             24451.
11 1956-11-01    271 0                             24719.
12 1956-12-01    306 0                             24978.
13 1957-01-01    315 0                             25246.
14 1957-02-01    301 0                             25514.
15 1957-03-01    356 0                             25756.
16 1957-04-01    348 0                             26024.
17 1957-05-01    355 0                             26283.
18 1957-06-01    422 0                             26551.
19 1957-07-01    465 0                             26810.
20 1957-08-01    467 0                             27078.
21 1957-09-01    404 0                             27346.
22 1957-10-01    347 0                             27605.
23 1957-11-01    305 0                             27873.
24 1957-12-01    336 0                             28132.
25 1958-01-01    340 0                             28400.
26 1958-02-01    318 0                             28668.
27 1958-03-01    362 0                             28909.
28 1958-04-01    348 0                             29177.
29 1958-05-01    363 0                             29436.
30 1958-06-01    435 0                             29704.
31 1958-07-01    491 0                             29964.
32 1958-08-01    505 0                             30231.
33 1958-09-01    404 0                             30499.
34 1958-10-01    359 0                             30758.
35 1958-11-01    310 0                             31026.
36 1958-12-01    337 0                             31285.
37 1959-01-01    360 0                             31553.
38 1959-02-01    342 0                             31821.
39 1959-03-01    406 0                             32063.
40 1959-04-01    396 0                             32331.
41 1959-05-01    420 0                             32590.
42 1959-06-01    472 0                             32858.
43 1959-07-01    548 0                             33117.
44 1959-08-01    559 0                             33385.
45 1959-09-01    463 0                             33653.
46 1959-10-01    407 0                             33912 
47 1959-11-01    362 0                             34180.
48 1959-12-01    405 0                             34439.
> print(test,n=100)
# A tibble: 12 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1960-01-01    417 0                             34707.
 2 1960-02-01    391 0                             34975.
 3 1960-03-01    419 0                             35225.
 4 1960-04-01    461 0                             35493.
 5 1960-05-01    472 0                             35752.
 6 1960-06-01    535 0                             36020.
 7 1960-07-01    622 0                             36279.
 8 1960-08-01    606 0                             36547.
 9 1960-09-01    508 0                             36815.
10 1960-10-01    461 0                             37074.
11 1960-11-01    390 0                             37342.
12 1960-12-01    432 0                             37601.
> spec
A <prepared_tft_dataset_spec> with:

v lookback = 48 and horizon = 12.
v The number of possible slices is 73

-- Covariates: 
v `index`: date
v `keys`: key
v `static`: 
v `known`: self_adding_date_time_since_bg
v `unknown`: 
i Variables that are not specified in other types are considered `unknown`.

past_data = bind_rows(train, valid)
new_data = test
But I think past_data does not leak (index,key).

> pred <- predict(object = fitted, new_data = test, past_data = bind_rows(train, valid))
Error in `verify_new_data()`:
! `new_data` includes obs that we can't generate predictions.
x Found `12` observations.

@Ujjwal4CULS
Copy link

I also found the same error even though my test data date is ahead of the train data.

@dfalbel
Copy link
Member

dfalbel commented Apr 12, 2023

This is a bug. Would be nice if you could paste a reproducible example so it's quicker for us to debug.

@Sanaxen Sanaxen closed this as completed Apr 12, 2023
@Sanaxen Sanaxen reopened this Apr 12, 2023
@Sanaxen
Copy link
Author

Sanaxen commented Apr 12, 2023

https://github.com/jbrownlee/Datasets/blob/master/airline-passengers.csv
this data it is very simple.
train 84 rows
valid 48 rows
The remaining 12 rows are predictions.

@dfalbel
Copy link
Member

dfalbel commented Apr 12, 2023

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants