Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slidingWindowSum #99

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

treeowl
Copy link
Contributor

@treeowl treeowl commented Jun 5, 2020

slidingWindow typically helps write only functions with
complexity around O(n*k), where n is the number of elements
in the stream and k is the size of the window. In many cases,
this can be reduced to O(n) by looking not at the window itself
but instead the sum of that window in some Semigroup. This can
be used, for example, to implement moving averages such as
arithmetic, geometric, or harmonic means.

@treeowl
Copy link
Contributor Author

treeowl commented Jun 5, 2020

A couple concerns:

My biggest concern is that for a number of purposes, the size of the window is quite important. If the stream proves to be smaller than the window, we need to know that. For example, if we're taking a moving average, and the stream comes up very short, then we'd naively end up with a total garbage result. Should we have variants that add a return value indicating something about this?

Another concern: it seems rather useful to use this with newtypes from Data.Semigroup. Mapping over the stream once to apply the newtype and again to remove it seems potentially pricy. Do we want a version that maps over its input and its output?

@treeowl
Copy link
Contributor Author

treeowl commented Jun 5, 2020

Oh, one more thing: the name kind of sucks. Should it be slidingWindowSconcat or something?

@treeowl treeowl requested review from chessai and Gabriella439 June 5, 2020 19:45
@treeowl treeowl force-pushed the sliding-window-sum branch 2 times, most recently from 84d7e83 to 8ec3763 Compare June 5, 2020 19:50
@treeowl
Copy link
Contributor Author

treeowl commented Jun 5, 2020

Yet another question: do we want a By version that lets the user provide the semigroup operation directly?

src/Data/AnnotatedQueue.hs Outdated Show resolved Hide resolved
@treeowl
Copy link
Contributor Author

treeowl commented Jun 5, 2020

The CI failure is some GHC HEAD nonsense.

@treeowl
Copy link
Contributor Author

treeowl commented Jun 6, 2020

@andrewthad I know you've stepped away from this package, but I'd very much appreciate your thoughts on this one.

@treeowl treeowl force-pushed the sliding-window-sum branch 2 times, most recently from a3c5a40 to 0bffbf9 Compare June 8, 2020 00:04
src/Data/AnnotatedQueue.hs Outdated Show resolved Hide resolved
treeowl added 2 commits June 8, 2020 00:31
`slidingWindow` typically helps write only functions with
complexity around `O(n*k)`, where `n` is the number of elements
in the stream and `k` is the size of the window. In many cases,
this can be reduced to `O(n)` by looking not at the window itself
but instead the sum of that window in some `Semigroup`. This can
be used, for example, to implement moving averages such as
arithmetic, geometric, or harmonic means.
The queues seem more useful than the sliding window function itself.
@treeowl treeowl force-pushed the sliding-window-sum branch from 0bffbf9 to 6e8a5b1 Compare June 8, 2020 04:32
@danidiaz
Copy link
Contributor

My biggest concern is that for a number of purposes, the size of the window is quite important. If the stream proves to be smaller than the window, we need to know that. For example, if we're taking a moving average, and the stream comes up very short, then we'd naively end up with a total garbage result. Should we have variants that add a return value indicating something about this?

Instead of a return value, perhaps members of the original stream up to the first full window could be re-yielded before starting with the windows themselves. Something like:

slidingWindow :: Monad m => Int -> Stream (Of a) m b -> Stream (Of a) m (Stream (Of (Seq a)) m b)

For cases in which there aren't enough elements to form a window, the isolated elements would be re-yielded but the stream with the window results would be empty.

Also, this way we don't need to keep whole windows in memory in order to report "incomplete" windows.

@andrewthad
Copy link
Contributor

Pardon my ignorance in this area, but here are the things that confused me. This internal data structure looks a lot like a finger tree, but it differs from Patterson's finger tree:

  • The left digit and right digit have different size restrictions from one another.
  • The original finger trees have digits max out at size 4. This implementation uses 2 and 1.
  • The tree itself has an extra data constructor.

But I guess that's not what it's actually based on. The documentation says that this is "an implementation of Okasaki's implicit queues holding elements of some semigroup". It would be kind of nice to have a link to a reference implementation of Okasaki's implicit queues in Haskell so that someone could look at it and see how the data structure differs, but I don't know if there is a good library where this has already been done.

After figuring out what was going on with the data structure, I think this seems fine though, and it's the ideal "purely functional" solution to the problem. It's a little sad that the Semimeasurable abstraction is not exposed to the user in any way, but I think it would only be appropriate to do that if Semimeasurable were provided by a different library. I guess there's the Measured class from the fingertree library, but it's really not what you want here. It would be cool if there were a way to factor those out and unify them, but that's a complication for another way.

@treeowl
Copy link
Contributor Author

treeowl commented Jun 11, 2020

@andrewthad, I really just mashed up Okasaki's structure with Hinze and Paterson's and checked that it was still okay. Yes, the structures are quite closely related, with Okasaki's simpler and less flexible. Which is faster for the purpose? Dunno. It does rather feel like it belongs in Some Other Library. semimeasure is really a convenient hack, and I wouldn't want to expose it as is. Better to expose an idea like one of these for general use:

semimeasure' :: a -> Maybe s
semimeasure'' :: r -> (s -> r) -> a -> r

I just went with the bare minimum I needed here.

@danidiaz, I really don't know enough about what people need to feel like I can evaluate whether your idea is the right approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants