This report shows the latest benchmarks for the performance of the
large-records
library.
In most of these benchmarks one of the main things we measure is the size of the
ghc
core. While this is not in and of itself a useful measure, it is a useful
proxy for what we are interested in (compilation time and memory usage), and
we can measure it much more accurately). When reporting core size, we report the
size of the core after desugaring (ds-preopt
), after the
very simple optimiser
(ds
), and after the simplifier (simpl
).
Note on the use of color: In the plots we intentionally do not use different colors to differentiate between these phases, using color to emphasize the difference between different approaches instead. When a particular phase requires a separate explanation, we will use a separate plot. We will use red (and occassionally blue) for approaches that do not work, and green for the approach that we have adopted.
Compilation time in these benchmarks is as reported by -ddump-timings
,
compiling with -O0
(the primary goal here is to speed up the development
cycle). We measure this without also using any of the core dump options
(-ddump-ds-preopt
, -ddump-ds
, or -ddump-simpl
). This is important, because
not only do these flags affect compilation time, they may do so non-linearly if
part of the tree is never actually forced.
Runtime benchmarks do use -O1
, since here we are interested in how the library
would behave in a production build.
All measurements are done with ghc
8.8.4 as this is the version of ghc
used
by the client for whom this work was done (Juspay). The
only exception is the NoFieldSelectors
experiment, see below.
This is one of the main benchmarks of the library, and is described in more detail in Avoiding quadratic core code size with large records. It measures the compilation cost of compiling a module containing
- Record of a certain size
- Stock-derived
Eq
andShow
instances - TH-derived
generics-sop
Generic
instance HasField
instance derived by theRecordDotPreprocessor
pluginToJSON
instance in terms ofgtoJSON
(this is the "Before"), and compares that to a module containing the same ingredients, but with
- Record representation generated by
large-records
Eq
,Show
,Generic
(forlarge-generics
generics) andHasField
instances all generated bylarge-records
ToJSON
instance in terms of thelarge-generics
equivalent ofgtoJSON
.
(this is "After").
This benchmark is mostly a sanity check: it repeats the "After" benchmark,
but using a higher kinded record (that is, a record with a functor parameter,
like is common for example in beam
applications).
HasNormalForm
is an ingredient in how we deal with generic transforms.
We described this only very briefly in the blog post (section
[Transforms][blogpost1-transforms]),
but discussed it in a bit more detail in the Haskell eXchange presentation
about the large-records
library
Avoiding Quadratic Blow-up During Compilation
(section "Avoiding evidence", starts around 41:35).
Although the size of the core before the very simple optimiser runs is still quadratic (and huge!):
the very simple optimiser cleans this up to exactly nothing (this is what I describe in the talk):
Provided that the full tree before the very simple optimiser is never fully forced (and normally it isn't), compilation time is still fine:
Although the focus of these benchmarks is compilation time performance, when
comparing with superrecord
we will also look at runtime performance. It is
expected here that superrecord
will outperform large-anon
; that is after all
the trade-off that we make. However, runtime performance should still be
somewhat reasonable.
In order to keep benchmarking time within check, we only show the core size
the simplifier (-ddump-simpl
); the output of -ddump-ds-preopt
and
-ddump-ds
is similar, so showing all three does not provide further insights,
and showing only one significantly reduces benchmarking time. The output of
the simplifier is also what continues on through the compilation pipeline,
of course.
There is no explicit support for generics in superrecord
, so we instead we use
translation from and to JSON as a substitute; the way that these are defined in
superrecord
is somewhat reminiscent of how generic functions are defined in
GHC.Generics
. To ToJSON
benchmark then would be an example of a generic
consumer, whereas the ParseJSON
benchmark would be an example of a generic
producer.
This uses the source plugin, but it offers merely syntactic sugar, it does not
use any unsafe features from the library; you could instead just write these
as repeated inserts. The library does offer experimental integration with
typelet
, which confirms that the only source of non-linear core size is indeed
the lack of type sharing in the type arguments to insert
. However, as the
benchmarks show, this does not actually result in an improvement in compilation
time (at least, not for this benchmark): compilation time is nicely linear
without.
Comparison with superrecord
In superrecord
there are two APIs for constructing records: a safe one and
an unsafe one, which requires a conservative size estimate up-front (specified
as an argument to unsafeRNil
) and modifies the record inplace, thus requiring
very careful use.
The safe API results in such enormous core size and corresponding compilation time that it dwarves everything else in comparison, so we show it in a graph by itself; we only measured core size up to records of size 30 and compilation time up to size 50:
The unsafe API is somewhat more reasonable:
This extracts half the record fields into a non-record datatype.
Here is where we see the win from applyDiff
(we paid for it during record
construction).
This overwrites half the record fields.
In this section we report on a number of experiments that test various
approaches used in the large-records
library, independent of the details of
the full library. Indeed, these experiments do not use the large-records
library at all.
This experiment isolates the just cost of having a record with many fields. This cost is due to the record accessors, as discussed in section Warmup: terms of the first blogpost.
Note that this cost is not prevented by the use of NoFieldSelectors
: even
when using this language pragma, field selectors are still created, they just
won't pollute the namespace (they are used for HasField
instances).
This experiment checks that NoFieldSelectors
indeed makes it possible to
define record pattern synonyms without incurring a quadratic cost. Since
NoFieldSelectors
was introduced only in ghc
9.2.1, this is the only
experiment ran with that version of ghc
(and we currently make no use of
this feature).
This is a simple experiment that measures the cost of having many superclasses.
This cost turns out to be quadratic due to the record accessors for the type
class dictionaries, as explained in section
Type class dictionaries of the first blogpost.
These accessors are introduced by the simplifier, so we don't see this cost
until simpl
:
This experiment is about the cost of the lack of type sharing, as illustrated
by chains of calls to the Applicative
operator <*>
. We discss this in
section More subtle: types of the first blog post.
Note: type-level sharing is discussed in more detail in the third blog post on this topic, Type-level sharing in Haskell, now; see also the corresponding section on type-level sharing below.
The induction experiment compares the cost of instance induction over lists, with the cost of instance induction over trees generated from lists, with or without a phantom argument. This is one of the main topics discussed in the second blog post, Induction without core-size blow-up.
This experiment compares a standard generic-sop
generics
instance for HList
with a large-generics
instance using
tree induction and phantom contexts, as explored in the induction experiment
(both instances hand-written). This verifies that the techniques we explored can
in fact be used to construct Generics
instances that do not result in
quadratic core.
The problem with constraint families was described in the second blog post, Induction without core-size blow-up, a.k.a. Large records: anonymous edition, section Constraint Families.
The difference here comes from introducing an intermediate type class to ensure shallow evaluation. As mentioned in the blog post, this is only visible before the very simple optimiser runs:
After the optimiser the difference between the two approaches disappears:
(This experiment assumes an phantom role assigned to EmptyClass
, as discussed
in the "Induction" experiment.)
As long as we can depend on the very simple optimiser, the difference is not terribly important:
We nonetheless prefer the shallow approach, in order to limit dependency on the VSO as much as possible.
This experiment assesses the feasability of pre-evaluating types, as described in Postscript: Pre-evaluating type families. For reference we have also included the measurements for the regular induction case (without pre-evaluation, using phantom context). It's clear from these graphs that the difference is negligible, and so we do not use pre-evaluation at all. It's mildly interesting that this is a case where phantom contexts are actually (slightly) worse, however.
Type-level sharing is discussed in detail
Type-level sharing in Haskell, now.
We have two benchmarks here: one for the construction of HList
s and one
for applicative chains; both are discussed in more detail in the blog post.
Unfortunately, although these benchmarks show that the typelet
package can indeed help to construct ghc
core of linear size, they do not
show an improvement in compilation time, at least for these examples.
For the HList
benchmarks we have a baseline (no sharing), using letAs
,
and using letAs'
, the CPS form of letAs
.
For the applicative chains benchmarks we only have a baseline (no sharing),
and one using let
.