Skip to content

Commit

Permalink
Tree iteration and table (#30)
Browse files Browse the repository at this point in the history
* Fix `show()`

* test drive Table

* Move arrayapi.jl

* Add subset of branch interface

* README and precompile

* Settle with TypedTable for now
  • Loading branch information
Moelf authored Jul 9, 2021
1 parent 808cc63 commit 10388a9
Show file tree
Hide file tree
Showing 10 changed files with 124 additions and 153 deletions.
11 changes: 7 additions & 4 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,36 +1,39 @@
name = "UnROOT"
uuid = "3cd96dde-e98d-4713-81e9-a4a1b0235ce9"
authors = ["Tamas Gal", "Jerry Ling"]
version = "0.2.1"
version = "0.2.2"

[deps]
CodecLz4 = "5ba52731-8f18-5e0d-9241-30f10d1ec561"
CodecXz = "ba30903b-d9e8-5048-a5ec-d1f5b0d4b47b"
CodecZlib = "944b1d66-785c-5afd-91f1-9de20f533193"
CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2"
LRUCache = "8ac3fa9e-de4c-5943-b1dc-09c6b5f20637"
MD5 = "6ac74813-4b46-53a4-afec-0b5dc9d7885c"
Memoization = "6fafb56a-5788-4b4e-91ca-c0cea6611c73"
Mixers = "2a8e4939-dab8-5edc-8f64-72a8776f13de"
Parameters = "d96e819e-fc66-5662-9728-84c9c7592b0a"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TypedTables = "9d95f2ec-7b3d-5a63-8d20-e2491e220bb9"

[compat]
CodecLz4 = "^0.3.0, ^0.4.0"
CodecXz = "^0.6.0, ^0.7.0"
CodecZlib = "^0.6.0, ^0.7.0"
CodecZstd = "^0.6.0, ^0.7.0"
LRUCache = "^1.3.0"
MD5 = "^0.2.1"
Memoization = "^0.1.10"
Mixers = "^0.1.0"
Parameters = "^0.12.0"
StaticArrays = "^0.12.0, ^1"
Tables = "^1.0.0"
TypedTables = "^1.0.0"
julia = "1"

[extras]
MD5 = "6ac74813-4b46-53a4-afec-0b5dc9d7885c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
ThreadsX = "ac1d9e8a-700a-412c-b207-f0111f4b6c0d"

[targets]
test = ["Test", "ThreadsX"]
test = ["Test", "ThreadsX", "MD5"]
125 changes: 68 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,37 +29,55 @@ We support reading all scalar branch and jagged branch of "basic" types, provide
indexing interface (thus iteration too) with basket-cache. As
a metric, UnROOT can read all branches of CMS NanoAOD.

The most easy way to access data is through `LazyBranch` which will be constructed
when you index a `ROOTFile` with `"treename/branchname"`. It acts just like an array --
you can index it, iterate through it, `map` over it etc:

``` julia
using UnROOT
## Quick Start
The most easy way to access data is through `LazyTree`, which returns a `TypedTables` for now:
```julia
julia> using UnROOT

julia> t = ROOTFile("test/samples/NanoAODv5_sample.root");

julia> mytree = LazyTree(rf, "Events", ["nMuon", "Electron_dxy"])
Table with 2 columns and 1000 rows:
nMuon Electron_dxy
┌──────────────────────────────────────────────────────────────
10 Float32[0.000370502]
22 Float32[-0.00981903]
30 Float32[]
```

You can iterate through a `LazyTree`:
```julia
julia> for event in mytree
@show event.Electron_dxy
break
end
event.Electron_dxy = Float32[0.00037050247]
```

Only one basket per branch will be cached so you don't have to worry about running out or RAM.
At the same time, `event` inside the for-loop is not materialized, such that if one has a
stringent cut in the main looper, disk I/O can be reduced significantly.

If you only care about a few branches, you can directly use `LazyBranch` (they make up columes of `Table`) which can be constructed
when you index a `ROOTFile` with `"treename/branchname"`. It acts just like an array --
you can index it, iterate through it, `map` over it efficiently. Or even dump the entire branch, by `collect()` them!
``` julia
julia> LB = t["Events/Electron_dxy"]
LazyBranch{Vector{Float32}, UnROOT.Nooffsetjagg}:
File: ./test/samples/NanoAODv5_sample.root
Branch: Electron_dxy
Description: dxy (with sign) wrt first PV, in cm
NumEntry: 1000
Entry Type: Vector{Float32}

# while this pattern, `t["tree"]["branch"]`, will give you the branch object itself

# this pattern, `t["tree"]["branch"]`, will give you the branch object itself
julia> rf["Events"]["Electron_dxy"]
UnROOT.TBranch_13
cursor: UnROOT.Cursor
fName: String "Electron_dxy"
...

julia> for i = 5:8
julia> for i = 5:7
@show LB[i]
end
LB[i] = Float32[]
LB[i] = Float32[-0.0012559891]
LB[i] = Float32[0.06121826, 0.00064229965]
LB[i] = Float32[0.005870819, 0.00054883957, -0.00617218]

# or a range
julia> LB[5:8]
Expand All @@ -69,14 +87,6 @@ julia> LB[5:8]
[0.06121826, 0.00064229965]
[0.005870819, 0.00054883957, -0.00617218]

# a jagged branch
julia> collect(LB)
1000-element Vector{Vector{Float32}}:
[0.00037050247]
[-0.009819031]
[]
...

# reading branch is also thread-safe, although may not be much faster depending to disk I/O and cache
julia> using ThreadsX

Expand All @@ -89,6 +99,7 @@ julia> all(
true
```


If you have custom C++ struct inside you branch, reading raw data is also possible
using the `UnROOT.array(f::ROOTFile, path; raw=true)` method. The output can
be then reinterpreted using a custom type with the method
Expand All @@ -114,8 +125,41 @@ julia> UnROOT.splitup(data, offsets, UnROOT.KM3NETDAQHit)
[UnROOT.KM3NETDAQHit(1073742790, 0x00, 9, 0x60)......
```
## Main challenges
- ROOT data is generally stored as big endian and is a
self-descriptive format, i.e. so-called streamers are stored in the files
which describe the actual structure of the data in the corresponding branches.
These streamers are read during runtime and need to be used to generate
Julia structs and `unpack` methods on the fly.
- Performance is very important for a low level I/O library.
## Low hanging fruits
Pick one ;)
- [x] Parsing the file header
- [x] Read the `TKey`s of the top level dictionary
- [x] Reading the available trees
- [x] Reading the available streamers
- [x] Reading a simple dataset with primitive streamers
- [x] Reading of raw basket bytes for debugging
- [ ] Automatically generate streamer logic
- [ ] Prettier `show` for `Lazy*`s
- [ ] Clean up `Cursor` use
- [x] Reading `TNtuple` #27
## Acknowledgements
#3 Behind the scene
Special thanks to Jim Pivarski ([@jpivarski](https://github.com/jpivarski))
from the [Scikit-HEP](https://github.com/scikit-hep) project, who is the
main author of [uproot](https://github.com/scikit-hep/uproot), a native
Python library to read and write ROOT files, which was and is a great source
of inspiration and information for reverse engineering the ROOT binary
structures.
## Behind the scene
<details><summary>Some additional debug output: </summary>
<p>
Expand Down Expand Up @@ -198,36 +242,3 @@ Compressed datastream of 1317 bytes at 6180 (TKey 't1' (TTree))
```
</p>
</details>
## Main challenges
- ROOT data is generally stored as big endian and is a
self-descriptive format, i.e. so-called streamers are stored in the files
which describe the actual structure of the data in the corresponding branches.
These streamers are read during runtime and need to be used to generate
Julia structs and `unpack` methods on the fly.
- Performance is very important for a low level I/O library.
## Low hanging fruits
Pick one ;)
- [x] Parsing the file header
- [x] Read the `TKey`s of the top level dictionary
- [x] Reading the available trees
- [x] Reading the available streamers
- [x] Reading a simple dataset with primitive streamers
- [x] Reading of raw basket bytes for debugging
- [ ] Automatically generate streamer logic
- [ ] Clean up `Cursor` use
- [x] Reading `TNtuple` #27
## Acknowledgements
Special thanks to Jim Pivarski ([@jpivarski](https://github.com/jpivarski))
from the [Scikit-HEP](https://github.com/scikit-hep) project, who is the
main author of [uproot](https://github.com/scikit-hep/uproot), a native
Python library to read and write ROOT files, which was and is a great source
of inspiration and information for reverse engineering the ROOT binary
structures.
14 changes: 5 additions & 9 deletions src/UnROOT.jl
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
module UnROOT

export ROOTFile, LazyBranch
export ROOTFile, LazyBranch, LazyTree

import Base: keys, get, getindex, show, length, iterate, position, ntoh, lock, unlock
using Base.Threads: SpinLock
using Memoization, LRUCache
ntoh(b::Bool) = b

using CodecZlib, CodecLz4, CodecXz, CodecZstd
using Mixers
using Parameters
using StaticArrays
using CodecZlib, CodecLz4, CodecXz, CodecZstd, StaticArrays
using Mixers, Parameters, Memoization, LRUCache
import Tables, TypedTables

@static if VERSION < v"1.1"
fieldtypes(T::Type) = [fieldtype(T, f) for f in fieldnames(T)]
Expand All @@ -27,10 +25,8 @@ include("utils.jl")
include("streamers.jl")
include("bootstrap.jl")
include("root.jl")
include("arrayapi.jl")
# include("itr.jl")
include("iteration.jl")
include("custom.jl")
include("precompile.jl")


end # module
36 changes: 27 additions & 9 deletions src/arrayapi.jl → src/iteration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,25 @@ function basketarray(f::ROOTFile, branch, ithbasket)
interped_data(rawdata, rawoffsets, branch, jagt, T)
end

function LazyTree(f::ROOTFile, s::AbstractString, branches)
tree = f[s]
tree isa TTree || error("$s is not a tree name.")
d = Dict{Symbol, LazyBranch}()
for (i,b) in enumerate(branches)
d[Symbol(b)] = f["$s/$b"]
end
if length(branches) > 30
@warn "Your tree is pretty wide $(length(branches)), this will take compiler a moment."
end
TypedTables.Table(d)
end

function LazyTree(f::ROOTFile, s::AbstractString)
LazyTree(f, s, keys(f[s]))
end

# function barrior to make getting individual index faster
# TODO upstream some types into parametric types for Branch/BranchElement
#
"""
LazyBranch(f::ROOTFile, branch)
Expand Down Expand Up @@ -100,21 +116,23 @@ Base.length(ba::LazyBranch) = ba.L
Base.firstindex(ba::LazyBranch) = 1
Base.lastindex(ba::LazyBranch) = ba.L
Base.eltype(ba::LazyBranch{T,J}) where {T,J} = T
function Base.show(io::IO, ba::LazyBranch)
summary(io, ba)

Base.show(io::IO,m::MIME"text/plain", lb::LazyBranch) = Base.show(io, lb)
function Base.show(io::IO, lb::LazyBranch)
summary(io, lb)
println(":")
println(" File: $(ba.f.filename)")
println(" Branch: $(ba.b.fName)")
println(" Description: $(ba.b.fTitle)")
println(" NumEntry: $(ba.L)")
print(" Entry Type: $(eltype(ba))")
println(" File: $(lb.f.filename)")
println(" Branch: $(lb.b.fName)")
println(" Description: $(lb.b.fTitle)")
println(" NumEntry: $(lb.L)")
print(" Entry Type: $(eltype(lb))")
end

function Base.getindex(ba::LazyBranch{T, J}, idx::Integer) where {T, J}
# I hate 1-based indexing
seek_idx = findfirst(x -> x>(idx-1), ba.fEntry) - 1 #support 1.0 syntax
localidx = idx - ba.fEntry[seek_idx]
if seek_idx != ba.buffer_seek # update buffer
if seek_idx != ba.buffer_seek # update buffer if index in a new basket
ba.buffer = basketarray(ba.f, ba.b, seek_idx)
ba.buffer_seek = seek_idx
end
Expand Down
File renamed without changes.
42 changes: 0 additions & 42 deletions src/precompile.jl

This file was deleted.

5 changes: 5 additions & 0 deletions src/root.jl
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,14 @@ function Base.getindex(f::ROOTFile, s::AbstractString)
try # if we can't construct LazyBranch, just give up (maybe due to custom class)
return LazyBranch(f, S)
catch
@warn "Can't automatically create LazyBranch for branch $s. Returning a branch object"
end
end
S
end

@memoize LRU(maxsize = 2000) function _getindex(f::ROOTFile, s)
# function _getindex(f::ROOTFile, s)
if '/' s
@debug "Splitting path '$s' and getting items recursively"
paths = split(s, '/')
Expand Down Expand Up @@ -167,6 +169,7 @@ function interped_data(rawdata, rawoffsets, branch, ::Type{J}, ::Type{T}) where

end

# function interp_jaggT(branch, leaf)
@memoize LRU(;maxsize=10^3) function interp_jaggT(branch, leaf)
if hasproperty(branch, :fClassName)
classname = branch.fClassName # the C++ class name, such as "vector<int>"
Expand All @@ -176,6 +179,8 @@ end
elname = endswith(elname, "_t") ? lowercase(chop(elname; tail=2)) : elname # Double_t -> double
try
elname == "bool" && return Bool #Cbool doesn't exist
elname == "unsigned int" && return UInt32 #Cunsigned doesn't exist
elname == "unsigned char" && return Char #Cunsigned doesn't exist
getfield(Base, Symbol(:C, elname))
catch
error("Cannot convert element of $elname to a native Julia type")
Expand Down
Loading

2 comments on commit 10388a9

@Moelf
Copy link
Member Author

@Moelf Moelf commented on 10388a9 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/40598

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.2.2 -m "<description of version>" 10388a95808665d67492ee594e6c0569a7fb3953
git push origin v0.2.2

Please sign in to comment.