Skip to content

Commit

Permalink
Merge pull request #52 from zuoxunwu/Xunwu_Spring2021_BuBc_release
Browse files Browse the repository at this point in the history
Merge BuBc code from Xunwu for release for publication
  • Loading branch information
zuoxunwu authored Aug 3, 2023
2 parents c3ff9ae + b0e8cd5 commit 803daef
Show file tree
Hide file tree
Showing 30 changed files with 4,799 additions and 3 deletions.
54 changes: 54 additions & 0 deletions case-studies/flavour/BuBc2TauNu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## Combined search for Bu and Bc to tau + nu decays

### Overview

This analysis is a spin-off of the Bc2TauNu search. \
It inherits the basic steps from Bc2TauNu but aims to consider both Bc2TauNu and Bu2TauNu as separate signals. \
This analysis will also be used as a benchmark to help optimize the generic [FCC analysis workflow](https://github.com/HEP-FCC/FCCAnalyses).

### Contact

Clement Helsens <[email protected]> \
Xunwu Zuo <[email protected]>


### Scripts

The config files with common variables
- `user_config.py`: paths and common variables.
- `decay_mode_xs.py`: definitions of the branching ratios and cross-sections for the exclusive B-hadron and C-hadron modes considered as background.

The following scripts are used to perform BDT trainings:
- `process_sig_bkg_samples_for_xgb.py`: select a fraction of first stage training ntuples (ROOT) and convert to pickle files (pandas).
- `train_xgb.py`: train first stage binary BDT. Takes pickle files from previous step as inputs.
- `train_xgb_stage2.py`: train second stage multi-class BDT. Takes second stage training ntuples as inputs.

The following scripts are used to make plots for MVA performance and validations against overtraining:
- `plot_bdt_vars_stage1.py`: plot all input variables of the first stage BDT. Takes first stage training and testing ntuples as inputs.
- `plot_bdt_vars_stage2.py`: plot all input variables of the second stage BDT. Takes second stage training and testing ntuples as inputs.
- `plot_xgb.py`: make performance plots for first stage BDT. Takes first stage pickles as inputs.
- `plot_xgb_stage2.py`: make performance plots for second stage BDT. Takes second stage training ntuples as inputs.
- `overtrain_test_stage1.py`: validate first stage BDT against overtraining. Takes first stage training and testing ntuples as inputs.
- `overtrain_test_stage2.py`: validate second stage BDT against overtraining. Takes second stage training and testing ntuples as inputs.

The statistical analysis is performed with the following steps:
- `bkg_eff_decide_baseline_sel.py`: Scan different MVA cuts and check the changes in (inclusive) background efficiency and variable shapes. The variable chosen for the final fit needs to be independent from BDT cuts.
- `bkg_eff_compare_incl_and_excl.py`: After baseline selection, compare efficiencies for further cuts between inclusive and exclusive samples, to make sure the exclusive samples give good estimates of background efficiency.
- `bkg_eff_make_tight_sel.py`: Evaluate efficiencies in exclusive samples after tight selection.
- `bkg_eff_MVA_spline.py`: In each category, fit event efficiencies with regard to each of MVA1 and MVA2 (Bu or Bc) cut scans. Output splines.
- `final_sel_pretrim_signal.py`: Process signal samples to skim events. Just to save time of loading samples in the final analysis. No need to pretrim backgrounds as they are quick to load.
- `final_sel_estimate_purity.py`: Read splines and estimate signal and background yields. Scan MVA cuts for optimal signal purity, output to yields.json
- `fit_simultaneous_template.py`: Generate pseudo-data and perform combined fit in two categories. If more than 1 pseudo-data, output to fit_results.json
- `fit_multidim_simultaneous_template.py`: (DEPRECATED) It is a modification of `fit_simultaneous_template.py` to fit multi-dimensional distributions.
- `check_sig_shape_variation.py`: Split signal samples into a few subsets and compare the signal shape to get an estimate of signal shape uncertainty.
- `fit_test_sig_shape_uncert.py`: Duplicate of `fit_simultaneous_template.py`, with additional setup for signal shape uncertainty.
- `toy_study_stats.py`: It takes the fit results from the previous fit step and extract the total uncertainty on the signal strength.
- `toy_plot_summary.py`: It reads the results from the previous toy stats step and make a summary plot for several toy groups.

One small file for a custom defined function:
- `double_sided_gaussian.py`: double sided gaussian function in zfit, used in `toy_study_stats.py`. Would be better to replace it with something more centralized.


This is an additional script to study the decay chains in background events after MVA selections, in order to understand what processes need to be generated for exclusive background samples.
- `survey_exclusive_modes.py`: It takes a special sample set (produced by `analysis_extra_bkg_composition.py` in FCCAnalysis) as inputs. It loops through event with high MVA scores and rebuild decay chains that are selected as the signal-like decay. Output is a table in a text file.

306 changes: 306 additions & 0 deletions case-studies/flavour/BuBc2TauNu/bkg_eff_MVA_spline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
import sys, os, argparse
import json
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joblib
from collections import OrderedDict
import uproot
import pickle
from decay_mode_xs import modes as bkg_modes
from decay_mode_xs import prod, b_hadrons, c_hadrons
from scipy import interpolate
import time

#Local code
from userConfig import loc, train_vars, train_vars_vtx, Ediff_cut, NBin_MVA_fit, MVA_cuts, FCC_label
import plotting
import utils as ut

from matplotlib import rc
rc('font',**{'family':'serif','serif':['Roman']})
rc('text', usetex=True)



def run(cat, doPlot):

#BDT variables to optimise, and the hemisphere energy difference which we cut on > 10 GeV
var_list = ["EVT_MVA1Bis", "EVT_MVA2_bc", "EVT_MVA2_bu", "EVT_MVA2_bkg", "EVT_ThrustEmax_E", "EVT_ThrustEmin_E"]
path = loc.ANALYSIS

EVT_MVA2 = 'EVT_MVA2_bc'
if cat == 'bu':
EVT_MVA2 = 'EVT_MVA2_bu'

Cut_truth = 'CUT_CandTruth==0 and CUT_CandTruth2==0'
Cut_sel = f'{Cut_truth} and CUT_CandRho==1 and CUT_CandVtxThrustEmin==1 and EVT_CandMass < 1.8 and EVT_ThrustDiff_E > {Ediff_cut}'
cut = f"EVT_MVA1Bis > {MVA_cuts['tight']['MVA1']} and {EVT_MVA2} > {MVA_cuts['tight']['MVA2_sig']} and 1 - EVT_MVA2_bkg > {MVA_cuts['tight']['MVA2_bkg']} and {Cut_sel}"
# cut = f"EVT_MVA1Bis > {MVA_cuts['tight']['MVA1']} and {EVT_MVA2} > {MVA_cuts['tight']['MVA2']} and {Cut_sel}"

modes = OrderedDict()
modes['bb'] = OrderedDict()
modes['cc'] = OrderedDict()
bb_prefix = "p8_ee_Zbb_ecm91_EvtGen"
cc_prefix = "p8_ee_Zcc_ecm91_EvtGen"
#Background decays
for b in b_hadrons:
for d in bkg_modes[b]:
modes['bb'][f"{b}_{d}"] = [f"{bb_prefix}_{b}2{d}", prod[b], bkg_modes[b][d]]

for c in c_hadrons:
for d in bkg_modes[c]:
modes['cc'][f"{c}_{d}"] = [f"{cc_prefix}_{c}2{d}", prod[c], bkg_modes[c][d]]

tree = {}
N_gen = {}
df = {}
for m in modes["bb"]:
N_gen[m] = uproot.open(f"{path}/{modes['bb'][m][0]}.root")["eventsProcessed"].value
tree[m] = uproot.open(f"{path}/{modes['bb'][m][0]}.root")["events"]
df[m] = tree[m].arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
df[m] = df[m].query(cut)
df[m] = df[m][var_list]
df[m]["log_EVT_MVA1"] = -np.log(1. - df[m]["EVT_MVA1Bis"])
df[m]["log_EVT_MVA2_bkg"] = -np.log(df[m]["EVT_MVA2_bkg"])
df[m]["log_EVT_MVA2_sig"] = -np.log(1. - df[m][EVT_MVA2])
print (f"{m} : {len(df[m])}")
for m in modes["cc"]:
N_gen[m] = uproot.open(f"{path}/{modes['cc'][m][0]}.root")["eventsProcessed"].value
tree[m] = uproot.open(f"{path}/{modes['cc'][m][0]}.root")["events"]
df[m] = tree[m].arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
df[m] = df[m].query(cut)
df[m] = df[m][var_list]
df[m]["log_EVT_MVA1"] = -np.log(1. - df[m]["EVT_MVA1Bis"])
df[m]["log_EVT_MVA2_bkg"] = -np.log(df[m]["EVT_MVA2_bkg"])
df[m]["log_EVT_MVA2_sig"] = -np.log(1. - df[m][EVT_MVA2])
print (f"{m} : {len(df[m])}")


if doPlot:
tree_bb = uproot.open(f"{path}/p8_ee_Zbb_ecm91.root")["events"]
df['bb_inc'] = tree_bb.arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
df['bb_inc'] = df['bb_inc'].query(cut)
df['bb_inc'] = df['bb_inc'][var_list]
df['bb_inc']["log_EVT_MVA1"] = -np.log(1. - df['bb_inc']["EVT_MVA1Bis"])
df['bb_inc']["log_EVT_MVA2_bkg"] = -np.log(df['bb_inc']["EVT_MVA2_bkg"])
df['bb_inc']["log_EVT_MVA2_sig"] = -np.log(1. - df['bb_inc'][EVT_MVA2])
print (f"bb : {len(df['bb_inc'])}")

tree_cc = uproot.open(f"{path}/p8_ee_Zcc_ecm91.root")["events"]
df['cc_inc'] = tree_cc.arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
df['cc_inc'] = df['cc_inc'].query(cut)
df['cc_inc'] = df['cc_inc'][var_list]
df['cc_inc']["log_EVT_MVA1"] = -np.log(1. - df['cc_inc']["EVT_MVA1Bis"])
df['cc_inc']["log_EVT_MVA2_bkg"] = -np.log(df['cc_inc']["EVT_MVA2_bkg"])
df['cc_inc']["log_EVT_MVA2_sig"] = -np.log(1. - df['cc_inc'][EVT_MVA2])
print (f"cc : {len(df['cc_inc'])}")


# Make spline for MVA1
print ('\ngetting MVA1 spline\n-------------------')
time_start = time.time()
xmin = MVA_cuts["spline"][f'MVA1_in_{cat}']['xmin']
xmax = MVA_cuts["spline"][f'MVA1_in_{cat}']['xmax']

base = {}
count = {}
weight = {}
eff_scan = {}
spline = {}

for m in df:
base[m] = len(df[m])
for flav in ['bb', 'cc']:
base[flav] = 0
eff_scan[flav] = 0
weight[flav] = 0
bin_edges = 0

for m in modes[flav]:
count[m], bin_edges = np.histogram(df[m]["log_EVT_MVA1"], NBin_MVA_fit['MVA1'], range=(xmin, xmax))
count[m] = np.flip(count[m])
count[m] = count[m].cumsum()
count[m] = np.flip(count[m])
weight[m] = np.sqrt(count[m])
base[flav] += base[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
eff_scan[flav] += count[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
weight[flav] += weight[m]
eff_scan[flav] = eff_scan[flav] / base[flav]
cut_bound = bin_edges[:-1]
spline[flav] = interpolate.splrep(cut_bound, eff_scan[flav]) #do not apply weight w=weight[flav]
#weighted splines are heavily skewed from actual eff values
#

if doPlot:
eff_inc, bin_edges = np.histogram(df[f'{flav}_inc']["log_EVT_MVA1"], NBin_MVA_fit['MVA1'], range=(xmin, xmax))
eff_inc = np.flip(eff_inc)
eff_inc = eff_inc.cumsum()
eff_inc = np.flip(eff_inc)
eff_inc = eff_inc / len(df[f'{flav}_inc'])

eff_spl = interpolate.splev(cut_bound, spline[flav])
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(cut_bound, eff_inc, color='red', label='Inclusive sample',linewidth=3)
plt.plot(cut_bound, eff_scan[flav], color='blue', label='Exclusive samples',linewidth=3)
plt.plot(cut_bound, eff_spl, color='black', label='Spline interpolation', linestyle='dashed')
ax.tick_params(axis='both', which='major', labelsize=20)
ax.set_title( FCC_label, loc='right', fontsize=20)
plt.xlim(xmin,xmax)
plt.xlabel('log(1-BDT1)',fontsize=30)
plt.ylabel("Efficiency",fontsize=30)
plt.yscale('log')
ymin,ymax = plt.ylim()
plt.ylim(1e-6,1)
plt.legend(fontsize=18, loc="upper right")
plt.grid(alpha=0.4,which="both")
plt.tight_layout()
fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA1_in_{cat}_{flav}.pdf")


print (eff_scan[flav])
print (weight[flav])
with open(f'{loc.PKL}/spline/{cat}_MVA1_scan_{flav}_spline.pkl', 'wb') as f:
pickle.dump(spline[flav], f)
time_stop = time.time()
print (f'Time for forming MVA1 spline: {time_stop - time_start}')

# Make grid spline for MVA2
print ('\ngetting MVA2 spline\n-------------------')
time_start = time.time()
sig_min = MVA_cuts["spline"][f'MVA2_sig_in_{cat}']['xmin']
sig_max = MVA_cuts["spline"][f'MVA2_sig_in_{cat}']['xmax']
bkg_min = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmin']
bkg_max = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmax']

for flav in ['bb', 'cc']:
base[flav] = 0
eff_scan[flav] = 0
weight[flav] = 0
xedges, yedges = 0, 0

for m in modes[flav]:
count[m], xedges, yedges = np.histogram2d(df[m]["log_EVT_MVA2_sig"], df[m]["log_EVT_MVA2_bkg"], bins=[NBin_MVA_fit['MVA2_sig'], NBin_MVA_fit['MVA2_bkg']], range=[[sig_min, sig_max],[bkg_min, bkg_max]])
count[m] = np.flip(count[m])
count[m] = count[m].cumsum(axis=0)
count[m] = count[m].cumsum(axis=1)
count[m] = np.flip(count[m])

weight[m] = np.sqrt(count[m])
base[flav] += base[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
eff_scan[flav] += count[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
weight[flav] += weight[m]
eff_scan[flav] = eff_scan[flav] / base[flav]
cut_bound_x = xedges[:-1]
cut_bound_y = yedges[:-1]
spline[flav] = interpolate.RectBivariateSpline(cut_bound_x, cut_bound_y, eff_scan[flav]) # RectBivariateSpline does not take weight

if doPlot:
eff_inc, xedges, yedges = np.histogram2d(df[f'{flav}_inc']["log_EVT_MVA2_sig"], df[f'{flav}_inc']["log_EVT_MVA2_bkg"], bins=[NBin_MVA_fit['MVA2_sig'], NBin_MVA_fit['MVA2_bkg']], range=[[sig_min, sig_max],[bkg_min, bkg_max]])
eff_inc = np.flip(eff_inc)
eff_inc = eff_inc.cumsum(axis=0)
eff_inc = eff_inc.cumsum(axis=1)
eff_inc = np.flip(eff_inc)
eff_inc = eff_inc / len(df[f'{flav}_inc'])

eff_spl = []
for cut_val in cut_bound_x:
eff_spl.append( spline[flav].ev(cut_val, bkg_min) )

fig, ax = plt.subplots(figsize=(12,8))
plt.plot(cut_bound_x, eff_inc[:,0], color='red', label='Inclusive sample', linewidth=3)
plt.plot(cut_bound_x, eff_scan[flav][:,0], color='blue', label='Exclusive samples', linewidth=3)
plt.plot(cut_bound_x, eff_spl, color='black', label='Spline interpolation', linestyle='dashed')
ax.tick_params(axis='both', which='major', labelsize=20)
ax.set_title( FCC_label, loc='right', fontsize=20)
plt.xlim(sig_min,sig_max)
plt.xlabel(f'log(1-BDT2 {cat})',fontsize=30)
plt.ylabel("Efficiency",fontsize=30)
plt.yscale('log')
ymin,ymax = plt.ylim()
plt.ylim(1e-6,1)
plt.legend(fontsize=18, loc="upper right")
plt.grid(alpha=0.4,which="both")
plt.tight_layout()
fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA2_sig_in_{cat}_{flav}.pdf")

eff_spl = []
for cut_val in cut_bound_y:
eff_spl.append( spline[flav].ev(sig_min, cut_val) )
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(cut_bound_y, eff_inc[0,:], color='red', label='Inclusive sample', linewidth=3)
plt.plot(cut_bound_y, eff_scan[flav][0,:], color='blue', label='Exclusive samples', linewidth=3)
plt.plot(cut_bound_y, eff_spl, color='black', label='Spline interpolation', linestyle='dashed')
ax.tick_params(axis='both', which='major', labelsize=20)
ax.set_title( FCC_label, loc='right', fontsize=20)
plt.xlim(bkg_min,bkg_max)
plt.xlabel(f'log(BDT2 bkg)',fontsize=30)
plt.ylabel("Efficiency",fontsize=30)
plt.yscale('log')
ymin,ymax = plt.ylim()
plt.ylim(1e-6,1)
plt.legend(fontsize=18, loc="upper right")
plt.grid(alpha=0.4,which="both")
plt.tight_layout()
fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA2_bkg_in_{cat}_{flav}.pdf")


print (eff_scan[flav])
print (weight[flav])
with open(f'{loc.PKL}/spline/{cat}_MVA2_2d_scan_{flav}_spline.pkl', 'wb') as f:
pickle.dump(spline[flav], f)
time_stop = time.time()
print (f'Time for forming MVA2 spline: {time_stop - time_start}')







# Make 1D spline for MVA2_bkg, as a cross check
# print ('\ngetting MVA2 spline\n-------------------')
# time_start = time.time()
# bkg_min = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmin']
# bkg_max = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmax']
#
# for flav in ['bb', 'cc']:
# base[flav] = 0
# eff_scan[flav] = 0
# weight[flav] = 0
# bin_edges = 0
#
# for m in modes[flav]:
# count[m], bin_edges = np.histogram(df[m]["log_EVT_MVA2_bkg"], NBin_MVA_fit['MVA1'], range=(bkg_min, bkg_max))
# count[m] = np.flip(count[m])
# count[m] = count[m].cumsum()
# count[m] = np.flip(count[m])
# weight[m] = np.sqrt(count[m])
# base[flav] += base[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
# eff_scan[flav] += count[m] / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
# weight[flav] += weight[m]
# eff_scan[flav] = eff_scan[flav] / base[flav]
# cut_bound = bin_edges[:-1]
# spline[flav] = interpolate.splrep(cut_bound, eff_scan[flav], w=weight[flav])
#
# print (eff_scan[flav])
# print (weight[flav])
# with open(f'{loc.PKL}/spline/{cat}_MVA2_1d_bkg_scan_{flav}_spline.pkl', 'wb') as f:
# pickle.dump(spline[flav], f)
# time_stop = time.time()
# print (f'Time for forming MVA2 spline: {time_stop - time_start}')




def main():
parser = argparse.ArgumentParser(description='Estimate optimal cuts and associated yields')
parser.add_argument("--cat", choices=['bu','bc'],required=False,default='bc')
parser.add_argument("--doPlot", choices=[True, False],required=False,default=True)
args = parser.parse_args()

run(args.cat, args.doPlot)

if __name__ == '__main__':
main()

Loading

0 comments on commit 803daef

Please sign in to comment.