Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014

Name: Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014
Creator: ATLAS collaboration
Published: 2014

ATLAS collaboration

Cite as: ATLAS collaboration (2014). Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.ZBP2.M5T8

Dataset Derived Datascience ATLAS CERN-LHC

Description

The dataset has been built from official ATLAS full-detector simulation, with "Higgs to tautau" events mixed with different backgrounds. The simulator has two parts. In the first, random proton-proton collisions are simulated based on the knowledge that we have accumulated on particle physics. It reproduces the random microscopic explosions resulting from the proton-proton collisions. In the second part, the resulting particles are tracked through a virtual model of the detector. The process yields simulated events with properties that mimic the statistical properties of the real events with additional information on what has happened during the collision, before particles are measured in the detector.

The signal sample contains events in which Higgs bosons (with a fixed mass of 125 GeV) were produced. The background sample was generated by other known processes that can produce events with at least one electron or muon and a hadronic tau, mimicking the signal. For the sake of simplicity, only three background processes were retained for the Challenge. The first comes from the decay of the Z boson (with a mass of 91.2 GeV) into two taus. This decay produces events with a topology very similar to that produced by the decay of a Higgs. The second set contains events with a pair of top quarks, which can have a lepton and a hadronic tau among their decay. The third set involves the decay of the W boson, where one electron or muon and a hadronic tau can appear simultaneously only through imperfections of the particle identification procedure.

Due to the complexity of the simulation process, each simulated event has a weight that is proportional to the conditional density divided by the instrumental density used by the simulator (an importance-sampling flavour), and normalised for integrated luminosity such that, in any region, the sum of the weights of events falling in the region is an unbiased estimate of the expected number of events falling in the same region during a given fixed time interval. In our case, the weights correspond to the quantity of real data taken during the year 2012. The weights are an artifact of the way the simulation works and so they are not part of the input to the classifier. For the Challenge, weights have been provided in the training set so the AMS can be properly evaluated. Weights were not provided in the qualifying set since the weight distribution of the signal and background sets are very different and so they would give away the label immediately. However, in the opendata.cern.ch dataset, weights and labels have been provided for the complete dataset.

The evaluation metric is the approximate median significance (AMS):

\[ \text{AMS} = \sqrt{2\left((s+b+b_r) \log \left(1 + \frac{s}{b + b_r}\right)-s\right)}\]

where

$s, b$: unnormalised true positive and false positive rates, respectively,
$b_r =10$ is the constant regularisation term,
$\log$ is the natural log.

More precisely, let $(y_1, \ldots, y_n) \in \{\text{b},\text{s}\}^n$ be the vector of true test labels, let $(\hat{y}_1, \ldots, \hat{y}_n) \in \{\text{b},\text{s}\}^n$ be the vector of predicted (submitted) test labels, and let $(w_1, \ldots, w_n) \in {\mathbb{R}^+}^n$ be the vector of weights. Then

\[ s = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{s}\} \mathbb{1}\{\hat{y}_i = \text{s}\} \]

and

\[ b = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{b}\} \mathbb{1}\{\hat{y}_i = \text{s}\}, \]

where the indicator function $\mathbb{1}\{A\}$ is 1 if its argument $A$ is true and 0 otherwise.

For more information on the statistical model and the derivation of the metric, see the documentation.

Dataset characteristics

818238 events. 1 files. 186.5 MiB in total.

Dataset semantics

Variable	Type	Description
EventId	An unique integer identifier of the event.
DER_mass_MMC	The estimated mass $m_{H}$ of the Higgs boson candidate, obtained through a probabilistic phase space integration.
DER_mass_transverse_met_lep	The transverse mass between the missing transverse energy and the lepton.
DER_mass_vis	The invariant mass of the hadronic tau and the lepton.
DER_pt_h	The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton and the missing transverse energy vector.
DER_deltaeta_jet_jet	The absolute value of the pseudorapidity separation between the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_mass_jet_jet	The invariant mass of the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_prodeta_jet_jet	The product of the pseudorapidities of the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_deltar_tau_lep	The R separation between the hadronic tau and the lepton.
DER_pt_tot	The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num $\geq$) and the subleading jet (if PRI jet num = 2) (but not of any additional jets).
DER_sum_pt	The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI jet num $\geq$ 1) and the subleading jet (if PRI jet num = 2) and the other jets (if PRI jet num = 3).
DER_pt_ratio_lep_tau	The ratio of the transverse momenta of the lepton and the hadronic tau.
DER_met_phi_centrality	The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton.
DER_lep_eta_centrality	The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI_jet_num $\leq$ 1).
PRI_tau_pt	The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the hadronic tau.
PRI_tau_eta	The pseudorapidity $\eta$ of the hadronic tau.
PRI_tau_phi	The azimuth angle $\phi$ of the hadronic tau.
PRI_lep_pt	The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the lepton (electron or muon).
PRI_lep_eta	The pseudorapidity $\eta$ of the lepton.
PRI_lep_phi	The azimuth angle $\phi$ of the lepton.
PRI_met	The missing transverse energy $\overrightarrow{E}^{miss}_{T}$
PRI_met_phi	The azimuth angle $\phi$ of the mssing transverse energy
PRI_met_sumet	The total transverse energy in the detector.
PRI_jet_num	The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3).
PRI_jet_leading_pt	The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is the jet with largest transverse momentum (undefined if PRI_jet_num = 0).
PRI_jet_leading_eta	The pseudorapidity $\eta$ of the leading jet (undefined if PRI jet num = 0).
PRI_jet_leading_phi	The azimuth angle $\phi$ of the leading jet (undefined if PRI jet num = 0).
PRI_jet_subleading_pt	The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is, the jet with second largest transverse momentum (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_subleading_eta	The pseudorapidity $\eta$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_subleading_phi	The azimuth angle $\phi$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_all_pt	The scalar sum of the transverse momentum of all the jets of the events.
Weight	The event weight $w_{i}$
Label	The event label (string) $y_{i}$ $\in$ $\{s,b\}$ (s for signal, b for background).
KaggleSet	String specifying to which Kaggle set the event belongs : ”t”:training, ”b”:public leaderboard, ”v”:private leaderboard,”u”:unused.
KaggleWeight	Weight normalised within each Kaggle dataset.

External links

Go to the Higgs Boson Machine Learning Challenge on Kaggle

http://higgsml.lal.in2p3.fr

How were these data selected?

The events were selected from simulated events passing the single electron or single muon trigger. Each event has an identified electron or muon and an identified hadronic tau, and should not have a b-tagged jet.

How were these data validated?

Repeating the ATLAS "Higgs to tautau" analysis (as documented in the reference document, see documentation) on the dataset allow to reproduce approximately the event yields quoted for signal and background. The event yields cannot be reproduced exactly because data driven corrections have not been applied (see documentation for more details).

How can you use these data?

This dataset is an extended version of the data provided for the Higgs Boson Machine Learning Challenge on Kaggle. For more information

Go to the Higgs Machine Learning documentation

Files and indexes

Disclaimer

These open data are released under the Creative Commons Zero v1.0 Universal license.

Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.