Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: One-pass column optimization #491

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

martindurant
Copy link
Collaborator

@martindurant martindurant commented Mar 28, 2024

To-do:

  • make sure all output nodes do a touch-and-commit
  • do not get caught by caching or mutation
  • figure out what text manipulation is really required
  • reinstate necessary_columns based on optimize_columns
  • eliminate multi-loops for speed if an issue (likely!)

@martindurant martindurant changed the title ENH: One-pass column optimization feat: One-pass column optimization Mar 28, 2024
else:
paths.add(form_key_to_path[form_key])

# Select the most appropriate column for each wildcard
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant is the plan to restore e.g. wildcard matching in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that happens in this PR, this one's just about auto-selecting required columns. We might need more work for the case where the shape of something is required but not the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only surface-glancing, but I noted that this logic has been removed and didn't see anywhere that it was recovered. Might have missed something!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this should live after checking shape_touched_in, more complex than I had hoped.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK< I have some shonky code in optimize.optimize_columns to deal with shapes. I am probably still missing some touch-buffers on the outputs, but most things are in line.

It's all pretty ugly and in the latest commit I seem to have introduced some typetracer->numpy coersion.

Comment on lines 123 to 131
def _buf_to_col(s):
return (
s[2:]
.replace(".content", "")
.replace("-offsets", "")
.replace("-data", "")
.replace("-index", "")
.replace("-mask", "")
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time right now to look at the full PR here, but is this assumption about buffer name mappings enforced anywhere? Right now the design is such that input sources can use a non-canonical buffer naming scheme.

For coffea I think we made this possible so that Coffea could directly pass in their own buffer names with semantic meaning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have access to the call to ak.from_buffers here? That function takes a callable that constructs the buffer names from the form_keys and these roles.

The roles are fixed words like this, and Coffea only has the power to change the form_key, not the role. This is almost the complete set of such names; only -tags (for UnionArray) is missing.

@agoose77
Copy link
Collaborator

agoose77 commented Apr 23, 2024

@jpivarski I started writing a reply here, but I think I should just aim for explicit verbosity, and hope we end up on the same page.

@martindurant this might be useful too, not sure.

When writing the current column optimisation API for dask-awkward, I had the following intentions:

  • be generic enough that coffea doesn't have to write super weird code
  • make few assumptions about optimisation, let the IO func do the work

Crucially, the optimisation strategies differ between Parquet, ROOT, and remapped-ROOT:

  • Parquet cannot read the metadata arrays (so we have a pick-a-child strategy)
  • ROOT cannot read the metadata arrays (so we have a read-the-whole-branch strategy)
  • Remapped-ROOT can read individual buffers (as a single branch may correspond to an offsets buffer

To handle all of this complexity, I made column optimisation in Dask Awkward a thin contract; the strategy for optimisation remains an implementation-specific feature. We implement an optimisation strategy for "column" (pick-one-child) formats, but not for the ROOT/Remapped-ROOT cases.
The contract imposed upon IO functions wishing to implement a strategy is:

  • IOFunc.prepare_for_projection() returns:
    • array - an Awkward Array that has buffers already named
    • report - a TypeTracerReport
    • state - additional state required later by project()
  • IOFunc.project(report, state) returns:
    • layer - the projected layer

As such, Dask Awkward doesn't need to know about buffer / form keys in its optimisation machinery. The logic in utils is only for convenience; anyone calling those routines should ensure that they use "semantic" buffer names.

I think where possible we should keep this "black-box" approach, and have the IO-func perform the strategy for optimisation.

I'll attend the dask-awkward meeting this Friday in case I can be of any more help! (And try to find some more cycles on this).

@martindurant
Copy link
Collaborator Author

I think you are effectively saying we should move the text manipulation to the IOFunction superclass so it can be overridden - which I am fine with. I don't otherwise see the consequence of what you are saying and why you need extra methods. As far as I could tell, the old technique was al about making layers that could go through the "no consequence" run of the graph at optimization time. From my point of view, all IO layers should implement project(), but with the option that it does nothing. (Or, alternatively, a meaningful project() could be what defines an IO Layer, since this is the only thing we need them for).

Your engagement is well appreciated! I hope you can help push this through, as it has turned out much harder than I had initially thought, and the implicit contracts in the various protocols are exactly what confused me the most. Now, I believe the only issue with the current tests is some cached state: either the explicit caches, or a missing commit().

@agoose77
Copy link
Collaborator

as it has turned out much harder than I had initially thought

I have a lot of sympathy for this, it's a rabbit hole.

the implicit contracts in the various protocols are exactly what confused me the most

Yes, in writing my reply I came to realise that the uproot code is really adding to the complexity of thinking about this, even though it's effectively separate. Having to consider coffea-via-uproot on awkward-dask is hard.

I hope you can help push this through

Will be glad too.

@agoose77
Copy link
Collaborator

I'm also mindful that we want to be clear about what "column" means; I think dask-awkward should/can only reason about buffers, and it's the IO-func's job to figure out whether that means "columns" (i.e. coffea reports the TTree branches for necessary_columns, whereas Parquet reports the attribute-path to the leaf).

@agoose77
Copy link
Collaborator

@martindurant also, I'd like to apologise for how long it took to get that detailed reply onto paper; my personal life has had quite a lot going on recently (new house among them!) that means I've had much less spare time than anticipated.

@martindurant
Copy link
Collaborator Author

OK, so agreed: we can maintain the buffer names while optimising, and only normalise to the IO-specific column names within the IOFunc's project() method, so that different backends can handle that differently. Also I can get rid of the "@." part, which I copied from the previous optimisation code and I don't think it does anything.

@jpivarski , this is a place where "."-containing field names might be a problem:

  • when the optimisation says that you need the offsets for some record field, we look for any column beginning with that schema path. So if there is a field "field" and another "field.name" which is NOT a sub-field of the first, we might erroneously select it.

@martindurant
Copy link
Collaborator Author

we can maintain the buffer names while optimising, and only normalise to the IO-specific column names within the IOFunc's project() method

done

@agoose77
Copy link
Collaborator

I'll hopefully take a look at this before our meeting tomorrow :)

@martindurant
Copy link
Collaborator Author

Ping @agoose77 , do you think you have time to look at this?

@agoose77
Copy link
Collaborator

@martindurant one question that came to me a while ago (only just actioning now) is - why do we store the report on the layer? Can we not just retrieve it from the meta? We don't expose an ak. function to do that yet, but it's basically:

assert ak.backend(array) == "typetracer"
form, length, container = ak.to_buffers(array)
reports = {b.report for b in container.values() if b is not None}

@martindurant
Copy link
Collaborator Author

If we only need to store the meta on the layer, fine. But we cannot access the meta of the collection (dak.Array) at optimize time, since dask only passes the combined graph of all the collections that were passed to compute().

@martindurant
Copy link
Collaborator Author

@lgray , this PR is finally pretty close. Could you do some speed-tests/profiles on moderate workloads to see if a) they work as expected and b) we really achieve what we expect? It should be, that the total wall time is decently improved, but more importantly, that the compute() call is decidedly faster, since the optimize phase is mostly spread through the graph building steps (each of which will be slightly slower).

@lgray
Copy link
Collaborator

lgray commented May 22, 2024

I'll probably need this test to function:

FAILED tests/test_inspect.py::test_basic_root_works - AttributeError: '_UprootRead' object has no attribute 'form'

To be able to run any of my tests.

@martindurant
Copy link
Collaborator Author

Oh, right - fair enough. I'll look at what patch root needs. In this codebase, of course we only ensure parquet and json.

@martindurant
Copy link
Collaborator Author

tests/test_inspect.py::test_basic_root_works is fixed with scikit-hep/uproot5#1225

The two remaining failures are:

  • getting an exception report is causing touching of columns (execution reports should drop all column touching reports?)
  • a behaviour/attribute gets lost, presumably it is set on a meta that gets overwritten

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an initial-thoughts review. I'll propose some concrete changes where I can in the next few days. The main things to tackle in my view are:

  1. Re-implement necessary_columns in terms of the IO function
  1. Understand how coffea & uproot need to change to match this new dask-awkward-driven buffer mapping

There's probably more to think about, but as I say this is just an initial thought. I recognise that the coupling between coffea and uproot is confusing, so I'll be sure to start with that.

@@ -518,6 +523,7 @@ def f(self, other):
meta = op(self._meta, other._meta)
else:
meta = op(self._meta, other)
[_.commit(name) for _ in self.report]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's lift this into a utility function e.g.

def commit_to_reports(name, reports):
    for report in reports:
        report.commit(name)

@@ -398,6 +399,10 @@ def name(self) -> str:
def key(self) -> Key:
return (self._name, 0)

@property
def report(self):
return getattr(self._meta, "_report", set())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lift _report from the meta to the Layer? I know that it's more state to keep in sync, but I don't love the act of tagging non-reserved attributes onto an Awkward Array, and this will better separate the stateful part of typetracer from the stateless part.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski I wonder whether we should add a new merge function to TypeTracerReport. Due to the uniqueness of layer IDs, and the existing association of reports with layouts, we can replace the need to carry a set of reports with a single report via

merge(A, B) -> A

I haven't yet thought about the memory implication for effectively having "larger" reports, but this might be a nicer UX.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Awkward Arrays are not accessible to users ("_meta" has an underscore); they're internal, bookkeeping objects. There's a similar use of temporary attributes in conversions from Arrow:

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/_connect/pyarrow/conversions.py#L69-L71

and

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/operations/ak_from_arrow.py#L75-L79

We're using the convenience of Python objects as dicts to handle a bookkeeping problem that would get more complicated if we were to try to do it in a statically typable way (although we could also make these "backdoors" into nullable, private attributes that are unused in almost all other circumstances).

An attempt to do this a bit more by-the-book with Indexes unfortunately made too much noise. The Index class has an arbitrary metadata that can be used to stash whatever we need:

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/index.py#L43-L52

although it was only ever used for one thing: associating a multidimensional integer array slice with an Index, which is one-dimensional.

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/_slicing.py#L143-L145

That's pretty obscure, but Index now has this mechanism everywhere, and it must make people wonder why it's there.

A hacky solution, like attaching temporary metadata to a Python object as though it were a dict, can prevent an implementation detail about a corner case from getting "noisy," alerting all parts of the code about its existence. If the hacky solution is implemented incorrectly and the hack spreads, then we'd be wishing we had that noise to help understand what's going wrong, but if it's implemented correctly and no one finds out what happens in Vegas, making it formal would obfuscate the normal case with an outlier.

So I'd be in favor of leaving in the _report attribute handling, especially if something ensures that it never leaves the graph-optimization phase. The example above with a __pyarrow_original attribute (which includes its context in its name, by the way) has an explicit, recursive procedure to ensure that no layouts escape ak.from_arrow with that attribute intact. I don't know if there's a similar guarantee for _report.

Comment on lines 633 to 638
array_meta, report = typetracer_with_report(
form_with_unique_keys(io_func.form, "@"),
highlevel=True,
behavior=io_func.behavior,
buffer_key=render_buffer_key,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big change:

Control over buffer names now solely rests with dask-awkward, rather than the provider of the IO function e.g. uproot.

This means that dask-awkward generates buffer names and gives them to the IO-function, meaning that the IO-functions now need to understand how to parse these keys instead of dask-awkward asking the IO-functions to parse them. This is all semantics; this is a trivial operation and we can define public functions in dask-awkward to do this.

Defining our column-optimisation interface at the Form level still ensures unique names, but it now requires that the IO-function understands how to map from the form to the buffers it works.1 A nice perk of this change is that because dask-awkward now controls buffer names, we can ensure that they are globally unique (i.e. between different input layers) e.g. by embedding the layer ID. This might prove useful.

Footnotes

  1. To understand the consequence of this, I need to more intimately remind myself how coffea currently remaps keys (where keys are VM instructions) to that which it presents to uproot.

def _optimize_columns(dsk, all_layers):
for k, lay in dsk.copy().items():
if not isinstance(lay, AwkwardInputLayer) or not hasattr(
lay.io_func, "_column_report"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like with _report, I'd like to formalise this as part of the API. This would become a part of the column-projection interface.

Comment on lines 173 to 177
for c in second.copy():
if "." in c:
parent = c.rsplit(".", 1)[0]
second.discard(parent)
out[k] = sorted(second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic assumes that buffer names are . delimited and are hierarchical, i.e. foo.bar is below foo. This isn't true for form remapping, which can move arbitrary buffers around.

That is to say, the real operation we want to implement is necessary_buffers which happens to be necessary_columns for the Parquet/JSON schemes.

@agoose77
Copy link
Collaborator

agoose77 commented May 27, 2024

I don't seem to be able to push to the repo, so here's a patch for the reports helper.1

Footnotes

  1. I know you can do this in some contexts, but there's a nuance (perhaps with auth) that I've not yet figured out.
    0001-refactor-add-commit_to_reports.patch.zip

@lgray
Copy link
Collaborator

lgray commented May 28, 2024

First attempt at this on a realistic analysis yields massive amounts of over-touching.

new (dak==this pr, uproot==martin's branch):

modify_mc_events: {'from-uproot-88a311d73c782076572d613c0a4c4122': ['BTagsDeepCSV', 'BTagsDeepCSVJECdown', 'BTagsDeepCSVJECup', 'BTagsDeepCSVJERdown', 'BTagsDeepCSVJERup', 'BadChargedCandidateFilter', 'BadPFMuonDzFilter', 'BadPFMuonFilter', 'CSCTightHaloFilter', 'CaloMET', 'CaloMETPhi', 'CrossSection', 'DeltaPhi1', 'DeltaPhi1JECdown', 'DeltaPhi1JECup', 'DeltaPhi1JERdown', 'DeltaPhi1JERup', 'DeltaPhi1_AK8', 'DeltaPhi2', 'DeltaPhi2JECdown', 'DeltaPhi2JECup', 'DeltaPhi2JERdown', 'DeltaPhi2JERup', 'DeltaPhi2_AK8', 'DeltaPhi3', 'DeltaPhi3JECdown', 'DeltaPhi3JECup', 'DeltaPhi3JERdown', 'DeltaPhi3JERup', 'DeltaPhi4', 'DeltaPhi4JECdown', 'DeltaPhi4JECup', 'DeltaPhi4JERdown', 'DeltaPhi4JERup', 'DeltaPhiMin_AK8', 'EcalDeadCellBoundaryEnergyFilter', 'EcalDeadCellTriggerPrimitiveFilter', 'Electrons', 'EvtNum', 'GenElectrons', 'GenHT', 'GenJets', 'GenJetsAK15', 'GenJetsAK8', 'GenMET', 'GenMETPhi', 'GenMHT', 'GenMHTPhi', 'GenMT2_AK8', 'GenMuons', 'GenParticles', 'GenTaus', 'GenVertices', 'HBHEIsoNoiseFilter', 'HBHENoiseFilter', 'HT', 'HT5', 'HT5JECdown', 'HT5JECup', 'HT5JERdown', 'HT5JERup', 'HTJECdown', 'HTJECup', 'HTJERdown', 'HTJERup', 'JetID', 'JetIDAK15', 'JetIDAK8', 'JetIDAK8JECdown', 'JetIDAK8JECup', 'JetIDAK8JERdown', 'JetIDAK8JERup', 'JetIDJECdown', 'JetIDJECup', 'JetIDJERdown', 'JetIDJERup', 'Jets', 'JetsAK15', 'JetsAK8', 'JetsAK8JECdown', 'JetsAK8JECup', 'JetsAK8JERdown', 'JetsAK8JERup', 'JetsConstituents', 'JetsJECdown', 'JetsJECup', 'JetsJERdown', 'JetsJERup', 'LumiBlockNum', 'MET', 'METDown', 'METPhi', 'METPhiDown', 'METPhiUp', 'METSignificance', 'METUp', 'MHT', 'MHTJECdown', 'MHTJECup', 'MHTJERdown', 'MHTJERup', 'MHTPhi', 'MHTPhiJECdown', 'MHTPhiJECup', 'MHTPhiJERdown', 'MHTPhiJERup', 'MJJ_AK8', 'MT_AK8', 'Mmc_AK8', 'Muons.pt', 'NElectrons', 'NJets', 'NJetsISR', 'NJetsISRJECdown', 'NJetsISRJECup', 'NJetsISRJERdown', 'NJetsISRJERup', 'NJetsJECdown', 'NJetsJECup', 'NJetsJERdown', 'NJetsJERup', 'NMuons', 'NVtx', 'NonPrefiringProb', 'NonPrefiringProbDown', 'NonPrefiringProbECAL', 'NonPrefiringProbECALDown', 'NonPrefiringProbECALUp', 'NonPrefiringProbMuon', 'NonPrefiringProbMuonDown', 'NonPrefiringProbMuonUp', 'NonPrefiringProbUp', 'NumEvents', 'NumInteractions', 'PDFweights', 'PFCaloMETRatio', 'PSweights', 'Photons', 'PrimaryVertexFilter', 'PrimaryVertices', 'RunNum', 'ScaleWeights', 'SignalParameters', 'TAPElectronTracks', 'TAPMuonTracks', 'TAPPionTracks', 'Tracks', 'TriggerPass', 'TriggerPrescales', 'TriggerVersion', 'TrueNumInteractions', 'Weight', 'ecalBadCalibFilter', 'eeBadScFilter', 'fixedGridRhoFastjetAll', 'globalSuperTightHalo2016Filter', 'globalTightHalo2016Filter', 'hasGenPromptPhoton', 'hfNoisyHitsFilter', 'isoElectronTracks', 'isoMuonTracks', 'isoPionTracks', 'madHT', 'nAllVertices', 'puSysDown', 'puSysUp', 'puWeight']}
after modify_mc_events: {'from-uproot-88a311d73c782076572d613c0a4c4122': ['BTagsDeepCSV', 'BTagsDeepCSVJECdown', 'BTagsDeepCSVJECup', 'BTagsDeepCSVJERdown', 'BTagsDeepCSVJERup', 'BadChargedCandidateFilter', 'BadPFMuonDzFilter', 'BadPFMuonFilter', 'CSCTightHaloFilter', 'CaloMET', 'CaloMETPhi', 'CrossSection', 'DeltaPhi1', 'DeltaPhi1JECdown', 'DeltaPhi1JECup', 'DeltaPhi1JERdown', 'DeltaPhi1JERup', 'DeltaPhi1_AK8', 'DeltaPhi2', 'DeltaPhi2JECdown', 'DeltaPhi2JECup', 'DeltaPhi2JERdown', 'DeltaPhi2JERup', 'DeltaPhi2_AK8', 'DeltaPhi3', 'DeltaPhi3JECdown', 'DeltaPhi3JECup', 'DeltaPhi3JERdown', 'DeltaPhi3JERup', 'DeltaPhi4', 'DeltaPhi4JECdown', 'DeltaPhi4JECup', 'DeltaPhi4JERdown', 'DeltaPhi4JERup', 'DeltaPhiMin_AK8', 'EcalDeadCellBoundaryEnergyFilter', 'EcalDeadCellTriggerPrimitiveFilter', 'Electrons', 'EvtNum', 'GenElectrons', 'GenHT', 'GenJets', 'GenJetsAK15', 'GenJetsAK8', 'GenMET', 'GenMETPhi', 'GenMHT', 'GenMHTPhi', 'GenMT2_AK8', 'GenMuons', 'GenParticles', 'GenTaus', 'GenVertices', 'HBHEIsoNoiseFilter', 'HBHENoiseFilter', 'HT', 'HT5', 'HT5JECdown', 'HT5JECup', 'HT5JERdown', 'HT5JERup', 'HTJECdown', 'HTJECup', 'HTJERdown', 'HTJERup', 'JetID', 'JetIDAK15', 'JetIDAK8', 'JetIDAK8JECdown', 'JetIDAK8JECup', 'JetIDAK8JERdown', 'JetIDAK8JERup', 'JetIDJECdown', 'JetIDJECup', 'JetIDJERdown', 'JetIDJERup', 'Jets.HTMask', 'Jets.ID', 'Jets.LeptonMask', 'Jets.MHTMask', 'Jets.axismajor', 'Jets.axisminor', 'Jets.bDiscriminatorCSV', 'Jets.bJetTagDeepCSVBvsAll', 'Jets.bJetTagDeepCSVprobb', 'Jets.bJetTagDeepCSVprobbb', 'Jets.bJetTagDeepCSVprobc', 'Jets.bJetTagDeepCSVprobudsg', 'Jets.bJetTagDeepFlavourprobb', 'Jets.bJetTagDeepFlavourprobbb', 'Jets.bJetTagDeepFlavourprobc', 'Jets.bJetTagDeepFlavourprobg', 'Jets.bJetTagDeepFlavourproblepb', 'Jets.bJetTagDeepFlavourprobuds', 'Jets.chargedEmEnergyFraction', 'Jets.chargedHadronEnergyFraction', 'Jets.chargedHadronMultiplicity', 'Jets.chargedMultiplicity', 'Jets.electronEnergyFraction', 'Jets.electronMultiplicity', 'Jets.energy', 'Jets.eta', 'Jets.hadronFlavor', 'Jets.hfEMEnergyFraction', 'Jets.hfHadronEnergyFraction', 'Jets.jecFactor', 'Jets.jecUnc', 'Jets.jerFactor', 'Jets.jerFactorDown', 'Jets.jerFactorUp', 'Jets.multiplicity', 'Jets.muonEnergyFraction', 'Jets.muonMultiplicity', 'Jets.neutralEmEnergyFraction', 'Jets.neutralHadronEnergyFraction', 'Jets.neutralHadronMultiplicity', 'Jets.neutralMultiplicity', 'Jets.origIndex', 'Jets.partonFlavor', 'Jets.phi', 'Jets.photonEnergyFraction', 'Jets.photonMultiplicity', 'Jets.pileupId', 'Jets.pt', 'Jets.ptD', 'Jets.qgLikelihood', 'JetsAK15', 'JetsAK8', 'JetsAK8JECdown', 'JetsAK8JECup', 'JetsAK8JERdown', 'JetsAK8JERup', 'JetsConstituents', 'JetsJECdown', 'JetsJECup', 'JetsJERdown', 'JetsJERup', 'LumiBlockNum', 'MET', 'METDown', 'METPhi', 'METPhiDown', 'METPhiUp', 'METSignificance', 'METUp', 'MHT', 'MHTJECdown', 'MHTJECup', 'MHTJERdown', 'MHTJERup', 'MHTPhi', 'MHTPhiJECdown', 'MHTPhiJECup', 'MHTPhiJERdown', 'MHTPhiJERup', 'MJJ_AK8', 'MT_AK8', 'Mmc_AK8', 'Muons.pt', 'NElectrons', 'NJets', 'NJetsISR', 'NJetsISRJECdown', 'NJetsISRJECup', 'NJetsISRJERdown', 'NJetsISRJERup', 'NJetsJECdown', 'NJetsJECup', 'NJetsJERdown', 'NJetsJERup', 'NMuons', 'NVtx', 'NonPrefiringProb', 'NonPrefiringProbDown', 'NonPrefiringProbECAL', 'NonPrefiringProbECALDown', 'NonPrefiringProbECALUp', 'NonPrefiringProbMuon', 'NonPrefiringProbMuonDown', 'NonPrefiringProbMuonUp', 'NonPrefiringProbUp', 'NumEvents', 'NumInteractions', 'PDFweights', 'PFCaloMETRatio', 'PSweights', 'Photons', 'PrimaryVertexFilter', 'PrimaryVertices', 'RunNum', 'ScaleWeights', 'SignalParameters', 'TAPElectronTracks', 'TAPMuonTracks', 'TAPPionTracks', 'Tracks.IP2DPV0', 'Tracks.IP2DSigPV0', 'Tracks.IP3DPV0', 'Tracks.IP3DSigPV0', 'Tracks.IPSign', 'Tracks.charge', 'Tracks.dxyErrorPV0', 'Tracks.dxyPV0', 'Tracks.dzAssociatedPV', 'Tracks.dzErrorPV0', 'Tracks.dzPV0', 'Tracks.etaError', 'Tracks.firstHit', 'Tracks.foundHits', 'Tracks.fromPV0', 'Tracks.hitPattern', 'Tracks.hitPatternCounts', 'Tracks.lostHits', 'Tracks.matchedToPFCandidate', 'Tracks.normalizedChi2', 'Tracks.numberOfHits', 'Tracks.numberOfPixelHits', 'Tracks.pdgId', 'Tracks.pfEnergy', 'Tracks.phiError', 'Tracks.ptError', 'Tracks.pvAssociationQuality', 'Tracks.qoverpError', 'Tracks.quality', 'Tracks.referencePoint.x', 'Tracks.referencePoint.y', 'Tracks.referencePoint.z', 'Tracks.vertexIdx', 'Tracks.x', 'Tracks.y', 'Tracks.z', 'TriggerPass', 'TriggerPrescales', 'TriggerVersion', 'TrueNumInteractions', 'Weight', 'ecalBadCalibFilter', 'eeBadScFilter', 'fixedGridRhoFastjetAll', 'globalSuperTightHalo2016Filter', 'globalTightHalo2016Filter', 'hasGenPromptPhoton', 'hfNoisyHitsFilter', 'isoElectronTracks', 'isoMuonTracks', 'isoPionTracks', 'madHT', 'nAllVertices', 'puSysDown', 'puSysUp', 'puWeight']}

old (this has been confirmed to be both correct and minimal, dak==5.3.0, uproot=5.3.7):

modify_mc_events: {'from-uproot-3162b4c9f5707da5eac9bc4135170e04': frozenset({'Weight', 'puWeight', 'Muons/Muons.fCoordinates.fPt'})}
after modify_mc_events: {'from-uproot-3162b4c9f5707da5eac9bc4135170e04': frozenset({'Weight', 'puWeight', 'HT', 'Jets/Jets.fCoordinates.fPt', 'Muons/Muons.fCoordinates.fPt', 'Jets/Jets.fCoordinates.fE', 'Jets_jerFactorUp', 'Tracks/Tracks.fCoordinates.fX', 'Jets_jerFactor', 'Jets_jerFactorDown', 'Jets_jecUnc'})}

This is after a few transformations on input data to add some useful columns and such.

However, it did work and certain portions were noticeably rather faster! Some slower.

@martindurant
Copy link
Collaborator Author

martindurant commented Jul 30, 2024

Only three failures:

  • assigning attributes
  • IO report
  • root (I have a fix for uproot in a branch, but not one that also accounts for nanoevents remapping)

@LGrey - moved from IO reporting layers returning a tuple to returning
a record (or dict) for my own sanity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants