feat: One-pass column optimization #491

martindurant · 2024-03-28T20:00:28Z

To-do:

make sure all output nodes do a touch-and-commit
do not get caught by caching or mutation
~~figure out what text manipulation is really required~~
~~reinstate necessary_columns based on optimize_columns~~
eliminate multi-loops for speed if an issue (likely!)

agoose77 · 2024-04-12T15:10:10Z

src/dask_awkward/lib/io/columnar.py

-            else:
-                paths.add(form_key_to_path[form_key])
-
-        # Select the most appropriate column for each wildcard


@martindurant is the plan to restore e.g. wildcard matching in this PR?

I don't think that happens in this PR, this one's just about auto-selecting required columns. We might need more work for the case where the shape of something is required but not the data.

I'm only surface-glancing, but I noted that this logic has been removed and didn't see anywhere that it was recovered. Might have missed something!

You are right, this should live after checking shape_touched_in, more complex than I had hoped.

OK< I have some shonky code in optimize.optimize_columns to deal with shapes. I am probably still missing some touch-buffers on the outputs, but most things are in line.

It's all pretty ugly and in the latest commit I seem to have introduced some typetracer->numpy coersion.

agoose77 · 2024-04-23T11:07:23Z

src/dask_awkward/lib/optimize.py

+def _buf_to_col(s):
+    return (
+        s[2:]
+        .replace(".content", "")
+        .replace("-offsets", "")
+        .replace("-data", "")
+        .replace("-index", "")
+        .replace("-mask", "")
+    )


I don't have time right now to look at the full PR here, but is this assumption about buffer name mappings enforced anywhere? Right now the design is such that input sources can use a non-canonical buffer naming scheme.

For coffea I think we made this possible so that Coffea could directly pass in their own buffer names with semantic meaning.

Do we have access to the call to ak.from_buffers here? That function takes a callable that constructs the buffer names from the form_keys and these roles.

The roles are fixed words like this, and Coffea only has the power to change the form_key, not the role. This is almost the complete set of such names; only -tags (for UnionArray) is missing.

agoose77 · 2024-04-23T12:36:07Z

@jpivarski I started writing a reply here, but I think I should just aim for explicit verbosity, and hope we end up on the same page.

@martindurant this might be useful too, not sure.

When writing the current column optimisation API for dask-awkward, I had the following intentions:

be generic enough that coffea doesn't have to write super weird code
make few assumptions about optimisation, let the IO func do the work

Crucially, the optimisation strategies differ between Parquet, ROOT, and remapped-ROOT:

Parquet cannot read the metadata arrays (so we have a pick-a-child strategy)
ROOT cannot read the metadata arrays (so we have a read-the-whole-branch strategy)
Remapped-ROOT can read individual buffers (as a single branch may correspond to an offsets buffer

To handle all of this complexity, I made column optimisation in Dask Awkward a thin contract; the strategy for optimisation remains an implementation-specific feature. We implement an optimisation strategy for "column" (pick-one-child) formats, but not for the ROOT/Remapped-ROOT cases.
The contract imposed upon IO functions wishing to implement a strategy is:

IOFunc.prepare_for_projection() returns:
- array - an Awkward Array that has buffers already named
- report - a TypeTracerReport
- state - additional state required later by project()
IOFunc.project(report, state) returns:
- layer - the projected layer

As such, Dask Awkward doesn't need to know about buffer / form keys in its optimisation machinery. The logic in utils is only for convenience; anyone calling those routines should ensure that they use "semantic" buffer names.

I think where possible we should keep this "black-box" approach, and have the IO-func perform the strategy for optimisation.

I'll attend the dask-awkward meeting this Friday in case I can be of any more help! (And try to find some more cycles on this).

martindurant · 2024-04-23T13:15:47Z

I think you are effectively saying we should move the text manipulation to the IOFunction superclass so it can be overridden - which I am fine with. I don't otherwise see the consequence of what you are saying and why you need extra methods. As far as I could tell, the old technique was al about making layers that could go through the "no consequence" run of the graph at optimization time. From my point of view, all IO layers should implement project(), but with the option that it does nothing. (Or, alternatively, a meaningful project() could be what defines an IO Layer, since this is the only thing we need them for).

Your engagement is well appreciated! I hope you can help push this through, as it has turned out much harder than I had initially thought, and the implicit contracts in the various protocols are exactly what confused me the most. Now, I believe the only issue with the current tests is some cached state: either the explicit caches, or a missing commit().

agoose77 · 2024-04-23T14:04:45Z

as it has turned out much harder than I had initially thought

I have a lot of sympathy for this, it's a rabbit hole.

the implicit contracts in the various protocols are exactly what confused me the most

Yes, in writing my reply I came to realise that the uproot code is really adding to the complexity of thinking about this, even though it's effectively separate. Having to consider coffea-via-uproot on awkward-dask is hard.

I hope you can help push this through

Will be glad too.

agoose77 · 2024-04-23T14:06:05Z

I'm also mindful that we want to be clear about what "column" means; I think dask-awkward should/can only reason about buffers, and it's the IO-func's job to figure out whether that means "columns" (i.e. coffea reports the TTree branches for necessary_columns, whereas Parquet reports the attribute-path to the leaf).

agoose77 · 2024-04-23T14:07:05Z

@martindurant also, I'd like to apologise for how long it took to get that detailed reply onto paper; my personal life has had quite a lot going on recently (new house among them!) that means I've had much less spare time than anticipated.

martindurant · 2024-04-24T19:07:26Z

OK, so agreed: we can maintain the buffer names while optimising, and only normalise to the IO-specific column names within the IOFunc's project() method, so that different backends can handle that differently. Also I can get rid of the "@." part, which I copied from the previous optimisation code and I don't think it does anything.

@jpivarski , this is a place where "."-containing field names might be a problem:

when the optimisation says that you need the offsets for some record field, we look for any column beginning with that schema path. So if there is a field "field" and another "field.name" which is NOT a sub-field of the first, we might erroneously select it.

martindurant · 2024-04-25T15:23:52Z

we can maintain the buffer names while optimising, and only normalise to the IO-specific column names within the IOFunc's project() method

done

agoose77 · 2024-04-25T15:57:03Z

I'll hopefully take a look at this before our meeting tomorrow :)

martindurant · 2024-05-14T13:33:26Z

Ping @agoose77 , do you think you have time to look at this?

agoose77 · 2024-05-21T20:50:42Z

@martindurant one question that came to me a while ago (only just actioning now) is - why do we store the report on the layer? Can we not just retrieve it from the meta? We don't expose an ak. function to do that yet, but it's basically:

assert ak.backend(array) == "typetracer"
form, length, container = ak.to_buffers(array)
reports = {b.report for b in container.values() if b is not None}

martindurant · 2024-05-22T00:58:50Z

If we only need to store the meta on the layer, fine. But we cannot access the meta of the collection (dak.Array) at optimize time, since dask only passes the combined graph of all the collections that were passed to compute().

martindurant · 2024-05-22T14:30:27Z

@lgray , this PR is finally pretty close. Could you do some speed-tests/profiles on moderate workloads to see if a) they work as expected and b) we really achieve what we expect? It should be, that the total wall time is decently improved, but more importantly, that the compute() call is decidedly faster, since the optimize phase is mostly spread through the graph building steps (each of which will be slightly slower).

lgray · 2024-05-22T14:34:04Z

I'll probably need this test to function:

FAILED tests/test_inspect.py::test_basic_root_works - AttributeError: '_UprootRead' object has no attribute 'form'

To be able to run any of my tests.

martindurant · 2024-05-22T14:36:40Z

Oh, right - fair enough. I'll look at what patch root needs. In this codebase, of course we only ensure parquet and json.

…lled (except the uproot test itself)

martindurant · 2024-05-23T18:34:56Z

tests/test_inspect.py::test_basic_root_works is fixed with scikit-hep/uproot5#1225

The two remaining failures are:

getting an exception report is causing touching of columns (execution reports should drop all column touching reports?)
a behaviour/attribute gets lost, presumably it is set on a meta that gets overwritten

agoose77

This is an initial-thoughts review. I'll propose some concrete changes where I can in the next few days. The main things to tackle in my view are:

Re-implement necessary_columns in terms of the IO function

"columns" in my view (and hence the existing implementation) represents the IO-function's definition of a column, i.e. it's different between uproot, and Parquet/JSON.
We should ask the IO-function to translate buffer names to column names e.g. https://github.com/scikit-hep/uproot5/blob/734700ef1f822338b03a7573df484909b317b2c2/src/uproot/_dask.py#L1077

Understand how coffea & uproot need to change to match this new dask-awkward-driven buffer mapping

There's probably more to think about, but as I say this is just an initial thought. I recognise that the coupling between coffea and uproot is confusing, so I'll be sure to start with that.

agoose77 · 2024-05-26T18:11:51Z

src/dask_awkward/lib/core.py

@@ -518,6 +523,7 @@ def f(self, other):
                meta = op(self._meta, other._meta)
            else:
                meta = op(self._meta, other)
+            [_.commit(name) for _ in self.report]


Let's lift this into a utility function e.g.

def commit_to_reports(name, reports): for report in reports: report.commit(name)

agoose77 · 2024-05-26T18:13:02Z

src/dask_awkward/lib/core.py

@@ -398,6 +399,10 @@ def name(self) -> str:
    def key(self) -> Key:
        return (self._name, 0)

+    @property
+    def report(self):
+        return getattr(self._meta, "_report", set())


Can we lift _report from the meta to the Layer? I know that it's more state to keep in sync, but I don't love the act of tagging non-reserved attributes onto an Awkward Array, and this will better separate the stateful part of typetracer from the stateless part.

@jpivarski I wonder whether we should add a new merge function to TypeTracerReport. Due to the uniqueness of layer IDs, and the existing association of reports with layouts, we can replace the need to carry a set of reports with a single report via

merge(A, B) -> A

I haven't yet thought about the memory implication for effectively having "larger" reports, but this might be a nicer UX.

These Awkward Arrays are not accessible to users ("_meta" has an underscore); they're internal, bookkeeping objects. There's a similar use of temporary attributes in conversions from Arrow:

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/_connect/pyarrow/conversions.py#L69-L71

and

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/operations/ak_from_arrow.py#L75-L79

We're using the convenience of Python objects as dicts to handle a bookkeeping problem that would get more complicated if we were to try to do it in a statically typable way (although we could also make these "backdoors" into nullable, private attributes that are unused in almost all other circumstances).

An attempt to do this a bit more by-the-book with Indexes unfortunately made too much noise. The Index class has an arbitrary metadata that can be used to stash whatever we need:

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/index.py#L43-L52

although it was only ever used for one thing: associating a multidimensional integer array slice with an Index, which is one-dimensional.

https://github.com/scikit-hep/awkward/blob/727ff9ca4759aed52e0b4f21d3daa66724582c61/src/awkward/_slicing.py#L143-L145

That's pretty obscure, but Index now has this mechanism everywhere, and it must make people wonder why it's there.

A hacky solution, like attaching temporary metadata to a Python object as though it were a dict, can prevent an implementation detail about a corner case from getting "noisy," alerting all parts of the code about its existence. If the hacky solution is implemented incorrectly and the hack spreads, then we'd be wishing we had that noise to help understand what's going wrong, but if it's implemented correctly and no one finds out what happens in Vegas, making it formal would obfuscate the normal case with an outlier.

So I'd be in favor of leaving in the _report attribute handling, especially if something ensures that it never leaves the graph-optimization phase. The example above with a __pyarrow_original attribute (which includes its context in its name, by the way) has an explicit, recursive procedure to ensure that no layouts escape ak.from_arrow with that attribute intact. I don't know if there's a similar guarantee for _report.

agoose77 · 2024-05-26T18:17:03Z

src/dask_awkward/lib/io/io.py

+        array_meta, report = typetracer_with_report(
+            form_with_unique_keys(io_func.form, "@"),
+            highlevel=True,
+            behavior=io_func.behavior,
+            buffer_key=render_buffer_key,
+        )


This is the big change:

Control over buffer names now solely rests with dask-awkward, rather than the provider of the IO function e.g. uproot.

This means that dask-awkward generates buffer names and gives them to the IO-function, meaning that the IO-functions now need to understand how to parse these keys instead of dask-awkward asking the IO-functions to parse them. This is all semantics; this is a trivial operation and we can define public functions in dask-awkward to do this.

Defining our column-optimisation interface at the Form level still ensures unique names, but it now requires that the IO-function understands how to map from the form to the buffers it works.¹ A nice perk of this change is that because dask-awkward now controls buffer names, we can ensure that they are globally unique (i.e. between different input layers) e.g. by embedding the layer ID. This might prove useful.

Footnotes

To understand the consequence of this, I need to more intimately remind myself how coffea currently remaps keys (where keys are VM instructions) to that which it presents to uproot. ↩

agoose77 · 2024-05-26T18:18:05Z

src/dask_awkward/lib/optimize.py

+def _optimize_columns(dsk, all_layers):
+    for k, lay in dsk.copy().items():
+        if not isinstance(lay, AwkwardInputLayer) or not hasattr(
+            lay.io_func, "_column_report"


Just like with _report, I'd like to formalise this as part of the API. This would become a part of the column-projection interface.

agoose77 · 2024-05-26T18:24:37Z

src/dask_awkward/lib/optimize.py

+        for c in second.copy():
+            if "." in c:
+                parent = c.rsplit(".", 1)[0]
+                second.discard(parent)
+        out[k] = sorted(second)


This logic assumes that buffer names are . delimited and are hierarchical, i.e. foo.bar is below foo. This isn't true for form remapping, which can move arbitrary buffers around.

That is to say, the real operation we want to implement is necessary_buffers which happens to be necessary_columns for the Parquet/JSON schemes.

agoose77 · 2024-05-27T15:02:23Z

I don't seem to be able to push to the repo, so here's a patch for the reports helper.¹

I know you can do this in some contexts, but there's a nuance (perhaps with auth) that I've not yet figured out.
0001-refactor-add-commit_to_reports.patch.zip ↩

lgray · 2024-05-28T14:46:33Z

First attempt at this on a realistic analysis yields massive amounts of over-touching.

new (dak==this pr, uproot==martin's branch):

modify_mc_events: {'from-uproot-88a311d73c782076572d613c0a4c4122': ['BTagsDeepCSV', 'BTagsDeepCSVJECdown', 'BTagsDeepCSVJECup', 'BTagsDeepCSVJERdown', 'BTagsDeepCSVJERup', 'BadChargedCandidateFilter', 'BadPFMuonDzFilter', 'BadPFMuonFilter', 'CSCTightHaloFilter', 'CaloMET', 'CaloMETPhi', 'CrossSection', 'DeltaPhi1', 'DeltaPhi1JECdown', 'DeltaPhi1JECup', 'DeltaPhi1JERdown', 'DeltaPhi1JERup', 'DeltaPhi1_AK8', 'DeltaPhi2', 'DeltaPhi2JECdown', 'DeltaPhi2JECup', 'DeltaPhi2JERdown', 'DeltaPhi2JERup', 'DeltaPhi2_AK8', 'DeltaPhi3', 'DeltaPhi3JECdown', 'DeltaPhi3JECup', 'DeltaPhi3JERdown', 'DeltaPhi3JERup', 'DeltaPhi4', 'DeltaPhi4JECdown', 'DeltaPhi4JECup', 'DeltaPhi4JERdown', 'DeltaPhi4JERup', 'DeltaPhiMin_AK8', 'EcalDeadCellBoundaryEnergyFilter', 'EcalDeadCellTriggerPrimitiveFilter', 'Electrons', 'EvtNum', 'GenElectrons', 'GenHT', 'GenJets', 'GenJetsAK15', 'GenJetsAK8', 'GenMET', 'GenMETPhi', 'GenMHT', 'GenMHTPhi', 'GenMT2_AK8', 'GenMuons', 'GenParticles', 'GenTaus', 'GenVertices', 'HBHEIsoNoiseFilter', 'HBHENoiseFilter', 'HT', 'HT5', 'HT5JECdown', 'HT5JECup', 'HT5JERdown', 'HT5JERup', 'HTJECdown', 'HTJECup', 'HTJERdown', 'HTJERup', 'JetID', 'JetIDAK15', 'JetIDAK8', 'JetIDAK8JECdown', 'JetIDAK8JECup', 'JetIDAK8JERdown', 'JetIDAK8JERup', 'JetIDJECdown', 'JetIDJECup', 'JetIDJERdown', 'JetIDJERup', 'Jets', 'JetsAK15', 'JetsAK8', 'JetsAK8JECdown', 'JetsAK8JECup', 'JetsAK8JERdown', 'JetsAK8JERup', 'JetsConstituents', 'JetsJECdown', 'JetsJECup', 'JetsJERdown', 'JetsJERup', 'LumiBlockNum', 'MET', 'METDown', 'METPhi', 'METPhiDown', 'METPhiUp', 'METSignificance', 'METUp', 'MHT', 'MHTJECdown', 'MHTJECup', 'MHTJERdown', 'MHTJERup', 'MHTPhi', 'MHTPhiJECdown', 'MHTPhiJECup', 'MHTPhiJERdown', 'MHTPhiJERup', 'MJJ_AK8', 'MT_AK8', 'Mmc_AK8', 'Muons.pt', 'NElectrons', 'NJets', 'NJetsISR', 'NJetsISRJECdown', 'NJetsISRJECup', 'NJetsISRJERdown', 'NJetsISRJERup', 'NJetsJECdown', 'NJetsJECup', 'NJetsJERdown', 'NJetsJERup', 'NMuons', 'NVtx', 'NonPrefiringProb', 'NonPrefiringProbDown', 'NonPrefiringProbECAL', 'NonPrefiringProbECALDown', 'NonPrefiringProbECALUp', 'NonPrefiringProbMuon', 'NonPrefiringProbMuonDown', 'NonPrefiringProbMuonUp', 'NonPrefiringProbUp', 'NumEvents', 'NumInteractions', 'PDFweights', 'PFCaloMETRatio', 'PSweights', 'Photons', 'PrimaryVertexFilter', 'PrimaryVertices', 'RunNum', 'ScaleWeights', 'SignalParameters', 'TAPElectronTracks', 'TAPMuonTracks', 'TAPPionTracks', 'Tracks', 'TriggerPass', 'TriggerPrescales', 'TriggerVersion', 'TrueNumInteractions', 'Weight', 'ecalBadCalibFilter', 'eeBadScFilter', 'fixedGridRhoFastjetAll', 'globalSuperTightHalo2016Filter', 'globalTightHalo2016Filter', 'hasGenPromptPhoton', 'hfNoisyHitsFilter', 'isoElectronTracks', 'isoMuonTracks', 'isoPionTracks', 'madHT', 'nAllVertices', 'puSysDown', 'puSysUp', 'puWeight']}
after modify_mc_events: {'from-uproot-88a311d73c782076572d613c0a4c4122': ['BTagsDeepCSV', 'BTagsDeepCSVJECdown', 'BTagsDeepCSVJECup', 'BTagsDeepCSVJERdown', 'BTagsDeepCSVJERup', 'BadChargedCandidateFilter', 'BadPFMuonDzFilter', 'BadPFMuonFilter', 'CSCTightHaloFilter', 'CaloMET', 'CaloMETPhi', 'CrossSection', 'DeltaPhi1', 'DeltaPhi1JECdown', 'DeltaPhi1JECup', 'DeltaPhi1JERdown', 'DeltaPhi1JERup', 'DeltaPhi1_AK8', 'DeltaPhi2', 'DeltaPhi2JECdown', 'DeltaPhi2JECup', 'DeltaPhi2JERdown', 'DeltaPhi2JERup', 'DeltaPhi2_AK8', 'DeltaPhi3', 'DeltaPhi3JECdown', 'DeltaPhi3JECup', 'DeltaPhi3JERdown', 'DeltaPhi3JERup', 'DeltaPhi4', 'DeltaPhi4JECdown', 'DeltaPhi4JECup', 'DeltaPhi4JERdown', 'DeltaPhi4JERup', 'DeltaPhiMin_AK8', 'EcalDeadCellBoundaryEnergyFilter', 'EcalDeadCellTriggerPrimitiveFilter', 'Electrons', 'EvtNum', 'GenElectrons', 'GenHT', 'GenJets', 'GenJetsAK15', 'GenJetsAK8', 'GenMET', 'GenMETPhi', 'GenMHT', 'GenMHTPhi', 'GenMT2_AK8', 'GenMuons', 'GenParticles', 'GenTaus', 'GenVertices', 'HBHEIsoNoiseFilter', 'HBHENoiseFilter', 'HT', 'HT5', 'HT5JECdown', 'HT5JECup', 'HT5JERdown', 'HT5JERup', 'HTJECdown', 'HTJECup', 'HTJERdown', 'HTJERup', 'JetID', 'JetIDAK15', 'JetIDAK8', 'JetIDAK8JECdown', 'JetIDAK8JECup', 'JetIDAK8JERdown', 'JetIDAK8JERup', 'JetIDJECdown', 'JetIDJECup', 'JetIDJERdown', 'JetIDJERup', 'Jets.HTMask', 'Jets.ID', 'Jets.LeptonMask', 'Jets.MHTMask', 'Jets.axismajor', 'Jets.axisminor', 'Jets.bDiscriminatorCSV', 'Jets.bJetTagDeepCSVBvsAll', 'Jets.bJetTagDeepCSVprobb', 'Jets.bJetTagDeepCSVprobbb', 'Jets.bJetTagDeepCSVprobc', 'Jets.bJetTagDeepCSVprobudsg', 'Jets.bJetTagDeepFlavourprobb', 'Jets.bJetTagDeepFlavourprobbb', 'Jets.bJetTagDeepFlavourprobc', 'Jets.bJetTagDeepFlavourprobg', 'Jets.bJetTagDeepFlavourproblepb', 'Jets.bJetTagDeepFlavourprobuds', 'Jets.chargedEmEnergyFraction', 'Jets.chargedHadronEnergyFraction', 'Jets.chargedHadronMultiplicity', 'Jets.chargedMultiplicity', 'Jets.electronEnergyFraction', 'Jets.electronMultiplicity', 'Jets.energy', 'Jets.eta', 'Jets.hadronFlavor', 'Jets.hfEMEnergyFraction', 'Jets.hfHadronEnergyFraction', 'Jets.jecFactor', 'Jets.jecUnc', 'Jets.jerFactor', 'Jets.jerFactorDown', 'Jets.jerFactorUp', 'Jets.multiplicity', 'Jets.muonEnergyFraction', 'Jets.muonMultiplicity', 'Jets.neutralEmEnergyFraction', 'Jets.neutralHadronEnergyFraction', 'Jets.neutralHadronMultiplicity', 'Jets.neutralMultiplicity', 'Jets.origIndex', 'Jets.partonFlavor', 'Jets.phi', 'Jets.photonEnergyFraction', 'Jets.photonMultiplicity', 'Jets.pileupId', 'Jets.pt', 'Jets.ptD', 'Jets.qgLikelihood', 'JetsAK15', 'JetsAK8', 'JetsAK8JECdown', 'JetsAK8JECup', 'JetsAK8JERdown', 'JetsAK8JERup', 'JetsConstituents', 'JetsJECdown', 'JetsJECup', 'JetsJERdown', 'JetsJERup', 'LumiBlockNum', 'MET', 'METDown', 'METPhi', 'METPhiDown', 'METPhiUp', 'METSignificance', 'METUp', 'MHT', 'MHTJECdown', 'MHTJECup', 'MHTJERdown', 'MHTJERup', 'MHTPhi', 'MHTPhiJECdown', 'MHTPhiJECup', 'MHTPhiJERdown', 'MHTPhiJERup', 'MJJ_AK8', 'MT_AK8', 'Mmc_AK8', 'Muons.pt', 'NElectrons', 'NJets', 'NJetsISR', 'NJetsISRJECdown', 'NJetsISRJECup', 'NJetsISRJERdown', 'NJetsISRJERup', 'NJetsJECdown', 'NJetsJECup', 'NJetsJERdown', 'NJetsJERup', 'NMuons', 'NVtx', 'NonPrefiringProb', 'NonPrefiringProbDown', 'NonPrefiringProbECAL', 'NonPrefiringProbECALDown', 'NonPrefiringProbECALUp', 'NonPrefiringProbMuon', 'NonPrefiringProbMuonDown', 'NonPrefiringProbMuonUp', 'NonPrefiringProbUp', 'NumEvents', 'NumInteractions', 'PDFweights', 'PFCaloMETRatio', 'PSweights', 'Photons', 'PrimaryVertexFilter', 'PrimaryVertices', 'RunNum', 'ScaleWeights', 'SignalParameters', 'TAPElectronTracks', 'TAPMuonTracks', 'TAPPionTracks', 'Tracks.IP2DPV0', 'Tracks.IP2DSigPV0', 'Tracks.IP3DPV0', 'Tracks.IP3DSigPV0', 'Tracks.IPSign', 'Tracks.charge', 'Tracks.dxyErrorPV0', 'Tracks.dxyPV0', 'Tracks.dzAssociatedPV', 'Tracks.dzErrorPV0', 'Tracks.dzPV0', 'Tracks.etaError', 'Tracks.firstHit', 'Tracks.foundHits', 'Tracks.fromPV0', 'Tracks.hitPattern', 'Tracks.hitPatternCounts', 'Tracks.lostHits', 'Tracks.matchedToPFCandidate', 'Tracks.normalizedChi2', 'Tracks.numberOfHits', 'Tracks.numberOfPixelHits', 'Tracks.pdgId', 'Tracks.pfEnergy', 'Tracks.phiError', 'Tracks.ptError', 'Tracks.pvAssociationQuality', 'Tracks.qoverpError', 'Tracks.quality', 'Tracks.referencePoint.x', 'Tracks.referencePoint.y', 'Tracks.referencePoint.z', 'Tracks.vertexIdx', 'Tracks.x', 'Tracks.y', 'Tracks.z', 'TriggerPass', 'TriggerPrescales', 'TriggerVersion', 'TrueNumInteractions', 'Weight', 'ecalBadCalibFilter', 'eeBadScFilter', 'fixedGridRhoFastjetAll', 'globalSuperTightHalo2016Filter', 'globalTightHalo2016Filter', 'hasGenPromptPhoton', 'hfNoisyHitsFilter', 'isoElectronTracks', 'isoMuonTracks', 'isoPionTracks', 'madHT', 'nAllVertices', 'puSysDown', 'puSysUp', 'puWeight']}

old (this has been confirmed to be both correct and minimal, dak==5.3.0, uproot=5.3.7):

modify_mc_events: {'from-uproot-3162b4c9f5707da5eac9bc4135170e04': frozenset({'Weight', 'puWeight', 'Muons/Muons.fCoordinates.fPt'})}
after modify_mc_events: {'from-uproot-3162b4c9f5707da5eac9bc4135170e04': frozenset({'Weight', 'puWeight', 'HT', 'Jets/Jets.fCoordinates.fPt', 'Muons/Muons.fCoordinates.fPt', 'Jets/Jets.fCoordinates.fE', 'Jets_jerFactorUp', 'Tracks/Tracks.fCoordinates.fX', 'Jets_jerFactor', 'Jets_jerFactorDown', 'Jets_jecUnc'})}

This is after a few transformations on input data to add some useful columns and such.

However, it did work and certain portions were noticeably rather faster! Some slower.

martindurant · 2024-07-30T20:39:15Z

Only three failures:

~~assigning attributes~~
IO report
root (I have a fix for uproot in a branch, but not one that also accounts for nanoevents remapping)

@LGrey

@LGrey - moved from IO reporting layers returning a tuple to returning a record (or dict) for my own sanity.

martindurant added 2 commits March 28, 2024 13:15

start

b18b3af

Remove unused

3cdc778

martindurant changed the title ~~ENH: One-pass column optimization~~ feat: One-pass column optimization Mar 28, 2024

martindurant added 5 commits April 4, 2024 09:44

Pass reports around

c57693c

remember to commit

0c367df

first working (for parquet)

9c654b1

Merge branch 'main' into one-pass

e6c5ce3

stop

f56f77a

agoose77 reviewed Apr 12, 2024

View reviewed changes

martindurant added 3 commits April 15, 2024 14:32

most pass

75b4416

fix most

a5c319d

probably better

243fbc1

martindurant mentioned this pull request Apr 19, 2024

dask-awkward array loaded by dak.from_parquet and field-sliced is not populated (has PlaceholderArrays) #501

Open

martindurant added 2 commits April 22, 2024 15:25

Reinstate necessary_columns (needs doc)

6eebf87

Merge branch 'main' into one-pass

8bbe409

agoose77 reviewed Apr 23, 2024

View reviewed changes

pass buffer names around, not columns

43b2e43

martindurant added 2 commits May 13, 2024 14:39

Clear cache between tests

e5828fb

Merge branch 'main' into one-pass

df0cf2f

Another one squashed

b593f87

martindurant added 2 commits May 22, 2024 11:10

Squash errors that only show when uproot.dask and hist.dask are insta…

e410b61

…lled (except the uproot test itself)

fix uproot

cd537e2

martindurant mentioned this pull request May 23, 2024

Changes for dask-awkward one-pass optimize scikit-hep/uproot5#1225

Draft

agoose77 reviewed May 26, 2024

View reviewed changes

fix report

baf6f46

martindurant added 2 commits June 6, 2024 12:52

if meta fails

e29e929

rev

f61fdd7

martindurant force-pushed the one-pass branch from 63ddd47 to f61fdd7 Compare July 23, 2024 20:09

martindurant added 5 commits July 24, 2024 09:38

concat enforce condition

2c3abd0

temp

04abbc8

squached some

d876f00

add note

8e1d507

Merge branch 'main' into one-pass

d1922ab

martindurant added 3 commits July 31, 2024 09:57

Fix concat form comparison

fc9589b

one more squashed

961dd0c

fix IO report

c8b254b

@LGrey - moved from IO reporting layers returning a tuple to returning a record (or dict) for my own sanity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: One-pass column optimization #491

feat: One-pass column optimization #491

martindurant commented Mar 28, 2024 •

edited

Loading

agoose77 Apr 12, 2024

martindurant Apr 12, 2024

agoose77 Apr 12, 2024

martindurant Apr 12, 2024

martindurant Apr 15, 2024

agoose77 Apr 23, 2024

jpivarski Apr 23, 2024

agoose77 commented Apr 23, 2024 •

edited

Loading

martindurant commented Apr 23, 2024

agoose77 commented Apr 23, 2024

agoose77 commented Apr 23, 2024

agoose77 commented Apr 23, 2024

martindurant commented Apr 24, 2024

martindurant commented Apr 25, 2024

agoose77 commented Apr 25, 2024

martindurant commented May 14, 2024

agoose77 commented May 21, 2024

martindurant commented May 22, 2024

martindurant commented May 22, 2024

lgray commented May 22, 2024

martindurant commented May 22, 2024

martindurant commented May 23, 2024

agoose77 left a comment

agoose77 May 26, 2024

agoose77 May 26, 2024

agoose77 May 27, 2024

jpivarski May 28, 2024

agoose77 May 26, 2024

agoose77 May 26, 2024

agoose77 May 26, 2024

agoose77 commented May 27, 2024 •

edited

Loading

lgray commented May 28, 2024 •

edited

Loading

martindurant commented Jul 30, 2024 •

edited

Loading

feat: One-pass column optimization #491

Are you sure you want to change the base?

feat: One-pass column optimization #491

Conversation

martindurant commented Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoose77 commented Apr 23, 2024 • edited Loading

martindurant commented Apr 23, 2024

agoose77 commented Apr 23, 2024

agoose77 commented Apr 23, 2024

agoose77 commented Apr 23, 2024

martindurant commented Apr 24, 2024

martindurant commented Apr 25, 2024

agoose77 commented Apr 25, 2024

martindurant commented May 14, 2024

agoose77 commented May 21, 2024

martindurant commented May 22, 2024

martindurant commented May 22, 2024

lgray commented May 22, 2024

martindurant commented May 22, 2024

martindurant commented May 23, 2024

agoose77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoose77 commented May 27, 2024 • edited Loading

Footnotes

lgray commented May 28, 2024 • edited Loading

martindurant commented Jul 30, 2024 • edited Loading

martindurant commented Mar 28, 2024 •

edited

Loading

agoose77 commented Apr 23, 2024 •

edited

Loading

agoose77 commented May 27, 2024 •

edited

Loading

lgray commented May 28, 2024 •

edited

Loading

martindurant commented Jul 30, 2024 •

edited

Loading