Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referencing of unbound nodes may be incorrect. #94

Open
dwarring opened this issue Feb 13, 2023 · 17 comments
Open

Referencing of unbound nodes may be incorrect. #94

dwarring opened this issue Feb 13, 2023 · 17 comments

Comments

@dwarring
Copy link
Contributor

I'm suspecting unbound nodes can have the owner document destroyed, e.g.

my LibXML::Document $doc .= new;
my LibXML::Element $elem = $doc.createElement: 'root';
$doc = Nil;
# sometime later
$elem.doc.root = $elem;

The problem (I think), is that there's nothing currently keeping the document alive, if it's not explicitly referenced.

dwarring added a commit that referenced this issue Feb 13, 2023
@vrurg
Copy link
Contributor

vrurg commented Feb 14, 2023

How interesting the coincidence is! The last couple of days I'm thinking about possible ways of not reincarnating a node object on every occasion. Though it may not sound totally identical to this case, but it is still related.

Consider the following. I'm trying to develop a framework of dealing with ALTO standard. Partial top-level layout for it is like:

<alto>
  <Styles>
    <TextStyle>...</TextStyle>
  </Styles>
  <Layout>...</Layout>
</alto>

Elements within <Layout> may reference a <TextStyle> via STYLEREFS attribute. If I need the style what I do is $elem.getOwner.root and find the actual <Styles> collection from there. As a result each $elem actually gets its own copy of <TextStyle> element instead of sharing a common one among them.

Having the $.doc attribute is likely to help a lot here. Since I create own mapping of elements into local classes, <alto> is represented with an alto class which has $.styles attribute with an instance of styles class representing <Styles>. Hopefully, it would now be possible to just do $.doc.root.styles and get the same pre-fetched instances for <TextStyle> nodes.

But this solution is partial because the other side of the coin would be findnodes and other methods of the kind. Since I used Config's class-from method, I checked out what raw objects does it get. Turns out every time a new raw instance is produced for the same original XML node. Consequently, it gets a new boxing every time and I don't see a way to get around this. My expectation would be that the raw instance be the same, but not sure if this is possible. My guess would be that it depends on libxml behavior.

Anyway, solving this would be very beneficial to the memory footprint of any LibXML-based code. And very likely to it performance too as less object construction and less memory fragmentation would be involved.

What do you think?

@dwarring
Copy link
Contributor Author

I've backed out adding a $.doc attribute. It's impactful and difficult to get watertight. As well as adding everywhere including findnode results. The ownerDocument can change. It doesn't do enough. Even the $.doc.root method will return a new instance each time.

The problem is the document struct can be updated via the DOM. nodes can even be moved between documents,. Still considering this.

@vrurg
Copy link
Contributor

vrurg commented Feb 15, 2023

What happens to unique-key value when changes happen? Does it change too?

@dwarring
Copy link
Contributor Author

No that doesn't change. and is globally unique Its simply the hex string of the raw node's address.

@dwarring
Copy link
Contributor Author

At least while a reference is being kept the the object unique-key will remain unique. There's some chance that the address could be reused for another node if the object is freed.

@vrurg
Copy link
Contributor

vrurg commented Feb 16, 2023

It is apparent that DOM mutation makes things problematic, to say the least. But for a r/o case there is a chance. My all-time wish was to have a box outcome be somehow bound to the underlying node structure so that any subsequent box would result in the same object. If the node is removed from the DOM the object would start reporting Nil for its parent and owner. There would be some edge cases, but the unique key would be preserved until the object is gone as it would hold the reference to the raw object.

There is another case where something like XML::Class is used. By pure accident, I partially implemented almost the same approach while was working on Spreadsheet::XLSX improvements. Just gave it a different name, but whatever. Since it is about de-/serializing DOM is not getting modified and everything is almost OK until the point where findnodes is involved because it would be much faster than traversing an object tree, but more problematic when it comes to mapping into a serialized instance. unique-key can help as a global map could be held somewhere in a hash. But the best place for that "somewhere" would be the document object.

Just to be clear, I'm not asking for a solution. Simply sharing my thoughts, so that they might result in something helpful later. Unfortunately, the second case is currently out of question anyway because pulling the XMLRepresentation part out of Spreadsheet::XLSX and refactoring it into LibXML::Class requires time which I don't have and barely would be having in a foreseeable future. :(

@vrurg
Copy link
Contributor

vrurg commented Feb 25, 2023

Working with ALTO turned out to be very slow. Being curious, I decided to intercept bless and count the number of allocations made. An outcome of a single analysis of a PDF file follows.

The file size is ~48K. Don't know how many actual tags does it contain, but deriving from the count of unique keys – 930-950 (some are likely never pulled in by the code).

A sample section of the file looks like this:

                <TextBlock ID="p1_b8" HPOS="157.889" VPOS="77.1160" HEIGHT="8.3520" WIDTH="216.612">
                    <TextLine WIDTH="216.612" HEIGHT="8.3520" ID="p1_t10" HPOS="157.889" VPOS="77.1160">
                        <String ID="p1_w18" CONTENT="(Please" HPOS="157.889" VPOS="77.1160" WIDTH="30.5010"
                                HEIGHT="8.3520" STYLEREFS="font4"/>
                        <SP WIDTH="2.5151" VPOS="77.1160" HPOS="188.390"/>
                        <String ID="p1_w19" CONTENT="reference" HPOS="190.905" VPOS="77.1160" WIDTH="38.0070"
                                HEIGHT="8.3520" STYLEREFS="font4"/>
                        <SP WIDTH="2.5150" VPOS="77.1160" HPOS="228.912"/>
                        <String ID="p1_w20" CONTENT="#1234567" HPOS="231.427" VPOS="77.1160" WIDTH="45.0360"
                                HEIGHT="8.3520" STYLEREFS="font1"/>
                        <SP WIDTH="2.5128" VPOS="77.1160" HPOS="276.463"/>
                        <String ID="p1_w21" CONTENT="when" HPOS="278.976" VPOS="77.1160" WIDTH="21.5100" HEIGHT="8.3520"
                                STYLEREFS="font4"/>
                        <SP WIDTH="2.5061" VPOS="77.1160" HPOS="300.486"/>
                        <String ID="p1_w22" CONTENT="making" HPOS="302.992" VPOS="77.1160" WIDTH="29.0070"
                                HEIGHT="8.3520" STYLEREFS="font4"/>
                        <SP WIDTH="2.5062" VPOS="77.1160" HPOS="331.999"/>
                        <String ID="p1_w23" CONTENT="payment.)" HPOS="334.506" VPOS="77.1160" WIDTH="39.9960"
                                HEIGHT="8.3520" STYLEREFS="font4"/>
                    </TextLine>
                </TextBlock>

Considering that many allocations are followed by a couple of new ones since values from XML attributes are pulled into my classes – no wonder it all takes ages to complete a single run.

My next step is to try back boxing by some a cache of some kind. Shouldn't be a problem since all is done is R/O mode with few hand-made elements which are not planted back into the original DOM.

------ STATS ------
  TOTAL: 21813
  BY TAG:
    Page:  1
    alto:  1
    String:  13647
    Illustration:  7
    TextStyle:  520
    Styles:  1
    SP:  2584
    TextLine:  5052
  BY ID:
    0x7fd1d6394b40:  16
    0x7fd1d633a070:  230
    0x7fd1d6382100:  60
    0x7fd1d6399370:  55
    0x7fd1d62f7320:  11
    0x7fd1d63628f0:  82
    0x7fd1d62e8360:  5
    0x7fd1d6387b10:  68
    0x7fd1d6384860:  60
    0x7fd1d6340bd0:  14
    0x7fd1d6311240:  78
    0x7fd1d6356ed0:  15
    0x7fd1d6341000:  74
    0x7fd1d6366660:  14
    0x7fd1d636d5f0:  82
    0x7fd1d62c9460:  52
    0x7fd1d632fd10:  23
    0x7fd1d632b1f0:  28
    0x7fd1d633feb0:  14
    0x7fd1d63241d0:  31
    0x7fd1d6372da0:  24
    0x7fd1d6314290:  66
    0x7fd1d6368730:  82
    0x7fd1d62cb5c0:  52
    0x7fd1d6378a10:  12
    0x7fd1d6335be0:  24
    0x7fd1d62ee230:  11
    0x7fd1d6336b60:  22
    0x7fd1d636f030:  82
    0x7fd1d62e9970:  6
    0x7fd1d636ec00:  16
    0x7fd1d62fbdb0:  42
    0x7fd1d639c3c0:  29
    0x7fd1d635a9e0:  41
    0x7fd1d62f0990:  35
    0x7fd1d630af40:  29
    0x7fd1d63442b0:  16
    0x7fd1d6309b90:  14
    0x7fd1d6316eb0:  96
    0x7fd1d634cee0:  17
    0x7fd1d63514b0:  79
    0x7fd1d639d0e0:  29
    0x7fd1d62e10e0:  24
    0x7fd1d63191e0:  25
    0x7fd1d638d950:  108
    0x7fd1d63238e0:  70
    0x7fd1d62c9360:  1
    0x7fd1d63750d0:  14
    0x7fd1d637ec80:  60
    0x7fd1d62cae00:  52
    0x7fd1d6342a40:  74
    0x7fd1d63215b0:  91
    0x7fd1d634fa70:  81
    0x7fd1d62e5740:  5
    0x7fd1d62c9d50:  52
    0x7fd1d62b1440:  20
    0x7fd1d6355490:  17
    0x7fd1d63188f0:  96
    0x7fd1d6388400:  18
    0x7fd1d6336270:  47
    0x7fd1d62cbd80:  52
    0x7fd1d637f9a0:  60
    0x7fd1d636c8d0:  82
    0x7fd1d6351080:  17
    0x7fd1d6370640:  14
    0x7fd1d6365d70:  80
    0x7fd1d62ff230:  42
    0x7fd1d630bc60:  29
    0x7fd1d634bf60:  229
    0x7fd1d6386500:  12
    0x7fd1d639de00:  29
    0x7fd1d62b0d80:  35
    0x7fd1d6303200:  29
    0x7fd1d636f920:  16
    0x7fd1d6303af0:  14
    0x7fd1d6319870:  48
    0x7fd1d639adb0:  55
    0x7fd1d636ae90:  82
    0x7fd1d632e4a0:  48
    0x7fd1d62fa7a0:  14
    0x7fd1d63341a0:  31
    0x7fd1d6364c20:  16
    0x7fd1d63068e0:  36
    0x7fd1d63923e0:  18
    0x7fd1d639f410:  75
    0x7fd1d6321ea0:  29
    0x7fd1d62f2cc0:  10
    0x7fd1d6337ae0:  22
    0x7fd1d63125f0:  68
    0x7fd1d6389e40:  18
    0x7fd1d62b2190:  20
    0x7fd1d6369020:  16
    0x7fd1d62b37a0:  40
    0x7fd1d63a6c10:  1
    0x7fd1d62f16b0:  35
    0x7fd1d636fd50:  80
    0x7fd1d638c5a0:  18
    0x7fd1d6360eb0:  82
    0x7fd1d6368110:  262
    0x7fd1d6305530:  74
    0x7fd1d631ded0:  137
    0x7fd1d6374a40:  7
    0x7fd1d638ef60:  18
    0x7fd1d6365940:  14
    0x7fd1d6352ac0:  15
    0x7fd1d6378120:  66
    0x7fd1d6302b70:  52
    0x7fd1d6313570:  66
    0x7fd1d63064b0:  14
    0x7fd1d63558c0:  60
    0x7fd1d636c4a0:  16
    0x7fd1d63a7500:  1
    0x7fd1d6390dd0:  108
    0x7fd1d63813e0:  60
    0x7fd1d63439c0:  70
    0x7fd1d637dd00:  66
    0x7fd1d62e73e0:  12
    0x7fd1d6395860:  16
    0x7fd1d635cd10:  16
    0x7fd1d639a980:  17
    0x7fd1d63322a0:  24
    0x7fd1d630df90:  70
    0x7fd1d630a8b0:  52
    0x7fd1d630e880:  31
    0x7fd1d635b700:  41
    0x7fd1d638e240:  18
    0x7fd1d6392810:  108
    0x7fd1d62b2a80:  10
    0x7fd1d6324ef0:  79
    0x7fd1d6370ae0:  80
    0x7fd1d62f8d60:  11
    0x7fd1d63565e0:  58
    0x7fd1d6398220:  17
    0x7fd1d62f8470:  30
    0x7fd1d63617a0:  16
    0x7fd1d63a0130:  74
    0x7fd1d6379730:  12
    0x7fd1d6308c10:  30
    0x7fd1d636e310:  82
    0x7fd1d638f390:  108
    0x7fd1d62af800:  52
    0x7fd1d62f6a30:  30
    0x7fd1d63402e0:  74
    0x7fd1d630fc30:  70
    0x7fd1d6309fc0:  22
    0x7fd1d6384430:  12
    0x7fd1d63222d0:  91
    0x7fd1d63177a0:  32
    0x7fd1d62f1fa0:  10
    0x7fd1d6380290:  12
    0x7fd1d6369450:  82
    0x7fd1d63631e0:  16
    0x7fd1d6302280:  21
    0x7fd1d637a880:  66
    0x7fd1d62f4d90:  6
    0x7fd1d62ee660:  24
    0x7fd1d639fd00:  34
    0x7fd1d6357300:  58
    0x7fd1d63806c0:  60
    0x7fd1d62f8040:  11
    0x7fd1d631a160:  81
    0x7fd1d6338a60:  22
    0x7fd1d6341d20:  74
    0x7fd1d63184c0:  32
    0x7fd1d639ccb0:  15
    0x7fd1d633db80:  76
    0x7fd1d63a2030:  46
    0x7fd1d637b170:  12
    0x7fd1d62e47c0:  24
    0x7fd1d63909a0:  18
    0x7fd1d638a270:  68
    0x7fd1d630c980:  29
    0x7fd1d6376050:  168
    0x7fd1d633ce60:  76
    0x7fd1d637a450:  12
    0x7fd1d631c490:  137
    0x7fd1d6327220:  92
    0x7fd1d6325e70:  31
    0x7fd1d6349e90:  14
    0x7fd1d62b2eb0:  20
    0x7fd1d62e2d80:  24
    0x7fd1d6381cd0:  12
    0x7fd1d635e0c0:  260
    0x7fd1d62ae7c0:  52
    0x7fd1d633ca30:  16
    0x7fd1d6366a90:  80
    0x7fd1d631be00:  227
    0x7fd1d6307ef0:  14
    0x7fd1d6325580:  70
    0x7fd1d6373690:  14
    0x7fd1d631d1b0:  137
    0x7fd1d62b3e30:  24
    0x7fd1d6395c90:  106
    0x7fd1d634ed50:  81
    0x7fd1d635d140:  41
    0x7fd1d639a090:  55
    0x7fd1d6364330:  82
    0x7fd1d63a4850:  1
    0x7fd1d635c420:  41
    0x7fd1d62f30f0:  35
    0x7fd1d62fe0e0:  11
    0x7fd1d6367380:  14
    0x7fd1d6300ed0:  27
    0x7fd1d6372710:  30
    0x7fd1d634d310:  81
    0x7fd1d6360190:  82
    0x7fd1d632bf10:  22
    0x7fd1d63345d0:  93
    0x7fd1d6327b10:  32
    0x7fd1d62f23d0:  35
    0x7fd1d636b780:  16
    0x7fd1d6350790:  81
    0x7fd1d63352f0:  93
    0x7fd1d6385e70:  6
    0x7fd1d633a700:  76
    0x7fd1d62eaf80:  5
    0x7fd1d62fb720:  91
    0x7fd1d63521d0:  79
    0x7fd1d63a2fb0:  46
    0x7fd1d62b0090:  1
    0x7fd1d6379b60:  66
    0x7fd1d6344fd0:  16
    0x7fd1d633b420:  76
    0x7fd1d6315f30:  66
    0x7fd1d6315b00:  28
    0x7fd1d62e3670:  11
    0x7fd1d630d900:  79
    0x7fd1d62f9190:  30
    0x7fd1d6363610:  82
    0x7fd1d62efc70:  35
    0x7fd1d6361bd0:  82
    0x7fd1d63713d0:  14
    0x7fd1d637be90:  12
    0x7fd1d63390f0:  44
    0x7fd1d63a7df0:  1
    0x7fd1d63371f0:  44
    0x7fd1d6389550:  68
    0x7fd1d6394f70:  106
    0x7fd1d6369d40:  16
    0x7fd1d635f470:  82
    0x7fd1d631b510:  72
    0x7fd1d6354b90:  60
    0x7fd1d62fc6a0:  11
    0x7fd1d6348880:  68
    0x7fd1d62f7750:  30
    0x7fd1d632ce90:  25
    0x7fd1d6387480:  176
    0x7fd1d6383b40:  60
    0x7fd1d62e3aa0:  24
    0x7fd1d63969b0:  106
    0x7fd1d6393e20:  18
    0x7fd1d62f3e10:  35
    0x7fd1d62ffb20:  11
    0x7fd1d631e7c0:  29
    0x7fd1d631daa0:  29
    0x7fd1d62fd7f0:  42
    0x7fd1d634a2c0:  68
    0x7fd1d6346e40:  70
    0x7fd1d635f040:  16
    0x7fd1d6305bc0:  36
    0x7fd1d63624c0:  16
    0x7fd1d63972a0:  118
    0x7fd1d62e1e00:  24
    0x7fd1d62e19d0:  11
    0x7fd1d6394250:  106
    0x7fd1d638c9d0:  66
    0x7fd1d632c5a0:  44
    0x7fd1d6320890:  91
    0x7fd1d638fc80:  18
    0x7fd1d636dee0:  16
    0x7fd1d631ebf0:  137
    0x7fd1d6351da0:  15
    0x7fd1d6367820:  80
    0x7fd1d63a10b0:  49
    0x7fd1d632de10:  25
    0x7fd1d633e8a0:  76
    0x7fd1d6365050:  82
    0x7fd1d6393100:  18
    0x7fd1d6321180:  29
    0x7fd1d635e750:  82
    0x7fd1d6382e20:  60
    0x7fd1d62b1d60:  10
    0x7fd1d6323250:  79
    0x7fd1d62fa110:  7
    0x7fd1d6348450:  16
    0x7fd1d632a4d0:  28
    0x7fd1d62ca640:  52
    0x7fd1d63537e0:  15
    0x7fd1d6342610:  14
    0x7fd1d6359630:  15
    0x7fd1d6358d40:  58
    0x7fd1d63071d0:  14
    0x7fd1d637c2c0:  66
    0x7fd1d62cc670:  52
    0x7fd1d62eb610:  10
    0x7fd1d637f570:  12
    0x7fd1d62ccea0:  1
    0x7fd1d635fd60:  16
    0x7fd1d6329550:  122
    0x7fd1d634abb0:  14
    0x7fd1d6353c10:  79
    0x7fd1d62f0560:  10
    0x7fd1d62fcad0:  42
    0x7fd1d62f1280:  10
    0x7fd1d6360a80:  16
    0x7fd1d6385150:  12
    0x7fd1d6398650:  55
    0x7fd1d6398f40:  17
    0x7fd1d639ed80:  85
    0x7fd1d63121c0:  32
    0x7fd1d638e670:  108
    0x7fd1d630ecb0:  70
    0x7fd1d63a3f30:  46
    0x7fd1d6326b90:  134
    0x7fd1d63303a0:  45
    0x7fd1d634e920:  17
    0x7fd1d633aff0:  16
    0x7fd1d636aa60:  16
    0x7fd1d6345400:  70
    0x7fd1d6347b60:  70
    0x7fd1d6307600:  36
    0x7fd1d6389120:  18
    0x7fd1d6388830:  68
    0x7fd1d6380fb0:  12
    0x7fd1d62f5420:  12
    0x7fd1d6346120:  70
    0x7fd1d6320200:  125
    0x7fd1d6383710:  12
    0x7fd1d6343330:  210
    0x7fd1d63561b0:  17
    0x7fd1d631b0e0:  32
    0x7fd1d6332930:  47
    0x7fd1d6314b80:  72
    0x7fd1d6371790:  80
    0x7fd1d6377400:  66
    0x7fd1d6329be0:  88
    0x7fd1d638ab60:  18
    0x7fd1d6303f20:  29
    0x7fd1d630c550:  14
    0x7fd1d63446e0:  70
    0x7fd1d6350360:  17
    0x7fd1d635bff0:  16
    0x7fd1d6377cf0:  12
    0x7fd1d62e6d50:  6
    0x7fd1d6310950:  70
    0x7fd1d6308320:  36
    0x7fd1d6304c40:  29
    0x7fd1d633e470:  16
    0x7fd1d62fff50:  42
    0x7fd1d63900b0:  108
    0x7fd1d639bd30:  54
    0x7fd1d6391af0:  108
    0x7fd1d6354500:  147
    0x7fd1d630b830:  14
    0x7fd1d6316820:  137
    0x7fd1d6315210:  66
    0x7fd1d62ed510:  11
    0x7fd1d633d750:  16
    0x7fd1d62ef5e0:  80
    0x7fd1d6331320:  45
    0x7fd1d63495a0:  68
    0x7fd1d637d8d0:  12
    0x7fd1d632ed90:  25
    0x7fd1d631cd80:  29
    0x7fd1d6310520:  31
    0x7fd1d632a900:  88
    0x7fd1d62e26f0:  40
    0x7fd1d63a19a0:  23
    0x7fd1d639d9d0:  15
    0x7fd1d631f910:  137
    0x7fd1d635a350:  84
    0x7fd1d62e5dd0:  10
    0x7fd1d6397930:  55
    0x7fd1d638af90:  68
    0x7fd1d6312ee0:  72
    0x7fd1d62f39e0:  10
    0x7fd1d6399c60:  17
    0x7fd1d6347730:  16
    0x7fd1d6327f40:  92
    0x7fd1d637e5f0:  150
    0x7fd1d6345cf0:  16
    0x7fd1d62fd3c0:  11
    0x7fd1d62ec590:  40
    0x7fd1d6357bf0:  15
    0x7fd1d63092a0:  22
    0x7fd1d631f4e0:  29
    0x7fd1d637b5a0:  66
    0x7fd1d6334ec0:  31
    0x7fd1d63a6320:  1
    0x7fd1d638bcb0:  68
    0x7fd1d6358020:  58
    0x7fd1d636bbb0:  82
    0x7fd1d62ed940:  24
    0x7fd1d63a0a20:  25
    0x7fd1d63a5a30:  1
    0x7fd1d63262a0:  70
    0x7fd1d6301e50:  13
    0x7fd1d6338170:  44
    0x7fd1d6301560:  21
    0x7fd1d62fe510:  42
    0x7fd1d6376fd0:  12
    0x7fd1d634c5f0:  81
    0x7fd1d6330c90:  23
    0x7fd1d6378e40:  66
    0x7fd1d635b2d0:  16
    0x7fd1d63a38a0:  23
    0x7fd1d6317bd0:  96
    0x7fd1d637cfe0:  66
    0x7fd1d638d2c0:  304
    0x7fd1d63829f0:  12
    0x7fd1d6373ac0:  24
    0x7fd1d6333220:  134
    0x7fd1d633f190:  16
    0x7fd1d62ecc20:  24
    0x7fd1d63118d0:  70
    0x7fd1d62f63a0:  57
    0x7fd1d6358910:  15
    0x7fd1d633f5c0:  74
    0x7fd1d6359a60:  58
    0x7fd1d6363f00:  16
    0x7fd1d63916c0:  18
    0x7fd1d62fee00:  11
    0x7fd1d63418f0:  14
    0x7fd1d62aef10:  52
    0x7fd1d62ea000:  12
    0x7fd1d633c140:  76
    0x7fd1d63766e0:  66
    0x7fd1d6352ef0:  79
    0x7fd1d6346a10:  16
    0x7fd1d63a2920:  23
    0x7fd1d636d1c0:  16
    0x7fd1d6349170:  14
    0x7fd1d634afe0:  68
    0x7fd1d6328c60:  90
    0x7fd1d632f420:  48
    0x7fd1d6313e60:  28
    0x7fd1d637cbb0:  12
    0x7fd1d630f5a0:  79
    0x7fd1d6328830:  32
    0x7fd1d632d520:  48
    0x7fd1d6304810:  14
    0x7fd1d638b880:  18
    0x7fd1d634dc00:  17
    0x7fd1d6385580:  60
    0x7fd1d633bd10:  16
    0x7fd1d62e4390:  11
    0x7fd1d634e030:  81
    0x7fd1d632b620:  88
    0x7fd1d6393530:  108
    0x7fd1d636a170:  82
    0x7fd1d63338b0:  93
    0x7fd1d634f640:  17
    0x7fd1d6324600:  70
    0x7fd1d62e0cb0:  11
    0x7fd1d63a5140:  1
    0x7fd1d62e89f0:  10
    0x7fd1d631a7f0:  72
    0x7fd1d6396580:  16
------ END STATS ------

@dwarring
Copy link
Contributor Author

That ties in with what I've seen when benchmarking. LibXML is faster than XML for parsing and Xpath queries.

But XML is fast for repeated DOM access, because LibXML is recreating the objects everytime. Caching and a read-only mode sounds like a good idea.

@vrurg
Copy link
Contributor

vrurg commented Feb 26, 2023

I'm trying to implement caching locally with, well, mixed results so far. But thinking over it for a while I came upon an idea. Not sure if it is feasible to implement it, but it better be spoken out.

What is the biggest problem of unique-key? Being bound to memory allocations it is susceptible to re-allocations. So, same node may get different unique-key if it accidentally gets moved in memory for whatever reason. This, apparently, would break any existing mapping from the node into a Raku object. What would help in this situation if there is some kind of unique ID which is immutable, belongs to a single node, and never gets re-used. Just a sequential number would work well enough.

I just suppose that it couldn't be done without changing the underlying C struct which is, presumably, a property of libxml itself.

@dwarring
Copy link
Contributor Author

There's already a _private field on libxml2 holds a struct owned by the Raku bindings.

Yes, this could hold a sequential id that's incremented globally say 64 bit unsigned integer. and could be fetched by the bindings This would leads to a better implementation of unique-id. I don't think it's that hard to implement.

dwarring referenced this issue Feb 26, 2023
This shouldn't collide, except for a very, very long running process.

Also provide uid method to get the raw value as a 64 bit unsigned integer.
@dwarring
Copy link
Contributor Author

See ea5c163 which reimplements unique-key as a global (64 bit) counter. Also added a uid to get the value as a 64 bit integer.

@vrurg
Copy link
Contributor

vrurg commented Feb 26, 2023

I'm currently investigating a caching weirdness where object counter installed via bless produces the results, I published earlier. Yet, while overriding box and caching over raw node unique-key I see no cache hits! Anyway, just sharing my emotions. :)

A quick glance into the commit tells me that C code is not thread safe, counter increment isn't atomic. Either a mutex is needed or libuv atomic interface, unless there is something in the modern standard library as my memories of C are frozen somewhen back in early 2000th.

@dwarring
Copy link
Contributor Author

The increment is in code protected by a xmlMutexLock, which is held to prevent two threads taking a lock at the same time.

Just realised another problem with that last commit: the xml_ref struct can change over time for a given node.

This can happen if Raku references to a node are destroyed, then the xml6_ref struct is destroyed and _private is set back to NULL. A later fetch will recreate the xml6_ref struct with a new uid.

I might revert the last commit and do any further work on a branch.

@vrurg
Copy link
Contributor

vrurg commented Feb 26, 2023

I don't plan to rely on this new change anyway.

In the meantime, I hope you can help me to understand better the relations between box and new. Further debugging revealed that bless is called more times than box does. It means most of the stats above are result of new calls. Considering repeating unique keys (i.e. where the count is 2+), new are getting invoked with the same raw nodes. But I was confident that the method is used to create new nodes from scratch, not for wrapping natives. Apparently, it’s a wrong assumption.

Can you help me to clear this out?

The key problem is that I cannot return a cached object from bless because .Reference is invoked by box and for that reason, to protect reference count from corruption, cache checks are to be done earlier. I’m currently on my iPad and cannot check out the sources, but guessing that same applies for new. Therefore it would be better to know where exactly I need to plug into to make sure no new memory leaks are introduced.

dwarring added a commit that referenced this issue Feb 26, 2023
This reverts commit ea5c163.

Some issues with the uid not remaining unique. see discussion #94
@dwarring
Copy link
Contributor Author

I'm not sure, except I'd also expect bless to be called via box rather than directly for LibXML::Node objects.

For example, I'd assume the call to .Reference could be deferred and placed in a TWEAK submethod.

@vrurg
Copy link
Contributor

vrurg commented Feb 26, 2023

For example, I'd assume the call to .Reference could be deferred and placed in a TWEAK submethod.

I like the idea. In this case whatever would go wrong during object creation it wouldn't result in memory leaks due to wrong reference count.

There is one more reason why implementing the cache via bless is not a great idea. Before the method gets called the upstream chain of calls have some work already done, most of which is useless waste of CPU cycles because it is only needed for creating a new object.

BTW, unfortunately, whatever I currently do I do in a quick and hacky way applicable to my local project only because I need it, like, yesterday. Hopefully, I'd be able to share whatever useful bits I learn along the way. For example, I just have realized that most calls to bless are coming through LibXML::Item.box calling LibXML::Node.box. By "most" I mean more than a half of them. Here is where my cache ineffectiveness comes from. But this shouldn't be a problem on your side, if you get there.

vrurg added a commit to vrurg/LibXML-raku that referenced this issue Feb 27, 2023
Any config supplied by upstream was previously ignored and class mapping
was expected to be done by `LibXML::Node`. This resulted in limited
abilities of user code in controlling how boxing is done. See libxml-raku#94.

Note that `LibXML::Node` still does its own class mapping because this
is the point of final decision and in many cases it is invoked directly.
dwarring pushed a commit that referenced this issue Feb 27, 2023
Any config supplied by upstream was previously ignored and class mapping
was expected to be done by `LibXML::Node`. This resulted in limited
abilities of user code in controlling how boxing is done. See #94.

Note that `LibXML::Node` still does its own class mapping because this
is the point of final decision and in many cases it is invoked directly.
@vrurg
Copy link
Contributor

vrurg commented Feb 27, 2023

With PR #97 the total number of allocations fell down from 22k to 10k for my code. Next problem I just've encountered are iterators. Hopefully I can quickly make them do about the same because, for example, childNodes creates an iterator over LibXML::Node which makes total sense but results in LibXML::Node.box called directly later in Node::List itertor pull-one. The iterator does have access to the config, so adding a class-map should not be a problem.

dwarring pushed a commit that referenced this issue Feb 27, 2023
Move the call to $.raw.set-flags to a TWEAK submethod as it needs to be
called after $.raw.Reference has been called.
Any config supplied by upstream was previously ignored and class mapping
was expected to be done by `LibXML::Node`. This resulted in limited
abilities of user code in controlling how boxing is done. See #94.

Note that `LibXML::Node` still does its own class mapping because this
is the point of final decision and in many cases it is invoked directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants