-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Consider using a discriminating "tracking variable" or "discriminating function" for untagged unions in testing/debug builds. #1907
Replies: 2 comments · 5 replies
-
(I'm converting this issue to a discussion because at present it doesn't seem like there's any action we can take to resolve it.) I was thinking along somewhat similar lines in #139, an early proposal for untagged unions. Under that proposal, you can't type-pun through a union, so there would always be a definite answer to "which union member (if any) is currently active?", although that answer isn't available to Carbon code. In principle a debug or sanitizer build could track that state and use it to diagnose errors. However, it's hard to see how to do that efficiently when a union member is accessed through a pointer, for roughly the same reasons that the "discriminating function" approach isn't a complete solution (see the "Safety" section of the proposal doc).
I think it will be simpler and safer to keep the language-internal discriminator completely separate from any discriminator mechanism that's accessible to Carbon code. See p0157 for details on how we plan to support sum types with user-defined layout and discrimination -- I think that should extend reasonably well to wrapping sum types defined in C++. |
Beta Was this translation helpful? Give feedback.
All reactions
-
One can come up with sanitizing solutions if one requires the allocator to be type-aware. E.g. the project can put a cap on max-alloc size for testing and put a map (bit-map, byte-map etc) at the beginning the alloc-block (e.g. cap alloc at 96MB and make sure that the blocks are aligned to 128MB boundaries, then you get the beginning of the block with bit-and masking of the address). Static options might be impractical, the only one I can think of is to only allow taking the address of unions, or to create a typing-scheme where all types have a "twin type" that is tagged as a union member, but that has wide ranging type system and build implications… Still, C-unions kind of implies some kind of informal dependent-typing so making it formal and more strongly typed will drive the design towards supporting a dependent type scheme? You might end up moving in that direction anyway if you are going to extend the type system later with borrow checking, effects-typing, deadlock-prevention etc. So you might want to think about more advanced type system options at this stage instead of postponing it to after V1.0, otherwise you might end up doing the job two times… (The most interesting solution might be to throw out unions and come up with some kind of more generic typestate-analysis scheme, but that might be a too extensive language-design task for Carbon, and it would also be quite different from C++)
I am not sure if I understand how this will work out in real code? I hope I am mistaken, but my impression is that this might lead to verbosity/inefficiency? I thing I would prefer an optional "discriminator function" as it is simple and could be limited to c-style asserts. E.g. you assume that the selected union-type is active and the compiler asserts that it is so (by the "discriminator function") in debug builds. This is basically C++ semantics. By and large, it isn't difficult to use C-unions correctly, but it is difficult to pinpoint what is causing buggy behaviour when the root cause is incorrect usage of C-unions. So I think the runtime debugging aspect is the most critical one, paying a runtime release-build cost for new code is problematic as you choose untagged unions because you want better space/time performance than tagged ones. I also think the suggested solution for a storage-buffer will lead to over-estimation of size. Example: 5 bytes array and 4-byte-aligned float. Maybe it is better to not make the union a type, but a template that is instantiated within a record, that way you can squeeze the size requirements by having the template result in different union-member layouts based on the alignment position in the concrete record. If you take the typestate approach then the union doesn't have to be a type at all, just a collection of record-fields that have different configuration of inactive/active for each type-state. Specify legal state-type transitions and let the compiler figure out the optimal layout. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I think that only solves half the problem, though. If you're dereferencing a pointer, and you want to diagnose if it points somewhere inside an inactive member of a union, you need to know which union member is currently active at that memory location, and which union member was active when the pointer was formed, so that you can verify that the two are equal. The approach you describe gives us the first piece of information, but I don't see how it can give us the second. For that we'd need some way to associate provenance information with pointer values. And yeah, I can imagine ways to track pointer provenance statically, but it seems like that would add a fearsome amount of complexity to the type system. Maybe if we wind up needing dependent typing of pointers for other reasons, we could "piggyback" on that for this purpose, but I don't think making unions safer is a sufficiently compelling use case on its own. I don't know much about dependent typing, but the people who do tell me that it's not a step to take lightly.
I'm not aware of any such issues; can you be more specific?
Are there any cases where the discriminator function approach would catch bugs that the sanitizer approach would miss? I'm having trouble coming up with any. Conversely, there are definitely cases that a sanitizer would catch, but a discriminator function can't:
So unless I'm missing something, it doesn't seem like discriminator functions will help much with debugging. The other thing I'd say here is that a discriminated (aka tagged) union type is fundamentally an untagged union plus a way of determining which union member is active -- in other words, an untagged union plus a discriminator function. So for the use cases that permit a discriminator function, I'd rather focus on enabling those users to define a discriminated union API for their data. That way they get support for things like pattern matching, and they can add debug assertions in the accessors if they choose, or define an API that's statically safe, which makes debug assertions unnecessary. That's one of the overarching goals of p0157.
I don't follow. Why would size be overestimated in that case?
I agree that untagged unions probably shouldn't be types. However, I don't think "let the compiler figure out the optimal layout" will work, because there are many possible definitions of "optimal". Instead, I think we should think of unions as part of a suite of tools for precisely controlling the layout of fields in a class. That's one of the key ideas behind #139. |
Beta Was this translation helpful? Give feedback.
All reactions
-
You could view the bitmap like a bloom-filter: "is this pointer worth a more expensive hash-table lookup"? The assumption being that union members are much less frequently pointed to than other objects. (this is only for debugging). Or you could use a byte as an index, so you can only have a limited amount of types allocated in the same memory block (or some other scheme). Index 0 would mean that it is not a union member.
Yes, I agree that it isn't worth it on its own, but I think you will end up with some limited version of it. I would be surprised if not C++ with meta-programming could be considered as having some kind of dependent typing (by accident). ATS has dependent types backed up with a logic language and it might be worth having a look at it as it is not too far away from system level programming (even if this is too far to go for Carbon). But I think one can do interesting things without going this far.
I just don't think I fully understand what is required when using For verbosity and clarity; a C union member that points to Object can be accessed as Granted, my understanding of what For me it would be better to be able to do:
and have that turned into something like
Yes, you miss the case where you use SIMD/BLIT/ASM/memset optimizations and bypass the typesystem completely when you initialize a union? You also miss the case where the union is provided by another language, the operating system or some other external resource?
If you can make it in such a way that you can "paint" it over memory that is controlled by an external resource, that means that the discriminator-function can return a different result after every FFI call.
Assume address-offset 2 and onwards are available. If you use
I understand that Carbon intends to follow C/C++ with fixed in-order layout. I think this is a mistake.
So I would prefer a layout where you can say that two fields are accessed together and should be on the same cache line, and for low-level scenarios I think it would be interesting if one could lock fields to a specific byte-offset (e.g. for next-pointers that makes it possible to have objects of different types being linked into multiple linked lists). There are also inheritance scenarios where you want to allow subclasses to have small fields fit into free areas of the superclass. I can see scenarios where you would get better packing by allowing union-members to be placed in different areas of the record. Also think of the scenario where you have multiple unions in the same record, but can prove that only certain field combinations are active at the same time? Overall, I think the type-state approach with compiler determined layout (limited by user provided constraints) is more attractive, but it is also more advanced… so I understand that there is a trade-off (as with dependent typing). However, it also clear that Carbon will need a more advanced type system than C++ if it is to be seen as a worthy successor. |
Beta Was this translation helpful? Give feedback.
All reactions
-
But what's the key for that hash table lookup? It can't be just the address, because active and inactive union members can have the same address.
It sounds like you're describing some kind of pointer tagging, where the pointer representation includes additional information beyond the address. Is that right? That can work, and it seems like the best option here, but it's tricky. If we don't want to change the size of the pointer type (which could wreak all kinds of havoc), we only have 64 bits to work with. Most of those will be taken up by the address itself, and it seems likely that other sanitizer checks will want most or all of the rest. That might mean that this needs to be a separate build mode from the other sanitizers, which we've been trying to avoid.
First of all, I wouldn't worry too much about the details of
No, an object is not allowed to span multiple
I'd like to see if we can disallow "bypassing the typesystem" in that way (while still supporting the underlying use cases), because they tend to lead to code that's confusing to users for the same reason it's confusing to the sanitizer. For example, perhaps we can provide something like an operation that says "create an object of this type at this location, using the current state of that memory as the object representation", which is a no-op in a release build, but can add safety checks in a debug or hardened build. That way object lifetimes are always explicitly represented in the Carbon code, which will be clearer for users as well as more amenable to sanitizing and hardening.
Perhaps, but that doesn't seem specific to unions. If I have a Carbon class type that wraps memory controlled by an external resource, the object could be mutated by every FFI call.
How can you do this in only 8 bytes using individual fields?
That seems like an argument against compiler-controlled layout, because the compiler typically doesn't have access to profiling information. I know that profile-guided optimization is a thing, but I don't think Carbon's performance story can rely on it. It requires significant changes to the development workflow, and it relies on instrumentation and feedback channels that may not always exist.
True, but we can address that by providing a mechanism for per-platform code customization. We're going to need that anyway, for many other purposes besides layout. In fact, I suspect even if layout is compiler-optimized, we'd still need per-platform layout customizations, because I may want to optimize for different things on different platforms (e.g. on mobile I probably care a lot more about power-efficiency).
I'm generally a fan of capturing intent in code, but in this case I'm doubtful that it's feasible to capture intent in a way that's sufficiently granular and flexible to meet users' performance needs. Are there examples of this kind of intent-driven layout optimization being successfully adopted? I definitely see the appeal and the advantages of this approach, but I'm doubtful that it's a good place for us to spend our "innovation budget", especially at this early stage. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I see that there are some issues for certain unions, and for some specific configurations it becomes very expensive to have sanitizers, but I think they can work for the most common cases. So off-the-top-of-my-head I see two options: 1) you try to find the "type state" for the whole union, or 2) you track the "type-state" of individual union members only. In scenario 1) you will have issues with union members having different addresses… The simple solution is to create a hash entry for every plausible address in the union at allocation time. This is ok for the way I tend to use unions, but will be excessive for things like string-buffers unless you require string-buffer-views to use fat pointers (so you always have a pointer to the beginning of the string buffer). In scenario 2) you have issues with union members having the same address, but in the latter case you also have type-information that can be enumerated (unless you allow This all breaks down if two union members have the same super-class and the union field is accessed through Overall, I guess it is difficult to figure out what can and cannot work until other semantics (such as having internal pointers in buffer-objects) has been finalized. On the other hand, it is better to have sanitizers for some commonly constructed unions even if it is possible to construct unions where the sanitizer strategy does not work. After all, sanitizers are not part of the type system, so it is not a disaster if you cannot get the benefit for all unions.
That wasn't what I was describing, but I agree that this is needed in the general case as described above. What I tried to convey with the "byte-map" was a scheme for figuring out that the pointed-to-object is taking part in a union and the means to find the union's type-info. I don't know how well it would work in practice…
I understand. My personal opinion is that it is worth it if you have serious problems identifying the source of a bug and this is the only way to get it to work… (not for testing as such, but for identifying the source for what seems to be spurious random behaviour during debugging).
Then I don't see how you can paint this typing mechanism over C data structures?
But a discriminator-function would still work, I am not sure you can find a general solution to sanitizing untagged unions, so maybe you have to combine several different techniques on a case-by-case basis?
Yes, using a different implementation for debugging is a good first option… But maybe that also prevents the cause for the bug to be triggered. For instance if the key problem is a concurrency issue or if the program is using a coprocessor that use DMA of some kind and there is some kind of syncing issue.
The point I tried to make was that a discriminator function might work out ok in this case where other options won't work.
Case 1: byte 0…1:filled, byte 2…6: five_byte_type_alignment_1 Case 2: byte 0…1: filled, byte 4…7: float_alignment_4
You can also use heuristics; granted, that does require a more advanced compilation setup than C++.
I haven't done a search for it. There aren't really all that many system level programming language efforts and it is unlikely to be an area looked into by research languages as this is more of a pragmatic issue than a theoretical one. It does make sense to me, though, that interfaces in system level programming could provide byte-offsets as an option. I personally find single-linked-lists to be the easiest way to create a lock-free mailbox and using byte-offsets is a simple way of getting the same object into multiple lock-free mailboxes. Is this "dirty", yes, but low level system programming tends to have some of that in some critical limited spots… |
Beta Was this translation helpful? Give feedback.
All reactions
-
Just a couple of perspectives on tagged and untagged unions: We can view a tagged union as a "allocator-optimization" of a class hierarchy. The unpopulated union is the superclass, and the union-members are the attributes of subclasses. The tag is the class-type. For untagged union we can take a similar "allocator-optimization" perspective. We can often replace a record with a union-of-N-fields with N records. If we interpret this case with the same modelling mindset as when we are applying TypeScript types to JSON data then we can view this as safe reinterpret-casting a memory-pointer. If you add type-state analysis you get some interesting and more flexible alternative takes on records-with-unions. (The concept of unions might be considered somewhat superflous semantically, so it might be interesting to think about to what extent it has to be a type, and to what extent it can just be viewed as an "optimization problem". I sense that this is the angle the |
Beta Was this translation helpful? Give feedback.
All reactions
This discussion was converted from issue #1896 on August 03, 2022 22:40.
-
One problem with untagged unions is correct triggering of destructors and preventing type-punning (writing as one type, reading as another). It would make testing and debugging easier if there was a mechanism tracking construction, access and destruction of a type in a union.
This "implicit" state might be tracked by a variable outside the program proper. A ghost variable is an "imagined" variable that is used for reasoning about a program (e.g. verification), not for actual computations of the program. The "tracking variable" is similar except it might have a physical representation, yet, it cannot affect the program and thus is only used for tracking the correctness of the program during testing or debugging. This tracking should not be part of the program, and is as such only a testing or debugging feature. We can think of it as a feature of the machine executing the program rather than as being part of the program.
How this "tracking variable" is realized (or if it is at all needed) should be up to the compiler, if it can find free space for it in a struct that cannot be overwritten, or if it can put it at an offset, then that is an option. Another option is to just use a hash-table.
A more flexible solution is to add a "discriminating function" that takes a context-parameter which indicates the type that is active in the union. Usually the context would be the record the union is embedded in, as this record often contains an enumeration value or some other state-carrier that indicates the type that is allowed to be used in the union.
Complication: Finding the indicator could also require a lookup in a container object or some other structure, which is more complicated in the case where you have no back-pointer. In that case it would require either passing a hidden parameter to the container or that the record containing the union has a back-pointer at a negative offset, or some other mechanism. Which could be tricky.
As such, a "discriminating function" cannot be a general solution, but it has more potential for becoming part of the computation than the "tracking variable" approach. In some cases it can be added to existing C++ codebases without changing the record or touching the existing code that use the record. Which could be a welcome feature for Carbon users who want to interact with C++ libraries without modifying them.
Beta Was this translation helpful? Give feedback.
All reactions