Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: clarify thread-safety #66

Open
jods4 opened this issue Sep 25, 2020 · 6 comments
Open

Docs: clarify thread-safety #66

jods4 opened this issue Sep 25, 2020 · 6 comments

Comments

@jods4
Copy link

jods4 commented Sep 25, 2020

This library is great, thanks! ❤️

Could you please clarify proper usage if we're going to parse many files with the same format?

As we don't want to generate dynamic code for each file, I was wondering if it is possible to create static mappers/selectors for the file format we need and then to use GetReader on possibly concurrent threads.

I guess the question is: are Mappers and Selectors thread-safe?

If not I would like to know if there is a way to re-use the dynamically generated code (esp. after the changes in 4.8) without putting the parsing in a critical section.

@jehugaleahsa
Copy link
Owner

When I was working on this for 4.8, at first tried to do some sort of equals/hashing to detect if something changed to the type mapper that'd require me to generate new code to reduce the amount of code generation going on. I was concerned that generating an assembly was a slow operation.

After benchmarking and profiling a bit, I found it's not that costly. Prior to 4.8 I was also creating generated code each time you create a reader/writer; the only thing that's different is that I create a new dynamic assembly each time now. Unless your files are small, this shouldn't be noticeably slower than it was before.

In your situation, probably the best thing to do is stick your type mapper somewhere you can access it across threads, then just have each thread call GetReader or GetWriter. Don't quote me on it but I'm pretty sure calling these methods results in a Mapper getting created, which has its own code generator instance, so each thread would generate it's own assembly. If you try this out and something awful happens, just let me know. My paranoia is that having the same assembly name might cause issues.

@jods4
Copy link
Author

jods4 commented Sep 26, 2020

If you try this out and something awful happens, just let me know.

The issue here is that finding concurrency issues is quite finicky and this is in an environment where I can't afford failures.

It would be nice being able to generate a Mapper per file format in a static variable and then get readers out of it on multiple threads.

What I understand from your answer is that thread-safety of mappers is not guaranteed, so I'll stick to one mapper per thread.

Feel free to close this issue or keep it open if you like the suggestion to guarantee thread safety of mappers.
I think it would be nice, esp. for usage on servers, and shouldn't be hard. As long as you don't use static state during parsing, the generated code should be thread safe.

@jehugaleahsa
Copy link
Owner

Yeah, I similarly have no magical way to tell if there'd be unexpected threading issues with this code. All I can say for certain is that there's no longer any shared state (static values) among type mappers/readers/writers, as each reader/writer instance will get its own code generator now. The only thread-safety guarantee prior to 4.8.0 was that the shared dynamic assembly was created in a static constructor (essentially) which guarantees that only one assembly would be created and shared among instances. That static constructor nonsense isn't even necessary anymore and any other threading concerns would be the same as code prior to 4.8.

If you look here: https://github.com/jehugaleahsa/FlatFiles/blob/master/FlatFiles/TypeMapping/SeparatedValueTypeMapper.cs#L1484 you can see that every new "read" operation gets its own "mapper", which gets its own "code generator". The only thing I see that's suspicious is this line here: https://github.com/jehugaleahsa/FlatFiles/blob/master/FlatFiles/TypeMapping/Mapper.cs#L49 where I am caching the reader. This style of caching is not thread-safe except I am pretty sure (not 100%) that I always call this code in single-threaded context. I am not sure I even need this caching mechanism anymore but VS's debugging is not working at the moment in order to see if those if-statements ever even get hit. Nonetheless, that code shouldn't even be callable across threads, so I think its a moot point.

My only fear is related to how .NET handles two or more dynamic assemblies have the same name. It doesn't seem to cause any issues in a single-threaded context so I doubt it will cause issues in a multi-threaded context. I never need to look up the assembly and simply use the generated code in it directly, so the name is sort of irrelevant.

I did a quick look to see where all I am using static variables and the only remaining places I use statics are to declare readonly constant values. The other thought I have is creating a type mapper is a relatively inexpensive operation as it just generates a schema and passes it to a reader/writer. Everything happens when creating the reader/writer (i.e., code generation).

@jods4
Copy link
Author

jods4 commented Sep 27, 2020

I dug a little more into the code and have some observations to share.

  1. The readers themselves seem thread-safe, with some restrictions.
    The only shared state is mapping/schema, so if user doesn't mutate it concurrently it's fine. Ideally the API would separate cleanly mapping configuration and usage, but hindsight is always 10/10. It would be a weird usage anyway and definitely not supported concurrently.
    Cached stuff is not coded in a thread-safe way (it's trivial to fix), I spotted cachedColumns in schema and cachedReader/Writer in mapper. Turns out that in 4.8 every reader gets its own schema and mapper, so it doesn't matter if they're thread-safe or not -- they're useless.

  2. Following up on the 4.8 "every reader gets its mapper + codegen", even though you benchmarked it to be fast enough to create, it seems like unrequired memory/gc pressure if you parse many files. After reading the code I would change two things:

2.a. Have you considered using Lightweight Code Generation (LCG), aka DynamicMethod? You go through a lot of hassle with dynamically creating assemblies, assigning public keys and unique names, when you could just create a method?
I have seen you create classes, but it looks like they contain a single method and are intended to capture state. DynamicMethod can capture context as well.

2.b. Codegen for a typed mapper is 100% derived from MemberLookup, so it makes sense to move it there rather than inside Mapper. There it could easily be cached and reused if you Read multiple times from the same FixedLengthTypeMapper.
There was the weird usage in 1. where you could mutate your config after calling GetReader. This is easily supported by setting the cache to null in MemberLookup::GetOrAddMember and MemberLookup::AddIgnored.

@jehugaleahsa jehugaleahsa reopened this Sep 27, 2020
@jehugaleahsa
Copy link
Owner

  1. Brings up a couple good points. I believe that the Mapper class is very short-lived and GetReader or GetWriter gets called and then the results gets passed to a TypedReader/TypedWriter and then the Mapper gets discarded. I will double-check this - things get complicated when dealing with nested schemas and complex mappers.

  2. I went down the DynamicMethod route and, as you mentioned, there's some backing field nonsense that made it seem impossible before. If you had a link to an example where context is captured, I'd gladly take a look. At some point I would like to take a deeper look at code generation as a whole: if you read the code carefully, you'll see when I build deeply nested object graphs, I am creating a reader/writer method for each object down the path. I should just be creating a single, long code gen'd method that does the getting/setting for every object in the graph top to bottom. Unfortunately, this didn't occur to me at the time when I was working on nested initialization. This could be an excuse to take another dive into it.

Your comment about mutating your config after starting reading got me thinking. I ran into this problem before with people modifying the "schema options" objects a long time ago. It seems like I will have the same problem with people modifying columns and schema after-the-fact too. In the past I solved this by cloning the objects - I probably need to do something similar here. Then I should wrap my schema object and the returned column definitions in immutable wrappers with read-only interfaces. This would solve the issue related to the way I create the property mapping objects right now: I am just storing a StringColumn or whatever as a backing field and mutating it. This approach allows the same sort of mutation-after-the-fact to occur.

@jods4
Copy link
Author

jods4 commented Sep 27, 2020

If you had a link to an example where context is captured, I'd gladly take a look.

If you use the overload CreateDelegate(Type,Object) the second parameter is a capture: it can be whatever you want and will be bound to the 1st argument of your dynamic method (so the delegate type has to have one less argument than the IL code).
There is an example further down in the linked doc where they "fake" an instance method by capturing a target instance.

I should just be creating a single, long code gen'd method that does the getting/setting for every object in the graph top to bottom.

I saw that and I was surprised. I was expecting a single dynamic method that would straight read a whole line instead of a loop over the metadata.

This could be an excuse to take another dive into it.

If you rewrite the whole thing (which is not required to use DynamicMethod, as it's based on IL as well), maybe you should consider creating the method with Expression rather than IL, as it might be easier to write and maintain. It's still ugly, mind you, but far less than raw IL code.
If you go this way, I'd suggest you take a dependency on FastExpressionCompiler which provides you more lightweight, drop-in Expression classes, faster compilation time and faster execution than the BCL implementation. It's used by other libraries that emit code to avoid Reflection.

(Going off track here: C# source generators seem even better for this use case but that's the future and not really usable by most devs right now).

It seems like I will have the same problem with people modifying columns and schema after-the-fact too. In the past I solved this by cloning the objects - I probably need to do something similar here.

My thoughts on this: it's a weird use case to mutate your file format once defined.
The easiest cop-out is to not support it (maybe even keep a flag that a reader has been produce and throw if a mutation is attempted).
If you want to support it, instead of throwing you could set a cached reader to null, so that a new one will be produced when calling GetReader again and state that concurrent usage of reader/writer and mutation of config is not supported. Seems reasonable to me.

If you really want to support crazy use cases, I'd say the best design is a builder pattern. Maybe you can do this without breaking the public API.
Users can mutate a builder (actually the mapper) by doing mapper.Property(x => Name) etc. but when you call GetReader you get a readonly reader that is totally disconnected from the mapper (so snapshot any data you'd need).
Further mutations of the mapper don't affect the previously created reader, and you can create new ones, etc.
If you want to go fancy maybe the snapshot could be based on immutable structures, although I'm really not sure it would be beneficial (perf and memory-wise) for simple cases like this.
There might be even less stuff to snapshot if you fully generate a row reader dynamic method, that would embed rather than reference most information (names, types, fields, etc.)

A final unrelated thought I had while reading the code: parsing flat files may benefit from using Span<char>. The code creates string copies of every field found in the file, which can be a lot of memory allocations, esp. on big files.
In theory, many fields could be parsed straight from the input line, through a Span: e.g. numbers, dates, plain strings.
Strings would only need to be allocated for inputs that need to be modified (e.g. from OnParsing callbacks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants