-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a flag to the Index Block that indicates whether a data block encodes text #375
Comments
For various reasons, the main implementation of GSD is a pure C library, and C lacks any dedicated string type. GSD also lacked logged quantities in the initial design, so the specification did not anticipate the need. When we later added logged quantities, we kept the C representation for strings that GSD previously used only internally. Unfortunately, the HDF5 writer does not support strings at all - due to the same limitations as pandas. I would guess that Ovito interprets the any data chunk with the type What use case do you have for string output that fits more naturally in a GSD file than a plain text file written by write.Table? If there is a strong need, we could bump the specification minor revision and add a new string data type: Lines 21 to 52 in d31663d
I would not use flags - a reader could not reasonably interpret e.g. Nx3 float32 data with the "string" flag set. We could either keep the null terminated representation or restrict string data chunks to be |
I'm learning a lot - I did not know that C lacks a dedicated string type. The use case that got me interested in this was that I wanted my function, which initializes a Hoomd simulation, to add a logger that would record the row action that originally called the function. The reason I want this is so I can track which timesteps correspond to which actions in a GSD file that is appended to by multiple row actions. I know a potentially easier approach would be to just make a new GSD file with every action, but I still wanted to approach it this way so that I have fewer GSD files in each job directory. Keeping this data in a GSD file, rather than a file written by What you say about types over flags makes sense - I'm just not familiar enough with C to know the details of implementation. If we decide not to add this functionality in GSD, I think it might be helpful to raise a warning on the Python side of Hoomd if it detects that a loggable callback can return a string. This warning could notify the user that string output will have to be decoded on file reading. |
Small update: I just learned that
Again, Ovito handles the data correctly and opens the GSD file without issue. |
Summary
Currently in GSD, log data seems to be encoded in such a way that it is not possible to determine whether it should be decoded as numerical or textual information. This can be seen in a MWE provided below, in which a GSD file is written with Hoomd, but upon reading with GSD's own built-in functions, one of the logged fields seems (at first glance) to be decoded incorrectly. This behavior is actually described in the docs, but it is difficult to find that description, and the behavior can feel very unexpected to newcomers. I suggest that we add a flag to the index block to indicate whether a block contains numerical or textual information, and then modify the behavior of GSD's built-in reading functions to behave accordingly.
Description
Here is a MWE of a simple simulation that writes trajectories and a message "hello" to an output file called
test.gsd
:When
test.gsd
is opened in Ovito, themessage
field is displayed properly:But when
test.gsd
is read with GSD's built-in functions,message
instead contains arrays of integers:These integers are the decimal form of Unicode names for the letters 'h', 'e', 'l', 'l', 'o', and
null
. Indeed, you can decode them this way:After discussing this offline with @tcmoore3 , I learned that this behavior is not only expected, but also documented - still, this behavior seems strange to me. I understand why data is stored the way it is in GSD, as blobs are much more efficient for read/write tasks than plain text, but the Python API is intended to be approachable, allowing users to, for example, create a pandas dataframe from a GSD log in one line. If we want to make it easy for users to interact with GSD files through Python, I think we should make it so they don't have to convert decimal unicode to text on their own. Just as an example, the oneliner I linked above will break if the log contains varying textual information, because pandas requires uniform dimensions in each column.
Proposed solution
I am not familiar enough with GSD's specification and C and Python APIs to give detailed suggestions for a solution, but I do have some ideas. The file layer specification indicates that the Index Block contains an unused(?) line item called
flags
, which is "reserved for future use". I suggest that we consider adding a flag to indicate that a data block contains textual information.If we implemented this flag, we would also need to modify the behavior of
gsd.hoomd.open()
andgsd.hoomd.read_log()
to check for the flag and then automatically decode the decimal unicode. We would also need to add some logic in Hoomd handle the new behavior. Again, I am not familiar enough with Hoomd's internals to suggest a full solution, but this could involve modifying the behavior ofhoomd.write.GSD
to set the flag when a logger passes it string values.The text was updated successfully, but these errors were encountered: