Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get OA dates as double instead of converted to DateTime/TimeSpan? #545

Open
jbrockerville opened this issue Jul 1, 2021 · 15 comments
Open

Comments

@jbrockerville
Copy link

jbrockerville commented Jul 1, 2021

Hi,

Incredibly useful library! However, like some other users, I have a date-time issue. The ones I looked at, and mine, seem to stem from the fact that the worksheet implementations detect and convert OA dates to DateTime or TimeSpan objects. For example, the XlsxWorksheet. From there, the problems seem like issues with system region formatting and Excel formatting and their interplay. This was my issue too until I figured out that the library was doing a conversion before the system region formatting came into play. I worked through it by understanding it. Cool. But I have found instances where I simply do not want any conversion whatsoever. I want to take Excel's and the system's formatting completely out of the equation and work with the raw double. Unfortunately, there does not seem to be an option to simply not do this and return the raw OA date double value, or at least none that I could find. Can you do that? Just return the OA date double? If not, could you please add the option to do so?

Thanks!
-jb

@andersnm
Copy link
Collaborator

andersnm commented Jul 1, 2021

Hi,

This is something I've wanted myself for the ExcelNumberFormat project in order to properly support those invalid dates that Excel supports (f.ex January 0 1900 and February 29 1900). Although to be useful, the reader must also expose the "Is1904" flag how to interpret the double.

However, in your case, is there any reason you cannot just use DateTime.ToOADate() to get a double?

@jbrockerville
Copy link
Author

jbrockerville commented Jul 1, 2021

Yes, the Is1904 flag would be important to expose as well.

I have some reasons for wanting the raw OA date.

I have some tools built around this library and everything is converted to strings with the overridden and convenient ToString method, but that gets me the Excel formatted (or the system region formatted) date-time string. On the surface that is no big deal, but it is a pain to deal with a system dependent formatted date-time string like that when your scenarios are any region. And then throw in some non-standard date-time string formatting into that mix and it becomes nightmarish. Suffice it to say, I am doing something odd things and I want to control exactly what comes out of a OA date cell regardless of Excel or system region formatting.

Ok, so, admittedly the above situation does not preclude me from specifically detecting a DateTime object and calling ToOADate, but there are bugs in doing that I believe. I did some simple tests. I parsed and called ToOADate on three Excel cells containing OA dates, a complete date-time, e.g. 1808.04445601852, a pure date, e.g. 1808, and a pure time, e.g. 0.04445601852. The complete date-time is fine. But the pure date DateTime object now has the default time and screws up the ToString conversion by printing 12:00:00 AM (depending on system region) and the pure time DateTime object gets a 1 added to the OA date somewhere and prints a "default" date like "1899-12-31". These are not desirable outcomes.

Also, according to something a friend found, it appears that certain OA dates converted to TimeSpan messes up the decimal part from netcoreapp2.2 to netcoreapp3.1. 2.2's value matched the OA date value, but 3.1 did not. Or put another way, 2.2 TimeSpan.ToString outputs "13:14:15" (which is correct), but 3.1 outputs "13:14:14.9999999". Also, not a desirable outcome.

So, hopefully, I have adequately described the reasons why simply retrieving the raw OA date and avoiding "filtering" it through a DateTime or TimeSpan object would be useful.

@andersnm
Copy link
Collaborator

andersnm commented Jul 1, 2021

I still don't quite get it. It sounds like all your concerns are related to the default formatting provided by the .NET runtime. You would see the same behavior whether you get a raw double from ExcelDataReader or from ToOADate().

@andersnm
Copy link
Collaborator

andersnm commented Jul 1, 2021

Have you looked at this section: https://github.com/ExcelDataReader/ExcelDataReader#formatting

If you want to print a value as date-only, or time-only, you cannot rely on the raw values in the file. Neither Excel nor .NET knows how to print those without a number format. Thus you cannot use ToString(). The linked code snippet shows how to get a string from a value using the number format.

@jbrockerville
Copy link
Author

As per my previous post, the main takeaway should have been is that "filtering" the raw OA date double through a DateTime or TimeSpan (along with the netcore inconsistencies I mentioned) can change the value. That alone is reason enough to provide the raw OA date double.

If you want to print a value as date-only, or time-only, you cannot rely on the raw values in the file.

Sure I can. And do correct me if my simple assumption here is wrong somehow. Inspect the double. Whole number part only? Pure-date. Fractional part only with leading 0? Pure-time. Both? Complete date-time.

The following data, code, and output, illustrates my point:

data.xlsx:

(I do not seem to be able to attach things so here are the three raw OA date doubles in the file)

44369, 0.609560185185185, 44369.6095601852

Code:

using ExcelDataReader;
using System;
using System.IO;
using System.Text;

namespace ExcelDataReaderTest
{
    static class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello, ExcelDataReaderTest!");

            Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

            var filePath = "..\\..\\..\\data\\data.xlsx";
            var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
            IExcelDataReader reader = ExcelReaderFactory.CreateReader(stream);
            Console.WriteLine("Column | FieldType");
            do
            {
                while (reader.Read())
                {
                    for(var i = 0; i < reader.FieldCount; ++i)
                    {
                        var fieldType = reader.GetFieldType(i);
                        Console.WriteLine($"{i} {fieldType}");

                        if (fieldType.Name == "DateTime")
                        {
                            var dt = reader.GetDateTime(i);
                            var oa = dt.ToOADate();
                            Console.WriteLine($"  {dt}");
                            Console.WriteLine($"  {oa}");
                        }
                    }
                }
            } while (reader.NextResult());
            reader.Dispose();
            stream.Dispose();
        }
    }
}

Ouput:

Hello, ExcelDataReaderTest!
Column | FieldType
0 System.DateTime
  2021-06-22 12:00:00 AM
  44369
1 System.DateTime
  1899-12-31 2:37:46 PM
  1.6095601851851853
2 System.DateTime
  2021-06-22 2:37:46 PM
  44369.609560185185

Do you see the issue? After "filtering" through the DateTime object, I can no longer tell that 0 is a pure-date and 1 is a pure-time because their supposed "missing" parts have been filled in by unwanted default values. If the interface provided access to the raw OA date double, then a user could tell them apart by simply inspecting the double.

@andersnm
Copy link
Collaborator

andersnm commented Jul 2, 2021

Inspect the double. Whole number part only? Pure-date. Fractional part only with leading 0? Pure-time. Both? Complete date-time.

It is the number format that decides how values are printed in Excel. You can apply identical logic to a DateTime object: Time is midnight? Pure-date. Date-part is 1899-12-31? Pure-time. Both? Yes. Neither is correct, but gives identical results and kind of works.

Regarding your netcore inconsistencies, here's some info about changes to formatting in .NET Core 3.0:
https://devblogs.microsoft.com/dotnet/floating-point-parsing-and-formatting-improvements-in-net-core-3-0/

As far as I can tell, having the OA date might simplify things for you, but it's not a blocker.

@jbrockerville
Copy link
Author

I am sorry, but "kind of works" is not really good enough. Sure, looking for "1899-12-31" would be ok, but any recent date at exactly midnight, while improbable, is still very valid and therefore is inappropriate for doing a pure-date check.

Thank you for the .NET Core 3.0 format changes info. Explains things, but I still have to deal with the aftermath. And dealing with that would be easier with access to the OA date.

Look. You have made a great library here and it has made my life easier. Props. But I respectfully disagree that this is not blocking. It is for me. I am sure most users will use and appreciate the convenient internal conversion from OA date to DateTime you have done here, but I have found an edge case for which extracting the raw cell data is preferable. And, to be honest, respectfully, regardless of my particular reasons, an Excel reader that does not actually give the user access to the raw cell data seems like a gap in functionality. What is the harm in simply providing both the conveniently converted object and the raw object and letting the user decide what to use?

@andersnm
Copy link
Collaborator

andersnm commented Jul 2, 2021

I am sorry, but "kind of works" is not really good enough. Sure, looking for "1899-12-31" would be ok, but any recent date at exactly midnight, while improbable, is still very valid and therefore is inappropriate for doing a pure-date check.

There is no difference between a DateTime at midnight and a whole number. Technically they are both just numbers that require additional information through a number format to tell if its date-only, time-only or both. That's why I'm saying both approaches "kind of works" - but neither is correct. You must use the number format if correctness is a requirement. A raw OA date suffers exactly the same issue as a DateTime.

If you want to contribute a PR for an option that returns OA dates from ExcelDataReader then that's ok. But it won't help to determine if a value is date-only, time-only or both in a 100% way.

@appel1
Copy link
Collaborator

appel1 commented Jul 2, 2021

I don't think any of us is against providing this as a feature somehow.

But just to make sure I understand. You want to differentiate between date and time by how it is stored as an OADate and not by how it is formatted to look in Excel? So if the OADate is 40303.467361111114 formatted using [$-F400]h:mm:ss\ AM/PM so that it is displayed in Excel as 11:13:00 you want to handle that as the full datetime and not a time?

@jbrockerville
Copy link
Author

@andersnm

Technically they are both just numbers that require additional information through a number format to tell if its date-only, time-only or both.

Are you sure about that? I feel like the example I provided proves that you can indeed tell the difference with the OA date only, no additional number formats are required.

@appel1
Copy link
Collaborator

appel1 commented Jul 2, 2021

Some time in the future I'd love to expose this as the new DateOnly and TimeOnly structs that will be introduced in .NET 6 instead. Perhaps something to take into consideration if/when we decide how to expose the raw OADates doubles.

@jbrockerville
Copy link
Author

@appel1
Something like that yeah. In my example, the OA dates are 44369, 0.609560185185185, 44369.6095601852, a pure-date, a pure-time, and complete date-time. I made them using the cell formatters provided by Excel and those are the OA date doubles that Excel created. You see them when you set the display format to General. Assuming OA dates are stored like that, you can easily differentiate between complete date-time, pure-date, and pure-time using the double, but once converted into a DateTime that distinction has been lost.

Those DateOnly and TimeOnly structs sound like they would be wonderfully suited for this situation. However, I know I'll have constraints and will not be able to go .NET 6. So, as a workaround for all versions, simply providing the OA date double would be beneficial.

Heck, while I would advocate for providing access to the double regardless, a DateTime companion enum indicating such would be amazing. That way, the user gets the conveniently converted OA date as a DateTime object, but also the information lost in the conversion.

@andersnm
Copy link
Collaborator

andersnm commented Jul 2, 2021

Here's an example of the same values repeated, but with different number formats for date-only, time-only or both in each row:

image

If we were to implement OA dates in ExcelDataReader, then all rows would return identical doubles in columns 2-4. You couldn't tell the "OA date" doubles apart from the "General" doubles without examining the number format. The underlying values are still "just doubles" with no intrinsic information about presentation. That's all in the number format.

@jbrockerville
Copy link
Author

Oh, I misunderstood your original post about that. Good point.

Another value in a DateTime companion enum?

enum DateTimeType
{
    None, // NotADateTime
    DateAndTime,
    DateOnly,
    TimeOnly
}

@andersnm
Copy link
Collaborator

andersnm commented Jul 3, 2021

@jbrockerville It is important to understand how number formats work when your requirement is to produce output similar to Excel. Feels like we're missing the full picture and only discussing details out of incorrect assumptions.

You can derive a DateTimeType companium enum like that from the number format in your application's presentation logic​ (however requires new properties for IsDateOnly/IsTimeOnly in the ExcelNumberFormat library). Such an enum does not belong in ExcelDataReader: it already returns the number format as the true "companion enum" what values should look like.

Let me just stress that the GetFormattedValue() example here is supposed be a replacement for ToString() that correctly handles issues like date-only, time-only, both, netcore inconsistencies, and near-infinite variations like "HH:MM:SS", "HH:MM", "H:MM", 24h/12h, etc. It's not clear if your requirements go beyond this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants