Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing Dataframe from vector of named tuples: missing values #3370

Open
lukas-weber opened this issue Aug 15, 2023 · 5 comments
Open
Labels
Milestone

Comments

@lukas-weber
Copy link

When I initialize my dataframe using

df = DataFrame([(a=1,b=2), (a=3,b=4), (a=1,)])

I would expect to get

a b
1 2
3 4
1 missing

Instead I get an error

ERROR: type NamedTuple has no field b
Stacktrace:
  [1] getproperty
    @ ./Base.jl:37 [inlined]
  [2] getcolumn
    @ ~/.julia/packages/Tables/AcRIE/src/Tables.jl:102 [inlined]

Is there a simple way to get the former behavior? Should it be what DataFrames does by default instead of throwing an error?

@bkamins
Copy link
Member

bkamins commented Aug 15, 2023

Is there a simple way to get the former behavior?

julia> DataFrame(Tables.dictrowtable([(a=1,b=2), (a=3,b=4), (a=1,)]))
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1        2
   2 │     3        4
   3 │     1  missing

Should it be what DataFrames does by default instead of throwing an error?

It is on purpose strict. The Tables.dictrowtable was designed to handle such cases.
Alternatively you could use:

julia> reduce((df, x) -> push!(df, x; cols=:union), [(a=1,b=2), (a=3,b=4), (a=1,)], init=DataFrame())
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1        2
   2 │     3        4
   3 │     1  missing

which is faster but more verbose.


Having said that maybe indeed it makes sense to allow for what you ask for in the constructor, so I keep the issue open.

@bkamins bkamins added this to the 1.7 milestone Aug 15, 2023
@lukas-weber
Copy link
Author

Thanks!

@bkamins
Copy link
Member

bkamins commented Aug 15, 2023

@quinnj, @nalimilan - what do you think? Now as I think about it maybe a lighter wrapper than Tables.dictrowtable could be introduced in Tables.jl that would produce missing in case a value in a column is missing? The issue is that Tables.dictrowtable materializes Dict entries for each row, while maybe we could have some similar "lazy wrapper" that would not materialize anything, but just when getcolumn is called would be able to inject missing where needed?

@nalimilan
Copy link
Member

Tables.dictrowtable is indeed hard to discover. Maybe we could support cols=:union like in vcat?

@bkamins
Copy link
Member

bkamins commented Aug 16, 2023

Maybe we could support cols=:union like in vcat?

This is what I was also considering. The problem is that it is not an issue on the level of DataFrames.jl. We delegate construction of columns to Tables.jl, so Tables.jl would need to have this kind of mechanism first (it is Tables.jl that throws an error not DataFrames.jl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants