Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rows Duplicated When Reading Python Dictionary #16172

Open
manh4wk opened this issue Apr 23, 2024 · 0 comments
Open

Rows Duplicated When Reading Python Dictionary #16172

manh4wk opened this issue Apr 23, 2024 · 0 comments
Assignees
Labels

Comments

@manh4wk
Copy link

manh4wk commented Apr 23, 2024

H2O version, Operating System and Environment
h2o version 3.46.0.1, Windows 10.0.19045, Python v3.10.14.

I've run into this in several versions of h2o on various versions of Windows and Python. I wasn't able to recreate it in an Ubuntu environment.

Actual and Expected Behavior
When trying to read in a Python dictionary into an h2o dataframe, some rows are duplicated. This results in the h2o dataframe having more records than the original dictionary. The expected behavior is for the h2o dataframe to have the same number of records as the original dataframe.

Here is an example with random data that recreates the issue, though I unfortunately first ran across it with real data. This example should result in a dataframe with 2,364,350 records, but I end up with 2,364,353:

import h2o
import numpy as np

h2o.init()

d_test = dict()
np.random.seed(17)
for i in range(6):
    d_test[f'my_column_number_{i}'] = list(np.random.rand(2364350))
hf_test = h2o.H2OFrame(d_test, destination_frame='my_test')

hf_test.shape

Here is a screenshot comparing the row count from the h2o dataframe with a pandas dataframe reading the same data:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants