Rows Duplicated When Reading Python Dictionary #16172

manh4wk · 2024-04-23T03:07:39Z

H2O version, Operating System and Environment
h2o version 3.46.0.1, Windows 10.0.19045, Python v3.10.14.

I've run into this in several versions of h2o on various versions of Windows and Python. I wasn't able to recreate it in an Ubuntu environment.

Actual and Expected Behavior
When trying to read in a Python dictionary into an h2o dataframe, some rows are duplicated. This results in the h2o dataframe having more records than the original dictionary. The expected behavior is for the h2o dataframe to have the same number of records as the original dataframe.

Here is an example with random data that recreates the issue, though I unfortunately first ran across it with real data. This example should result in a dataframe with 2,364,350 records, but I end up with 2,364,353:

import h2o
import numpy as np

h2o.init()

d_test = dict()
np.random.seed(17)
for i in range(6):
    d_test[f'my_column_number_{i}'] = list(np.random.rand(2364350))
hf_test = h2o.H2OFrame(d_test, destination_frame='my_test')

hf_test.shape

Here is a screenshot comparing the row count from the h2o dataframe with a pandas dataframe reading the same data:

manh4wk added the bug label Apr 23, 2024

wendycwong assigned krasinski Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rows Duplicated When Reading Python Dictionary #16172

Rows Duplicated When Reading Python Dictionary #16172

manh4wk commented Apr 23, 2024 •

edited

Rows Duplicated When Reading Python Dictionary #16172

Rows Duplicated When Reading Python Dictionary #16172

Comments

manh4wk commented Apr 23, 2024 • edited

manh4wk commented Apr 23, 2024 •

edited