Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I generate Arch features of new datasets from GLRM predict function #160

Open
tsengj opened this issue Jun 25, 2022 · 4 comments
Open

Comments

@tsengj
Copy link

tsengj commented Jun 25, 2022

Raised the same question here;

https://stackoverflow.com/questions/72753783/how-do-i-generate-the-archetypes-of-new-dataset-from-the-glrm-predict-function.

I have used these sites as reference and though has been resourceful, I'm unable to regenerate the reduced dimensions of new datasets via the glrm predict function

I work in the Sparklyr environment with H2o. I'm keen to use the GLRM function to reduce dimensions to cluster. Though from the model, i am able to access the PCAs or Arch, i would like to generate the Archs from the GRLM predict function on new datasets.

Appreciate your help.

Here is the training of the GLRM model on the training dataset

glrm_model <-
  h2o.glrm(
    training_frame = train,
    cols = glrm_cols,
    loss = "Absolute",
    model_id = "rank2",
    seed = 1234,
    k = 5,
    transform = "STANDARDIZE",
    loss_by_col_idx = losses$index,
    loss_by_col = losses$loss
  )
# Decompose training frame into XY
X <- h2o.getFrame(glrm_model@model$representation_name) #as h2o frame

The Arch Types from the training dataset:

X
        Arch1      Arch2       Arch3      Arch4      Arch5
1  0.10141381 0.10958071  0.26773514 0.11584502 0.02865024
2  0.11471676 0.06489475  0.01407714 0.24536782 0.10223535
3  0.08848878 0.26742082  0.04915022 0.11693702 0.03530641
4 -0.03062604 0.29793032 -0.07003814 0.01927975 0.52451867
5  0.09497268 0.12915324  0.21392107 0.08574152 0.03750636
6  0.05857743 0.18863508  0.14570623 0.08695144 0.03448957

But when i wish use the trained GLRM model on new dataset to regenerate these arch types,
I got the full dimensions instead of the Arch types as per X above?

I'm using these Arch as features for clustering purposes.

# Generate predictions on a validation set (if necessary):
glrm_pred <- h2o.predict(glrm_model, newdata = test)
glrm_pred
  reconstr_price reconstr_bedrooms reconstr_bathrooms reconstr_sqft_living reconstr_sqft_lot reconstr_floors reconstr_waterfront reconstr_view reconstr_condition reconstr_grade reconstr_sqft_above reconstr_sqft_basement reconstr_yr_built reconstr_yr_renovated
1     -0.8562455       -1.03334892         -1.9903167           -1.3950774        -0.2025564      -1.6537486                   0             4                  5             13         -1.20187061             -0.6584413       -1.25146116            -0.3042907
2     -0.7940549       -0.29723926         -0.7863867           -0.4364751        -0.1666500      -0.8527297                   0             4                  5             13         -0.13831432             -0.6545514        0.54821146            -0.3622902
3     -0.7499614       -0.18296317          0.1970824           -0.3989486        -0.1532677       0.4914559                   0             4                  5             13         -0.09100889             -0.6614534        1.38779632            -0.1844416
4     -1.0941432        0.08954988          0.7872987           -0.2087964        -0.1599888       0.8254916                   0             4                  5             13          0.11973488             -0.6623575        2.70176558            -0.2363486
5      0.3727360        0.82848389          0.4965246            1.1134378        -0.9013011      -1.3388791                   0             4                  5             13          0.08427185              2.1354440       -0.07213625            -1.2213866
6     -0.4042458       -0.59876839         -0.9685556           -0.7093578        -0.1745297      -0.5061798                   0             4                  5             13         -0.43503836             -0.6628391       -0.55165408            -0.2207544
  reconstr_lat reconstr_long reconstr_sqft_living15 reconstr_sqft_lot15
1  -0.07307503    -0.4258959             -1.0132778          -0.1964471
2  -0.52124543     0.7283153              0.1242903          -0.1295341
3  -0.56113519     0.6011221             -0.1616215          -0.1624136
4  -0.99759090     1.3032420              0.1556193          -0.1569607
5   0.70028433    -0.6436112              1.1400189          -0.9272790
6  -0.02222403    -0.2257382             -0.4859787          -0.1817499

[6416 rows x 18 columns] 

thank you

@wendycwong
Copy link

James: Thank you for bringing me the issue. @us8945 has also brought up a good question on how do we score a new data set using a trained GLRM model. Let me answer his question first here:

Given a training dataset, the purpose of GLRM is to extract a set of basis vectors that span the whole subspace where the training dataset is derived from. Hence, the GLRM model will generate a set of archetypes (which are equivalent to the concept of basis vectors) here. Hence, each row vector in the training dataset can be written as a linear combination of the archetypes as yi = x1*archetype1 + x2 archetype2+x3archetype3+... . Note here, the x1, x2, x3 are the coefficients that are returned when we call predict on the training dataset for each row. During training, we derive the archetypes and the coefficients together in an alternate way.

Now, given a new set of dataset derived from the same subspace that the training dataset is derived from, the job of the predict function here is find the set of coefficients for each data row using the archetypes that are derived earlier. Here, we already know the archetypes, only need to find the coefficients. This is achieved by setting initial values of coefficients to random values and then using simple gradient descend to minimize the objective function to obtained the correct objective function.

@tsengj
Copy link
Author

tsengj commented Jun 28, 2022

Thanks Wendy, but unfortunately, this is past my depth level.
Your response here made more sense, where you wrote;

"GLRM, you decompose a matrix A = XY and you perform clustering on X. For a new dataset, ANew, you need to get your new XNew. To do this, you perform XNew = ANew * inverse(Y)"

How do I implement this ANew * inverse(Y) in R in order for me to cluster on the features from XNew

Apologies, I'm quite a novice in this space.

@wendycwong
Copy link

James:

I know what you are looking for, the X for a new dataset. Luckily Uri (@us8945) has brought up the issue to me. I will write a new function for you in order for you to get the new X.

The predict function return Anew to you but you are looking for the new X.

Will get this done for you.

Thank you for bringing this to our attention.

W

@wendycwong
Copy link

Here is the JIRA: https://h2oai.atlassian.net/browse/PUBDEV-8750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants