Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming features in BQML #53

Open
vrolok opened this issue May 18, 2023 · 5 comments
Open

Streaming features in BQML #53

vrolok opened this issue May 18, 2023 · 5 comments

Comments

@vrolok
Copy link

vrolok commented May 18, 2023

I tried this use case in my own environment. I'm not an AI engineer, but I'd like to understand VertexAI framework and services. My question is about streaming data features processed by Dataflow and injected into the feature store.

I have values in the feature store for 2 entities:

Screen Shot 2023-05-18 at 12 21 10 PM

But in BQ training table streaming values are all 0s.

Screen Shot 2023-05-18 at 12 31 13 PM

The Explainable AI shows these top contributing features.
Screen Shot 2023-05-18 at 12 34 27 PM

Does this mean that streaming features were not used? How BQML uses the feature store during the training in this example and pulls data from it?

Thanks!

@vrolok
Copy link
Author

vrolok commented May 18, 2023

Just to add to the previous message. I guess, the batch_serve_to_bq method combines the 2 together. You have this code in the notebook:

ff_feature_store.batch_serve_to_bq(
    bq_destination_output_uri="bq://xxx.tx.train_table_20230513",
    serving_feature_ids=SERVING_FEATURE_IDS,
    read_instances_uri="bq://xxx.tx.ground_truth_20230513",
    pass_through_fields=["tx_amount", "tx_fraud"],
)

print(f"Feature values from feature store outputted to: {TRAIN_TABLE_URI}.")

But, I still have 0s in the table after it.

tx_amount tx_fraud timestamp entity_type_customer customer_id_avg_amount_14day_window customer_id_nb_tx_7day_window customer_id_avg_amount_15min_window customer_id_nb_tx_30min_window customer_id_avg_amount_1day_window customer_id_avg_amount_60min_window customer_id_nb_tx_15min_window customer_id_nb_tx_14day_window customer_id_avg_amount_30min_window customer_id_nb_tx_1day_window customer_id_nb_tx_60min_window customer_id_avg_amount_7day_window entity_type_terminal terminal_id_risk_14day_window terminal_id_avg_amount_15min_window terminal_id_nb_tx_7day_window terminal_id_nb_tx_30min_window terminal_id_avg_amount_60min_window terminal_id_risk_1day_window terminal_id_nb_tx_15min_window terminal_id_avg_amount_30min_window terminal_id_nb_tx_14day_window terminal_id_nb_tx_60min_window terminal_id_nb_tx_1day_window terminal_id_risk_7day_window
40.050000000 0 2023-05-13 19:47:09+00:00 0157007812465732 43.592381 14 0.0 0 46.887500 0.0 0 21 0.0 4 0 43.855000 17910084 0.0 0.0 236 0 0.0 0.0 0 0.0 256 0 34 0.0
22.880000000 0 2023-05-13 19:38:23+00:00 4066333781802858 21.637674 25 0.0 0 22.880000 0.0 0 43 0.0 1 0 22.679600 17910084 0.0 0.0 236 0 0.0 0.0 0 0.0 256 0 35 0.0
5.780000000 0 2023-05-13 21:30:51+00:00 9547043372742072 10.287895 23 0.0 0 5.780000 0.0 0 38 0.0 1 0 10.492609 17910084 0.0 0.0 236 0 0.0 0.0 0 0.0 257 0 31 0.0

@polong-lin
Copy link
Member

Yep exactly -- you use batch_serve_to_bq() to bring the data from Feature Store back to BigQuery before training using BigQuery ML.

The reason you're most likely seeing the streaming features as 0 is because in feature_engineering_batch.ipynb, you first start by creating placeholder values for all the streaming features using the value of 0. This will likely make up the majority of your training data anyway. It's only through Dataflow that the streaming features get properly computed, but the really only start being computed from the feature_engineering_streaming.ipynb. However, since there's only approximately 1 incoming tx per second from Pub/Sub, the amount of streaming features computed probably isn't very high by the time you're actually exporting the data for training, at least compared to the rest of the training data (which still have the placeholder of 0).

I realize it's a bit confusing this way, but think of it as an illustrative example of how this might begin to work in the long-term. The Dataflow pipeline will, over time, populate the streaming feature values.

Does this help?

@vrolok
Copy link
Author

vrolok commented May 19, 2023

Thanks for your response. I had a feeling that this could be the case. I've let the dataflow pipeline running for some time and now it shows this:

Screen Shot 2023-05-19 at 8 09 12 AM

Does this mean now I should have data for the previous day to the day when the pipeline started? So, If I go back and create a new terminal & customer tables - tx.terminal/customer_20230518 and re-ingest feature values into the feature store with customer_entity_type.ingest_from_bq... & terminal_entity_type.ingest_from_bq..., re-create ground-truth table for a new day and run batch_serve_to_bq the resulting tx.train_table_20230518 should have all feature values?

@polong-lin
Copy link
Member

Yes indeed, you can try that.

Once extracted, before trying to train again, you can also try to filter in BQ specifically for non-zero values (e.g. on one of the streaming columns) to just make sure you're getting values in the extracted data you've just retrieved from Feature Store:

For example:

sql_inspect = f"""
SELECT
    *
FROM
    `tx.{BQ_TABLE_NAME}`
WHERE
  customer_id_nb_tx_60min_window > 0 #just to check when one of these streaming columns is not zero
LIMIT 10
"""
run_bq_query(sql_inspect)

If this gives you results, then you might need to scroll to the right to visually confirm the streaming feature values, and then you're good to go.

If it still gives you empty results, then could you report back here too? It might be an indication that there's a bug (perhaps in the notebook code or something) if that's the case.

@vrolok
Copy link
Author

vrolok commented May 19, 2023

I think, there is a bug or permission issue.

Here are the data for customer & terminal in the feature store:
Screen Shot 2023-05-19 at 12 04 50 PM

But the table after batch_serve_to_bq doesn't have streaming data:
Screen Shot 2023-05-19 at 12 09 03 PM

The compute SA has these permissions:
Screen Shot 2023-05-19 at 12 22 39 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants