Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

118 distilbert api #132

Merged
merged 33 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
8556d2b
started basic framework for dockerizing sentiment endpoint
yiwen-h Jul 3, 2023
243c69b
working docker container - cant use Alpine
yiwen-h Jul 3, 2023
3ad8e6b
got docker container to mount data folder, accept filename as argument
yiwen-h Jul 4, 2023
64fe780
json input file now deleted if NOT run locally
yiwen-h Jul 14, 2023
86414f2
Predictions now outputted as json file in data_out folder
yiwen-h Jul 14, 2023
b726388
added label to dockerfile
yiwen-h Jul 18, 2023
6ab83a3
added most tests for docker_run
yiwen-h Jul 18, 2023
da605dc
added larger json file - about 8000 comments
yiwen-h Jul 19, 2023
e853659
wrote get_y_score function
yiwen-h Aug 4, 2023
9f135e6
prediction dfs now include probabilities as well
yiwen-h Aug 4, 2023
439dece
prediction dfs now include probabilities for sklearn multilabel
yiwen-h Aug 4, 2023
78632fc
added macro roc auc score to model summary
yiwen-h Aug 4, 2023
db180e0
write_model_preds now uses probabilities from predict_multilabel df o…
yiwen-h Aug 4, 2023
6eb5259
added model_performance.additional_analysis which calculates confusio…
yiwen-h Aug 4, 2023
42b5ec6
confusion matrix info and roc_auc_score now in model analysis
yiwen-h Aug 7, 2023
368f9d8
Replaced macro roc_aoc score with average_precision_score in perf ana…
yiwen-h Aug 9, 2023
bfb6dca
renamed "support" column to be more userfriendly
yiwen-h Aug 9, 2023
40a3cfd
fixed ruff complaining about == instead of isinstance in tests
yiwen-h Aug 9, 2023
f1256ed
fixed ruff complaining about == instead of isinstance in test_factory…
yiwen-h Aug 9, 2023
301c8b7
some broken dependencies causing test to fail, trying to fix pyprojec…
yiwen-h Aug 9, 2023
c815088
Merge pull request #131 from CDU-data-science-team/126_ROC
yiwen-h Aug 9, 2023
18bf54d
started basic framework for dockerizing sentiment endpoint
yiwen-h Jul 3, 2023
d4f0831
working docker container - cant use Alpine
yiwen-h Jul 3, 2023
784c306
got docker container to mount data folder, accept filename as argument
yiwen-h Jul 4, 2023
98c1846
json input file now deleted if NOT run locally
yiwen-h Jul 14, 2023
72c5a24
Predictions now outputted as json file in data_out folder
yiwen-h Jul 14, 2023
1a30581
added label to dockerfile
yiwen-h Jul 18, 2023
dfe0907
added most tests for docker_run
yiwen-h Jul 18, 2023
f3549b0
added larger json file - about 8000 comments
yiwen-h Jul 19, 2023
12a0f79
added cache removal to dockerfile in bid to reduce size
yiwen-h Jul 19, 2023
eb94f9a
fewer layers, slim-debian, for smaller size
yiwen-h Jul 19, 2023
866154c
updated dockerfile to reduce size
yiwen-h Aug 9, 2023
ca2e483
mocking load_model in test_predict_sentiment
yiwen-h Aug 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ repos:
args: [ "--maxkb=750000" ]
- id: end-of-file-fixer
name: Check for a blank line at the end of scripts (auto-fixes)
exclude: 'json'
- id: trailing-whitespace
name: Check for trailing whitespaces (auto-fixes)
- repo: https://github.com/pycqa/isort
Expand Down
14 changes: 14 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM python:3.10.12-slim-bookworm
VOLUME /data

COPY docker-requirements.txt requirements.txt
RUN pip install --upgrade pip setuptools \
&& pip install -r requirements.txt \
&& rm -rf /root/.cache

COPY api/bert_sentiment bert_sentiment
COPY --chmod=755 docker_run.py docker_run.py

LABEL org.opencontainers.image.source=https://github.com/cdu-data-science-team/pxtextmining

ENTRYPOINT ["python3", "docker_run.py"]
4 changes: 4 additions & 0 deletions docker-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
pandas==1.5.3 ; python_version >= "3.8" and python_version < "3.11"
scikit-learn==1.0.2 ; python_version >= "3.8" and python_version < "3.11"
tensorflow==2.12.0 ; python_version >= "3.8" and python_version < "3.11"
pxtextmining==0.5.4
22 changes: 22 additions & 0 deletions docker_data/data_in/file_01.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[
{
"comment_id": "1",
"comment_text": "Nurse was great.",
"question_type": "what_good"
},
{
"comment_id": "2",
"comment_text": "The ward was freezing.",
"question_type": "could_improve"
},
{
"comment_id": "3",
"comment_text": "",
"question_type": "nonspecific"
},
{
"comment_id": "4",
"comment_text": "Thank you so much",
"question_type": "nonspecific"
}
]
1 change: 1 addition & 0 deletions docker_data/data_in/file_02.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docker_data/data_out/file_01.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"comment_id": "1", "sentiment": 2.0}, {"comment_id": "2", "sentiment": 4.0}, {"comment_id": "3", "sentiment": "Labelling not possible"}, {"comment_id": "4", "sentiment": 1.0}]
105 changes: 105 additions & 0 deletions docker_run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import argparse
import json
import os

import pandas as pd
from tensorflow.keras.saving import load_model

from pxtextmining.factories.factory_predict_unlabelled_text import (
predict_sentiment_bert,
)


def load_sentiment_model():
model_path = "bert_sentiment"
if not os.path.exists(model_path):
model_path = os.path.join("api", model_path)
loaded_model = load_model(model_path)
return loaded_model


def get_sentiment_predictions(
text_to_predict, loaded_model, preprocess_text, additional_features
):
predictions = predict_sentiment_bert(
text_to_predict,
loaded_model,
preprocess_text=preprocess_text,
additional_features=additional_features,
)
return predictions


def predict_sentiment(items):
"""Accepts comment ids, comment text and question type as JSON in a POST request. Makes predictions using trained Tensorflow Keras model.

Args:
items (List[ItemIn]): JSON list of dictionaries with the following compulsory keys:
- `comment_id` (str)
- `comment_text` (str)
- `question_type` (str)
The 'question_type' must be one of three values: 'nonspecific', 'what_good', and 'could_improve'.
For example, `[{'comment_id': '1', 'comment_text': 'Thank you', 'question_type': 'what_good'},
{'comment_id': '2', 'comment_text': 'Food was cold', 'question_type': 'could_improve'}]`

Returns:
(dict): Keys are: `comment_id`, `comment_text`, and predicted `labels`.
"""

# Process received data
loaded_model = load_sentiment_model()
df = pd.DataFrame([i for i in items], dtype=str)
df_newindex = df.set_index("comment_id")
if df_newindex.index.duplicated().sum() != 0:
raise ValueError("comment_id must all be unique values")
df_newindex.index.rename("Comment ID", inplace=True)
text_to_predict = df_newindex[["comment_text", "question_type"]]
text_to_predict = text_to_predict.rename(
columns={"comment_text": "FFT answer", "question_type": "FFT_q_standardised"}
)
# Make predictions
preds_df = get_sentiment_predictions(
text_to_predict, loaded_model, preprocess_text=False, additional_features=True
)
# Join predicted labels with received data
preds_df["comment_id"] = preds_df.index.astype(str)
merged = pd.merge(df, preds_df, how="left", on="comment_id")
merged["sentiment"] = merged["sentiment"].fillna("Labelling not possible")
return_dict = merged[["comment_id", "sentiment"]].to_dict(orient="records")
return return_dict


def parse_args():
"""Parse command line arguments"""
parser = argparse.ArgumentParser()
parser.add_argument(
"json_file",
nargs=1,
help="Name of the json file",
)
parser.add_argument(
"--local-storage",
"-l",
action="store_true",
help="Use local storage (instead of Azure)",
)
args = parser.parse_args()

return args


def main():
args = parse_args()
json_file = os.path.join("data", "data_in", args.json_file[0])
with open(json_file, "r") as jf:
json_in = json.load(jf)
if not args.local_storage:
os.remove(json_file)
json_out = predict_sentiment(json_in)
out_path = os.path.join("data", "data_out", args.json_file[0])
with open(out_path, "w+") as jf:
json.dump(json_out, jf)


if __name__ == "__main__":
main()
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

68 changes: 64 additions & 4 deletions pxtextmining/factories/factory_model_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,6 @@ def get_multilabel_metrics(
enhance_with_rules=enhance_with_rules,
already_encoded=already_encoded,
)
y_pred = np.array(y_pred_df)[:, :-1].astype("int64")
elif model_type == "sklearn":
y_pred_df = predict_multilabel_sklearn(
x_test,
Expand All @@ -143,17 +142,28 @@ def get_multilabel_metrics(
enhance_with_probs=True,
enhance_with_rules=enhance_with_rules,
)
y_pred = np.array(y_pred_df)[:, :-1].astype("int64")
else:
raise ValueError(
'Please select valid model_type. Options are "bert" or "sklearn"'
)
y_pred = np.array(y_pred_df[labels]).astype("int64")
# Calculate various metrics
model_metrics["exact_accuracy"] = metrics.accuracy_score(y_test, y_pred)
model_metrics["hamming_loss"] = metrics.hamming_loss(y_test, y_pred)
model_metrics["macro_jaccard_score"] = metrics.jaccard_score(
y_test, y_pred, average="macro"
)
y_probs = y_pred_df.filter(like="Probability", axis=1)
model_metrics["macro_roc_auc"] = metrics.roc_auc_score(
y_test, y_probs, multi_class="ovr"
)
model_metrics[
"Label ranking average precision"
] = metrics.label_ranking_average_precision_score(
y_test,
y_probs,
)
# Model summary
if model_type in ("bert", "tf"):
stringlist = []
model.summary(print_fn=lambda x: stringlist.append(x))
Expand Down Expand Up @@ -218,14 +228,64 @@ def parse_metrics_file(metrics_file, labels):
"precision": [],
"recall": [],
"f1_score": [],
"support": [],
"support (label count in test data)": [],
}
for each in lines:
splitted = each.split(" ")
metrics_dict["label"].append(splitted[0].strip())
metrics_dict["precision"].append(splitted[1].strip())
metrics_dict["recall"].append(splitted[2].strip())
metrics_dict["f1_score"].append(splitted[3].strip())
metrics_dict["support"].append(splitted[4].strip())
metrics_dict["support (label count in test data)"].append(splitted[4].strip())
metrics_df = pd.DataFrame.from_dict(metrics_dict)
return metrics_df


def get_y_score(probs):
"""Converts probabilities into format (n_samples, n_classes) so they can be passed into sklearn roc_auc_score function

Args:
probs (np.ndarray): Probability estimates outputted by model

Returns:
np.ndarray: Probability estimates in format (n_samples, n_classes)
"""
if probs.ndim == 3:
score = np.transpose([pred[:, 1] for pred in probs])
elif probs.ndim == 2:
score = probs
return score


def additional_analysis(preds_df, y_true, labels):
"""For given predictions, returns dataframe containing: macro one-vs-one ROC AUC score, number of True Positives, True Negatives, False Positives, and False Negatives.

Args:
preds_df (pd.DataFrame): Dataframe containing predicted labels in one-hot encoded format
y_true (np.array): One-hot encoded real Y values
labels (List): List of the target labels

Returns:
pd.DataFrame: dataframe containing: macro one-vs-one ROC AUC score, number of True Positives, True Negatives, False Positives, and False Negatives.
"""
# include threshold?? (later)
y_score = np.array(preds_df.filter(like="Probability", axis=1))
cm = metrics.multilabel_confusion_matrix(y_true, np.array(preds_df[labels]))
cm_dict = {}
average_precision = {}
for i, label in enumerate(labels):
cm_meaning = {}
tn, fp = cm[i][0]
fn, tp = cm[i][1]
cm_meaning["True Negative"] = tn
cm_meaning["False Negative"] = fn
cm_meaning["True Positive"] = tp
cm_meaning["False Positive"] = fp
cm_dict[label] = cm_meaning
average_precision[label] = metrics.average_precision_score(
y_true[:, i], y_score[:, i]
)
df = pd.DataFrame.from_dict(cm_dict, orient="index")
average_precision = pd.Series(average_precision)
df["average_precision_score"] = average_precision
return df
8 changes: 8 additions & 0 deletions pxtextmining/factories/factory_predict_unlabelled_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ def predict_multilabel_sklearn(
predictions[row][label_index] = 1
preds_df = pd.DataFrame(predictions, index=processed_text.index, columns=labels)
preds_df["labels"] = preds_df.apply(get_labels, args=(labels,), axis=1)
# add probs to df
if pred_probs.ndim == 3:
pred_probs = np.transpose([pred[:, 1] for pred in pred_probs])
label_list = ['Probability of "' + label + '"' for label in labels]
preds_df[label_list] = pred_probs
return preds_df


Expand Down Expand Up @@ -142,6 +147,9 @@ def predict_multilabel_bert(
predictions = y_binary
preds_df = pd.DataFrame(predictions, index=processed_text.index, columns=labels)
preds_df["labels"] = preds_df.apply(get_labels, args=(labels,), axis=1)
# add probs to df
label_list = ['Probability of "' + label + '"' for label in labels]
preds_df[label_list] = y_probs
return preds_df


Expand Down
Loading