-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Dagster Data pipeline #798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of feedback, focusing on prediction part.
"pandas" | ||
) | ||
.copy_local_file(local_path="/tmp/tabby_model_id", remote_path="/tmp/tabby_model_id") | ||
.run_function(download_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use https://modal.com/docs/reference/modal.Image#env to pass MODEL_ID
as environment variable, and in download_model
you can access model id as os.environ.get("MODEL_ID")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.env("MODEL_ID", os.environ.get("MODEL_ID"))
) | ||
.dockerfile_commands("ENTRYPOINT []") | ||
.pip_install( | ||
"git+https://github.com/TabbyML/tabby.git#egg=tabby-python-client&subdirectory=experimental/eval/tabby-python-client", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embed this directory inside tabby-data-pipeline, use https://modal.com/docs/reference/modal.Image#copy_local_dir to copy and run pip install
|
||
my_env = os.environ.copy() | ||
my_env["TABBY_DISABLE_USAGE_COLLECTION"] = "1" | ||
MODEL_ID = os.popen("cat /tmp/tabby_model_id").read().strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
from tabby_python_client.api.v1 import health | ||
|
||
resp = await health.asyncio(client=self.client) | ||
return resp.to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return resp?
return resp.to_dict() | ||
|
||
@method() | ||
async def complete(self, language, crossfile_context, index, row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type annotation for every argument. For row
, define a named tuple: https://docs.python.org/3/library/typing.html#typing.NamedTuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to list all column name of the dataframe in the namedtuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For columns used for input / output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there's a dynamic column? For example, "prediction" didn't exist when you first run the file and it will be added to the file. Next time, we need to pass it through the "row"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it can be a optional typed field:
https://docs.python.org/3/library/typing.html?highlight=optional#typing.Optional
|
||
df = pd.DataFrame(objs) | ||
|
||
outputs = await asyncio.gather(*[model.complete.remote.aio(language, crossfile_context, index, row) for index, row in df.iterrows()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it still necessary to run this in chunks?
|
||
|
||
@stub.local_entrypoint() | ||
async def main(language, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put prediction code under modal
directory.
LAUNCH_FLAGS = ["serve", "--model", MODEL_ID, "--port", "8000", "--device", "cuda"] | ||
self.launcher = subprocess.Popen(["/opt/tabby/bin/tabby"] + LAUNCH_FLAGS, env=my_env) | ||
self.client = Client("http://127.0.0.1:8000", timeout=240) | ||
|
||
# Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs. | ||
def webserver_ready(): | ||
try: | ||
socket.create_connection(("127.0.0.1", 8000), timeout=1).close() | ||
socket.create_connection(("127.0.0.1", 8000), timeout=30).close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
socket.create_connection(("127.0.0.1", 8000), timeout=30).close() | |
socket.create_connection(("127.0.0.1", 8000), timeout=1).close() |
Line 91 already contains retry logic, no need to increase timeout
LAUNCH_FLAGS = ["serve", "--model", MODEL_ID, "--port", "8000", "--device", "cuda"] | ||
self.launcher = subprocess.Popen(["/opt/tabby/bin/tabby"] + LAUNCH_FLAGS, env=my_env) | ||
self.client = Client("http://127.0.0.1:8000", timeout=240) | ||
|
||
# Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs. | ||
def webserver_ready(): | ||
try: | ||
socket.create_connection(("127.0.0.1", 8000), timeout=1).close() | ||
socket.create_connection(("127.0.0.1", 8000), timeout=30).close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
socket.create_connection(("127.0.0.1", 8000), timeout=30).close() | |
socket.create_connection(("127.0.0.1", 8000), timeout=1).close() |
Line 91 already contains retry logic, no need to increase timeout
@@ -111,15 +115,10 @@ async def complete(self, language, crossfile_context, index, row): | |||
from tabby_python_client import errors | |||
import pandas as pd | |||
|
|||
if 'prediction' in row and not pd.isnull(row['prediction']): | |||
# if prediction exists, just skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if prediction exists, you can simply don't call complete
function
|
||
for file in ['line_completion.jsonl', 'line_completion_rg1_bm25.jsonl', 'line_completion_oracle_bm25.jsonl']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please extract function for this, e.g read_pandas_frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this meant? extract function for the "for" loop? or do you mean extract function for reading all the three files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either interface is fine, it's just good to split the main function here into smaller chunks for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory should be in side .gitignore
|
||
context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)}) | ||
|
||
files = 'line_completion.jsonl, line_completion_rg1_bm25.jsonl, line_completion_oracle_bm25.jsonl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's these files? Should they being added as assets?
|
||
|
||
model = model_id.split("/")[-1] | ||
for file in ["line_completion.jsonl", "line_completion_rg1_bm25.jsonl", "line_completion_oracle_bm25.jsonl"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't each file itself being an asset?
Int, | ||
file_relative_path | ||
) | ||
from . import analyze, create_csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no need to extract utility functions. You could just define the asset as python function and orgnize them in individual python files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are individual python files, but I have to import them and call them in the assets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they don't have to be. They can be defined as asset
directly.
python/tabby-eval/.gitignore
Outdated
@@ -0,0 +1,2 @@ | |||
tmp* | |||
tabby_data_pipeline.egg-info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still need to remove the directory from this commit
python/tabby/trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't remove this file
python/tabby-eval/log.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in git ignore as well
There are 5 assets: