Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Triton BLS scripting fails with model not ready #3299

Open
3 tasks
david-waterworth opened this issue Sep 1, 2023 · 14 comments
Open
3 tasks

[bug] Triton BLS scripting fails with model not ready #3299

david-waterworth opened this issue Sep 1, 2023 · 14 comments

Comments

@david-waterworth
Copy link

david-waterworth commented Sep 1, 2023

Checklist

  • [Y] I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • [Y] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • [Y] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:

I'm using Triton BLS scripting to script another model. My problem occurs when I use the official NVidia Triton examples, specifically Sync BLS Model

This model consists of "bls_sync", "add_sub" and "pytorch". The default model is "bls_sync" (all three use the python backend). When "bls_sync" is invoked it forwards the request to one of the other two models using the following code:

      # Get Model Name
      model_name = pb_utils.get_input_tensor_by_name(request, "MODEL_NAME")

      # Model Name string
      model_name_string = model_name.as_numpy()[0]

      # Create inference request object
      infer_request = pb_utils.InferenceRequest(
          model_name=model_name_string,
          requested_output_names=["OUTPUT0", "OUTPUT1"],
          inputs=[in_0, in_1],
      )

      # Perform synchronous blocking inference request
      infer_response = infer_request.exec()

This works fine using the NVidia Triton image (v23.02) but I cannot get it to work on Sagemaker - the last line above is causing the issue.

DLC image/dockerfile:

I'm using 355873309152.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-tritonserver:23.02-py3

Current behavior:

I've tried two alternatives, the first is to define the default model i.e.

container = {
    "Image": image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "bls_sync"},
    #"Mode": "MultiModel",
}

And invoke

response = client.invoke_endpoint(
    EndpointName="triton-bls-sync-endpoint",
    ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
    Body=request_body,
    #TargetModel="bls_sync"
)

This results in the infer_request.exec() from bls_model to add_sub failing

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{"error":"Failed to process the request(s) for model instance 'bls_sync_0', message: TritonModelException: Failed for execute the inference request. Model 'add_sub' is not ready.\n\nAt:\n  [/opt/ml/model/bls_sync/1/model.py](https://file+.vscode-resource.vscode-cdn.net/opt/ml/model/bls_sync/1/model.py)(112): execute\n"}"

The alternative I'v tried is to set "MultiModel" and pass TargetModel="bls_sync" but this also errors:

ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Failed to download model data(bucket: ***.com, key: models/triton-bls-sync/models.tar.gzbls_sync). Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the model.

Neither seem to behave in the same manner as the NVidia Triton container, Triton ensemble models work fine, but BLS scripting doesn't, not does it appear possible to invoke different models from a triton model-repository deployed to Sagemaker (the MultiModel option appears to be downloading a repository, not invoking a model in the existing repository?)

Expected behavior:

An inference request between models in the same Triton container should work in the same manner as it does using the official NVidia Triton image.

Additional context:

@david-waterworth
Copy link
Author

I notice that a load model api has been added to triton_python_backend_utils (but after the most recent release)

https://github.com/triton-inference-server/python_backend/blob/6f369ef10312ad3db0aef73df5b1138b8467cf14/src/pb_stub.cc#L413

From what I've read an ensemble automatically loads models it uses (since Triton can infer that from the config) but not necessarily when you dynamically invoke another model. So I suspect that for whatever reason the 'add_sub' model hasn't been loaded at the point when I invoke bls_sync

Unfortunately, it looks like I may have to wait until 23.09 is released by NVidia and then added to Sagemaker if I want to explicitly load the dependent models from the BLS (i.e. in the initialize method).

@david-waterworth
Copy link
Author

david-waterworth commented Sep 2, 2023

I found a solution based on https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/business_logic_scripting/stable_diffusion/model_repository/pipeline/1/model.py

There's a note in the accompanying notebook that confirms my hypothesis above, you need to explicitly load the model being invoked using BLS

Screenshot from 2023-09-03 08-05-55

This is a bit of a workaround - I would suggest explicitly adding SAGEMAKER_TRITON_ADDITIONAL_ARGS so it's explicirty rather than "injecting" them, but its probably not needed once Triton releases the container with the ability to load/unload models from BLS and AWS integrates it.

@nskool
Copy link
Contributor

nskool commented Sep 6, 2023

@david-waterworth

Right, to load the BLS model explicitly, along with other models that it refers to, they'd have to be loaded at once using the --load-model argument. The above hack is for the previous versions of SM-Triton.

SM Triton v23.06 onwards enable the SAGEMAKER_TRITON_ADDITIONAL_ARGS to specify the additional args to the triton startup.

@david-waterworth
Copy link
Author

Thanks, migration to v23.06+ is something I intend to do but haven't had time to create and test a py310 conda-pack environment which has held me back.

@david-waterworth
Copy link
Author

david-waterworth commented Sep 11, 2023

@nskool I'm having issues getting this to work with v23.06

i.e. using the follow to define the model works fine with 23.02

container = {
    "Image": image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {
        "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "pipeline",
        "SAGEMAKER_TRITON_LOG_INFO": "false --load-model=encoder",
    },
}

response = sagemaker_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=container, VpcConfig=vpc_config
)

But when I tried 23.06 with the setup below it only loaded the pipeline model - the log doesn't even mention the encoder at all (this model uses the onnx backend).

container = {
    "Image": image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {
        "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "pipeline",
        "SAGEMAKER_TRITON_LOG_VERBOSE": "true",
        "SAGEMAKER_TRITON_LOG_INFO": "true",
        "SAGEMAKER_TRITON_LOG_WARNING": "true",
        "SAGEMAKER_TRITON_LOG_ERROR": "true",
        "SAGEMAKER_TRITON_ADDITIONAL_ARGS": "--load-model=encoder",
    },
}

If I start triton locally using the serve script it works find in sagemaker mode, but it v23.06 doesn't appear to by using the env variables passed to the model?

@david-waterworth
Copy link
Author

david-waterworth commented Sep 12, 2023

Also when I use the same Environment hack for 23.02 and 23.06 it works. i..e

container = {
    "Image": image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {
        "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "pipeline",
        "SAGEMAKER_TRITON_LOG_INFO": "false --load-model=encoder",
    },
}

response = sagemaker_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=container, VpcConfig=vpc_config
)

It's only when I move the --load-model=encoder to SAGEMAKER_TRITON_ADDITIONAL_ARGS that it fails to load the encoder model at all. But if I omit SAGEMAKER_TRITON_DEFAULT_MODEL_NAME it correctly reports that there are more than one model.

@nskool
Copy link
Contributor

nskool commented Sep 21, 2023

@david-waterworth I will look into this, perhaps some ordering issue with the additional_args vs. log_info workaround. Please continue using the workaround. I will update the thread once I have more info.

@jadhosn
Copy link

jadhosn commented Mar 8, 2024

@david-waterworth I will look into this, perhaps some ordering issue with the additional_args vs. log_info workaround. Please continue using the workaround. I will update the thread once I have more info.

@nskool any updates on the thread above?

@FrikadelleHelle
Copy link

So the above hack with --load-model in the "SAGEMAKER_TRITON_LOG_INFO" is the only supported way now?

@david-waterworth
Copy link
Author

@FrikadelleHelle it's the only way I've got it to work! I've update triton containers a couple of times and this doesn't appear to have been addressed (@nskool can you please confirm).

@FrikadelleHelle
Copy link

FrikadelleHelle commented Apr 16, 2024

Thanks for commenting on this stale issue @david-waterworth !
I actually am also wondering whether you've gotten it to work with the container running in MultiModel mode, I see your original code has that commented out.

For me it seems I can deploy one BLS version in MME mode but when I call other models it seems sagemaker has forgotten the --load-model args and it doesn't work (which makes sense cause that's not its intended use).
Interested to hear if you got that to work

@david-waterworth
Copy link
Author

@FrikadelleHelle no I didn't manage to get it to work, I suspect because it doesn't fit the general sagemaker interface. The other thing that's frustrated me is triton has a compressed binary request/response format but I'm using sagemaker async endpoints so as far as I can tell there's no way to use this properly as you don't have access to the triton server request/response headers (because when you call the sagemaker invoke async doesn't let you pass additional headers to be included in the triton request).

I feel that to get the best out of triton you probably need to host it yourself (i.e. on an EC2 instance)

@jadhosn
Copy link

jadhosn commented Apr 17, 2024

@FrikadelleHelle no I didn't manage to get it to work, I suspect because it doesn't fit the general sagemaker interface. The other thing that's frustrated me is triton has a compressed binary request/response format but I'm using sagemaker async endpoints so as far as I can tell there's no way to use this properly as you don't have access to the triton server request/response headers (because when you call the sagemaker invoke async doesn't let you pass additional headers to be included in the triton request).

I feel that to get the best out of triton you probably need to host it yourself (i.e. on an EC2 instance)

Would that be solved if you were to set up a custom reverse proxy? (see instructions for bring your own container on sagemaker, where they show how to set up a custom nginx.config)

In theory, that would declaw sagemaker's way of interacting with Triton running on a sagemaker endpoint and simply pass through your request as is to be handled by your custom reverse proxy.

@david-waterworth
Copy link
Author

@jadhosn the issue I have is specific to batch and async inference. In both these configurations the actual model endpoint (i.e. triton) is invoked by AWS so you cannot pass headers, i.e. when you call invoke_async_endpoint it puts the request on a queue and returns immediatly. Some internal process then consumes requests from the queue, forwarding them to triton and in order to use binary mode it's at this point that the triton headers need to be added. But the invoke_async_endpoint doesn't seem to provide any way of forwarding headers to the (eventual) model server request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants