External contributions will be welcome soon, and they are greatly appreciated! Every little bit helps, and credit will always be given.
- Filing Issues
- Cloning the Repository
- Code Contributions
- Architectural Guidelines
- Single Responsibility Principle (SRP)
- Interface Segregation Principle (ISP)
- Dependency Inversion Principle (DIP)
- Physical Design Structure Mirroring Logical Design Structure
- Levelization
- Acyclic Dependencies Principle (ADP)
- Package Cohesion Principles
- Encapsulate What Varies
- Favor Composition Over Inheritance
- Clean Separation of Concerns (SoC)
- Principle of Least Knowledge (Law of Demeter)
- Document Assumptions and Decisions
- Continuous Integration and Testing
- Licensing
- Attribution
- Bug Reports, Feature Requests, and Documentation Issues: Please file an issue with a detailed description of the problem, feature request, or documentation issue. The NV-Ingest team will review and triage these issues, scheduling them for a future release.
DATASET_ROOT=[path to your dataset root]
MODULE_NAME=[]
MORPHEUS_ROOT=[path to your Morpheus root]
NV_INGEST_ROOT=[path to your NV-Ingest root]
git clone https://github.com/nv-morpheus/Morpheus.git $MORPHEUS_ROOT
git clone https://github.com/NVIDIA/nv-ingest.git $NV_INGEST_ROOT
cd $NV_INGEST_ROOT
Ensure all submodules are checked out:
git submodule update --init --recursive
- Finding an Issue: Start with issues labeled good first issue.
- Claim an Issue: Comment on the issue you wish to work on.
- Implement Your Solution: Dive into the code! Update or add unit tests as necessary.
- **Submit Your Pull Request: ** Create a pull request once your code is ready.
- Code Review: Wait for the review by other developers and make necessary updates.
- Merge: Once approved, an NV-Ingest developer will approve your pull request.
For those familiar with the codebase, please check the project boards for issues. Look for unassigned issues and follow the steps starting from Claim an Issue.
-
NV-Ingest Foundation: Built on top of NVIDIA Morpheus.
-
Pipeline Structure: Designed around a pipeline that processes individual jobs within an asynchronous execution graph. Each job is processed by a series of stages or task handlers.
-
Job Composition: Jobs consist of a data payload, metadata, and task specifications that determine the processing steps applied to the data.
-
Job Submission:
- A job is submitted as a JSON specification and converted into a ControlMessage, with the payload consisting of a cuDF dataframe.
- For example:
document_type source_id uuid metadata 0 pdf somefile 1234 { ... }
- The
metadata
column contents correspond to the schema-enforced metadata format of returned data.
-
Pipeline Processing:
- The
ControlMessage
is passed through the pipeline, where each stage processes the data and metadata as needed. - Subsequent stages may add, transform, or filter data as needed, with all resulting artifacts stored in
the
ControlMessage
's payload. - For example, after processing, the payload may look like:
document_type source_id uuid metadata 0 text somefile abcd-1234 {'content': "The quick brown fox jumped...", ...} 1 image somefile efgh-5678 {'content': "base64 encoded image", ...} 2 image somefile xyza-5618 {'content': "base64 encoded image", ...} 3 image somefile zxya-5628 {'content': "base64 encoded image", ...} 4 status somefile kvq9-5600 {'content': "", 'status': "filtered", ...}
- A single job can result in multiple artifacts, each with its own metadata element definition.
- The
-
Job Completion:
- Upon reaching the end of the pipeline, the
ControlMessage
is converted into aJobResult
object and pushed to the ephemeral output queue for client retrieval. JobResult
objects consist of a dictionary containing:- data: A list of metadata artifacts produced by the job.
- status: The job status as success or failure.
- description: A human-readable description of the job status.
- trace: A list of timing traces generated during the job's processing.
- annotations: A list of task annotations generated during the job's processing.
- Upon reaching the end of the pipeline, the
-
Dependencies are managed via 'Conda' and 'Pip'.
-
Dependencies are stored in .yml files
- Service Dependencies 'conda/environments/nv_ingest_environment.yml' file.
- Client Dependencies 'conda/environments/nv_ingest_client_environment.yml' file.
-
To update dependencies:
- Create a clean environment using the relevant .yml file.
- Update the dependencies using 'Conda' or 'Pip' and validate the changes.
- Update the .yml file by exporting the updated environment.
- For example:
conda env export --name nv_ingest_runtime --no-builds > conda/environment/nv_ingest_environment.yml conda env export --name nv_ingest_client --no-builds > conda/environment/nv_ingest_client_environment.yml
- For example:
In NV-Ingest, decorators are used to enhance the functionality of functions by adding additional processing logic. These decorators help ensure consistency, traceability, and robust error handling across the pipeline. Below, we introduce some common decorators used in NV-Ingest, explain their usage, and provide examples.
The traceable
decorator adds entry and exit trace timestamps to a ControlMessage
's metadata. This helps in
monitoring and debugging by recording the time taken for function execution.
Usage:
- To track function execution time with default trace names:
@traceable() def process_message(message): pass
- To use a custom trace name:
@traceable(trace_name="CustomTraceName") def process_message(message): pass
This decorator wraps a function with failure handling logic to manage potential failures involving ControlMessages
. It
ensures that failures are managed consistently, optionally raising exceptions or annotating the ControlMessage
.
Usage:
- To handle failures with default settings:
@nv_ingest_node_failure_context_manager(annotation_id="example_task") def process_message(message): pass
- To handle failures and allow empty payloads:
@nv_ingest_node_failure_context_manager(annotation_id="example_task", payload_can_be_empty=True) def process_message(message): pass
The filter_by_task
decorator checks if the ControlMessage
contains any of the specified tasks. Each task can be a
string of the task name or a tuple of the task name and task properties. If the message does not contain any listed task
and/or task properties, the message is returned directly without calling the wrapped function, unless a forwarding
function is provided.
Usage:
- To filter messages based on tasks:
@filter_by_task(["task1", "task2"]) def process_message(message): pass
- To filter messages based on tasks with specific properties:
@filter_by_task([("task", {"prop": "value"})]) def process_message(message): pass
- To forward messages to another function. This is necessary when the decorated function does not return the message
directly, but instead forwards it to another function. In this case, the forwarding function should be provided as an
argument to the decorator.
@filter_by_task(["task1", "task2"], forward_func=other_function) def process_message(message): pass
The cm_skip_processing_if_failed
decorator skips the processing of a ControlMessage
if it has already failed. This
ensures that no further processing is attempted on a failed message, maintaining the integrity of the pipeline.
Usage:
- To skip processing if the message has failed:
@cm_skip_processing_if_failed def process_message(message): pass
TODO(Devin): Add details about adding a new stage or module once we have router node functionality in place.
Writing unit tests is essential for maintaining code quality and ensuring that changes do not introduce new bugs. In
this project, we use pytest
for running tests and adopt blackbox testing principles. Below are some common practices
for writing unit tests, which are located in the [repo_root]/tests
directory.
-
Test Structure: Each test module should test a specific module or functionality within the codebase. The test module should be named
test_<module_name>.py
, and reside on a mirrored physical path to its corresponding test target to be easily discoverable bypytest
.- Example:
nv_ingest/some_path/another_path/my_module.py
should have a corresponding test file:tests/some_path/another_path/test_my_module.py
.
- Example:
-
Test Functions: Each test function should focus on a single aspect of the functionality. Use descriptive names that clearly indicate what is being tested. For example,
test_function_returns_correct_value
ortest_function_handles_invalid_input
. -
Setup and Teardown: Use
pytest
fixtures to manage setup and teardown operations for your tests. Fixtures help in creating a consistent and reusable setup environment. -
Assertions: Use assertions to validate the behavior of the code. Ensure that the tests cover both expected outcomes and edge cases.
When writing tests that depend on external services (e.g., databases, APIs), it is important to mock these dependencies to ensure that tests are reliable, fast, and do not depend on external factors.
-
Mocking Libraries: Use libraries like
unittest.mock
to create mocks for external services. Thepytest-mock
plugin can also be used to integrate mocking capabilities directly withpytest
. -
Mock Objects: Create mock objects to simulate the behavior of external services. Use these mocks to test how your code interacts with these services without making actual network calls or database transactions.
-
Patching: Use
patch
to replace real objects in your code with mocks. This can be done at the function, method, or object level. Ensure that patches are applied in the correct scope to avoid side effects.
Here is an example of how to structure a test module in the [repo_root]/tests
directory:
import pytest
from unittest.mock import patch, Mock
# Assuming the module to test is located at [repo_root]/module.py
from module import function_to_test
@pytest.fixture
def mock_external_service():
with patch('module.ExternalService') as mock_service:
yield mock_service
def test_function_returns_correct_value(mock_external_service):
# Arrange
mock_external_service.return_value.some_method.return_value = 'expected_value'
# Act
result = function_to_test()
# Assert
assert result == 'expected_value'
def test_function_handles_invalid_input(mock_external_service):
# Arrange
mock_external_service.return_value.some_method.side_effect = ValueError("Invalid input")
# Act and Assert
with pytest.raises(ValueError, match="Invalid input"):
function_to_test(invalid_input)
- Submodules are used to manage third-party libraries and dependencies.
- Submodules should be created in the
third_party
directory. - Ensure that the submodule is updated to the latest commit before making changes.
- Model Integration: NV-Ingest is designed to be scalable and flexible, so running models directly in the pipeline is discouraged.
- Model Export: Models should be exported to a format compatible with Triton Inference Server or TensorRT.
- Model acquisition and conversion should be documented in
triton_models/README.md
, including the model name, version, pbtxt file, Triton model files, etc., along with an example of how to query the model in Triton. - Models should be externally hosted and downloaded during the pipeline execution, or added via LFS.
- Any additional code, configuration files, or scripts required to run the model should be included in
the
triton_models/[MODEL_NAME]
directory.
- Model acquisition and conversion should be documented in
- Self-Contained Dependencies: No assumptions should be made regarding other models or libraries being available in the pipeline. All dependencies should be self-contained.
- Base Triton Container: Directions for the creation of the base Triton container are listed in
the
triton_models/README.md
file. If a new model requires additional base dependencies, please update theDockerfile
in thetriton_models
directory.
To ensure the quality and maintainability of the NV-Ingest codebase, the following architectural guidelines should be followed:
- Ensure that each module, class, or function has only one reason to change.
- Avoid forcing clients to depend on interfaces they do not use.
- High-level modules should not depend on low-level modules, both should depend on abstractions.
- The physical layout of the codebase should reflect its logical structure.
- Organize code into levels where higher-level components depend on lower-level components but not vice versa.
- Ensure the dependency graph of packages/modules has no cycles.
- Package classes that change together.
- Package classes that are used together.
- Identify aspects of the application that vary and separate them from what stays the same.
- Utilize object composition over class inheritance for behavior reuse where possible.
- Divide the application into distinct features with minimal overlap in functionality.
- Objects should assume as little as possible about the structure or properties of anything else, including their subcomponents.
- Assumptions made and reasons behind architectural and design decisions should be clearly documented.
- Integrate code frequently into a shared repository and ensure comprehensive testing is an integral part of the development cycle.
Contributors are encouraged to follow these guidelines to ensure contributions are in line with the project's architectural consistency and maintainability.
NV-Ingest is licensed under the NVIDIA Proprietary Software License -- ensure that any contributions are compatible.
The following should be included in the header of any new files:
SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Portions adopted from