Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tensor_set/scale/cast/copy to gtest #3416

Merged
merged 13 commits into from
Dec 5, 2024

Conversation

CAHEK7
Copy link
Contributor

@CAHEK7 CAHEK7 commented Dec 1, 2024

This PR moves tensor_set/scale/cast/copy to gtest framework.
It introduces fast and flexible cpu verification routines for unary/binary/ternary operations (@Vsevolod1983 please pay attention to this, it's very useful for ternary tensor_ops conversion)
Those routines are up to 4x faster for unary operations, 3xfaster for binary and almost 2x faster for ternary operations (probably more, it becomes better for bigger tensors, I checked only those sizes what are in the testsuites).

It introduces more size_t friendly configs for #3393 (@Vsevolod1983 please take a look)

It also fixes few obvious bugs in the Cast operations.

@CAHEK7 CAHEK7 force-pushed the C7/simple_tensorops_gtest branch from a4ffd88 to 4af51c7 Compare December 2, 2024 13:41
@CAHEK7 CAHEK7 force-pushed the C7/simple_tensorops_gtest branch from 4af51c7 to a66714d Compare December 2, 2024 14:03
@CAHEK7 CAHEK7 force-pushed the C7/simple_tensorops_gtest branch from 44aa747 to 585ee05 Compare December 2, 2024 15:33
@CAHEK7
Copy link
Contributor Author

CAHEK7 commented Dec 4, 2024

This PR blocks #3424

@@ -2135,8 +2135,10 @@ void CastTensor(const Handle& handle,
MIOPEN_THROW(miopenStatusBadParm, "Tensor dimension sizes unsupported.");
}

auto miopen_alpha = *(static_cast<const float*>(alpha));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible alpha to be nullptr here ? Or I missed this check ?

Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's set everywhere it's used, but also surprising we pass it as a pointer.
It seems like it could be a float in this case.

Copy link
Contributor Author

@CAHEK7 CAHEK7 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved it earlier because previously it was even worse - it did CopyTensor for any alpha and since I extended the coverage, I've found out the case and fixed it.
You are right, but I have no time for another review round) Probably I can make some subsequent PR related to that nullptr check (and I8 Copy case).

It seems like it could be a float in this case.

For all the types except double it must be float and for the double it must be double, but MIOpen barely supports doubles, almost everywhere... Anyway, it can be done later.

Btw, that's why I asked to add explicit double check for all the isApplicable methods here https://github.com/ROCm/MIOpen/pull/3346/files#diff-1af35db437d55c5d90c9186b7db9736706d96549c0f20ea3156113db8fdc16e4

@@ -35,7 +35,7 @@ struct conv2d_bias_driver : public conv_bias_driver<T>
tensor_elem_gen_checkboard_sign{}(is...);
};

this->add(this->output, "output", this->get_tensor(get_inputs, gen_value));
this->add(this->output, "output", this->get_tensor(get_inputs<int>, gen_value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change relate to the purpose of the given PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it won't compile because it can't inference a type even if there is default one.

Comment on lines +126 to +150
if(clamp)
{
operate_over_subtensor(
[alpha, clampVal = static_cast<float>(std::numeric_limits<DstType>::max())](
auto& dst, auto src) {
dst = std::min(static_cast<float>(src) * alpha, clampVal);
},
dstSuperCpu,
srcSuperCpu,
dstDesc,
srcDesc,
dstOffset,
srcOffset);
}
else
{
operate_over_subtensor(
[alpha](auto& dst, auto src) { dst = static_cast<float>(src) * alpha; },
dstSuperCpu,
srcSuperCpu,
dstDesc,
srcDesc,
dstOffset,
srcOffset);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we gained coverage here.
I think the old tests only checked the clamp case.

Nice!

Comment on lines -2149 to +2153
std::string network_config = "cast " + std::to_string(dstDesc_flat.GetType());
// TODO: make proper network config
std::string network_config = "cast " + std::to_string(srcDesc_flat.GetType()) +
std::to_string(dstDesc_flat.GetType());
Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this wrong before?

Looks like it used to have a caching key based off the dst type only, but uses the src type as part of the compilation params.

Copy link
Contributor Author

@CAHEK7 CAHEK7 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before - it did not use srcType for the network config and failed if you do call int -> float and then float -> int conversion (nobody tested mixed calls ever and we had that problem in the other algorithms).
In the both cases it called int -> float because it has been added first. Now it calls appropriate kernels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw - that's one of the "gtest single binary" problems - we may stuck withing the first added kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper fix is to include all the compile-time and performance critical parameters into the network config.
In the first case it will choose a properly compiled kernel and in the second one it will choose the most performant one (if there are any). I haven't analyzed it deep and just fixed that part with has been found we the wider coverage.

Comment on lines +291 to +293
X_INSTANTIATE_COPY(FP32, float);
X_INSTANTIATE_COPY(FP16, float16);
X_INSTANTIATE_COPY(BFP16, bfloat16);
Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there no I8 tests for Copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is forgotten.

Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I think this expands the coverage in our CI, and my comments are not blocking.

@BrianHarrisonAMD BrianHarrisonAMD merged commit a325bdb into develop Dec 5, 2024
30 of 144 checks passed
@BrianHarrisonAMD BrianHarrisonAMD deleted the C7/simple_tensorops_gtest branch December 5, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants