Fast and generic implementation using OpenMP and CUDA#45
Fast and generic implementation using OpenMP and CUDA#45shikishima-TasakiLab wants to merge 14 commits intod-li14:mainfrom
Conversation
|
@shikishima-TasakiLab My compilation fails with the error info: src/pytorch_wrapper.cpp:12:65: required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 |
|
@d-li14 Are you running the following command in the environment where PyTorch is installed to compile it? python3 setup.py buildor python3 setup.py install |
|
@shikishima-TasakiLab Yes, I am running |
|
@d-li14 In the following Docker environment, I was able to build. |
|
@shikishima-TasakiLab |
|
@d-li14 |
|
@d-li14 |
|
@shikishima-TasakiLab |
|
Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision. As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this. def _involution2d(
input: torch.Tensor,
weight: torch.Tensor,
kernel_size: Union[int, Tuple[int, int]] = 7,
stride: Union[int, Tuple[int, int]] = 1,
padding: Union[int, Tuple[int, int]] = 0,
dilation: Union[int, Tuple[int, int]] = 1,
groups: int = 1,
bias: torch.Tensor = None,
) -> torch.Tensor:
kernel_size_ = _pair(kernel_size)
stride_ = _pair(stride)
padding_ = _pair(padding)
dilation_ = _pair(dilation)
if input.dtype == torch.half:
input = input.float()
output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)
if bias is not None:
output += bias.view(1, -1, 1, 1)
return output |
|
@shikishima-TasakiLab |
| at::Tensor involution2d_autocast( | ||
| const torch::autograd::Variable& input, | ||
| const torch::autograd::Variable& weight, | ||
| const std::vector<int64_t>& kernel_size, | ||
| const std::vector<int64_t>& stride, | ||
| const std::vector<int64_t>& padding, | ||
| const std::vector<int64_t>& dilation, | ||
| const int64_t groups | ||
| ) { | ||
| c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast); | ||
| auto exec_type = at::autocast::promote_type(at::kFloat, input, weight); | ||
| return involution2d_autograd( | ||
| at::autocast::cached_cast(exec_type, input), | ||
| at::autocast::cached_cast(exec_type, weight), | ||
| kernel_size, stride, padding, dilation, groups | ||
| ); | ||
| } | ||
|
|
There was a problem hiding this comment.
@csvance
Fixed CUDA implementation input to be full precision using Autocast.
|
|
||
| #define CUDA_MAX_THREADS 512u | ||
| #define CUDA_MAX_THREADS 1024u | ||
|
|
There was a problem hiding this comment.
@d-li14
In your CuPy implementation, the maximum number of CUDA threads was set to 1024. However, when I experimented, my CuPy reimplementation did not work with 1024, so I set it to 512.
My CUDA implementation does work with 1024. However, when I experimented, I set it to 512 and forgot to change it back to 1024.
There was a problem hiding this comment.
@shikishima-TasakiLab
Thanks, but I have tried to change the maximum CUDA threads, and it seems the result is still similar.
| 'nvcc': ['-O3'] + GENERATE_CODES, | ||
| } | ||
| ) | ||
| ) |
There was a problem hiding this comment.
@d-li14
I changed the arguments of NVCC to be optimized for different architectures.
Hopefully this will increase the speed.
There was a problem hiding this comment.
@shikishima-TasakiLab
Thanks for your efforts. However, the new code still does not lead to an expected speedup from my side.
close #44