RuntimeError: CUDA error: operation not supported when calling cusparseCreate(handle)
复现lightGCN在配置环境的时候出现了很多问题,狠狠记录让我恼火的几个大坑,在未来配置环境出现问题时提供一些思路(可恶autodl还我10块钱啊啊啊啊):问题1根据官方配置的情况下出现很多安装包不匹配问题,官方配置如下(https://github.com/gusye1234/LightGCN-PyTorch/blob/master/requirements.txt)问题2当问题1解决之后,c
问题描述:
复现lightGCN在配置环境的时候出现了很多问题,记录一下让人恼火的几个大坑,在未来配置环境出现问题时提供一些思路:
问题1
根据官方配置的情况下出现很多安装包不匹配问题,
官方配置如下(https://github.com/gusye1234/LightGCN-PyTorch/blob/master/requirements.txt)
torch==1.4.0
pandas==0.24.2
scipy==1.3.0
numpy==1.22.0
tensorboardX==1.8
scikit-learn==0.23.2
tqdm==4.48.2
安装numpy:
ERROR: No matching distribution found for numpy==1.22.0
安装pandas:
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
安装scipy:
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
问题2
当问题1解决之后,cuda持续报错
RuntimeError: CUDA error: operation not supported when calling cusparseCreate(handle)
RuntimeError: CUDA error: kernel launch failure when calling `cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)`
RuntimeError: CUDA error: no kernel image is available for execution on the device
原因分析:
问题1:安装包和python版本不匹配,安装包之间也不匹配
例如:numpy 1.22.x需要匹配python 3.8,但是python 3.8不适配scipy 1.3.0 …(scipy 1.3.0最适配的是3.5 or 3.6)
问题2:rtx架构就是有问题,不是版本不匹配也不是安装方式不对(有博客写pip安装换成conda安装就不报错了,在我这里不行)
解决方案:
问题1:
找了另一种版本的requirement(https://github.com/clhchtcjj/LightGCN-PyTorch/blob/master/requirements.txt)+ python3.6环境可用
值得注意的是,其中pytorch版本只要满足1.x即可,因为conda pytorch 1.4\1.5 用清华源找不到cudetoolkit 0.5.0\0.6.0 …的版本,我自己 pip 1.8,可用。
问题2:
首先检查版本问题,再检查安装方式(如果是pip安装,换成conda安装看下是否成功),再不行就不要用rtx架构的服务器
这个发现来源于我在多次尝试无果之后(花了半天时间,怒了),用本机gpu运行发现成功了…再次仔细研究报错原因:
Traceback (most recent call last):
File "main.py", line 45, in <module>
Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
rating = Recmodel.getUsersRating(batch_users_gpu)
File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
all_users, all_items = self.computer()
File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 146, in computer
all_emb = torch.cat([users_emb, items_emb])
RuntimeError: CUDA error: no kernel image is available for execution on the device
以及
Traceback (most recent call last):
File "main.py", line 45, in <module>
Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
rating = Recmodel.getUsersRating(batch_users_gpu)
File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
all_users, all_items = self.computer()
File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 166, in computer
all_emb = torch.sparse.mm(g_droped, all_emb)
File "/root/miniconda3/envs/gcn/lib/python3.6/site-packages/torch/sparse/__init__.py", line 68, in mm
return torch._sparse_mm(mat1, mat2)
RuntimeError: CUDA error: kernel launch failure when calling cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)
排除不兼容问题,报错通常发生在矩阵运算中,通过查看实时监控看到gpu也在工作,所以考虑是计算时内核启动失败,这段CUDA代码不支持这种gpu架构进行编译。
所以换架构,看了一下本机的gu,把rtx换成gtx,之后成功:
后记:水洞我的崩溃…换了好几个服务器装了好多次环境…
所幸结果是好的,希望以后这种事情不要发生了(可恶autodl还我10块钱呜呜呜呜呜呜呜)
更多推荐
所有评论(0)