RuntimeError: CUDA error: operation not supported when calling cusparseCreate(handle)

复现lightGCN在配置环境的时候出现了很多问题，狠狠记录让我恼火的几个大坑，在未来配置环境出现问题时提供一些思路（可恶autodl还我10块钱啊啊啊啊）：问题1根据官方配置的情况下出现很多安装包不匹配问题，官方配置如下（https://github.com/gusye1234/LightGCN-PyTorch/blob/master/requirements.txt）问题2当问题1解决之后，c

233486412

2534人浏览 · 2024-08-18 02:54:03

233486412 · 2024-08-18 02:54:03 发布

问题描述：

复现lightGCN在配置环境的时候出现了很多问题，记录一下让人恼火的几个大坑，在未来配置环境出现问题时提供一些思路：

问题1
根据官方配置的情况下出现很多安装包不匹配问题，
官方配置如下（https://github.com/gusye1234/LightGCN-PyTorch/blob/master/requirements.txt）

torch==1.4.0
pandas==0.24.2
scipy==1.3.0
numpy==1.22.0
tensorboardX==1.8
scikit-learn==0.23.2
tqdm==4.48.2

安装numpy:

ERROR: No matching distribution found for numpy==1.22.0

安装pandas:

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

安装scipy:

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

问题2
当问题1解决之后，cuda持续报错

RuntimeError: CUDA error: operation not supported when calling cusparseCreate(handle)

RuntimeError: CUDA error: kernel launch failure when calling `cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)`

RuntimeError: CUDA error: no kernel image is available for execution on the device

原因分析：

问题1：安装包和python版本不匹配，安装包之间也不匹配

例如：numpy 1.22.x需要匹配python 3.8，但是python 3.8不适配scipy 1.3.0 …（scipy 1.3.0最适配的是3.5 or 3.6)

问题2：rtx架构就是有问题，不是版本不匹配也不是安装方式不对（有博客写pip安装换成conda安装就不报错了，在我这里不行）

解决方案：

问题1：
找了另一种版本的requirement（https://github.com/clhchtcjj/LightGCN-PyTorch/blob/master/requirements.txt）+ python3.6环境可用

值得注意的是，其中pytorch版本只要满足1.x即可，因为conda pytorch 1.4\1.5 用清华源找不到cudetoolkit 0.5.0\0.6.0 …的版本，我自己 pip 1.8，可用。

问题2：
首先检查版本问题，再检查安装方式（如果是pip安装，换成conda安装看下是否成功），再不行就不要用rtx架构的服务器

这个发现来源于我在多次尝试无果之后（花了半天时间，怒了），用本机gpu运行发现成功了…再次仔细研究报错原因：

Traceback (most recent call last):
  File "main.py", line 45, in <module>
    Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
    rating = Recmodel.getUsersRating(batch_users_gpu)
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
    all_users, all_items = self.computer()
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 146, in computer
    all_emb = torch.cat([users_emb, items_emb])
RuntimeError: CUDA error: no kernel image is available for execution on the device

以及

Traceback (most recent call last):
  File "main.py", line 45, in <module>
    Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
    rating = Recmodel.getUsersRating(batch_users_gpu)
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
    all_users, all_items = self.computer()
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 166, in computer
    all_emb = torch.sparse.mm(g_droped, all_emb)
  File "/root/miniconda3/envs/gcn/lib/python3.6/site-packages/torch/sparse/__init__.py", line 68, in mm
    return torch._sparse_mm(mat1, mat2)
RuntimeError: CUDA error: kernel launch failure when calling cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)

排除不兼容问题，报错通常发生在矩阵运算中，通过查看实时监控看到gpu也在工作，所以考虑是计算时内核启动失败，这段CUDA代码不支持这种gpu架构进行编译。
所以换架构，看了一下本机的gu，把rtx换成gtx，之后成功：
在这里插入图片描述
后记：水洞我的崩溃…换了好几个服务器装了好多次环境…

所幸结果是好的，希望以后这种事情不要发生了（可恶autodl还我10块钱呜呜呜呜呜呜呜）