问题描述:

复现lightGCN在配置环境的时候出现了很多问题,记录一下让人恼火的几个大坑,在未来配置环境出现问题时提供一些思路:

问题1
根据官方配置的情况下出现很多安装包不匹配问题,
官方配置如下(https://github.com/gusye1234/LightGCN-PyTorch/blob/master/requirements.txt)

torch==1.4.0
pandas==0.24.2
scipy==1.3.0
numpy==1.22.0
tensorboardX==1.8
scikit-learn==0.23.2
tqdm==4.48.2

安装numpy:

ERROR: No matching distribution found for numpy==1.22.0

安装pandas:

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

安装scipy:

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

问题2
当问题1解决之后,cuda持续报错

RuntimeError: CUDA error: operation not supported when calling cusparseCreate(handle)
RuntimeError: CUDA error: kernel launch failure when calling `cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)`
RuntimeError: CUDA error: no kernel image is available for execution on the device

原因分析:

问题1:安装包和python版本不匹配,安装包之间也不匹配

例如:numpy 1.22.x需要匹配python 3.8,但是python 3.8不适配scipy 1.3.0 …(scipy 1.3.0最适配的是3.5 or 3.6)

问题2:rtx架构就是有问题,不是版本不匹配也不是安装方式不对(有博客写pip安装换成conda安装就不报错了,在我这里不行)


解决方案:

问题1:
找了另一种版本的requirement(https://github.com/clhchtcjj/LightGCN-PyTorch/blob/master/requirements.txt)+ python3.6环境可用

值得注意的是,其中pytorch版本只要满足1.x即可,因为conda pytorch 1.4\1.5 用清华源找不到cudetoolkit 0.5.0\0.6.0 …的版本,我自己 pip 1.8,可用。

问题2:
首先检查版本问题,再检查安装方式(如果是pip安装,换成conda安装看下是否成功),再不行就不要用rtx架构的服务器

这个发现来源于我在多次尝试无果之后(花了半天时间,怒了),用本机gpu运行发现成功了…再次仔细研究报错原因:

Traceback (most recent call last):
  File "main.py", line 45, in <module>
    Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
    rating = Recmodel.getUsersRating(batch_users_gpu)
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
    all_users, all_items = self.computer()
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 146, in computer
    all_emb = torch.cat([users_emb, items_emb])
RuntimeError: CUDA error: no kernel image is available for execution on the device

以及

Traceback (most recent call last):
  File "main.py", line 45, in <module>
    Procedure.Test(dataset, Recmodel, epoch, w, world.config['multicore'])
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/Procedure.py", line 106, in Test
    rating = Recmodel.getUsersRating(batch_users_gpu)
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 175, in getUsersRating
    all_users, all_items = self.computer()
  File "/root/autodl-fs/LightGCN-PyTorch-master/code/model.py", line 166, in computer
    all_emb = torch.sparse.mm(g_droped, all_emb)
  File "/root/miniconda3/envs/gcn/lib/python3.6/site-packages/torch/sparse/__init__.py", line 68, in mm
    return torch._sparse_mm(mat1, mat2)
RuntimeError: CUDA error: kernel launch failure when calling cusparseXcoo2csr(handle, coorowind, i_nnz, i_m, csrrowptr, CUSPARSE_INDEX_BASE_ZERO)

排除不兼容问题,报错通常发生在矩阵运算中,通过查看实时监控看到gpu也在工作,所以考虑是计算时内核启动失败,这段CUDA代码不支持这种gpu架构进行编译。
所以换架构,看了一下本机的gu,把rtx换成gtx,之后成功:
在这里插入图片描述
后记:水洞我的崩溃…换了好几个服务器装了好多次环境…
在这里插入图片描述
所幸结果是好的,希望以后这种事情不要发生了(可恶autodl还我10块钱呜呜呜呜呜呜呜)

Logo

2万人民币佣金等你来拿,中德社区发起者X.Lab,联合德国优秀企业对接开发项目,领取项目得佣金!!!

更多推荐