问题 #
运行RecIS的deepctr例子时, 出现以下错误
RecIS load lib /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/lib/recis.so
[2025-12-09 16:19:20,462] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Bucketize -> FusedBoundaryOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Mod -> FusedModOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Hash -> FusedHashOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register SequenceTruncate -> FusedCutoffOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register IDMultiHash -> FusedMultiHashOP
[1] 824655 segmentation fault (core dumped) python deepctr.py
重装 #
首先重新安装RecIS, 再重新运行试试. 安装教程: https://alibaba.github.io/RecIS/installation_en.html
安装完成之后, 运行教程里的 installation verification 代码,
python -c "import recis; print('RecIS installed successfully!')"
这一段代码没有出现问题, 但是接下来的这段代码出现问题了.
import torch
import recis
# Check versions
print(f"PyTorch version: {torch.__version__}")
# Check GPU support
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
# Simple functionality test
from recis.nn.modules.embedding import DynamicEmbedding, EmbeddingOption
emb_opt = EmbeddingOption(embedding_dim=16)
emb = DynamicEmbedding(emb_opt)
print("RecIS core modules loaded successfully!")
运行结果为:
RecIS load lib /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/lib/recis.so
PyTorch version: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
GPU count: 8
Traceback (most recent call last):
File "/home1/workspace/xxx/code/RecIS/examples/installation_verfication.py", line 17, in <module>
emb = DynamicEmbedding(emb_opt)
File "/home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/nn/modules/embedding.py", line 362, in __init__
self._pg = dist.distributed_c10d._get_default_group()
File "/home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1302, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
添加 init_process_group 的调用就可以解决.
import torch
import recis
import torch.distributed as dist
dist.init_process_group(rank=0, world_size=1, init_method="tcp://127.0.0.1:29500", backend="gloo")
# Check versions
print(f"PyTorch version: {torch.__version__}")
# Check GPU support
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
# Simple functionality test
from recis.nn.modules.embedding import DynamicEmbedding, EmbeddingOption
emb_opt = EmbeddingOption(embedding_dim=16)
emb = DynamicEmbedding(emb_opt)
print("RecIS core modules loaded successfully!")
上面这段代码就没有报错了, 说明RecIS的安装应该没问题?
再次运行官方的deepctr.py, 还是出现原来的错误.
定位问题 #
运行自己写的 orc_example.py 和 simple_example.py都没有问题. 运行官方的 deepctr.py, deepfm.py, basic_usage.py 都出现错误. 对比下他们import的库. 把basic_usage.pyimport的库import到simple_example.py中, 就会出现一样的segment fault, 经过排查, 只要有以下代码, 就会出现错误
from recis.io.orc_dataset import OrcDataset
解决问题 #
猜测 #
问题是, orc_example.py也使用了OrcDataset, 为什么没有出错呢? 可能是这些库import的时候进行了某些初始化, 有的库必须在OrcDataset之前import?
求证 #
把orc_example.py在import OrcDataset之前的代码全部注释掉, 果然出现了segment fault.
解决 #
逐句注释import代码, 直到出现segment fault的代码. 最终确认只要在import OrcDataset之前import pyarrow, 就可以解决问题.
原因 #
OrcDataset需要有处理orc文件的能力, 这个能力是从column_io来的, 而column_io依赖arrow. 由于我在服务器上没有root权限, 我并没有全局安装arrow的能力, 我的arrow库是通过conda安装了pyarrow, 所以在使用OrcDataset之前, 我需要import pyarrow, 让arrow的动态库被加载, 要不然column_io找不到arrow.
进一步验证: 通过生成core dump文件, 执行 gdb python core_dump_file, 查看调用栈, 确实出现无法解析的地址, 这个地址应该就是arrow的函数
(gdb) bt
...
#400 0x00007ffcb58d5e68 in ?? ()
#401 0x000000000000001c in ?? ()
查看加载的动态库, 确实没找到arrow库
(gdb) info share
...
0x00007f5d5bad8030 0x00007f5d5bad8f60 Yes /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/lib-dynload/_bisect.cpython-310-x86_64-linux-gnu.so
0x00007f5d5b47e030 0x00007f5d5b47f5ba Yes /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/lib-dynload/_random.cpython-310-x86_64-linux-gnu.so
反思 #
完全用python写的库很少出问题, 这个RecIS依赖的column_io是用c++写的, 会有一些依赖库需要配置好. 以后遇到segmentation fault可以考虑是不是动态库没加载.