Recis Deepctr Segmentation Fault

Recis 官方的 deepctr 例子出现 Segmentation Fault

问题 #

运行RecIS的deepctr例子时, 出现以下错误

RecIS load lib /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/lib/recis.so
[2025-12-09 16:19:20,462] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Bucketize -> FusedBoundaryOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Mod -> FusedModOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register Hash -> FusedHashOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register SequenceTruncate -> FusedCutoffOP
[2025-12-09 16:19:20,463] [INFO] [recis.features.fused_op_impl] FusedOpFactory register IDMultiHash -> FusedMultiHashOP
[1]    824655 segmentation fault (core dumped)  python deepctr.py

重装 #

首先重新安装RecIS, 再重新运行试试. 安装教程: https://alibaba.github.io/RecIS/installation_en.html

安装完成之后, 运行教程里的 installation verification 代码,

python -c "import recis; print('RecIS installed successfully!')"

这一段代码没有出现问题, 但是接下来的这段代码出现问题了.

import torch
import recis

# Check versions
print(f"PyTorch version: {torch.__version__}")

# Check GPU support
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")

# Simple functionality test
from recis.nn.modules.embedding import DynamicEmbedding, EmbeddingOption

emb_opt = EmbeddingOption(embedding_dim=16)
emb = DynamicEmbedding(emb_opt)
print("RecIS core modules loaded successfully!")

运行结果为:

RecIS load lib /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/lib/recis.so
PyTorch version: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
GPU count: 8
Traceback (most recent call last):
  File "/home1/workspace/xxx/code/RecIS/examples/installation_verfication.py", line 17, in <module>
    emb = DynamicEmbedding(emb_opt)
  File "/home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/recis/nn/modules/embedding.py", line 362, in __init__
    self._pg = dist.distributed_c10d._get_default_group()
  File "/home1/workspace/xxx/.conda/envs/recis/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1302, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

添加 init_process_group 的调用就可以解决.

import torch
import recis
import torch.distributed as dist

dist.init_process_group(rank=0, world_size=1, init_method="tcp://127.0.0.1:29500", backend="gloo")

# Check versions
print(f"PyTorch version: {torch.__version__}")

# Check GPU support
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")

# Simple functionality test
from recis.nn.modules.embedding import DynamicEmbedding, EmbeddingOption

emb_opt = EmbeddingOption(embedding_dim=16)
emb = DynamicEmbedding(emb_opt)
print("RecIS core modules loaded successfully!")

上面这段代码就没有报错了, 说明RecIS的安装应该没问题?

再次运行官方的deepctr.py, 还是出现原来的错误.

定位问题 #

运行自己写的 orc_example.pysimple_example.py都没有问题. 运行官方的 deepctr.py, deepfm.py, basic_usage.py 都出现错误. 对比下他们import的库. 把basic_usage.pyimport的库import到simple_example.py中, 就会出现一样的segment fault, 经过排查, 只要有以下代码, 就会出现错误

from recis.io.orc_dataset import OrcDataset

解决问题 #

猜测 #

问题是, orc_example.py也使用了OrcDataset, 为什么没有出错呢? 可能是这些库import的时候进行了某些初始化, 有的库必须在OrcDataset之前import?

求证 #

orc_example.py在import OrcDataset之前的代码全部注释掉, 果然出现了segment fault.

解决 #

逐句注释import代码, 直到出现segment fault的代码. 最终确认只要在import OrcDataset之前import pyarrow, 就可以解决问题.

原因 #

OrcDataset需要有处理orc文件的能力, 这个能力是从column_io来的, 而column_io依赖arrow. 由于我在服务器上没有root权限, 我并没有全局安装arrow的能力, 我的arrow库是通过conda安装了pyarrow, 所以在使用OrcDataset之前, 我需要import pyarrow, 让arrow的动态库被加载, 要不然column_io找不到arrow.

进一步验证: 通过生成core dump文件, 执行 gdb python core_dump_file, 查看调用栈, 确实出现无法解析的地址, 这个地址应该就是arrow的函数

(gdb) bt
...
#400 0x00007ffcb58d5e68 in ?? ()
#401 0x000000000000001c in ?? ()

查看加载的动态库, 确实没找到arrow库

(gdb) info share
...
0x00007f5d5bad8030  0x00007f5d5bad8f60  Yes         /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/lib-dynload/_bisect.cpython-310-x86_64-linux-gnu.so
0x00007f5d5b47e030  0x00007f5d5b47f5ba  Yes         /home1/workspace/xxx/.conda/envs/recis/lib/python3.10/lib-dynload/_random.cpython-310-x86_64-linux-gnu.so

反思 #

完全用python写的库很少出问题, 这个RecIS依赖的column_io是用c++写的, 会有一些依赖库需要配置好. 以后遇到segmentation fault可以考虑是不是动态库没加载.