Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DETR结果对齐实验记录 #288

Open
7 of 8 tasks
HiHippie opened this issue May 19, 2022 · 25 comments
Open
7 of 8 tasks

DETR结果对齐实验记录 #288

HiHippie opened this issue May 19, 2022 · 25 comments
Assignees

Comments

@HiHippie
Copy link
Contributor

HiHippie commented May 19, 2022

Eager global 模型并行

参数对齐:https://github.com/facebookresearch/detr

问题排查TODO LIST:

@HiHippie
Copy link
Contributor Author

HiHippie commented May 19, 2022

detr
https://github.com/facebookresearch/detr
复现结果

name backbone box AP
DETR ResNet50 42.0
libai DETR ResNet50 25.9
l libai DETR 修复resnet50权重加载bug后 ResNet50 29.7
libai DETR 修复MultiHeadAttention实现 ResNet50 42.0

DETR

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.420
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.624
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.442
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.574
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805 

libai DETR

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.259
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.487
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.242
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.083
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.247
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.248
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.380
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.412
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.163
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.421
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.681

libai DETR 修复resnet50 bug后

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.297
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.283
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.107
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.294
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.418
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.453
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.200
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713

libai DETR 修复bug后,接在torch权重后结果与原论文一致

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.420
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.624
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.442
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.574
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805

@rentainhe @Ldpe2G 目前inference对齐了。没有遇到oneflow或libai的bug,主要是对实现细节的修订。

为了正确加载torch权重,我的注意力实现参考了很多torch.nn.MultiHeadAttention,感觉有点偏离libai,这周我完善一下。

@HiHippie HiHippie self-assigned this May 19, 2022
@Ldpe2G
Copy link
Collaborator

Ldpe2G commented May 19, 2022

这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?

@HiHippie
Copy link
Contributor Author

这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?

OK

@rentainhe
Copy link
Contributor

是inference的结果吗~

@HiHippie
Copy link
Contributor Author

是inference的结果吗~

是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。

@rentainhe
Copy link
Contributor

是inference的结果吗~

是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。

OKOK~

@HiHippie
Copy link
Contributor Author

HiHippie commented Jun 2, 2022

对某些input shape导致loss.backward报错"F20220602 14:17:25.050042 15603 shape.cpp:187] Check failed: !broadcast_axis_vec.empty() "问题的排查

问题定位至:projects/DETR/utils/box_ops.py 中 min/max oneflow的bug

def generalized_box_iou(boxes1, boxes2):
    """
    Generalized IoU from https://giou.stanford.edu/

    The boxes should be in [x0, y0, x1, y1] format

    Returns a [N, M] pairwise matrix, where N = len(boxes1)
    and M = len(boxes2)
    """
    # degenerate boxes gives inf / nan results
    # so do an early check
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
    iou, union = box_iou(boxes1, boxes2)
    lt = flow.min(boxes1[:, None, :2], boxes2[:, :2])
    rb = flow.max(boxes1[:, None, 2:], boxes2[:, 2:])

    wh = (rb - lt).clamp(min=0)  # [N,M,2]
    area = wh[:, :, 0] * wh[:, :, 1]

    return iou - (area - union) / area

最小复现代码:以flow.max为例,flow.min同理

版本:

>>> flow.__version__
'0.8.0.dev20220606+cu112'

>>> torch.__version__
'1.11.0+cu113'
# import torch
import oneflow as torch

class Net(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        
        self.linear = torch.nn.Linear(10,10)
        
    def forward(self, x, z):
        
        x = self.linear(x)
        '''
        我这边的场景是:
        输入的shape是动态的 有时x和z的shape一致有时不一致
        参考torch代码统一扩充了维度 来避免维度不同时无法比较的问题
        oneflow的问题在于如果shape是 x->[1, d],  z->[1, d]会有bug
        但如果第一维度不是1 是没有bug的
        测试用例给出了这三种cases
        '''
        #当x和z的shape不一致时,给x扩充维度才能做max/min
        h = torch.max(x[:, None, :], z)
        #h = torch.min(x[:, None, :], z)

        return h.mean()
    
    
net = Net()

# shape不一致的case
# x = torch.randn(15, 10)
# z = torch.randn(5, 10)

# shape一致的case
# x = torch.randn(15, 10)
# z = torch.randn(15, 10)

# ! oneflow出bug的case
# shape一致但为[1,x]的case
x = torch.randn(1, 10)
z = torch.randn(1, 10)

optimizer = torch.optim.SGD(net.parameters(), lr=0.1)

y = torch.ones([1])

criterion = torch.nn.MSELoss()

output = net(x, z)

optimizer.zero_grad()
loss = criterion(output, y)

# ! backward时报bug
loss.backward() 

optimizer.step()

输入为:

x = torch.randn(1, 10)
z = torch.randn(1, 10)

Bugs:

F20220607 10:56:17.827364 39752 shape.cpp:184] Check failed: !broadcast_axis_vec.empty() 
*** Check failure stack trace: ***
    @     0x7f4f74f2ff9a  google::LogMessage::Fail()
    @     0x7f4f74f30282  google::LogMessage::SendToLog()
    @     0x7f4f74f2fb07  google::LogMessage::Flush()
    @     0x7f4f74f32679  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4f6be5ac9d  oneflow::Shape::Axes4BroadcastTo()
    @     0x7f4f6bc1c4ef  oneflow::one::BroadcastMinMax::Apply()
    @     0x7f4f6bc1d5d1  oneflow::one::OpExprGradFunction<>::ApplyIf()
    @     0x7f4f6d61c609  _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEERKNS0_3one11TensorTupleEPS4_bEZNKS3_19AutogradInterpreter5ApplyERKNS3_6OpExprES6_S7_RKNS3_19OpExprInterpContextEEUlS6_S7_bE0_E9_M_invokeERKSt9_Any_dataS6_OS7_Ob
    @     0x7f4f6bbd5407  oneflow::one::FunctionNode::Apply()
    @     0x7f4f6bbd9158  oneflow::one::GraphTask::Apply()
    @     0x7f4f6bbd9fb8  oneflow::one::GraphAutogradEngine::RunBackwardAndSaveGrads4LeafTensor()
    @     0x7f4f6bbd3ef5  oneflow::one::AutogradEngine::RunBackwardAndSaveGrads4LeafTensorIf()
    @     0x7f50283c48e9  oneflow::autograd::Backward()
    @     0x7f50283bc21f  (unknown)
    @     0x7f50285ddc79  (unknown)
    @     0x55ade7f25348  PyCFunction_Call
    @     0x55ade7f14dbc  _PyObject_MakeTpCall.localalias.6
    @     0x55ade7f9c545  _PyEval_EvalFrameDefault
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a61  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a40  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7fff543  PyEval_EvalCode
    @     0x55ade7fff5e4  run_eval_code_obj
    @     0x55ade8025854  run_mod
    @     0x55ade7ee6390  pyrun_file
    @     0x55ade7ee90d2  PyRun_SimpleFileExFlags.localalias.16
    @     0x55ade7ee9bf0  Py_RunMain.cold.2953
    @     0x55ade8028a09  Py_BytesMain
Aborted

以上代码torch无bug

@clackhan

@HiHippie
Copy link
Contributor Author

libai/utils/distributed.py 中

def convert_to_distributed_default_setting(module):
    """
    Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
    global tensor with data parallelism as default.
    """
    for param in module.parameters():
        if not param.is_global:
            module.to_global(
                sbp=get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
                placement=get_layer_placement(0),
            )
            return

作用是在build_model时将模型to_global。

但如果模型中有register_buffer参数,module.parameters()是不包含register_buffer参数的,所以也就不会把buffer参数to_global。

这里是否应该改成state_dict来实现:

def convert_to_distributed_default_setting(module):
    """
    Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
    global tensor with data parallelism as default.
    """
    for _, v in module.state_dict().items():
        if not v.is_global:
            module.to_global(
                sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
                placement=dist.get_layer_placement(0),
            )
            return

这样才能把buffer参数to_global

@rentainhe @CPFLAME 帮忙看下有没有必要改一下~

@CPFLAME
Copy link
Contributor

CPFLAME commented Jun 13, 2022

我感觉应该可以改. 改了以后可以跑一下其他的case,
比如bash dev/model_test.sh, 看看其他的模型有没有报错

@HiHippie
Copy link
Contributor Author

OK,我来试试

@HiHippie
Copy link
Contributor Author

HiHippie commented Jul 13, 2022

global eager ddp
4卡数据并行很快就会报如下OOM错误,2卡会后面一点再报错。

F20220713 07:57:17.976464 1348305 virtual_machine_engine.cpp:332] 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp", line 332, in DispatchInstruction
    ret
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 49, in Prepare
    AllocateOutputBlobsMemory(operand, device_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 103, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(device_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 66, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
out of memory
Error Type: oneflow.ErrorProto.runtime_error
*** Check failure stack trace: ***

我用pynvml监控了下0卡显存的占用

        pynvml.nvmlInit()
        NUM_EXPAND = 1024 * 1024
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
        print(meminfo.used / NUM_EXPAND)

4卡迭代过程中 memory变化
image

2卡迭代过程中memory变化
image

vae迭代过程的memory变化

image

目前不确定是哪里的问题

@CPFLAME
Copy link
Contributor

CPFLAME commented Jul 13, 2022

是不是有一些变量没有及时释放

@HiHippie
Copy link
Contributor Author

是不是有一些变量没有及时释放

我排查下

@HiHippie
Copy link
Contributor Author

上面的问题定位到了,是因为在执行hidden_state+position_embedding时候如果二者的sbp不一致(hidden_state是split(0),position_embedding是broadcast),就会导致OOM问题。但如果二者保持一致(split(0)),就没问题了。

详细的最小复现我明天整理下

这可能是个潜在的bug?

@HiHippie
Copy link
Contributor Author

记录一个之前遗留的问题

首先有如下代码,transformer两个output,第二个没有用到

        hs, _ = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])

在transformer内部,逻辑如下:

        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)

        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
        # return hs.transpose(1, 2)

可以看到第二个output是encoder的输出memory

问题是,如果transformer返回memory.permute(1, 2, 0).view(bs, c, h, w),则会报错。不返回的话可以正常运行。

经过排查,是.view这个op导致的。

完整报错信息如下:

F20220726 03:19:20.448858 993233 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168) 
F20220726 03:19:20.448522 993266 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168) 
F20220726 03:19:20.448385 993267 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168) 
*** Check failure stack trace: ***
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F20220726 03:19:20.448640 993270 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168) 
*** Check failure stack trace: ***
    @     0x7f81e7480efa  google::LogMessage::Fail()
    @     0x7fa1499a7efa  google::LogMessage::Fail()
    @     0x7f3474521efa  google::LogMessage::Fail()
    @     0x7fd168531efa  google::LogMessage::Fail()
    @     0x7f81e74811e2  google::LogMessage::SendToLog()
    @     0x7fa1499a81e2  google::LogMessage::SendToLog()
    @     0x7f34745221e2  google::LogMessage::SendToLog()
    @     0x7fd1685321e2  google::LogMessage::SendToLog()
    @     0x7f3474521a67  google::LogMessage::Flush()
    @     0x7fa1499a7a67  google::LogMessage::Flush()
    @     0x7f81e7480a67  google::LogMessage::Flush()
    @     0x7fd168531a67  google::LogMessage::Flush()
    @     0x7fa1499aa5d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f34745245d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd1685345d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f81e74835d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa1432b6020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7fd161e40020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7f346de30020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7f81e0d8f020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7fd1630594ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7f346f0494ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7fa1444cf4ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7f81e1fa84ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7fd15e437e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7f346a427e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7fa13f8ade1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7f81dd386e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7fd161b4ad10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7f346db3ad10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7fa142fc0d10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7f81e0a99d10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7fd161b2a6a1  oneflow::vm::EpStreamType::Run()
    @     0x7f346db1a6a1  oneflow::vm::EpStreamType::Run()
    @     0x7fa142fa06a1  oneflow::vm::EpStreamType::Run()
    @     0x7f81e0a796a1  oneflow::vm::EpStreamType::Run()
    @     0x7fd161b3104f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7f346db2104f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fa142fa704f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fd161b333c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f346db233c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7fa142fa93c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f81e0a8004f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fd161b3351d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7f346db2351d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7fa142fa951d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7f81e0a823c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f81e0a8251d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7fd16854693f  execute_native_thread_routine
    @     0x7fd238398609  start_thread
    @     0x7f347453693f  execute_native_thread_routine
    @     0x7fa1499bc93f  execute_native_thread_routine
    @     0x7fd2382bd163  clone
    @     0x7f3544388609  start_thread
    @     0x7f81e749593f  execute_native_thread_routine
    @     0x7fa21980e609  start_thread
    @     0x7f35442ad163  clone
    @     0x7fa219733163  clone
    @     0x7f82b72e7609  start_thread
    @     0x7f82b720c163  clone
Killing subprocess 992337
Killing subprocess 992338
Killing subprocess 992339
Killing subprocess 992340
Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 231, in <module>
    main()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 219, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 188, in sigkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/dataset/czq_home/anaconda3/envs/libai/bin/python3', '-u', 'projects/DETR/train_net.py', '--config-file', 'projects/DETR/configs/detr_training.py']' died with <Signals.SIGABRT: 6>.

@HiHippie
Copy link
Contributor Author

记录待复现/排查的bug

训练过程中会遇到RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false,暂未定位到问题。
求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。
DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。

复现/排查到之后会更新过来。

File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train
    super().train(self.start_iter, self.max_iter)
  File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train
    self.run_step()
  File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step
    self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device)
  File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step
    data = next(self._data_loader_iter)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data
    return self._process_data(data)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data
    data.reraise()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "projects/DETR/datasets/detection.py", line 116, in __getitem__
    img, target = self.prepare(img, target)
  File "projects/DETR/datasets/detection.py", line 78, in __call__
    boxes = boxes[keep]
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator()
    PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices
    ExpandMaskIndex(tensor)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex
    functional::Reshape(item, {size})
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply
    internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }()
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc
    tensor_desc_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc
    
Error Type: oneflow.ErrorProto.check_failed_error

@yuanms2
Copy link

yuanms2 commented Jul 28, 2022

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

@BBuf
Copy link
Contributor

BBuf commented Jul 28, 2022

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

@HiHippie
Copy link
Contributor Author

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。

@Ldpe2G
Copy link
Collaborator

Ldpe2G commented Jul 28, 2022

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。

我看是数据读取时候报错的,是不是在做什么data augmentation

@yuanms2
Copy link

yuanms2 commented Jul 28, 2022

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。

许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配

相关issue https://github.com/Oneflow-Inc/OneTeam/issues/1076

@HiHippie
Copy link
Contributor Author

HiHippie commented Jul 28, 2022

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。

许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配

相关issue Oneflow-Inc/OneTeam#1076

好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。

@HiHippie
Copy link
Contributor Author

HiHippie commented Aug 1, 2022

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。
许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配
相关issue Oneflow-Inc/OneTeam#1076

好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。

记录待复现/排查的bug

训练过程中会遇到RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false,暂未定位到问题。 求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。 DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。

复现/排查到之后会更新过来。

File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train
    super().train(self.start_iter, self.max_iter)
  File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train
    self.run_step()
  File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step
    self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device)
  File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step
    data = next(self._data_loader_iter)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data
    return self._process_data(data)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data
    data.reraise()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "projects/DETR/datasets/detection.py", line 116, in __getitem__
    img, target = self.prepare(img, target)
  File "projects/DETR/datasets/detection.py", line 78, in __call__
    boxes = boxes[keep]
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator()
    PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices
    ExpandMaskIndex(tensor)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex
    functional::Reshape(item, {size})
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply
    internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }()
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc
    tensor_desc_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc
    
Error Type: oneflow.ErrorProto.check_failed_error

这个问题,我在更新oneflow后消失了,尝试训练了一些iter也没再出现。

@HiHippie
Copy link
Contributor Author

HiHippie commented Aug 1, 2022

参考https://github.com/Oneflow-Inc/OneTeam/issues/779 做模型loss对齐的记录

  • 检查网络结构model.py是否对齐
  • 确定dataloader的shuffle有没有关掉
  • 网络的dropout有没有关掉
  • 确定lr_scheduler和optimizer是否相同
  • 为了双重保险, 可以把传参里面的dropout_prob全部设置为0, 同时把model的mode设置为.eval(), 这样在训练的时候可以保证模型的dropout和bn等op全部都是固定的, 不包含随机性

no_aux_loss AdamW 单卡
加载预训练权重
image

aux_loss AdamW 单卡
加载预训练权重
image

aux_loss AdamW 4卡
加载预训练权重
因为torch版本采用的DistributedSampler和libai这边的采样顺序不一样,暂采用单个样本训练,loss曲线如下。
image

@xiezipeng-ML
Copy link
Contributor

xiezipeng-ML commented Aug 9, 2022

参考Oneflow-Inc/OneTeam#779 做模型loss对齐的记录

  • 检查网络结构model.py是否对齐
  • 确定dataloader的shuffle有没有关掉
  • 网络的dropout有没有关掉
  • 确定lr_scheduler和optimizer是否相同
  • 为了双重保险, 可以把传参里面的dropout_prob全部设置为0, 同时把model的mode设置为.eval(), 这样在训练的时候可以保证模型的dropout和bn等op全部都是固定的, 不包含随机性

no_aux_loss AdamW 单卡 image

@CPFLAME @xiezipeng-ML loss对齐这个程度算ok吗?求个经验~ 加载的权重是收敛后的模型,所以可能看不大出下降趋势了

看loss曲线基本没问题
也可以加载初始化权重来训练 看看下降

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants