Skip to content

Executor coredumps and/or panics happen when running SQL like TPC-DS Q95/TPCH Q17 #1957

@BrytonLee

Description

@BrytonLee

Describe the bug
Executor coredumps and/or panics happen when running SQL like TPC-DS Q75/TPCH Q17. Following are a few of error messages:

Panics at SendError

thread 'auron-native-stage-15-part-1-tid-119' panicked at native-engine/auron/src/lib.rs:58:64:
called `Result::unwrap()` on an `Err` value: SendError { .. }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
26/01/23 10:17:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[auron native task 1.0 in stage 15.0 (TID 119),5,main]
java.lang.RuntimeException: called `Result::unwrap()` on an `Err` value: SendError { .. }

Backtrace when panic

26/01/22 07:30:48 INFO Executor: Running task 19.1 in stage 114.0 (TID 1069)
   0: __rustc::rust_begin_unwind
             at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/core/src/panicking.rs:75:14
   2: core::result::unwrap_failed
             at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/core/src/result.rs:1732:5
   3: auron::handle_unwinded_scope
   4: auron::rt::NativeExecutionRuntime::start::{{closure}}
   5: tokio::runtime::task::core::Core<T,S>::poll
   6: tokio::runtime::task::raw::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::scoped::Scoped<T>::set
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run

Coredumps

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007bb0b4aaa575, pid=1588146, tid=1877402
#
# JRE version: OpenJDK Runtime Environment (17.0.16+8) (build 17.0.16+8-Ubuntu-0ubuntu124.04.1)
# Java VM: OpenJDK 64-Bit Server VM (17.0.16+8-Ubuntu-0ubuntu124.04.1, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# [2504.856s][info   ][gc,start       ] GC(692) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
[2504.856s][info   ][gc,task        ] GC(692) Using 43 workers of 43 for evacuation
[2504.861s][info   ][gc,phases      ] GC(692)   Pre Evacuate Collection Set: 0.2ms
[2504.861s][info   ][gc,phases      ] GC(692)   Merge Heap Roots: 0.2ms
[2504.861s][info   ][gc,phases      ] GC(692)   Evacuate Collection Set: 3.7ms
[2504.861s][info   ][gc,phases      ] GC(692)   Post Evacuate Collection Set: 1.1ms
[2504.861s][info   ][gc,phases      ] GC(692)   Other: 0.2ms
[2504.861s][info   ][gc,heap        ] GC(692) Eden regions: 270->0(18)
[2504.861s][info   ][gc,heap        ] GC(692) Survivor regions: 4->4(35)
[2504.861s][info   ][gc,heap        ] GC(692) Old regions: 69->69
[2504.861s][info   ][gc,heap        ] GC(692) Archive regions: 2->2
[2504.861s][info   ][gc,heap        ] GC(692) Humongous regions: 40->40
[2504.861s][info   ][gc,metaspace   ] GC(692) Metaspace: 106036K(107648K)->106036K(107648K) NonClass: 93465K(94336K)->93465K(94336K) Class: 12571K(13312K)->12571K(13312K)
[2504.861s][info   ][gc             ] GC(692) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 758M->218M(894M) 5.411ms
[2504.861s][info   ][gc,cpu         ] GC(692) User=0.08s Sys=0.01s Real=0.00s
C  [libauron-4547940331120690501.tmp+0x16aa575][thread 1877388 also had an error]
  datafusion_ext_commons::arrow::eq_comparator::EqComparator::eq::hffa5a7c62813e2e3+0x35
#
# Core dump will be written. Default location: /var/coredumps/core.%e.1588146.%t
#
# An error report file with more information is saved as:
# /tmp/hadoop-saying/nm-local-dir/usercache/saying/appcache/application_1765502793146_0014/container_1765502793146_0014_01_000004/hs_err_pid1588146.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-17
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

To Reproduce
There is a high possibility to reproduce this bug by running TPC-DS Q95 or TPCH Q17

Additional context

  • Coredump SIGILL is not due to cross platform compatible issue, Rust lang implements panic withud2 (undefined instrustion) to terminate program.
  • We are working on this issue, please contact us if you'd like to help. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions