-
Notifications
You must be signed in to change notification settings - Fork 208
Open
Description
Describe the bug
Executor coredumps and/or panics happen when running SQL like TPC-DS Q75/TPCH Q17. Following are a few of error messages:
Panics at SendError
thread 'auron-native-stage-15-part-1-tid-119' panicked at native-engine/auron/src/lib.rs:58:64:
called `Result::unwrap()` on an `Err` value: SendError { .. }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
26/01/23 10:17:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[auron native task 1.0 in stage 15.0 (TID 119),5,main]
java.lang.RuntimeException: called `Result::unwrap()` on an `Err` value: SendError { .. }
Backtrace when panic
26/01/22 07:30:48 INFO Executor: Running task 19.1 in stage 114.0 (TID 1069)
0: __rustc::rust_begin_unwind
at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/std/src/panicking.rs:697:5
1: core::panicking::panic_fmt
at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/core/src/panicking.rs:75:14
2: core::result::unwrap_failed
at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/core/src/result.rs:1732:5
3: auron::handle_unwinded_scope
4: auron::rt::NativeExecutionRuntime::start::{{closure}}
5: tokio::runtime::task::core::Core<T,S>::poll
6: tokio::runtime::task::raw::poll
7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
8: tokio::runtime::scheduler::multi_thread::worker::Context::run
9: tokio::runtime::context::scoped::Scoped<T>::set
10: tokio::runtime::context::runtime::enter_runtime
11: tokio::runtime::scheduler::multi_thread::worker::run
12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
13: tokio::runtime::task::core::Core<T,S>::poll
14: tokio::runtime::task::harness::Harness<T,S>::poll
15: tokio::runtime::blocking::pool::Inner::run
Coredumps
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007bb0b4aaa575, pid=1588146, tid=1877402
#
# JRE version: OpenJDK Runtime Environment (17.0.16+8) (build 17.0.16+8-Ubuntu-0ubuntu124.04.1)
# Java VM: OpenJDK 64-Bit Server VM (17.0.16+8-Ubuntu-0ubuntu124.04.1, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# [2504.856s][info ][gc,start ] GC(692) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
[2504.856s][info ][gc,task ] GC(692) Using 43 workers of 43 for evacuation
[2504.861s][info ][gc,phases ] GC(692) Pre Evacuate Collection Set: 0.2ms
[2504.861s][info ][gc,phases ] GC(692) Merge Heap Roots: 0.2ms
[2504.861s][info ][gc,phases ] GC(692) Evacuate Collection Set: 3.7ms
[2504.861s][info ][gc,phases ] GC(692) Post Evacuate Collection Set: 1.1ms
[2504.861s][info ][gc,phases ] GC(692) Other: 0.2ms
[2504.861s][info ][gc,heap ] GC(692) Eden regions: 270->0(18)
[2504.861s][info ][gc,heap ] GC(692) Survivor regions: 4->4(35)
[2504.861s][info ][gc,heap ] GC(692) Old regions: 69->69
[2504.861s][info ][gc,heap ] GC(692) Archive regions: 2->2
[2504.861s][info ][gc,heap ] GC(692) Humongous regions: 40->40
[2504.861s][info ][gc,metaspace ] GC(692) Metaspace: 106036K(107648K)->106036K(107648K) NonClass: 93465K(94336K)->93465K(94336K) Class: 12571K(13312K)->12571K(13312K)
[2504.861s][info ][gc ] GC(692) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 758M->218M(894M) 5.411ms
[2504.861s][info ][gc,cpu ] GC(692) User=0.08s Sys=0.01s Real=0.00s
C [libauron-4547940331120690501.tmp+0x16aa575][thread 1877388 also had an error]
datafusion_ext_commons::arrow::eq_comparator::EqComparator::eq::hffa5a7c62813e2e3+0x35
#
# Core dump will be written. Default location: /var/coredumps/core.%e.1588146.%t
#
# An error report file with more information is saved as:
# /tmp/hadoop-saying/nm-local-dir/usercache/saying/appcache/application_1765502793146_0014/container_1765502793146_0014_01_000004/hs_err_pid1588146.log
#
# If you would like to submit a bug report, please visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-17
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
To Reproduce
There is a high possibility to reproduce this bug by running TPC-DS Q95 or TPCH Q17
Additional context
- Coredump SIGILL is not due to cross platform compatible issue, Rust lang implements panic with
ud2(undefined instrustion) to terminate program. - We are working on this issue, please contact us if you'd like to help. Thanks.
Metadata
Metadata
Assignees
Labels
No labels