Skip to content

Conversation

@ktf
Copy link
Member

@ktf ktf commented Dec 10, 2025

This anticipates the forwarding to the earliest possible moment, i.e. when
we are about to insert the messages in a slot. This is the earliest moment
we can guarantee messages will be seen only once.


Stack created with Sapling. Best reviewed with ReviewStack.

ktf added 2 commits December 10, 2025 09:54
Use a single helper function to improve readability.
If one (header, payload, ...) tuple in a MessageSet was to be copied,
all the subsequent ones would have been copied.

If one (header, payload, ...) tuple got redirected to more than one destination,
all the subsequent ones would have been redirected there.
@ktf ktf requested a review from a team as a code owner December 10, 2025 20:07
@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@ktf
Copy link
Member Author

ktf commented Dec 10, 2025

@shahor02 this works in my synthetic tests (stage/bin/o2-testworkflows-early-forwarding -s --severity detail --early-forward-policy=always) . In the end I refactored the code to find the earliest spot where messages are guaranteed to be seen only once and I moved the early forward there.

@davidrohr @shahor02 I have noticed that the early forwarding is disabled by default. Is this expected?

@ktf
Copy link
Member Author

ktf commented Dec 10, 2025

@jgrosseo @nicolaspoffley I expect this to improve parallelism on hyperloop as well.

@shahor02
Copy link
Collaborator

@ktf for me it is not expected that the EF is disabled, when I was debugging the slow turnover of Polaris jobs, I thought the forwarding is done at the beginning of run method. Was not this the supposed behaviour of the EF?

@ktf
Copy link
Member Author

ktf commented Dec 10, 2025

@shahor02 I need to have a better look. Maybe it's just my small reproducer to be wrong.

I also see there is some issues with some of the tests. I will debug better tomorrow morning.

@alibuild
Copy link
Collaborator

alibuild commented Dec 11, 2025

Error while checking build/O2/fullCI_slc9 for 13018ab at 2025-12-11 02:12:

No log files found

Full log here.

ktf added 4 commits December 11, 2025 10:35
This is most likely faster, and it will allow us to move
the early forwarding at an earlier stage where the data is not
yet in a MessageSet.
Add splitPayloadIndex / splitPayloadParts to the default printout
This anticipates the forwarding to the earliest possible moment, i.e. when
we are about to insert the messages in a slot. This is the earliest moment
we can guarantee messages will be seen only once.
@ktf
Copy link
Member Author

ktf commented Dec 11, 2025

Ok, fixed the off by one issue with multiparts.

@alibuild
Copy link
Collaborator

alibuild commented Dec 11, 2025

Error while checking build/O2/fullCI_slc9 for f6dfcce at 2025-12-25 18:40:

## sw/BUILD/O2-full-system-test-latest/log
command /sw/slc9_x86-64/O2/14910-slc9_x86-64-local4/prodtests/full-system-test/dpl-workflow.sh had nonzero exit code 128
[ERROR] Workflow crashed - PID 8337 (EMCALRawToCellConverterSpec) did not exit correctly however it's not clear why. Exit code forced to 128.
[ERROR] Workflow crashed - PID 8948 (MID-QcTaskMIDTracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8854 (TRD-PHTrackMatch-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8917 (GLO-MTCITSTPC-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8988 (TRD-Tracklets-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8964 (TPC-Clusters-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8939 (MID-QcTaskMIDClust-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8905 (FT0-DigitQcTaskFT0-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8847 (MFT-MFTAsyncTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8922 (ITS-ITSClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8844 (GLO-MUONTracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 9001 (internal-dpl-injected-dummy-sink) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8868 (TRD-RawData-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8878 (EMC-CellTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8978 (TRD-Digits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8898 (EMC-RawTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8841 (CPV-PhysicsOnEPNs-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8873 (TRD-Tracking-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8852 (TPC-PID-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8971 (TPC-Tracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8935 (MFT-MFTClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8954 (PHS-ClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8930 (MCH-QcTaskMCHDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8945 (MID-QcTaskMIDDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8929 (ITS-ITSTrackTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8959 (TOF-TaskDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8899 (FDD-DigitQcTaskFDD-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8907 (FV0-DigitQcTaskFV0-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8990 (ZDC-QcZDCRecTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8918 (GLO-Vertexing-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8958 (TOF-MatchingTOFwTRD-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8700 (qc-task-PHS-ClusterTask) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8662 (qc-task-ITS-ITSTrackTask) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8738 (qc-task-TRD-Digits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8688 (qc-task-MID-QcTaskMIDDigits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8718 (qc-task-TOF-TaskDigits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8670 (qc-task-MCH-QcTaskMCHDigits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8830 (qc-task-TRD-Tracklets) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8643 (qc-task-GLO-Vertexing) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8726 (qc-task-TPC-Clusters) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8693 (qc-task-MID-QcTaskMIDTracks) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8713 (qc-task-TOF-MatchingTOFwTRD) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8676 (qc-task-MFT-MFTClusterTask) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8728 (qc-task-TPC-Tracks) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8639 (qc-task-GLO-MTCITSTPC) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8685 (qc-task-MID-QcTaskMIDClust) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8625 (qc-task-FDD-DigitQcTaskFDD) was killed abnormally with Killed and exited code was set to 137.
[0 more errors; see full log]

Full log here.

@davidrohr
Copy link
Collaborator

@shahor02 this works in my synthetic tests (stage/bin/o2-testworkflows-early-forwarding -s --severity detail --early-forward-policy=always) . In the end I refactored the code to find the earliest spot where messages are guaranteed to be seen only once and I moved the early forward there.

@davidrohr @shahor02 I have noticed that the early forwarding is disabled by default. Is this expected?

For online and offline reco we enable it here: https://github.com/davidrohr/O2DPG/blob/a5af1be2a96bbe3b2eeb2cf13d41c4afd1b81e4a/DATA/common/getCommonArgs.sh#L12

@shahor02
Copy link
Collaborator

@ktf this seems to be genuine crash:

[8369:EMCALRawToCellConverterSpec]: [14:43:54][INFO] Correctly handshaken websocket connection.
[8369:EMCALRawToCellConverterSpec]: [14:43:59][WARN] Timed out sending after 1s. Downstream backpressure detected on from_EMCALRawToCellConverterSpec_to_Dispatcher[0].
[8369:EMCALRawToCellConverterSpec]: [14:44:02][INFO] Downstream backpressure on from_EMCALRawToCellConverterSpec_to_Dispatcher[0] recovered.
[8369:EMCALRawToCellConverterSpec]: *** Program crashed (Segmentation fault)
[8369:EMCALRawToCellConverterSpec]: Backtrace by DPL:
[8369:EMCALRawToCellConverterSpec]: Executable is /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/bin/o2-emcal-reco-workflow
[8369:EMCALRawToCellConverterSpec]:     /lib64/libc.so.6:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: fair::mq::shmem::Message::Copy(fair::mq::Message const&)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingHelpers::routeForwardedMessages(o2::framework::FairMQDeviceProxy&, std::span<std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >, 18446744073709551615ul>&, std::vector<fair::mq::Parts, std::allocator<fair::mq::Parts> >&, bool, bool)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataRelayer::relay(void const*, std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >*, o2::framework::DataRelayer::InputInfo const&, unsigned long, unsigned long, std::function<void (o2::framework::ServiceRegistryRef&, std::span<std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >, 18446744073709551615ul>&)>, std::function<void (o2::framework::TimesliceSlot, std::vector<o2::framework::MessageSet, std::allocator<o2::framework::MessageSet> >&, o2::framework::TimesliceIndex::OldestOutputInfo)>)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::handleData(o2::framework::ServiceRegistryRef, o2::framework::InputChannelInfo&)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::doPrepare(o2::framework::ServiceRegistryRef)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::run_callback(uv_work_s*)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::Run()
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::Device::RunWrapper()
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: boost::detail::function::void_function_obj_invoker1<std::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::fsm::Machine_::ProcessWork()
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::StateMachine::ProcessWork()
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::DeviceRunner::Run()
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: doChild(int, char**, o2::framework::ServiceRegistry&, o2::framework::RunningWorkflowInfo const&, o2::framework::RunningDeviceRef, o2::framework::DriverConfig const&, o2::framework::ProcessingPolicies, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uv_loop_s*)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: runStateMachine(std::vector<o2::framework::DataProcessorSpec, std::allocator<o2::framework::DataProcessorSpec> > const&, WorkflowInfo const&, std::vector<o2::framework::DataProcessorInfo, std::allocator<o2::framework::DataProcessorInfo> > const&, o2::framework::CommandInfo const&, o2::framework::DriverControl&, o2::framework::DriverInfo&, o2::framework::DriverConfig&, std::vector<o2::framework::DeviceMetricsInfo, std::allocator<o2::framework::DeviceMetricsInfo> >&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, boost::program_options::variables_map&, std::vector<o2::framework::ServiceSpec, std::allocator<o2::framework::ServiceSpec> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: doMain(int, char**, std::vector<o2::framework::DataProcessorSpec, std::allocator<o2::framework::DataProcessorSpec> > const&, std::vector<o2::framework::ChannelConfigurationPolicy, std::allocator<o2::framework::ChannelConfigurationPolicy> > const&, std::vector<o2::framework::CompletionPolicy, std::allocator<o2::framework::CompletionPolicy> > const&, std::vector<o2::framework::DispatchPolicy, std::allocator<o2::framework::DispatchPolicy> > const&, std::vector<o2::framework::ResourcePolicy, std::allocator<o2::framework::ResourcePolicy> > const&, std::vector<o2::framework::CallbacksPolicy, std::allocator<o2::framework::CallbacksPolicy> > const&, std::vector<o2::framework::SendingPolicy, std::allocator<o2::framework::SendingPolicy> > const&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, o2::framework::ConfigContext&)
[8369:EMCALRawToCellConverterSpec]:     o2-emcal-reco-workflow() [0x407811]:     std::vector<o2::framework::ChannelConfigurationPolicy, std::allocator<o2::framework::ChannelConfigurationPolicy> >::~vector() at stl_vector.h:735
[8369:EMCALRawToCellConverterSpec]:     /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: callMain(int, char**, int (*)(int, char**))
[8369:EMCALRawToCellConverterSpec]:     o2-emcal-reco-workflow() [0x404c59]:     main at runDataProcessing.h:220
[8369:EMCALRawToCellConverterSpec]:     /lib64/libc.so.6:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     /lib64/libc.so.6:     ?? ??:0
[8369:EMCALRawToCellConverterSpec]:     o2-emcal-reco-workflow() [0x404cf5]:     _start at ??:?
[8369:EMCALRawToCellConverterSpec]: Backtrace complete.

@ktf
Copy link
Member Author

ktf commented Dec 11, 2025

@shahor02 indeed. I am investigating.

@ktf
Copy link
Member Author

ktf commented Dec 11, 2025

I suspect it's an issue with the back pressure. I will try to replicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants