-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
After running SigNoz for approximately 12 hours with a moderate metrics ingestion load (Kubernetes cluster monitoring), the signoz_metrics database—specifically the time_series_v4_1week table—accumulates over 3,000 active parts, and this number continues to grow. Despite:
Increasing ClickHouse’s part limit (max_parts_in_total)
Manually triggering merges via ALTER TABLE ... MERGE PARTS
…the part count does not decrease, leading to:
Rising disk I/O pressure
Slower metric queries
Risk of hitting system limits (e.g., too many open files)
This suggests that either automatic merging is ineffective or the data ingestion pattern creates too many small parts that cannot be merged efficiently.
How to reproduce
Deploy SigNoz v2.x (Helm chart) on a Kubernetes cluster with default settings.
Enable metrics collection from ~100-node Kubernetes cluster (via Prometheus + kube-state-metrics).
Let the system run for 12+ hours under steady load (~50k–100k samples/sec).
Query ClickHouse system tables:
SELECT table, count() AS parts
FROM system.parts
WHERE database = 'signoz_metrics' AND active = 1
GROUP BY table
ORDER BY parts DESC;
→ Observe time_series_v4_1week has >3000 parts.
errorlog:
{"date_time":"1765261347.566535","thread_name":"TCPServerConnection ([#228])","thread_id":"1062","level":"Error","query_id":"","logger_name":"TCPHandler","message":"Code: 252. DB::Exception: Too many parts (3001 with average size of 37.52 KiB) in table 'signoz_metrics.time_series_v4_1week (b551387e-903a-4682-9588-702f998fc386)'. Merges are processing significantly slower than inserts: while pushing to view signoz_metrics.time_series_v4_1week_mv (c80068d3-13ca-4315-8328-803ad28cd320): while pushing to view signoz_metrics.time_series_v4_1day_mv (95c52533-4529-4119-94c8-580fbc64c7c8): while pushing to view signoz_metrics.time_series_v4_6hrs_mv (93a5f98b-638e-47d2-bc92-5e75910f2354). (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):\n\n0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000f87489b\n1. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x0000000009d9940c\n2. DB::Exception::Exception<unsigned long&, ReadableSize, String>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity::type, std::type_identity::type>, unsigned long&, ReadableSize&&, String&&) @ 0x00000000148fe9bc\n3. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*, std::shared_ptr<DB::Context const> const&, bool) const @ 0x00000000148fe0f1\n4. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic) @ 0x00000000152e381f\n5. DB::ExceptionKeepingTransform::work() @ 0x00000000152e2fd0\n6. DB::ExecutionThreadContext::executeTask() @ 0x00000000150551e9\n7. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic) @ 0x0000000015048c98\n8. DB::PipelineExecutor::executeStep(std::atomic) @ 0x0000000015048072\n9. DB::PushingPipelineExecutor::start() @ 0x000000001505da5d\n10. DB::TCPHandler::processInsertQuery(DB::QueryState&) @ 0x0000000014fa6790\n11. DB::TCPHandler::runImpl() @ 0x0000000014f97608\n12. DB::TCPHandler::run() @ 0x0000000014fb6239\n13. Poco::Net::TCPServerConnection::start() @ 0x00000000186d9707\n14. Poco::Net::TCPServerDispatcher::run() @ 0x00000000186d9b59\n15. Poco::PooledThread::run() @ 0x00000000186a4e3b\n16. Poco::ThreadImpl::runnableEntry(void) @ 0x00000000186a331d\n17. ? @ 0x00007fe65fc74ac3\n18. ? @ 0x00007fe65fd06850\n","source_file":"src/Server/TCPHandler.cpp; auto DB::TCPHandler::runImpl()::(anonymous class)::operator()() const","source_line":"477"
SigNoz backend: 4 nodes × 16 vCPU / 32 GB RAM
ClickHouse cluster: 3 nodes (1 shard, 1 replica), each on dedicated VMs
Storage: Local SSDs
