The aim of this AIOps project under the LFEdge umbrella is to help deliver AI supported operations for heterogeneous and complex distributed environments such as edge.
There are three workstreams on the highest level:
The motto inside this folder is "Anything AIOps Goes". Here's the home for your wildest ideas to try out, code, test and report back on. Ideally everything built on open source technology or at least free to use and openly available software or models. Examples are evaluating different algorithms or models for anomaly detection, trying different agentic architectures like agents for logs, metrics, traces VS agents for each system component or layer of the technology stack.
The test harness is where we assemble different components of AIOps solutions to test and benchmark against each other. This can be sub-component test harnesses like for anomaly-detection or a full end-to-end AIOps solution cycle from data/signal collection, signal correlation, anomaly detection, root cause analysis, and remediation.
The demo is where the contributors assemble working demos and products to showcase.
The 'Recommend' arrow has several human in the loop (HIL) remediation maturity levels:
- Level 1: Assisted Questioning (lowest level of maturity)
- Level 2: Migitgation Solution Generation (checklist of tasks)
- Level 3: Command Recommendation (low level command recommendation according to the checklist)
- Level 4: Script Generation (entire script of all commands)
before we move into the Autonomous Operations realm with
- Level 5: Automatic Execution (no HIL)
(source https://arxiv.org/pdf/2507.12472)
- Manage Complexity – Distributed and heterogenous environments can be complex and hard to manage. Hece we are utilizing AI to help secure, fault-find and manage those environments.
- Adopt – Due to the vast differences distributed environments have wrt connectivity, workloads and general resource constraints, AIOps trains and finetunes the AI models 'on the job', learning and adopting to changing environments
- Scale – Only AIOps can enable growing distributed environments to remain manageable and secure.
Other factors are cost reduction, improving system perfomance, and decreasing incident resolution time.
- Infra & Runtime (Kind | MicroShift | Fedora)
- Setup of secure communication infrastructure
- Deployment, config & localisation
- Inference | Model selection | Model training | Testing | Rollout
- Workflows
- AI Model development | training | testing | validation | comparison
- Data engineering/science | data Pipelines/workflows | data cleansing | Analytics | data collection | storage (Iceberg|MinIO)| data entity models | data query
- Platform | Model | MLOps
- Encrypted data transfer | storage
- Workload identity (SPIFFE | SPIRE | CitadelAI)
- Geofencing | non-repudiation | Verifiability
- Model inference engine
- Authentication | authorization
- Predictive Models | LLM | RAG (system documentation, data entity information)
- Alerting
- Slack | Chatbot | Agents | Automation
- Signal collection | OTEL
- Model performance (accuracy), energy consumption,optimized inference
- Logging | tracing | metrics
- Documentation | web-site
- Andreas Spanner (Red Hat) (Lead)
- Ramki Krishnan (Intel) Component Lead - Security
- [Jey Paulraj] (Red Hat) Component Lead - Demo
- [Clyde Tedrick] (Zaga)
- [Giri] (Zaga)
- [Prathiba] (Zaga) Component Lead - Data
- [Stas Kozlov] (Red Hat)
