From 6c9aa474478325ca5a0a0621cb187526a70e260a Mon Sep 17 00:00:00 2001 From: Simon Pichugin Date: Thu, 15 Feb 2024 17:46:04 -0800 Subject: [PATCH 1/5] Add Replication Monitoring With Ansible Design --- .../ansible-replication-monitoring-design.md | 164 ++++++++++++++++++ docs/389ds/design/design.md | 1 + 2 files changed, 165 insertions(+) create mode 100644 docs/389ds/design/ansible-replication-monitoring-design.md diff --git a/docs/389ds/design/ansible-replication-monitoring-design.md b/docs/389ds/design/ansible-replication-monitoring-design.md new file mode 100644 index 0000000..badc643 --- /dev/null +++ b/docs/389ds/design/ansible-replication-monitoring-design.md @@ -0,0 +1,164 @@ +--- +title: "Replication Monitoring With Ansible" +--- + +# Replication Monitoring With Ansible Design + +{% include toc.md %} + +## Document Version + +0.1 + +## Revision History + +| Version | Date | Description of Change | +|---------|------------|-----------------------| +| 0.1 | 02-15-2024 | First MVP version | + +## Introduction + +This document outlines the design and implementation of an Ansible-based solution for monitoring replication lag within a 389 Directory Server (DS) topology. It aims to automate the setup, data collection, analysis, and reporting of replication performance metrics. + +## System Overview + +The system is designed around Ansible's automation capabilities, utilizing roles, playbooks, and Molecule for testing. It targets environments with multiple 389 DS instances, gathers CSN, time, and event time from the specified access logs in the specified topology and generates plot data in CSV and PNG formats which is then placed on the controller node. + +## Design Considerations + +### Component Overview + +Key components include: + +- **Ansible Inventory (`inventory/inventory.yml`):** Specifies hosts and variables for the staging and production environment. +- **Roles (`Replication-Monitoring`):** Encapsulates tasks for data gathering, analysis, and reporting. +- **Playbooks (`monitor-replication.yml`, `cleanup-environment.yml`, etc.):** Provide role examples with different combinations of parameters and states. + +## System Architecture + +### Requirements + +- Ansible, Python 3, and its matplotlib are installed on the control node. +- The 389 DS instances are accessible over the network, and the Ansible controller can connect to that securely. +- The specified log directories and their log files are readable. +- Docker is available for Molecule testing. + +### Parameters + +Parameter | Choices/Defaults | Comments +-------- | ----------- | -------- +replication_monitoring_lag_threshold| Default: 10 | Value to determine the threshold for replication monitoring. A line will be drawn in the result plot using this value. +replication_monitoring_result_dir| Default: "/tmp" | Directory path where the results of replication monitoring will be stored. +replication_monitoring_log_dir| | Directory path on the host where log files for replication monitoring are stored. +replication_monitoring_tmp_path| Default: "/tmp" | Directory for temporary files required for the replication monitoring output generation. + +### Inventory Example + +```yaml +all: + children: + production: + vars: + replication_monitoring_lag_threshold: 20 + replication_monitoring_result_dir: '/tmp' + hosts: + ds389_instance_1: + ansible_host: 192.168.2.101 + ansible_user: root + replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' + ds389_instance_2: + ansible_host: 192.168.2.102 + ansible_user: root + replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' + + staging: + vars: + replication_monitoring_lag_threshold: 20 + replication_monitoring_result_dir: '/tmp' + hosts: + ds389_instance_1: + ansible_host: 192.168.3.101 + ansible_user: root + replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' + ds389_instance_2: + ansible_host: 192.168.3.102 + ansible_user: root + replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' +``` + +You need to configure SSH authentication to avoid plaintext usage. + +### Playbooks Examples + +A simple playbook example that gathers data from 389 DS servers and generates a report: + +```yaml +- name: Create Replication Monitoring CSV and PNG graph + hosts: staging + + vars: + replication_monitoring_lag_threshold: 20 + replication_monitoring_result_dir: '/tmp' + + roles: + - role: Replication-Monitoring + state: present +``` + +Example playbook to run create a Replication monitoring report and clean up all temporary data afterwards: + +```yaml +- name: Create Replication Monitoring CSV and PNG graph + hosts: staging + + vars: + replication_monitoring_lag_threshold: 20 + replication_monitoring_result_dir: '/tmp' + replication_monitoring_cleanup: yes + + roles: + - role: Replication-Monitoring + state: present +``` + +Example playbook to clean up all temporary data: + +```yaml +- name: Create Replication Monitoring CSV and PNG graph + hosts: staging + + roles: + - role: Replication-Monitoring + state: absent +``` + +Example playbook to clean up all temporary data and the results from the results_dir: + +```yaml +- name: Create Replication Monitoring CSV and PNG graph + hosts: staging + + vars: + replication_monitoring_result_dir: '/tmp' + + roles: + - role: Replication-Monitoring + state: absent +``` + +## Molecule Testing + +The project is configured with Ansible Molecule for testing using Docker. To run tests: + +1. Make sure you have Docker installed and configured. +1. Navigate to the root of the project. +2. Execute Molecule tests: + ```molecule test``` + +The tests simulate a multi-instance DS environment, validating the role syntax, execution, and that output is present and not empty. + + +Authors +======= + +Simon Pichugin (@droideck) \ No newline at end of file diff --git a/docs/389ds/design/design.md b/docs/389ds/design/design.md index 48f3e0f..f4687af 100644 --- a/docs/389ds/design/design.md +++ b/docs/389ds/design/design.md @@ -36,6 +36,7 @@ If you are adding a new design document, use the [template](design-template.html ## Ansible - [Ansible DS](ansible-ds.html) +- [Replication Monitoring With Ansible](ansible-replication-monitoring-design.html) ## 389 Directory Server 3.1 From a5f92c6491cf6b839035b5307c1703409035f549 Mon Sep 17 00:00:00 2001 From: Simon Pichugin Date: Fri, 1 Mar 2024 18:03:25 -0800 Subject: [PATCH 2/5] Apply review comments --- .../ansible-replication-monitoring-design.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/389ds/design/ansible-replication-monitoring-design.md b/docs/389ds/design/ansible-replication-monitoring-design.md index badc643..318e354 100644 --- a/docs/389ds/design/ansible-replication-monitoring-design.md +++ b/docs/389ds/design/ansible-replication-monitoring-design.md @@ -14,7 +14,7 @@ title: "Replication Monitoring With Ansible" | Version | Date | Description of Change | |---------|------------|-----------------------| -| 0.1 | 02-15-2024 | First MVP version | +| 0.1 | 03-01-2024 | First MVP version | ## Introduction @@ -32,7 +32,7 @@ Key components include: - **Ansible Inventory (`inventory/inventory.yml`):** Specifies hosts and variables for the staging and production environment. - **Roles (`Replication-Monitoring`):** Encapsulates tasks for data gathering, analysis, and reporting. -- **Playbooks (`monitor-replication.yml`, `cleanup-environment.yml`, etc.):** Provide role examples with different combinations of parameters and states. +- **Playbooks (`monitor-replication.yml`, `cleanup-environment.yml`, etc.):** Provide role examples with different combinations of parameters. ## System Architecture @@ -102,7 +102,6 @@ A simple playbook example that gathers data from 389 DS servers and generates a roles: - role: Replication-Monitoring - state: present ``` Example playbook to run create a Replication monitoring report and clean up all temporary data afterwards: @@ -118,7 +117,6 @@ Example playbook to run create a Replication monitoring report and clean up all roles: - role: Replication-Monitoring - state: present ``` Example playbook to clean up all temporary data: @@ -129,7 +127,6 @@ Example playbook to clean up all temporary data: roles: - role: Replication-Monitoring - state: absent ``` Example playbook to clean up all temporary data and the results from the results_dir: @@ -143,7 +140,6 @@ Example playbook to clean up all temporary data and the results from the results roles: - role: Replication-Monitoring - state: absent ``` ## Molecule Testing @@ -157,6 +153,15 @@ The project is configured with Ansible Molecule for testing using Docker. To run The tests simulate a multi-instance DS environment, validating the role syntax, execution, and that output is present and not empty. +## Future Improvements + +The final goal for the Ansible Role will be to use the data to provide feedback to tools like Event Driven Ansible (EDA) or Insights Remediations. +For that, the result should provide facts that can later be used for managing the systems and/or notifying the administrator about the possible performance issue. + +Other features for consideration: +- Modules instead of Python scripts: Move the existing scripts to Ansible Modules for the role to use. +- Security Enhancements: Implement secure handling of sensitive inventory data using mechanisms like Ansible Vault and provide an option and documentation to configure SSH communication. +- Version Control: Manage different 389 DS versions to accommodate variations in log formats. Authors ======= From 6ce890de48906e1026d124f56c2900a0b566adac Mon Sep 17 00:00:00 2001 From: Simon Pichugin Date: Sun, 10 Mar 2024 16:20:46 -0700 Subject: [PATCH 3/5] Refactor the design as more things changed and more needs to be clarified --- .../ansible-replication-monitoring-design.md | 197 +++++++++--------- 1 file changed, 102 insertions(+), 95 deletions(-) diff --git a/docs/389ds/design/ansible-replication-monitoring-design.md b/docs/389ds/design/ansible-replication-monitoring-design.md index 318e354..8f3fbd9 100644 --- a/docs/389ds/design/ansible-replication-monitoring-design.md +++ b/docs/389ds/design/ansible-replication-monitoring-design.md @@ -14,154 +14,161 @@ title: "Replication Monitoring With Ansible" | Version | Date | Description of Change | |---------|------------|-----------------------| -| 0.1 | 03-01-2024 | First MVP version | +| 0.1 | 03-11-2024 | First MVP version | ## Introduction -This document outlines the design and implementation of an Ansible-based solution for monitoring replication lag within a 389 Directory Server (DS) topology. It aims to automate the setup, data collection, analysis, and reporting of replication performance metrics. +The ds389_repl_monitoring role is designed to monitor replication lag in 389 Directory Server instances. It gathers replication data from access log files, analyzes the data to identify replication lags, and generates visual representations of the replication lag over time. -## System Overview +## Design Considerations -The system is designed around Ansible's automation capabilities, utilizing roles, playbooks, and Molecule for testing. It targets environments with multiple 389 DS instances, gathers CSN, time, and event time from the specified access logs in the specified topology and generates plot data in CSV and PNG formats which is then placed on the controller node. +- The role should be able to handle multiple 389 Directory Server instances. +- It should provide flexibility in specifying the log directory and result directory paths. +- The role should allow filtering the replication data based on various criteria, such as fully replicated changes, not replicated changes, lag time, and execution time. +- It should generate both CSV and PNG files for easy analysis and visualization of replication lag data. +- The role should be idempotent and handle cases where the replication lag files already exist. -## Design Considerations +## System Architecture -### Component Overview +### Role Walkthrough -Key components include: +The ds389_repl_monitoring role consists of the following main task files: -- **Ansible Inventory (`inventory/inventory.yml`):** Specifies hosts and variables for the staging and production environment. -- **Roles (`Replication-Monitoring`):** Encapsulates tasks for data gathering, analysis, and reporting. -- **Playbooks (`monitor-replication.yml`, `cleanup-environment.yml`, etc.):** Provide role examples with different combinations of parameters. +1. setup.yml: Performs initial setup tasks such as ensuring connectivity to the hosts, installing necessary packages on the Ansible controller, and creating the log directory. -## System Architecture +2. gather_data.yml: Finds all access log files in the specified directory on each 389 Directory Server instance, analyzes the logs using the ds389_log_parser module to extract replication data, and merges the data from all instances using the ds389_merge_logs module. + +3. log_replication_lag.yml: Generates CSV and PNG files visualizing the replication lag data using the ds389_logs_plot module. The files are saved in a directory named with the current date and hour. + +4. cleanup.yml: Removes temporary files created during the monitoring process on both the remote hosts and the Ansible controller. + +### Custom Modules + +The ds389_repl_monitoring role utilizes three custom Ansible modules: + +1. ds389_log_parser: Parses 389 Directory Server access logs and calculates replication lags. + - logfiles: List of paths to 389ds access log files (required). + - anonymous: Replace log file names with generic identifiers (default: false). + - output_file: Path to the output file where the results will be written (required). -### Requirements +2. ds389_logs_plot: Plots 389 Directory Server log data from a JSON file. + - input: Path to the input JSON file containing the log data (required). + - csv_output_path: Path where the CSV file should be generated (required). + - png_output_path: Path where the plot image should be saved. + - only_fully_replicated: Filter to show only changes replicated on all replicas (default: false). + - only_not_replicated: Filter to show only changes not replicated on all replicas (default: false). + - lag_time_lowest: Filter to show only changes with lag time greater than or equal to the specified value. + - etime_lowest: Filter to show only changes with execution time (etime) greater than or equal to the specified value. + - utc_offset: UTC offset in seconds for timezone adjustment. + - repl_lag_threshold: Replication monitoring threshold value. A horizontal line will be drawn in the plot to represent this threshold. + +3. ds389_merge_logs: Merges multiple JSON log files into a single file. + - files: A list of paths to the JSON files to be merged (required). + - output: The path to the output file where the merged JSON will be saved (required). -- Ansible, Python 3, and its matplotlib are installed on the control node. -- The 389 DS instances are accessible over the network, and the Ansible controller can connect to that securely. -- The specified log directories and their log files are readable. -- Docker is available for Molecule testing. ### Parameters -Parameter | Choices/Defaults | Comments --------- | ----------- | -------- -replication_monitoring_lag_threshold| Default: 10 | Value to determine the threshold for replication monitoring. A line will be drawn in the result plot using this value. -replication_monitoring_result_dir| Default: "/tmp" | Directory path where the results of replication monitoring will be stored. -replication_monitoring_log_dir| | Directory path on the host where log files for replication monitoring are stored. -replication_monitoring_tmp_path| Default: "/tmp" | Directory for temporary files required for the replication monitoring output generation. +The role accepts the following parameters: + +| Variable | Default | Description | +|----------|---------|-------------| +| ds389_repl_monitoring_lag_threshold | 10 | Threshold for replication lag monitoring (in seconds). A line will be drawn in the plot to indicate the threshold value. | +| ds389_repl_monitoring_result_dir | '/tmp' | Directory to store replication monitoring results. The generated CSV and PNG files will be saved in this directory. | +| ds389_repl_monitoring_only_fully_replicated | false | Filter to show only changes replicated on all replicas. If set to true, only changes that have been replicated to all replicas will be considered. | +| ds389_repl_monitoring_only_not_replicated | false | Filter to show only changes not replicated on all replicas. If set to true, only changes that have not been replicated to all replicas will be considered. | +| ds389_repl_monitoring_lag_time_lowest | 0 | Filter to show only changes with lag time greater than or equal to the specified value (in seconds). Changes with a lag time lower than this value will be excluded from the monitoring results. | +| ds389_repl_monitoring_etime_lowest | 0 | Filter to show only changes with execution time (etime) greater than or equal to the specified value (in seconds). Changes with an execution time lower than this value will be excluded from the monitoring results. | +| ds389_repl_monitoring_utc_offset | 0 | UTC offset in seconds for timezone adjustment. This value will be used to adjust the log timestamps to the desired timezone. | +| ds389_repl_monitoring_tmp_path | "/tmp" | Temporary directory path for storing intermediate files. This directory will be used to store temporary files generated during the monitoring process. | +| ds389_repl_monitoring_tmp_analysis_output_file_path | "{{ ds389_repl_monitoring_tmp_path }}/{{ inventory_hostname }}_analysis_output.json" | Path to the temporary analysis output file for each host. This file will contain the parsed replication data for each individual host. | +| ds389_repl_monitoring_tmp_merged_output_file_path | "{{ ds389_repl_monitoring_tmp_path }}/merged_output.json" | Path to the temporary merged output file. This file will contain the merged replication data from all hosts. | -### Inventory Example + +## Inventory Example ```yaml all: children: production: vars: - replication_monitoring_lag_threshold: 20 - replication_monitoring_result_dir: '/tmp' + ds389_repl_monitoring_lag_threshold: 20 + ds389_repl_monitoring_result_dir: '/var/log/ds389_repl_monitoring' hosts: ds389_instance_1: ansible_host: 192.168.2.101 - ansible_user: root - replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' - ds389_instance_2: + ds389_repl_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' + ds389_instance_2: ansible_host: 192.168.2.102 - ansible_user: root - replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' - - staging: - vars: - replication_monitoring_lag_threshold: 20 - replication_monitoring_result_dir: '/tmp' - hosts: - ds389_instance_1: - ansible_host: 192.168.3.101 - ansible_user: root - replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' - ds389_instance_2: - ansible_host: 192.168.3.102 - ansible_user: root - replication_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' + ds389_repl_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' ``` -You need to configure SSH authentication to avoid plaintext usage. +## Playbook Examples -### Playbooks Examples +These examples demonstrate how ds389_repl_monitoring role can be customized using different variable settings to suit specific monitoring requirements. The role can be applied to different host groups, and the variables can be adjusted to filter the monitoring results based on various criteria such as fully replicated changes, minimum lag time, timezone offset, and minimum etime. -A simple playbook example that gathers data from 389 DS servers and generates a report: +### Example 1: Monitoring with custom lag threshold and result directory ```yaml -- name: Create Replication Monitoring CSV and PNG graph - hosts: staging - - vars: - replication_monitoring_lag_threshold: 20 - replication_monitoring_result_dir: '/tmp' - +- name: Monitor 389ds Replication with custom settings + hosts: ds389_replicas roles: - - role: Replication-Monitoring + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_lag_threshold: 30 + ds389_repl_monitoring_result_dir: '/var/log/ds389_monitoring' ``` -Example playbook to run create a Replication monitoring report and clean up all temporary data afterwards: - -```yaml -- name: Create Replication Monitoring CSV and PNG graph - hosts: staging - - vars: - replication_monitoring_lag_threshold: 20 - replication_monitoring_result_dir: '/tmp' - replication_monitoring_cleanup: yes - - roles: - - role: Replication-Monitoring -``` +In this example, the role is applied to the `ds389_replicas` host group. The `ds389_repl_monitoring_lag_threshold` is set to 30 seconds, meaning that replication lag line will be drawn across the PNG graph. The `ds389_repl_monitoring_result_dir` is set to `/var/log/ds389_monitoring`, specifying the directory where the CSV and PNG files will be stored. -Example playbook to clean up all temporary data: +### Example 2: Monitoring with filters for fully replicated and minimum lag time ```yaml -- name: Create Replication Monitoring CSV and PNG graph - hosts: staging - +- name: Monitor 389ds Replication with filters + hosts: ds389_servers roles: - - role: Replication-Monitoring + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_only_fully_replicated: true + ds389_repl_monitoring_lag_time_lowest: 5 ``` -Example playbook to clean up all temporary data and the results from the results_dir: - -```yaml -- name: Create Replication Monitoring CSV and PNG graph - hosts: staging +This playbook applies the role to the `ds389_servers` host group. The `ds389_repl_monitoring_only_fully_replicated` variable is set to `true`, which means that only changes that have been fully replicated across all replicas will be considered. The `ds389_repl_monitoring_lag_time_lowest` is set to 5 seconds, so only changes with a lag time greater than or equal to 5 seconds will be included in the monitoring results. The results will be put in `/tmp` directory, which is default for `ds389_repl_monitoring_result_dir`. - vars: - replication_monitoring_result_dir: '/tmp' +### Example 3: Monitoring with timezone offset and minimum etime +```yaml +- name: Monitor 389ds Replication with timezone and etime filters + hosts: directory_servers roles: - - role: Replication-Monitoring + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_utc_offset: -21600 + ds389_repl_monitoring_etime_lowest: 1.5 ``` -## Molecule Testing +In this example, the role is used to monitor the hosts in the `directory_servers` group. The `ds389_repl_monitoring_utc_offset` is set to -21600 seconds, which adjusts the log timestamps by -6 hours to match the desired timezone. The `ds389_repl_monitoring_etime_lowest` variable is set to 1.5 seconds, meaning that only changes with an etime greater than or equal to 1.5 seconds will be included in the monitoring output. The results will be put in `/tmp` directory, which is default for `ds389_repl_monitoring_result_dir`. -The project is configured with Ansible Molecule for testing using Docker. To run tests: +## Molecule Testing -1. Make sure you have Docker installed and configured. -1. Navigate to the root of the project. -2. Execute Molecule tests: - ```molecule test``` +The role includes a Molecule configuration for testing with Docker containers simulating 389ds replicas. The test sequence: -The tests simulate a multi-instance DS environment, validating the role syntax, execution, and that output is present and not empty. +1. Builds multiple containers +2. Copies mock access log files into each container +3. Runs the role against the containers +4. Verifies the role's functionality by: + - Checking CSV and PNG files are generated correctly + - Validating the content of the generated files + - Ensuring proper packages are installed + - Checking permissions on key directories ## Future Improvements -The final goal for the Ansible Role will be to use the data to provide feedback to tools like Event Driven Ansible (EDA) or Insights Remediations. -For that, the result should provide facts that can later be used for managing the systems and/or notifying the administrator about the possible performance issue. +- Support for additional log formats and directory server versions. +- Support for sending metrics to monitoring systems +- Notifications on critical replication lag events +- Dashboard visualization of replication status -Other features for consideration: -- Modules instead of Python scripts: Move the existing scripts to Ansible Modules for the role to use. -- Security Enhancements: Implement secure handling of sensitive inventory data using mechanisms like Ansible Vault and provide an option and documentation to configure SSH communication. -- Version Control: Manage different 389 DS versions to accommodate variations in log formats. Authors ======= From 98f4a8a00b565a790765f007d074481ef5a7e741 Mon Sep 17 00:00:00 2001 From: Simon Pichugin Date: Tue, 15 Oct 2024 19:28:31 -0700 Subject: [PATCH 4/5] Add Replication Lag Report Design --- docs/389ds/design/design.md | 1 + .../design/replication-lag-report-design.md | 113 ++++++++++++++++++ 2 files changed, 114 insertions(+) create mode 100644 docs/389ds/design/replication-lag-report-design.md diff --git a/docs/389ds/design/design.md b/docs/389ds/design/design.md index f4687af..bd67ae2 100644 --- a/docs/389ds/design/design.md +++ b/docs/389ds/design/design.md @@ -41,6 +41,7 @@ If you are adding a new design document, use the [template](design-template.html ## 389 Directory Server 3.1 - [Session Tracking Control client - replication](session-identifier-clients.html) +- [Replication Lag Report Design](replication-lag-report-design.md) ## 389 Directory Server 3.0 diff --git a/docs/389ds/design/replication-lag-report-design.md b/docs/389ds/design/replication-lag-report-design.md new file mode 100644 index 0000000..a0c4609 --- /dev/null +++ b/docs/389ds/design/replication-lag-report-design.md @@ -0,0 +1,113 @@ +--- +title: "Replication Lag Report Design" +--- + +# Replication Lag Report Design + +{% include toc.md %} + +## Document Version + +0.1 + +## Revision History + +| Version | Date | Description of Change | +|---------|------------|-----------------------| +| 1.0 | 10-15-2024 | First version | + +## Executive Summary + +The `ReplicationLagReport` class will consolidate the functionality of the existing `LogParser`, `ReplLag`, and `LagInfo` classes into a single, efficient, and reusable component. This design will allow for easy integration with both Ansible modules and standalone CLI tools, providing a comprehensive solution for analyzing and visualizing replication lag in the 389 Directory Server. + +## Architecture Overview + +The `ReplicationLagReport` class will be the central component, handling log parsing, data processing, and report generation. It will encapsulate all necessary functionality without relying on additional helper classes or modules. + +## Component Details + +### ReplicationLagReport (Main Class) + +- **Responsibilities**: + - Log file parsing and data extraction + - Data processing and analysis + - Report generation (CSV, PNG, HTML) +- **Key Methods**: + - `__init__(self, config: Dict)` + - `parse_logs(self)` + - `process_data(self)` + - `generate_report(self, report_type: str)` + +## Data Flow + +1. `ReplicationLagReport` is initialized with configuration parameters. +2. `parse_logs()` reads and processes input log files. +3. `process_data()` analyzes the collected data. +4. `generate_report()` creates the requested output (CSV, PNG, or HTML). + +## API Definitions + +### ReplicationLagReport + +- `__init__(self, config: Dict)` + - **Parameters**: + - `input_files`: `List[str]` + - `filters`: `Dict` + - `timezone`: `str` +- `parse_logs(self) -> None` +- `process_data(self) -> None` +- `generate_report(self, report_type: str) -> None` + +## Database Changes + +No database changes are required for this implementation. + +## Performance Considerations + +- Implement lazy loading for log files to reduce memory usage. +- Use generators for processing large log files. + +## Security Measures + +- Implement input validation for all user-provided data. +- Sanitize data before generating reports to prevent XSS attacks in HTML output. + +## Challenges and Mitigations + +- **Challenge**: Processing large log files + **Mitigation**: Implement streaming processing and use generators. +- **Challenge**: Maintaining compatibility with existing systems + **Mitigation**: Design the API to be easily adaptable for Ansible modules and CLI tools. +- **Challenge**: Ensuring accuracy of time-based calculations across different timezones + **Mitigation**: Implement robust timezone handling using the `datetime` library. + +## Implementation Roadmap + +### Phase 1: Port Existing Code to lib389 + +1. Port existing Python code to use the `lib389` library. +2. Implement tests for `lib389` code. + +### Phase 2: Develop Command-Line Interface (CLI) in dsconf Tool + +3. Design and develop the CLI. +4. Implement tests for CLI features. +5. Enhance dsconf CLI to consume logs and generate the report. +5a. Add support for `.dsrc` files. + +### Phase 3: Develop Web User Interface (WebUI) in Replication Monitoring Tab + +6. Develop WebUI using CLI code. +7. Add special reports in WebUI using Cockpit functionality. +8. Implement tests for WebUI features (?) + +### Phase 4: Finalization and Deployment + +9. Documentation. +10. Feedback and iteration. + + +Authors +======= + +Simon Pichugin (@droideck) \ No newline at end of file From 166a20aa84dc6d4249fa4f236260c9282549917f Mon Sep 17 00:00:00 2001 From: Simon Pichugin Date: Wed, 22 Jan 2025 19:46:26 -0800 Subject: [PATCH 5/5] Rework and add WebUI and CLI design --- .../design/replication-lag-report-design.md | 316 ++++++++++++++---- 1 file changed, 249 insertions(+), 67 deletions(-) diff --git a/docs/389ds/design/replication-lag-report-design.md b/docs/389ds/design/replication-lag-report-design.md index a0c4609..62abc3a 100644 --- a/docs/389ds/design/replication-lag-report-design.md +++ b/docs/389ds/design/replication-lag-report-design.md @@ -1,113 +1,295 @@ --- -title: "Replication Lag Report Design" +title: "Replication Log Analyzer Tool" --- -# Replication Lag Report Design - -{% include toc.md %} +# Directory Server Replication Lag Analyzer Tool ## Document Version -0.1 +1.0 ## Revision History | Version | Date | Description of Change | |---------|------------|-----------------------| -| 1.0 | 10-15-2024 | First version | +| 1.0 | 2025-10-26 | Initial design document | ## Executive Summary -The `ReplicationLagReport` class will consolidate the functionality of the existing `LogParser`, `ReplLag`, and `LagInfo` classes into a single, efficient, and reusable component. This design will allow for easy integration with both Ansible modules and standalone CLI tools, providing a comprehensive solution for analyzing and visualizing replication lag in the 389 Directory Server. +The Directory Server Replication Lag Analyzer Tool is designed to analyze replication performance in 389 Directory Server deployments. It processes access logs from multiple directory servers, calculates replication lag times, and generates comprehensive reports in various formats (Charts, CSV, and only for Fedora - HTML, PNG). The system is available both as a command-line tool and through an web-based interface in the 389 DS Cockpit WebUI. + +The tool focuses on two key metrics: +1. Global Replication Lag: Time difference between the earliest and latest appearance of a CSN across all servers +2. Hop-by-Hop Replication Lag: Time delays between individual server pairs in the replication topology + ## Architecture Overview -The `ReplicationLagReport` class will be the central component, handling log parsing, data processing, and report generation. It will encapsulate all necessary functionality without relying on additional helper classes or modules. +The system consists of three main components: +1. `DSLogParser`: Parses directory server access logs +2. `ReplicationLogAnalyzer`: Coordinates log analysis and report generation +3. `VisualizationHelper`: Handles data visualization and report formatting + +## Replication Lag Calculation Technical Details + +### Global Replication Lag +- For each CSN (Change Sequence Number): + 1. Track timestamp of first appearance across all servers + 2. Track timestamp of last appearance across all servers + 3. Global lag = latest_timestamp - earliest_timestamp + +### Hop Replication Lag +- For each CSN: + 1. Sort server appearances by timestamp + 2. For consecutive server pairs (supplier → consumer): + - Hop lag = consumer_timestamp - supplier_timestamp + 3. Track individual hop lags to identify bottlenecks + +### Input Parameters +1. Log Directories: + List of paths to server log directories. Each directory represents one server in topology. + +2. Filtering Parameters: + - `suffixes`: List of DN suffixes to analyze + - `time_range`: Optional start/end datetime range + - `lag_time_lowest`: Minimum lag threshold + - `etime_lowest`: Minimum operation execution time + - `repl_lag_threshold`: Alert threshold for lag times + +3. Analysis Options: + - `anonymous`: Hide server names in reports + - `only_fully_replicated`: Show only changes reaching all servers + - `only_not_replicated`: Show only incomplete replication + - `utc_offset`: Timezone handling + +### Output Parameters +1. Reports: + - CSV: Detailed event log with global and hop lags + - HTML: Interactive visualization with Plotly + - PNG: Static visualization with matplotlib + - JSON: Summary statistics and analysis + +2. Metrics: + - Global lag statistics (min/max/avg) + - Hop lag statistics (min/max/avg) + - Per-suffix update counts + - Total updates processed + - Server participation statistics ## Component Details -### ReplicationLagReport (Main Class) - -- **Responsibilities**: - - Log file parsing and data extraction - - Data processing and analysis - - Report generation (CSV, PNG, HTML) -- **Key Methods**: - - `__init__(self, config: Dict)` - - `parse_logs(self)` - - `process_data(self)` - - `generate_report(self, report_type: str)` +### DSLogParser +- Purpose: Efficient log file parsing +- Key Features: + - Batch processing for memory efficiency + - Timezone-aware timestamp handling + - Regular expression-based log parsing + +### ReplicationLogAnalyzer +- Purpose: Analysis coordination and report generation +- Key Features: + - Multi-server log correlation + - Flexible filtering options + - Multiple report format support + +### VisualizationHelper +- Purpose: Data visualization +- Key Features: + - Interactive Plotly charts + - Static matplotlib exports + - Consistent color schemes ## Data Flow -1. `ReplicationLagReport` is initialized with configuration parameters. -2. `parse_logs()` reads and processes input log files. -3. `process_data()` analyzes the collected data. -4. `generate_report()` creates the requested output (CSV, PNG, or HTML). +1. Log Collection: + ``` + Server Logs → DSLogParser → Parsed Events + ``` -## API Definitions +2. Analysis: + ``` + Parsed Events → ReplicationLogAnalyzer → Lag Calculations + ``` -### ReplicationLagReport +3. Reporting: + ``` + Lag Calculations → VisualizationHelper → Reports (CSV/HTML/PNG) + ``` -- `__init__(self, config: Dict)` - - **Parameters**: - - `input_files`: `List[str]` - - `filters`: `Dict` - - `timezone`: `str` -- `parse_logs(self) -> None` -- `process_data(self) -> None` -- `generate_report(self, report_type: str) -> None` +## Challenges and Mitigations -## Database Changes +1. Large Log Files: + - Challenge: Memory consumption + - Mitigation: Batch processing, generators -No database changes are required for this implementation. +2. Time Zone Handling: + - Challenge: Accurate timestamp comparison + - Mitigation: Consistent UTC conversion -## Performance Considerations +3. Visualization Performance: + - Challenge: Large datasets + - Mitigation: Data sampling, efficient plotting -- Implement lazy loading for log files to reduce memory usage. -- Use generators for processing large log files. +## Web User Interface (WebUI) -## Security Measures +The Replication Log Analyzer is accessible via **Monitor** → **Log Analyser** in the 389 DS Cockpit WebUI. The interface provides a form-based configuration system with real-time validation and integrated file browsing capabilities. -- Implement input validation for all user-provided data. -- Sanitize data before generating reports to prevent XSS attacks in HTML output. +### Interface Structure -## Challenges and Mitigations +The UI is organized into card-based sections with an expandable help section explaining the analysis process. Form validation occurs in real-time with error highlighting and helper text for invalid inputs. + +The tool starts with an expandable "About Replication Log Analysis" section that provides a clear overview of the analysis process. This isn't just documentation - it's an interactive guide that walks you through the five essential steps: selecting server log directories, specifying suffixes, adjusting filters, choosing report formats, and generating the report. + +### Log Directory Selection + +**File Browser Integration**: Modal dialog for directory selection opens to `/var/log/dirsrv` by default. Supports navigation via path input or folder browsing with checkbox-based multi-selection. + +**Directory Management**: Selected directories display in a DataList component with folder icons and remove buttons. The interface validates directory accessibility before allowing selection. + +### Suffix Configuration + +**Input Field**: Text input with real-time DN validation using the `valid_dn()` function. Invalid DNs trigger immediate error display. + +**Chip Display**: Selected suffixes appear as removable PatternFly chips. Interface pre-populates with existing replicated suffixes from server configuration. + +### Configuration Options + +**Display Options**: +- Server anonymization toggle (replaces hostnames with generic identifiers) +- Replication filter: all entries, fully replicated only, or failed replication only + +**Time Range Controls**: +- DatePicker and TimePicker components for start/end times +- UTC offset field with increment/decrement buttons (30-minute intervals) +- Linked controls prevent invalid date ranges + +**Threshold Configuration**: +- NumberInput components for lag time, etime, and replication lag thresholds +- Increment/decrement controls with validation for positive numbers + +### Report Format Selection + +**Format Options**: +- JSON: Interactive charts (always available) +- CSV: Data export (always available) +- HTML/PNG: Requires `python3-lib389-repl-reports` package + +**Package Detection**: Interface checks for required package on mount and disables unavailable formats with explanatory tooltips. + +### Output Configuration + +**Directory Selection**: Default to `/tmp` with optional custom directory selection via file browser. These directories create subdirectories for individual reports. + +**Report Naming**: Optional custom report names; defaults to timestamp-based naming. + +### Report Generation + +**Process Flow**: +1. Form validation before submission +2. Background command execution via Cockpit spawn +3. Loading state with progress indicators +4. JSON response parsing for report file locations + +**Command Construction**: Builds `dsconf replication lag-report` command with all configured parameters, including log directories, suffixes, time ranges, and output formats. + +### Report Viewing Modal + +**Tabbed Interface**: Modal dialog with tabs adapting to available report formats: + +- **Summary Tab**: Statistics display using PatternFly cards and description lists +- **Charts Tab**: Interactive PatternFly charts for JSON data visualization +- **PNG Tab**: Static image display (when available) +- **CSV Tab**: Data preview with download options +- **Files Tab**: Complete file listing with download links - Standalone HTML is here. + +### Existing Report Management + +**Report Discovery**: "Choose Existing Report" button opens modal that scans configured directory for existing reports. Identifies reports by file contents and naming patterns. + +**Report Table**: Displays report metadata with format availability indicators (checkmarks/X marks) and "View Report" actions that open the same viewing modal used for new reports. + +## Command Line Interface + +The replication lag analyzer is also available as a CLI tool through dsconf: + +``` +dsconf INSTANCE replication lag-report [options] +``` + +### Required Parameters + +**--log-dirs**: List of log directories to analyze. Each directory represents one server in the replication topology. +``` +--log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 +``` + +**--suffixes**: List of suffixes (naming contexts) to analyze. +``` +--suffixes "dc=example,dc=com" "dc=test,dc=com" +``` + +**--output-dir**: Directory where analysis reports will be written. +``` +--output-dir /tmp/repl_analysis +``` + +### Output Options + +**--output-format**: Specify one or more output formats. Options: html, json, png, csv. Default: html. +``` +--output-format json csv png +``` + +**--json**: Output results as JSON for programmatic use or UI integration. + +### Filtering Options -- **Challenge**: Processing large log files - **Mitigation**: Implement streaming processing and use generators. -- **Challenge**: Maintaining compatibility with existing systems - **Mitigation**: Design the API to be easily adaptable for Ansible modules and CLI tools. -- **Challenge**: Ensuring accuracy of time-based calculations across different timezones - **Mitigation**: Implement robust timezone handling using the `datetime` library. +**Replication Status Filters** (mutually exclusive): +- **--only-fully-replicated**: Show only entries that successfully replicated to all servers +- **--only-not-replicated**: Show only entries that failed to replicate to all servers -## Implementation Roadmap +**Threshold Filters**: +- **--lag-time-lowest SECONDS**: Filter entries with lag time above this threshold +- **--etime-lowest SECONDS**: Filter entries with execution time above this threshold +- **--repl-lag-threshold SECONDS**: Lag threshold for highlighting in reports -### Phase 1: Port Existing Code to lib389 +### Time Range Options -1. Port existing Python code to use the `lib389` library. -2. Implement tests for `lib389` code. +**--start-time**: Start time for analysis in YYYY-MM-DD HH:MM:SS format. Default: 1970-01-01 00:00:00 -### Phase 2: Develop Command-Line Interface (CLI) in dsconf Tool +**--end-time**: End time for analysis in YYYY-MM-DD HH:MM:SS format. Default: 9999-12-31 23:59:59 -3. Design and develop the CLI. -4. Implement tests for CLI features. -5. Enhance dsconf CLI to consume logs and generate the report. -5a. Add support for `.dsrc` files. +### Additional Options -### Phase 3: Develop Web User Interface (WebUI) in Replication Monitoring Tab +**--utc-offset**: UTC offset in ±HHMM format for timezone handling (e.g., -0400, +0530) -6. Develop WebUI using CLI code. -7. Add special reports in WebUI using Cockpit functionality. -8. Implement tests for WebUI features (?) +**--anonymous**: Anonymize server names in reports (replaces with generic identifiers) -### Phase 4: Finalization and Deployment +### Usage Examples -9. Documentation. -10. Feedback and iteration. +Basic analysis: +``` +dsconf supplier1 replication lag-report \ + --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 \ + --suffixes "dc=example,dc=com" \ + --output-dir /tmp/repl_report +``` +Advanced analysis with filtering: +``` +dsconf supplier1 replication lag-report \ + --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 \ + --suffixes "dc=example,dc=com" \ + --output-dir /tmp/repl_report \ + --output-format json csv png \ + --lag-time-lowest 1.0 \ + --repl-lag-threshold 5.0 \ + --only-fully-replicated \ + --start-time "2025-01-01 00:00:00" \ + --end-time "2025-01-31 23:59:59" \ + --utc-offset "-0500" +``` -Authors -======= +## Authors -Simon Pichugin (@droideck) \ No newline at end of file +Simon Pichugin (@droideck)