feat(api): add ASN-aggregated IOC statistics . CLOSES #458 #718

drona-gyawali · 2026-01-17T16:37:55Z

Description

This change introduces a new authenticated API endpoint that aggregates IOC data by ASN. The endpoint groups all matching IOCs under their respective ASNs and computes summary statistics, including IOC count, total attack count, total interaction count, total login attempts, expected IOC count (derived from recurrence probability), and expected interactions. It also returns the set of unique honeypots associated with each ASN. The implementation reuses the same filtering and authentication logic as the Advanced Feeds API to avoid code duplication, while intentionally returning a JSON-only response tailored for aggregated data use cases.

Reuses existing Advanced Feeds query building logic to avoid duplication
Aggregation logic is isolated in a utility function
Floating point values are rounded to 4 decimals for stable output
Includes test coverage following existing feeds test patterns

Related issues

closes #458

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue).
New feature (non-breaking change which adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).

Checklist

I have read and understood the rules about how to Contribute to this project.
The pull request is for the branch develop.
I have added documentation of the new features.
Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
If the GUI has been modified:
- I have a provided a screenshot of the result in the PR.
- I have created new frontend tests for the new component or updated existing ones.

Important Rules

If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

regulartim

Nice, thank you!

A few general things:

In my mind the aggregation would be done on a database level. Your approach also works, of course, but I am a little bit concerned regarding performance on instances with millions of IoCs.
Ordering is kind of tricky here. I tested your code and I was not able to order by ioc_count. But intuitively I would expect that to work. I think we should support to order by any field that is present in the output, if sensible (ordering by honeypots is not).
The IOC model has more field that have not been included here yet. Do you think you could also include first_seen and last_seen? Or do you think that this is not useful?

api/views/utils.py

api/views/feeds.py

drona-gyawali · 2026-01-18T13:50:00Z

Nice, thank you!

A few general things:

1. In my mind the aggregation would be done on a database level. Your approach also works, of course, but I am a little bit concerned regarding performance on instances with millions of IoCs.

2. Ordering is kind of tricky here. I tested your code and I was not able to order by `ioc_count`. But intuitively I would expect that to work. I think we should support to order by any field that is present in the output, if sensible (ordering by `honeypots` is not).

3. The IOC model has more field that have not been included here yet. Do you think you could also include `first_seen` and `last_seen`? Or do you think that this is not useful?

Hi, @regulartim, Thanks a lot for the review and the feedback!

I wanted to explain a bit why I initially went with Python-level aggregation and where I’m currently unsure, so I’d really appreciate your guidance.

My first thought was also to aggregate at the database level using values("asn") + annotate(...). However, I ran into a couple of structural issues with the current setup of get_queryset, which is shared by other API:

get_queryset applies slicing ([:feed_size]) after ordering. Once a queryset is sliced, Django does not allow further aggregation, which makes DB-level grouping impossible unless we disable slicing using a flag (doSlice=True). However i was afraid to change the fx which is used by other apis without your permission.

Error i recieved using this method:

And If we slice before aggregation, the aggregation sees only the first feed_size raw IOCs. This produces incorrect totals (ioc_count, attack_count, etc.), because other IOCs outside the slice are ignored. For example, if ASN 13335 has 10,000 IOCs, slicing at 5000 will undercount.

When aggregating by ASN and joining against general_honeypot, each IOC can appear multiple times at the sql level (one row per honeypot). This leads to inflated sums (e.g., attack_count, interaction_count).

For example, a single IOC linked to two honeypots is counted twice in SUM(attack_count) in aggregation . I guees it is many to many joins and duplicated rows issue.

I think that get_queryset needs a big refactor here

I’d love to hear what you think is the best tradeoff here, especially given the shared nature of get_queryset. I’m happy to adapt the implementation based on your recommendation.

regulartim · 2026-01-19T06:27:23Z

Hey @drona-gyawali ! Thanks for your detailed explanation. I think the best approach would be to aggregate on the DB level. If this makes it necessary to refactor or even split up get_queryset, that's fine. I recognize that it is too inflexible with the slicing and the way it queries the DB. If that's not possible, it is also possible to write a separate function, exclusively for this API.

When aggregating by ASN and joining against general_honeypot, each IOC can appear multiple times at the sql level (one row per honeypot). This leads to inflated sums (e.g., attack_count, interaction_count).

This is kind of strange. If a aggregation function is used on the general_honeypot filed, this should not happen. Did you use ArrayAgg?

drona-gyawali · 2026-01-20T17:05:50Z

Hey @drona-gyawali ! Thanks for your detailed explanation. I think the best approach would be to aggregate on the DB level. If this makes it necessary to refactor or even split up get_queryset, that's fine. I recognize that it is too inflexible with the slicing and the way it queries the DB. If that's not possible, it is also possible to write a separate function, exclusively for this API.

When aggregating by ASN and joining against general_honeypot, each IOC can appear multiple times at the sql level (one row per honeypot). This leads to inflated sums (e.g., attack_count, interaction_count).

This is kind of strange. If a aggregation function is used on the general_honeypot filed, this should not happen. Did you use ArrayAgg?

Extremely sorry for the late reply! I used ArrayAgg(distinct=True) for honeypots and distinct=True on all other fields to avoid inflated sums, since thing were failing without it. The iocs_qs comes from get_queryset. raw code for reference:

def asn_aggregated_queryset(iocs_qs):   
    return (
        iocs_qs
        .exclude(asn__isnull=True)
        .values("asn")
        .annotate(
            ioc_count=Count("id", distinct=True),
            total_attack_count=Sum("attack_count", distinct=True),
            total_interaction_count=Sum("interaction_count", distinct=True),
            total_login_attempts=Sum("login_attempts", distinct=True),
            expected_ioc_count=Sum("recurrence_probability", distinct=True),
            expected_interactions=Sum("expected_interactions", distinct=True),
            honeypots=ArrayAgg("general_honeypot__name", distinct=True),
        )
    )

I’ll push the new version soon.

regulartim · 2026-01-21T07:56:26Z

Extremely sorry for the late reply!

Don't worry, we all have other stuff to do! :)

I’ll push the new version soon.

Cool, looking forward to that!

drona-gyawali · 2026-01-22T17:02:01Z

Hi @regulartim , In this new version, I implemented DB-level aggregation for the ASN feed. While building this, I had to introduce a few things:

New serializer (ASNFeedsOrderingSerializer) – I inherited from FeedsRequestSerializer because the base serializer already provides default handling for parameters like max_age, feed_type, and attack_type. The main reason for creating the new serializer was that the base serializer’s ordering validation is strict and only allows model fields. Since aggregation introduces annotated/non-model fields (like ioc_count, total_attack_count), I needed to add custom validation here.

resolve_aggregation_ordering utility – This ensures that our aggregation endpoint defaults to ordering by -ioc_count instead of -last_seen, bypassing the default injection from feed_params. I tried to make this dynamic so that any future aggregation API can leverage the same pattern without reinventing the wheel.

The overall goal was to make the developer experience better and avoid complexity when building future aggregation endpoints. Anyone adding a new aggregation API only needs to inherit and customize the ordering/validation, without rewriting everything.

I hope this aligns with your expectations. If you feel any part of this design needs changes, I’m always open to feedback and happy to adjust.

regulartim

Good progress! 👍

regulartim · 2026-01-23T10:00:28Z

api/views/utils.py

+    # aggregated endpoints should operate on the full queryset
+    # to compute sums, counts, and other metrics correctly.
+    if not is_aggregated:
+        iocs = iocs.order_by(feed_params.ordering)
+        iocs = iocs[: int(feed_params.feed_size)]
+


regulartim · 2026-01-23T10:13:19Z

api/views/utils.py

+            ioc_count=Count("id", distinct=True),
+            total_attack_count=Sum("attack_count", distinct=True),
+            total_interaction_count=Sum("interaction_count", distinct=True),
+            total_login_attempts=Sum("login_attempts", distinct=True),
+            expected_ioc_count=Sum("recurrence_probability", distinct=True),
+            expected_interactions=Sum("expected_interactions", distinct=True),


Can you please explain why you are using distinct=True here? For the Count it does not really do anything, because id is unique and for Sum I can't find any documentation of what the distinct argument actually does.

Thank you for pointing that out. To be frank, I had used distinct as a quick workaround, but the root cause of the error was row duplication in the SQL query. This happened because of the Many-to-Many join from .filter(general_honeypot__active=True) combined with ArrayAgg on the honeypot field in get_queryset.

This duplication caused Count("id") and other aggregated fields to be inflated (e.g., 4 instead of 2), which led to the test failures. At first, I thought it was related to ordering or feed_size in get_queryset, and even considered a possible test case issue, but the real problem was the join itself.

Using distinct=True on Count was only a workaround it masks the issue but doesn’t fix it properly and can break Sum when values repeat and sorry , I realized this issue later after you pointed out.

The clean solution i think to separate numeric aggregation from the M2M honeypot aggregation in asn_aggreated_queryset and little refactor in get_queryset. That way, counts and sums stay accurate without duplication tricks.

The changes in get_queryset would roughly reflect this approach.

iocs = ( IOC.objects.filter(**query_dict) .exclude(ip_reputation__in=feed_params.exclude_reputation) .annotate(value=F("name")) .distinct() ) # aggregated endpoints should operate on the full queryset # to compute sums, counts, and other metrics correctly. if not is_aggregated: iocs= iocs.filter(general_honeypot__active=True) iocs = iocs.annotate(honeypots=ArrayAgg("general_honeypot__name")) iocs = iocs.order_by(feed_params.ordering) iocs = iocs[:int(feed_params.feed_size)]

what do you think?

regulartim · 2026-01-23T10:17:16Z

api/serializers.py

+
+        if field_name not in self.ALLOWED_ORDERING_FIELDS:
+            raise serializers.ValidationError(
+                {f"Invalid ordering field for ASN aggregated feed: '{field_name}'. Allowed fields: {', '.join(sorted(self.ALLOWED_ORDERING_FIELDS))}"}


This is a set literal but should be a string, right?

regulartim · 2026-01-23T10:23:29Z

api/views/utils.py

+def resolve_aggregation_ordering(ordering, *, default, fallback_fields=None):
+    """
+    Resolve effective ordering for aggregated endpoints.
+
+    Args
+        ordering (str or None): The user-provided ordering string from query params.
+        default (str): The default ordering to use if `ordering` is None or in fallback_fields.
+        fallback_fields (set[str], optional): A set of orderings that are allowed in other
+            contexts but should be overridden here. Defaults to None.
+
+    Returns
+        str: A safe ordering string to use directly in the aggregation query.
+    """
+    fallback_fields = fallback_fields or set()
+
+    if not ordering or ordering in fallback_fields:
+        return default
+
+    return ordering


This is a little overengineered in my opinion. I think instead of this, we can just return -ioc_count in the ASNFeedsOrderingSerializer if the validation fails. Then, if a user requests a supported ordering, it just works and if not, the results are ordered by -ioc_count. Or am I missing something?

The default ordering (-ioc_count) cannot be enforced in the serializer because by the time the aggregator runs, feed_params.ordering is already populated (default -last_seen) from FeedRequestParams. So the serializer never sees it as missing.

I acknowledge that using resolve_aggregation_ordering was over-engineered, and I agree that we can simplify it by adding a small override directly in asn_aggregated_queryset, like this:

if not ordering or ordering.strip() in {"", "-last_seen"}: ordering = "-ioc_count"

This keeps things simple while ensuring the default ordering works correctly. If you’re okay with it, I can push these changes but if you have a better approach, I’d love to apply that instead.

regulartim · 2026-01-23T10:26:45Z

api/views/utils.py

+    resolved_ordering = resolve_aggregation_ordering(
+        ordering=feed_params.ordering,
+        default="-ioc_count",
+        fallback_fields={"-last_seen"},
+    )
+
+    direction = "-" if resolved_ordering.startswith("-") else ""
+    field = resolved_ordering.lstrip("-").strip()
+
+    aggregated = aggregated.order_by(f"{direction}{field}")


Also very complicated. Can we also rely on the Serializer here?

feat(api): add ASN-aggregated IOC statistics

1c94e8a

regulartim requested changes Jan 18, 2026

View reviewed changes

api/views/utils.py Outdated Show resolved Hide resolved

api/views/utils.py Outdated Show resolved Hide resolved

api/views/feeds.py Outdated Show resolved Hide resolved

api/views/feeds.py Outdated Show resolved Hide resolved

drona-gyawali added 2 commits January 22, 2026 22:07

refactor: db level aggregation

1445edf

refactor: missing args

9ff1268

drona-gyawali marked this pull request as draft January 22, 2026 16:37

drona-gyawali added 2 commits January 22, 2026 22:33

Merge branch 'develop' into feat/asn_api

04be835

resolve linter issue

c65af2d

drona-gyawali marked this pull request as ready for review January 22, 2026 17:02

drona-gyawali requested a review from regulartim January 22, 2026 17:03

regulartim requested changes Jan 23, 2026

View reviewed changes

Uh oh!

feat(api): add ASN-aggregated IOC statistics . CLOSES #458 #718

Are you sure you want to change the base?

feat(api): add ASN-aggregated IOC statistics . CLOSES #458 #718

Conversation

drona-gyawali commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Type of change

Checklist

Important Rules

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drona-gyawali commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regulartim commented Jan 19, 2026

Uh oh!

drona-gyawali commented Jan 20, 2026

Uh oh!

regulartim commented Jan 21, 2026

Uh oh!

drona-gyawali commented Jan 22, 2026

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

drona-gyawali Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

drona-gyawali Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drona-gyawali commented Jan 17, 2026 •

edited

Loading

drona-gyawali commented Jan 18, 2026 •

edited

Loading