-
-
Notifications
You must be signed in to change notification settings - Fork 62
feat(api): add ASN-aggregated IOC statistics . CLOSES #458 #718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
1c94e8a
1445edf
9ff1268
04be835
c65af2d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,7 +8,7 @@ | |
|
|
||
| from django.conf import settings | ||
| from django.contrib.postgres.aggregates import ArrayAgg | ||
| from django.db.models import F | ||
| from django.db.models import Count, F, Max, Min, Q, Sum | ||
| from django.http import HttpResponse, HttpResponseBadRequest, StreamingHttpResponse | ||
| from rest_framework import status | ||
| from rest_framework.response import Response | ||
|
|
@@ -121,14 +121,23 @@ def get_valid_feed_types() -> frozenset[str]: | |
| return frozenset(feed_types) | ||
|
|
||
|
|
||
| def get_queryset(request, feed_params, valid_feed_types): | ||
| def get_queryset(request, feed_params, valid_feed_types, is_aggregated=False, serializer_class=FeedsRequestSerializer): | ||
| """ | ||
| Build a queryset to filter IOC data based on the request parameters. | ||
|
|
||
| Args: | ||
| request: The incoming request object. | ||
| feed_params: A FeedRequestParams instance. | ||
| valid_feed_types (frozenset): The set of all valid feed types. | ||
| is_aggregated (bool, optional): | ||
| - If True, disables slicing (`feed_size`) and model-level ordering. | ||
| - Ensures full dataset is available for aggregation or specialized computation. | ||
| - Default: False. | ||
| serializer_class (class, optional): | ||
| - Serializer class used to validate request parameters. | ||
| - Allows injecting a custom serializer to enforce rules for specific feed types | ||
| (e.g., to restrict ordering fields or validation for specialized feeds). | ||
| - Default: `FeedsRequestSerializer`. | ||
|
|
||
| Returns: | ||
| QuerySet: The filtered queryset of IOC data. | ||
|
|
@@ -139,7 +148,7 @@ def get_queryset(request, feed_params, valid_feed_types): | |
| f"Age: {feed_params.max_age}, format: {feed_params.format}" | ||
| ) | ||
|
|
||
| serializer = FeedsRequestSerializer( | ||
| serializer = serializer_class( | ||
| data=vars(feed_params), | ||
| context={"valid_feed_types": valid_feed_types}, | ||
| ) | ||
|
|
@@ -168,9 +177,14 @@ def get_queryset(request, feed_params, valid_feed_types): | |
| .annotate(value=F("name")) | ||
| .annotate(honeypots=ArrayAgg("general_honeypot__name")) | ||
| .distinct() | ||
| .order_by(feed_params.ordering)[: int(feed_params.feed_size)] | ||
| ) | ||
|
|
||
| # aggregated endpoints should operate on the full queryset | ||
| # to compute sums, counts, and other metrics correctly. | ||
| if not is_aggregated: | ||
| iocs = iocs.order_by(feed_params.ordering) | ||
| iocs = iocs[: int(feed_params.feed_size)] | ||
|
|
||
|
Comment on lines
+182
to
+187
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! |
||
| # save request source for statistics | ||
| source_ip = str(request.META["REMOTE_ADDR"]) | ||
| request_source = Statistics(source=source_ip) | ||
|
|
@@ -317,3 +331,74 @@ def is_sha256hash(string: str) -> bool: | |
| bool: True if the string is a valid SHA-256 hash, False otherwise | ||
| """ | ||
| return bool(re.fullmatch(r"^[A-Fa-f0-9]{64}$", string)) | ||
|
|
||
|
|
||
| def resolve_aggregation_ordering(ordering, *, default, fallback_fields=None): | ||
| """ | ||
| Resolve effective ordering for aggregated endpoints. | ||
|
|
||
| Args | ||
| ordering (str or None): The user-provided ordering string from query params. | ||
| default (str): The default ordering to use if `ordering` is None or in fallback_fields. | ||
| fallback_fields (set[str], optional): A set of orderings that are allowed in other | ||
| contexts but should be overridden here. Defaults to None. | ||
|
|
||
| Returns | ||
| str: A safe ordering string to use directly in the aggregation query. | ||
| """ | ||
| fallback_fields = fallback_fields or set() | ||
|
|
||
| if not ordering or ordering in fallback_fields: | ||
| return default | ||
|
|
||
| return ordering | ||
|
Comment on lines
+336
to
+354
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a little overengineered in my opinion. I think instead of this, we can just return
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default ordering (-ioc_count) cannot be enforced in the serializer because by the time the aggregator runs, feed_params.ordering is already populated (default -last_seen) from FeedRequestParams. So the serializer never sees it as missing. I acknowledge that using resolve_aggregation_ordering was over-engineered, and I agree that we can simplify it by adding a small override directly in if not ordering or ordering.strip() in {"", "-last_seen"}:
ordering = "-ioc_count"This keeps things simple while ensuring the default ordering works correctly. If you’re okay with it, I can push these changes but if you have a better approach, I’d love to apply that instead. |
||
|
|
||
|
|
||
| def asn_aggregated_queryset(iocs_qs, request, feed_params): | ||
| """ | ||
| Perform DB-level aggregation grouped by ASN. | ||
|
|
||
| Args | ||
| iocs_qs (QuerySet): Filtered IOC queryset from get_queryset; | ||
| request (Request): The API request object; | ||
| feed_params (FeedRequestParams): Validated parameter object | ||
|
|
||
| Returns: A values-grouped queryset with annotated metrics and honeypot arrays. | ||
| """ | ||
| # optional asn params for single asn filter | ||
| asn_filter = request.query_params.get("asn") | ||
| if asn_filter: | ||
| iocs_qs = iocs_qs.filter(asn=asn_filter) | ||
|
|
||
| aggregated = ( | ||
| iocs_qs.exclude(asn__isnull=True) | ||
| .values("asn") | ||
| .annotate( | ||
| ioc_count=Count("id", distinct=True), | ||
| total_attack_count=Sum("attack_count", distinct=True), | ||
| total_interaction_count=Sum("interaction_count", distinct=True), | ||
| total_login_attempts=Sum("login_attempts", distinct=True), | ||
| expected_ioc_count=Sum("recurrence_probability", distinct=True), | ||
| expected_interactions=Sum("expected_interactions", distinct=True), | ||
|
Comment on lines
+377
to
+382
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please explain why you are using
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for pointing that out. To be frank, I had used distinct as a quick workaround, but the root cause of the error was row duplication in the SQL query. This happened because of the Many-to-Many join from .filter(general_honeypot__active=True) combined with ArrayAgg on the honeypot field in get_queryset. This duplication caused Count("id") and other aggregated fields to be inflated (e.g., 4 instead of 2), which led to the test failures. At first, I thought it was related to ordering or feed_size in get_queryset, and even considered a possible test case issue, but the real problem was the join itself.
Using distinct=True on Count was only a workaround it masks the issue but doesn’t fix it properly and can break Sum when values repeat and sorry , I realized this issue later after you pointed out. The clean solution i think to separate numeric aggregation from the M2M honeypot aggregation in asn_aggreated_queryset and little refactor in get_queryset. That way, counts and sums stay accurate without duplication tricks. The changes in get_queryset would roughly reflect this approach. what do you think? |
||
| honeypots=ArrayAgg( | ||
| "general_honeypot__name", | ||
| filter=Q(general_honeypot__name__isnull=False), | ||
| distinct=True, | ||
| ), | ||
| first_seen=Min("first_seen"), | ||
| last_seen=Max("last_seen"), | ||
| ) | ||
| ) | ||
|
|
||
| resolved_ordering = resolve_aggregation_ordering( | ||
| ordering=feed_params.ordering, | ||
| default="-ioc_count", | ||
| fallback_fields={"-last_seen"}, | ||
| ) | ||
|
|
||
| direction = "-" if resolved_ordering.startswith("-") else "" | ||
| field = resolved_ordering.lstrip("-").strip() | ||
|
|
||
| aggregated = aggregated.order_by(f"{direction}{field}") | ||
|
Comment on lines
+393
to
+402
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also very complicated. Can we also rely on the Serializer here? |
||
|
|
||
| return aggregated | ||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a set literal but should be a string, right?