Skip to content

Conversation

@alvlkov
Copy link
Contributor

@alvlkov alvlkov commented Dec 13, 2024

What type of PR is this?

This adds a new managed script to delete a pod from Openshift's reserved namespace.

What this PR does / Why we need it?

This will help fixing errors related to openshift reserved namespaces, essentially when pod restart is required.

Which Jira/Github issue(s) does this PR fix?

OSD_20528

Special notes for your reviewer

Pre-checks (if applicable)

  • Validated the changes in a ROSA stage cluster
  • Included documentation changes with PR

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alvlkov
Once this PR has been reviewed and has the lgtm label, please assign wanghaoran1988 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2025
@alvlkov
Copy link
Contributor Author

alvlkov commented Mar 18, 2025

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2025
Copy link
Contributor

@iamkirkbater iamkirkbater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested change relates to the name of the script. The additional check for a replicaset would be more of a nice-to-have, but we can also add that after this is merged so that we can start using this sooner rather than later.

@@ -0,0 +1,21 @@
# Delete Openshift Pod Script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to just be delete-pod instead of adding the delete-os-pod? From a UX perspective, it will be easier to remember the closer the syntax name is to the actual OC command.



main(){
delete_pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a huge lift here to validate if a pod is owned by a replicaset before proceeding? We might also need to add a "force" flag/parameter to bypass that as well, but it might be a nice protection for the rare chance that a pod isn't managed in an openshift namespace, this way we can make sure it will come back as a default behavior, but have the option to bypass it if we need to.

author: Alex Volkov
allowedGroups:
- CEE
- SREP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- SREP
- SREP
- MCSTierTwo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the suggestions, thanks @iamkirkbater

@feichashao
Copy link
Contributor

Thanks @iamkirkbater for the review!

I would suggest we add the safeguard in this PR to validate if a pod is backed by a replicaset, otherwise the delete operation can be too wide.
The protection can be "we are not making the situation worse":

  • If we are deleting a non-healthy pod, go ahead and delete it.
  • If we are deleting a healthy pod,
    • If it is the only healthy pod in the replicaset, stop, raise a ticket and review it.
    • If there's another healthy pod besides the one we are going to delete, it is ok to delete.

Another nice-to-have is that we put a list of allowed namespace instead of openshift-*. This sound like a toil but it give us an opportunity to review if we want to allow the deletion when a new namespace comes. (can be next PR for this one).

@iamkirkbater
Copy link
Contributor

@feichashao - a few questions:

  1. Can you expand on what you mean by "non-healthy" pod? If we're asking for this in this PR I'd like to be explicit to what we are looking for. For example, if we just mean a "healthy" pod is one in a "Running" state, vs non-healthy which would be "Error", "Completed", "Pending" - etc.
  2. What specifically do you mean by raise a ticket - Do you mean like a JIRA here? Or would exiting out with an Error (if there's not a FORCE parameter set) work here?
  3. For the list of allowed namespaces - one thing I'd like to keep in mind here is that CEE/MCS have a wider scope of what they support than SREP does. While SREP may only limit ourselves to specific managed namespaces, CEE/MCS will be supporting additional things like openshift-virtualization, etc. So limiting them to managed namespaces may not be as efficient as we think it might be.

@feichashao
Copy link
Contributor

Can you expand on what you mean by "non-healthy" pod? If we're asking for this in this PR I'd like to be explicit to what we are looking for. For example, if we just mean a "healthy" pod is one in a "Running" state, vs non-healthy which would be "Error", "Completed", "Pending" - etc.

I would say Healthy = A pod with all containers in running state; The other should be non-healthy, eg, pending, crashloopbackoff, pod in running state but not all containers are running, showing like:

kube-apiserver-ip-10-119-135-4.ec2.internal           4/5     Running 

(I mocked this)

@alvlkov
Copy link
Contributor Author

alvlkov commented Apr 7, 2025

Added replicaset check and --force flag.

  • - Successfully deleted pod owned by replicaset regardless --force flag
  • - Couldn't delete a pod not owned by a replicaset without --force flag
  • - Successfully deleted pod not owned by replicaset with --force flag

@alvlkov alvlkov requested a review from iamkirkbater April 10, 2025 21:19
@alvlkov
Copy link
Contributor Author

alvlkov commented Jul 2, 2025

/retest

Comment on lines +10 to +17
clusterRoleRules:
- apiGroups:
- ""
resources:
- "pods"
verbs:
- "delete"
- "get"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid for all namespaces. There's no limitation to from openshift's reserved namespace. as mentioned above.

This permission extends beyond the scope even SRE-P has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above limitation applies to the NAMESPACE parameter, to avoid deleting Openshift related pods. AFAIK I cant scope namespaces within clusterRoleRules. Please elaborate about the suggestion.

Co-authored-by: typeid <github@typeid.org>
@typeid
Copy link
Member

typeid commented Jul 22, 2025

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 22, 2025

@alvlkov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@typeid
Copy link
Member

typeid commented Jul 28, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2025
@typeid
Copy link
Member

typeid commented Jul 28, 2025

Code LGTM. Pending approval from compliance: https://issues.redhat.com/browse/HCMSEC-611

@typeid
Copy link
Member

typeid commented Jul 28, 2025

/hold for compliance approve

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2025
@typeid
Copy link
Member

typeid commented Sep 18, 2025

/unhold

Merging this as we have not received any feedback from compliance. This does not provide read access to customer data so I'm okay just stamping this off.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2025
@typeid
Copy link
Member

typeid commented Sep 18, 2025

/retest

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants