Skip to content
This repository was archived by the owner on Aug 9, 2024. It is now read-only.

Conversation

@Ladsgroup
Copy link
Member

web nodes have tendency of going zombie and as the result they get
restarted regularly by uwsgi but --kill-on-term doesn't close the redis
connections meaning they stay hanging and pile up if requests go high a
bit.

Bug: T263910

web nodes have tendency of going zombie and as the result they get
restarted regularly by uwsgi but --kill-on-term doesn't close the redis
connections meaning they stay hanging and pile up if requests go high a
bit.

Bug: T263910
@adamwight
Copy link
Contributor

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.

The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.

An example use case that will fail after this patch: looking up two scores from the cache.

@akosiaris
Copy link

I 'll more or less echo @adamwight here.

This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

@Ladsgroup
Copy link
Member Author

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.
Yeah, I have been looking around and trying different routes but this seems the most plausible to fix the issue. I'm not happy about it either but a proper fix requires architecture rework of ores (like you said, for example proper DI)

The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.

An example use case that will fail after this patch: looking up two scores from the cache.

Naja, that's not the case. The exec command checks for connections and if there is none, it'll establish a connection: https://github.com/andymccurdy/redis-py/blob/master/redis/client.py#L932 so it'll be slower but it wouldn't fail.

I 'll more or less echo @adamwight here.

This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

Yup, My best solution would to have a watchdog/heartbeat or something similar (plus having a proper transaction/rollback mechanism in place) but that's not possible without reworking half of ores. It'll be slower to re-initiate the connection but I don't think it would be too much, we recently made a lot of improvements in ores performance by putting most of requests it's making behind envoy (we might be able to do this here too, hitting appservers has been behind envoy for two months now)

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

I already made a patch for this a month ago: https://gerrit.wikimedia.org/r/633578 but given that ores spawn 60 different uwsgi process in one node, each one will have their own pool of connections and as the result I don't think it'll achieve the same effect (we can try that though).

@akosiaris
Copy link

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.
Yeah, I have been looking around and trying different routes but this seems the most plausible to fix the issue. I'm not happy about it either but a proper fix requires architecture rework of ores (like you said, for example proper DI)
The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.
An example use case that will fail after this patch: looking up two scores from the cache.

Naja, that's not the case. The exec command checks for connections and if there is none, it'll establish a connection: https://github.com/andymccurdy/redis-py/blob/master/redis/client.py#L932 so it'll be slower but it wouldn't fail.

I 'll more or less echo @adamwight here.
This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

Yup, My best solution would to have a watchdog/heartbeat or something similar (plus having a proper transaction/rollback mechanism in place) but that's not possible without reworking half of ores. It'll be slower to re-initiate the connection but I don't think it would be too much, we recently made a lot of improvements in ores performance by putting most of requests it's making behind envoy (we might be able to do this here too, hitting appservers has been behind envoy for two months now)

It's true that this isn't a particularly expensive connection so we might be ok with this. That being said, I still have the feeling that lowering the number of uwsgi workers and limiting them to a small number of connections (I am wondering tbh if anything >1 even makes sense due to the way ores uwsgi works) is a better approach than forcibly closing the connection at this level.

As far as the envoy part goes, yes it does support redis in a pretty specific way (https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_protocols/redis) but from my reading it looks like it can work quite fine for the cache. We could go down that way if the other approaches prove ineffective.

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

I already made a patch for this a month ago: https://gerrit.wikimedia.org/r/633578 but given that ores spawn 60 different uwsgi process in one node, each one will have their own pool of connections and as the result I don't think it'll achieve the same effect (we can try that though).

Yeah, I 've let a number of comments in the one too. I think we should try to go down that path and also try to cut down on the number of workers (they are too many anyway).

@Ladsgroup
Copy link
Member Author

I definitely agree we should reduce number of uwsgi workers and also reduce max number of connections but if that doesn't fix the issue, let's try this method after that. My reasoning is that I'm not 100% sure uwsgi processes going zombie is all random. For example look this: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1604474614257&to=1604551820888

image

It seems the third party requests to wikidata is being followed by the increase in number of zombie processes. Requests to wikidata triggers lots of errors if the edit is not done in main ns (meaning content is not parsable json). ores catches the error and respond properly: https://ores.wikimedia.org/v3/scores/wikidatawiki/17 but I can't shake off the feeling that the process goes zombie after that.

@Ladsgroup
Copy link
Member Author

The reason I'm depending on feeling on this is that ores observability is 13 out of 10. We should work on that later but I have no time for that.

@akosiaris
Copy link

akosiaris commented Nov 9, 2020

I definitely agree we should reduce number of uwsgi workers and also reduce max number of connections but if that doesn't fix the issue, let's try this method after that.

Merged the lowering of uwsgi workers already. Thanks!

My reasoning is that I'm not 100% sure uwsgi processes going zombie is all random. For example look this: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1604474614257&to=1604551820888

image

It seems the third party requests to wikidata is being followed by the increase in number of zombie processes. Requests to wikidata triggers lots of errors if the edit is not done in main ns (meaning content is not parsable json). ores catches the error and respond properly: https://ores.wikimedia.org/v3/scores/wikidatawiki/17 but I can't shake off the feeling that the process goes zombie after that.

Heh, that's an interesting observation. I am wondering whether it's indeed just correlation but not causation. Per you comment about ores catching it it shouldn't, I guess we can only know if we get a zombie uwsgi process and gdb it. Not sure how much it would pay off though

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants