Close redis connections after executing #352

Ladsgroup · 2020-11-06T04:47:44Z

web nodes have tendency of going zombie and as the result they get
restarted regularly by uwsgi but --kill-on-term doesn't close the redis
connections meaning they stay hanging and pile up if requests go high a
bit.

Bug: T263910

web nodes have tendency of going zombie and as the result they get restarted regularly by uwsgi but --kill-on-term doesn't close the redis connections meaning they stay hanging and pile up if requests go high a bit. Bug: T263910

adamwight · 2020-11-06T08:28:51Z

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.

The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.

An example use case that will fail after this patch: looking up two scores from the cache.

akosiaris · 2020-11-06T10:46:41Z

I 'll more or less echo @adamwight here.

This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

Ladsgroup · 2020-11-06T13:36:15Z

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.
Yeah, I have been looking around and trying different routes but this seems the most plausible to fix the issue. I'm not happy about it either but a proper fix requires architecture rework of ores (like you said, for example proper DI)

The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.

An example use case that will fail after this patch: looking up two scores from the cache.

Naja, that's not the case. The exec command checks for connections and if there is none, it'll establish a connection: https://github.com/andymccurdy/redis-py/blob/master/redis/client.py#L932 so it'll be slower but it wouldn't fail.

I 'll more or less echo @adamwight here.

This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

Yup, My best solution would to have a watchdog/heartbeat or something similar (plus having a proper transaction/rollback mechanism in place) but that's not possible without reworking half of ores. It'll be slower to re-initiate the connection but I don't think it would be too much, we recently made a lot of improvements in ores performance by putting most of requests it's making behind envoy (we might be able to do this here too, hitting appservers has been behind envoy for two months now)

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

I already made a patch for this a month ago: https://gerrit.wikimedia.org/r/633578 but given that ores spawn 60 different uwsgi process in one node, each one will have their own pool of connections and as the result I don't think it'll achieve the same effect (we can try that though).

akosiaris · 2020-11-06T15:03:04Z

This seems overly coupled to the current way the class is used. We could merge this as an emergency workaround, let me know if you prefer to do that of course.
Yeah, I have been looking around and trying different routes but this seems the most plausible to fix the issue. I'm not happy about it either but a proper fix requires architecture rework of ores (like you said, for example proper DI)
The Redis(ScoreCache) class doesn't own its redis connect object, we should either fix how dependency injection is happening or have the caller close the connection.
An example use case that will fail after this patch: looking up two scores from the cache.

Naja, that's not the case. The exec command checks for connections and if there is none, it'll establish a connection: https://github.com/andymccurdy/redis-py/blob/master/redis/client.py#L932 so it'll be slower but it wouldn't fail.

I 'll more or less echo @adamwight here.
This feels weird to be honest. Closing the connection inside the lookup() or store() methods while the opening of those happens on the instantiation of the class (using from_parameters() classmethod) leaves me wondering how easily we can end up with a score_cache.Redis object that will just not be able to reconnect again. Granted, it's a ConnectionPool that is handling this internally in the redis object, but it's not clear to me yet under which circumstances and how quickly it reinitates the connection.

Yup, My best solution would to have a watchdog/heartbeat or something similar (plus having a proper transaction/rollback mechanism in place) but that's not possible without reworking half of ores. It'll be slower to re-initiate the connection but I don't think it would be too much, we recently made a lot of improvements in ores performance by putting most of requests it's making behind envoy (we might be able to do this here too, hitting appservers has been behind envoy for two months now)

It's true that this isn't a particularly expensive connection so we might be ok with this. That being said, I still have the feeling that lowering the number of uwsgi workers and limiting them to a small number of connections (I am wondering tbh if anything >1 even makes sense due to the way ores uwsgi works) is a better approach than forcibly closing the connection at this level.

As far as the envoy part goes, yes it does support redis in a pretty specific way (https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_protocols/redis) but from my reading it looks like it can work quite fine for the cache. We could go down that way if the other approaches prove ineffective.

As a side note, has the idea of setting max_connections in the redis.StrictRedis call been evaluated? Perhaps we can achieve the same effect without relying on sideeffects.

I already made a patch for this a month ago: https://gerrit.wikimedia.org/r/633578 but given that ores spawn 60 different uwsgi process in one node, each one will have their own pool of connections and as the result I don't think it'll achieve the same effect (we can try that though).

Yeah, I 've let a number of comments in the one too. I think we should try to go down that path and also try to cut down on the number of workers (they are too many anyway).

Ladsgroup · 2020-11-07T13:29:30Z

I definitely agree we should reduce number of uwsgi workers and also reduce max number of connections but if that doesn't fix the issue, let's try this method after that. My reasoning is that I'm not 100% sure uwsgi processes going zombie is all random. For example look this: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1604474614257&to=1604551820888

It seems the third party requests to wikidata is being followed by the increase in number of zombie processes. Requests to wikidata triggers lots of errors if the edit is not done in main ns (meaning content is not parsable json). ores catches the error and respond properly: https://ores.wikimedia.org/v3/scores/wikidatawiki/17 but I can't shake off the feeling that the process goes zombie after that.

Ladsgroup · 2020-11-07T13:56:32Z

The reason I'm depending on feeling on this is that ores observability is 13 out of 10. We should work on that later but I have no time for that.

akosiaris · 2020-11-09T10:28:13Z

I definitely agree we should reduce number of uwsgi workers and also reduce max number of connections but if that doesn't fix the issue, let's try this method after that.

Merged the lowering of uwsgi workers already. Thanks!

My reasoning is that I'm not 100% sure uwsgi processes going zombie is all random. For example look this: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1604474614257&to=1604551820888

It seems the third party requests to wikidata is being followed by the increase in number of zombie processes. Requests to wikidata triggers lots of errors if the edit is not done in main ns (meaning content is not parsable json). ores catches the error and respond properly: https://ores.wikimedia.org/v3/scores/wikidatawiki/17 but I can't shake off the feeling that the process goes zombie after that.

Heh, that's an interesting observation. I am wondering whether it's indeed just correlation but not causation. Per you comment about ores catching it it shouldn't, I guess we can only know if we get a zombie uwsgi process and gdb it. Not sure how much it would pay off though

Close redis connections after executing

18523ef

web nodes have tendency of going zombie and as the result they get restarted regularly by uwsgi but --kill-on-term doesn't close the redis connections meaning they stay hanging and pile up if requests go high a bit. Bug: T263910

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Close redis connections after executing #352

Close redis connections after executing #352

Uh oh!

Ladsgroup commented Nov 6, 2020

Uh oh!

adamwight commented Nov 6, 2020

Uh oh!

akosiaris commented Nov 6, 2020

Uh oh!

Ladsgroup commented Nov 6, 2020

Uh oh!

akosiaris commented Nov 6, 2020

Uh oh!

Ladsgroup commented Nov 7, 2020

Uh oh!

Ladsgroup commented Nov 7, 2020

Uh oh!

akosiaris commented Nov 9, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Close redis connections after executing #352

Are you sure you want to change the base?

Close redis connections after executing #352

Uh oh!

Conversation

Ladsgroup commented Nov 6, 2020

Uh oh!

adamwight commented Nov 6, 2020

Uh oh!

akosiaris commented Nov 6, 2020

Uh oh!

Ladsgroup commented Nov 6, 2020

Uh oh!

akosiaris commented Nov 6, 2020

Uh oh!

Ladsgroup commented Nov 7, 2020

Uh oh!

Ladsgroup commented Nov 7, 2020

Uh oh!

akosiaris commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akosiaris commented Nov 9, 2020 •

edited

Loading