[Reviewer EM] Upgrade etcd to V3.2.16 #570

graemerobertson · 2018-02-13T15:53:51Z

Some notes...

I took etcd and etcdctl binaries from etcd-v3.2.16-linux-amd64.tar.gz at https://github.com/coreos/etcd/releases/tag/v3.2.16
I just copied etcd-dump-logs from the 3.1.7 directory because a) I don't know how to compile go (not a particularly compelling argument) and b) there have been no new commits at https://github.com/coreos/etcd/tree/v3.2.16/tools/etcd-dump-logs between the v3.2.16 and v3.1.7 tags (which I think is more compelling)
The etcdwrapper is a direct copy (with version number updated) from the v3.1.7 version, which is obviously a bit lazy
I've successfully run the FV tests (and UTs, although not sure they actually use this)
Got a pipeline on the go, and if that succeeds, I'm going to do a scale in and out upgrade too
- I've caught and fixed two issues so far, hoping for a clean run this time!
- It passed!

eleanor-merry

LGTM - one question inline

eleanor-merry · 2018-02-16T10:35:42Z

clearwater-etcd/usr/bin/clearwater-etcdctl


 # Run the real etcdctl.
-/usr/share/clearwater/clearwater-etcd/$etcd_version/etcdctl -C $target_ip:4000 "$@"
+/usr/share/clearwater/clearwater-etcd/$etcd_version/etcdctl -C http://$target_ip:4000 "$@"


Why does this need to change? Does it work with all versions? I assume you've tested the 3.x ones - I wonder if we should be just removing the 2.x one entirely.

I'm not 100% sure, but I assume it's related to the following section from the upgrade documentation

Change in --listen-peer-urls and --listen-client-urls
3.2 now rejects domains names for --listen-peer-urls and --listen-client-urls (3.1 only prints out warnings), since domain name is invalid for network interface binding. Make sure that those URLs are properly formated as scheme://IP:port.

I've checked it works against 3.1.7, but I couldn't be bothered to check with 2.2.5 (partly because I was pretty sure it would do and partly because I wasn't sure why I cared). Any idea what the last CC version we shipped with 2.2.5 was? V10?

As discussed, I'm going to remove support for 2.2.5

graemerobertson · 2018-02-18T11:48:39Z

Whilst doing some live testing of this change, I noticed that etcd proxies in the cluster would run for around 5 minutes before consuming all of their available file handles and being restarted by monit. This was happening reliably and repeatedly.

I’ve collected some diags from a single etcd proxy. The etcd proxy’s IP address is 10.230.11.141, the etcd masters’ IP addresses are 10.230.11.136, 10.230.11.137 and 10.230.11.138.

I unmonitored and stopped clearwater-etcd, and cleared the etcd log file. I then ran sudo tcpdump -i any tcp port 4000 -w etcdv3.2.16.cap and started clearwater-etcd. I gathered ~3 minutes of packet capture and logs (clearwater-etcdv3.2.16.log). These are attached.
I then repeated this on the same node, but this time with etcd_version set to 3.1.7 in local_config. The corresponding etcdv3.1.7.cap and clearwater-etcdv3.1.7.log are also attached. It’s worth noting that the etcd proxy was quite happy running v3.1.17.

Here’s what I can see happening from these diags…

If we filter the packet captures by http.request.uri.path contains "mmf_rules_json", so that we’re just looking at watches on the MMF rules JSON config file…
- Every 5 seconds we send two GETs for this key. One is on the loopback interface and the other is over the wire to an etcd master. These all happen on new TCP connections (after the initial ones, which are special). This is all true for both etcd versions.
  - At a guess, is the request over the loopback interface from clearwater-config-manager to the local etcd proxy, which triggers the etcd proxy to ask an etcd master?
- The crucial difference is on the GET to the etcd master. If we follow the TCP stream for one of these GETs…
  - On v3.2.16, the TCP connection never gets closed for any of these GETs, and in fact there are TCP Keep-Alives every 30 seconds.
  - On v3.1.17, the etcd proxy sends a FIN-ACK after 5 seconds. I don’t completely understand the flow here, but the crucial thing is that the TCP connection is torn down.
- (This is not specific to the MMF rules JSON config file, I just used that as an example)

So, in conclusion…

On both versions, the etcd proxy creates a new TCP connection to the etcd master every time it tries to read a key, which is every 5 seconds.
On etcdv3.2.16, these connections stay open forever (or possibly just until the key changes, but whatever)
Running sudo netstat -tnp | grep 4000 | grep ESTABLISHED | wc -l in both scenarios confirms this – on v3.1.7 the number of ESTABLISHED connections remains pretty constant (at around 45); on v3.2.16 it grows and grows until we run out of file handles.
This all looks kind of similar to http://www.projectclearwater.org/adventures-in-debugging-etcd-http-pipelining-and-file-descriptor-leaks/, except in that case I think we would have been leaking file handles on the etcd master side? I also can’t see any reason why we would have regressed that fix; I suspect this is a similar class of problem, rather than being the same exact problem.

There’s one other thing I’m confused about. In both etcd log files I’ve attached, there are lots of logs the look like the following…

2018-02-18 11:08:03.521458 I | proxy/httpproxy: client 10.230.11.141:57284 closed request prematurely

This log corresponds to 5 seconds after sending the GET to the etcd master, and on etcd v3.1.7 it happens just after sending the FIN-ACK. Do we just spam this log out on all our systems? I don’t remember ever seeing it before? Is it because I have debug logging turned on?!

FTR, I’ve tried using v3.2.1 too, but this exhibits the same problem.

clearwater-etcdv3.2.16.zip

Graeme Robertson added 3 commits February 13, 2018 15:43

Upgrade etcd to V3.2.16

68d8919

Given etcdwrapper execute permissions

af38060

Add HTTP scheme

1357ee5

graemerobertson changed the title ~~[Reviewer ???] Upgrade etcd to V3.2.16~~ [Reviewer EM] Upgrade etcd to V3.2.16 Feb 15, 2018

graemerobertson requested a review from eleanor-merry February 15, 2018 16:08

graemerobertson assigned eleanor-merry Feb 15, 2018

eleanor-merry approved these changes Feb 16, 2018

View reviewed changes

eleanor-merry assigned graemerobertson and unassigned eleanor-merry Feb 16, 2018

Graeme Robertson added 2 commits February 16, 2018 11:03

Remove support for 2.2.5

9ce84f4

Merge remote-tracking branch 'origin/dev' into etcd_v3.2.16

e70a541

Graeme Robertson added 2 commits March 15, 2018 18:16

Merge remote-tracking branch 'origin/master' into etcd_v3.2.16

51210a0

Upgrade to v3.2.17

8eacb5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reviewer EM] Upgrade etcd to V3.2.16 #570

[Reviewer EM] Upgrade etcd to V3.2.16 #570

Uh oh!

graemerobertson commented Feb 13, 2018 •

edited

Loading

Uh oh!

eleanor-merry left a comment

Uh oh!

eleanor-merry Feb 16, 2018

Uh oh!

graemerobertson Feb 16, 2018

Uh oh!

graemerobertson Feb 16, 2018

Uh oh!

graemerobertson commented Feb 18, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Reviewer EM] Upgrade etcd to V3.2.16 #570

Are you sure you want to change the base?

[Reviewer EM] Upgrade etcd to V3.2.16 #570

Uh oh!

Conversation

graemerobertson commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eleanor-merry left a comment

Choose a reason for hiding this comment

Uh oh!

eleanor-merry Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

graemerobertson Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

graemerobertson Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

graemerobertson commented Feb 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

graemerobertson commented Feb 13, 2018 •

edited

Loading

graemerobertson commented Feb 18, 2018 •

edited

Loading