Skip to content
This repository was archived by the owner on Feb 27, 2020. It is now read-only.

Conversation

@graemerobertson
Copy link

@graemerobertson graemerobertson commented Feb 13, 2018

Some notes...

  • I took etcd and etcdctl binaries from etcd-v3.2.16-linux-amd64.tar.gz at https://github.com/coreos/etcd/releases/tag/v3.2.16
  • I just copied etcd-dump-logs from the 3.1.7 directory because a) I don't know how to compile go (not a particularly compelling argument) and b) there have been no new commits at https://github.com/coreos/etcd/tree/v3.2.16/tools/etcd-dump-logs between the v3.2.16 and v3.1.7 tags (which I think is more compelling)
  • The etcdwrapper is a direct copy (with version number updated) from the v3.1.7 version, which is obviously a bit lazy
  • I've successfully run the FV tests (and UTs, although not sure they actually use this)
  • Got a pipeline on the go, and if that succeeds, I'm going to do a scale in and out upgrade too
    • I've caught and fixed two issues so far, hoping for a clean run this time!
    • It passed!

@graemerobertson graemerobertson changed the title [Reviewer ???] Upgrade etcd to V3.2.16 [Reviewer EM] Upgrade etcd to V3.2.16 Feb 15, 2018
Copy link
Contributor

@eleanor-merry eleanor-merry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one question inline


# Run the real etcdctl.
/usr/share/clearwater/clearwater-etcd/$etcd_version/etcdctl -C $target_ip:4000 "$@"
/usr/share/clearwater/clearwater-etcd/$etcd_version/etcdctl -C http://$target_ip:4000 "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to change? Does it work with all versions? I assume you've tested the 3.x ones - I wonder if we should be just removing the 2.x one entirely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but I assume it's related to the following section from the upgrade documentation

Change in --listen-peer-urls and --listen-client-urls
3.2 now rejects domains names for --listen-peer-urls and --listen-client-urls (3.1 only prints out warnings), since domain name is invalid for network interface binding. Make sure that those URLs are properly formated as scheme://IP:port.

I've checked it works against 3.1.7, but I couldn't be bothered to check with 2.2.5 (partly because I was pretty sure it would do and partly because I wasn't sure why I cared). Any idea what the last CC version we shipped with 2.2.5 was? V10?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm going to remove support for 2.2.5

@graemerobertson
Copy link
Author

graemerobertson commented Feb 18, 2018

Whilst doing some live testing of this change, I noticed that etcd proxies in the cluster would run for around 5 minutes before consuming all of their available file handles and being restarted by monit. This was happening reliably and repeatedly.

I’ve collected some diags from a single etcd proxy. The etcd proxy’s IP address is 10.230.11.141, the etcd masters’ IP addresses are 10.230.11.136, 10.230.11.137 and 10.230.11.138.

  • I unmonitored and stopped clearwater-etcd, and cleared the etcd log file. I then ran sudo tcpdump -i any tcp port 4000 -w etcdv3.2.16.cap and started clearwater-etcd. I gathered ~3 minutes of packet capture and logs (clearwater-etcdv3.2.16.log). These are attached.
  • I then repeated this on the same node, but this time with etcd_version set to 3.1.7 in local_config. The corresponding etcdv3.1.7.cap and clearwater-etcdv3.1.7.log are also attached. It’s worth noting that the etcd proxy was quite happy running v3.1.17.

Here’s what I can see happening from these diags…

  • If we filter the packet captures by http.request.uri.path contains "mmf_rules_json", so that we’re just looking at watches on the MMF rules JSON config file…
    • Every 5 seconds we send two GETs for this key. One is on the loopback interface and the other is over the wire to an etcd master. These all happen on new TCP connections (after the initial ones, which are special). This is all true for both etcd versions.
      • At a guess, is the request over the loopback interface from clearwater-config-manager to the local etcd proxy, which triggers the etcd proxy to ask an etcd master?
    • The crucial difference is on the GET to the etcd master. If we follow the TCP stream for one of these GETs…
      • On v3.2.16, the TCP connection never gets closed for any of these GETs, and in fact there are TCP Keep-Alives every 30 seconds.
      • On v3.1.17, the etcd proxy sends a FIN-ACK after 5 seconds. I don’t completely understand the flow here, but the crucial thing is that the TCP connection is torn down.
    • (This is not specific to the MMF rules JSON config file, I just used that as an example)

So, in conclusion…

  • On both versions, the etcd proxy creates a new TCP connection to the etcd master every time it tries to read a key, which is every 5 seconds.
  • On etcdv3.2.16, these connections stay open forever (or possibly just until the key changes, but whatever)
  • Running sudo netstat -tnp | grep 4000 | grep ESTABLISHED | wc -l in both scenarios confirms this – on v3.1.7 the number of ESTABLISHED connections remains pretty constant (at around 45); on v3.2.16 it grows and grows until we run out of file handles.
  • This all looks kind of similar to http://www.projectclearwater.org/adventures-in-debugging-etcd-http-pipelining-and-file-descriptor-leaks/, except in that case I think we would have been leaking file handles on the etcd master side? I also can’t see any reason why we would have regressed that fix; I suspect this is a similar class of problem, rather than being the same exact problem.

There’s one other thing I’m confused about. In both etcd log files I’ve attached, there are lots of logs the look like the following…

2018-02-18 11:08:03.521458 I | proxy/httpproxy: client 10.230.11.141:57284 closed request prematurely

This log corresponds to 5 seconds after sending the GET to the etcd master, and on etcd v3.1.7 it happens just after sending the FIN-ACK. Do we just spam this log out on all our systems? I don’t remember ever seeing it before? Is it because I have debug logging turned on?!

FTR, I’ve tried using v3.2.1 too, but this exhibits the same problem.

clearwater-etcdv3.2.16.zip

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants