SPDX provides machine-readable versions of licenses, e.g. https://github.com/spdx/license-list-XML/blob/master/src/MIT.xml which would be better than the n-gram matching done currently, at the expense of a little more complexity. We should figure out whether that's worth it and compare accuracy across a representative sample.