fix data harvesting and fuzzy matching #24

rfdougherty · 2025-11-11T23:50:43Z

Fixed the data harvesting scripts to allow drug data update. Also refactored the fuzzy-matching feature to use fuzzyset2 and resolved several bugs in the code.

Updated Python version requirement to allow newer versions.

sart1991 · 2025-11-14T16:47:09Z

src/drug_named_entity_recognition/drugs_finder.py

                        match_data["match_similarity"] = similarity
                        match_data["match_variant"] = fuzzy_matched_variant
                        match_data["matching_string"] = cand
+                        lookup_name = match_data.get("name", m)


woodthom2 · 2025-12-18T12:10:07Z

Hi @rfdougherty! Thanks so much for this pull request and I really appreciate the time you have put into it and your willingness to contribute. Please forgive my late reply.

I just have a quick request, there are a lot of files changed (17 files), so it's a bit hard for me to review as this is the majority of the files in the project. I can see at a glance that some things have been removed, such as the call to curl if the user is on Windows - I am not sure if this is intentional or part of the PR.

Would it be possible please to split it up into atomic PRs - if you are fixing multiple issues can you send them as separate PRs, ideally each one modifying only one or two files, and also remove things from the PR that don't need to be in there? Then I can review more easily. If not, I will take the time to review and try to merge as soon as I get some time, perhaps I will merge the files individually.

I would like to get it merged as the changes look really valuable, especially if you have improved the data ingestion!

We could always connect on a quick video call to go through the changes if that works? I'm free in the week beginning 29 December.

rfdougherty · 2025-12-23T21:33:21Z

Hi Thomas, Thanks for the response! I did put more into the PR than I had intended, as it included some changes I made that were specific to my use-case. I noticed this after submitting and didn't know if the repo was being maintained so hadn't bothered to fix it. I'll redo the PR with just the generally useful changes and will break it apart if necessary. It may take me a week or so to get time to do this. cheers, bob

…

On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***> wrote: *woodthom2* left a comment (fastdatascience/drug_named_entity_recognition#24) <#24 (comment)> Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for this pull request and I really appreciate the time you have put into it and your willingness to contribute. Please forgive my late reply. I just have a quick request, there are a lot of files changed (17 files), so it's a bit hard for me to review as this is the majority of the files in the project. I can see at a glance that some things have been removed, such as the call to curl if the user is on Windows <https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf> - I am not sure if this is intentional or part of the PR. Would it be possible please to split it up into atomic PRs - if you are fixing multiple issues can you send them as separate PRs, ideally each one modifying only one or two files, and also remove things from the PR that don't need to be in there? Then I can review more easily. If not, I will take the time to review and try to merge as soon as I get some time, perhaps I will merge the files individually. I would like to get it merged as the changes look really valuable, especially if you have improved the data ingestion! We could always connect on a quick video call to go through the changes if that works? I'm free in the week beginning 29 December. — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

woodthom2 · 2025-12-24T10:24:16Z

Thanks Bob. Yes I'd appreciate it. It's still being maintained and used (I think we have a few thousand users judging by the Pypi stats) so anything you can submit would be useful. If possible atomic PRs that change one or two files each are easiest for me to review. But no rush at all!

…

On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***> wrote: *rfdougherty* left a comment (fastdatascience/drug_named_entity_recognition#24) <#24 (comment)> Hi Thomas, Thanks for the response! I did put more into the PR than I had intended, as it included some changes I made that were specific to my use-case. I noticed this after submitting and didn't know if the repo was being maintained so hadn't bothered to fix it. I'll redo the PR with just the generally useful changes and will break it apart if necessary. It may take me a week or so to get time to do this. cheers, bob On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***> wrote: > *woodthom2* left a comment > (fastdatascience/drug_named_entity_recognition#24) > < #24 (comment)> > > Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for this > pull request and I really appreciate the time you have put into it and your > willingness to contribute. Please forgive my late reply. > > I just have a quick request, there are a lot of files changed (17 files), > so it's a bit hard for me to review as this is the majority of the files in > the project. I can see at a glance that some things have been removed, such > as the call to curl if the user is on Windows > < https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf> > - I am not sure if this is intentional or part of the PR. > > Would it be possible please to split it up into atomic PRs - if you are > fixing multiple issues can you send them as separate PRs, ideally each one > modifying only one or two files, and also remove things from the PR that > don't need to be in there? Then I can review more easily. If not, I will > take the time to review and try to merge as soon as I get some time, > perhaps I will merge the files individually. > > I would like to get it merged as the changes look really valuable, > especially if you have improved the data ingestion! > > We could always connect on a quick video call to go through the changes if > that works? I'm free in the week beginning 29 December. > > — > Reply to this email directly, view it on GitHub > < #24 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.*** > com> > — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ> . You are receiving this because you commented.Message ID: ***@***.*** com>

woodthom2 · 2025-12-24T10:58:29Z

So it might be easier if, instead of deleting files from the existing PR, to make a new PR and copy just the necessary files into it one by one. I can merge a small one- or two-file PR very quickly

…

On Wed, 24 Dec 2025, 10:23 Thomas Wood, ***@***.***> wrote: Thanks Bob. Yes I'd appreciate it. It's still being maintained and used (I think we have a few thousand users judging by the Pypi stats) so anything you can submit would be useful. If possible atomic PRs that change one or two files each are easiest for me to review. But no rush at all! On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***> wrote: > *rfdougherty* left a comment > (fastdatascience/drug_named_entity_recognition#24) > <#24 (comment)> > Hi Thomas, > > Thanks for the response! I did put more into the PR than I had intended, > as > it included some changes I made that were specific to my use-case. I > noticed this after submitting and didn't know if the repo was being > maintained so hadn't bothered to fix it. I'll redo the PR with just the > generally useful changes and will break it apart if necessary. It may > take > me a week or so to get time to do this. > > cheers, > bob > > On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***> > wrote: > > > *woodthom2* left a comment > > (fastdatascience/drug_named_entity_recognition#24) > > < > #24 (comment)> > > > > > Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for > this > > pull request and I really appreciate the time you have put into it and > your > > willingness to contribute. Please forgive my late reply. > > > > I just have a quick request, there are a lot of files changed (17 > files), > > so it's a bit hard for me to review as this is the majority of the > files in > > the project. I can see at a glance that some things have been removed, > such > > as the call to curl if the user is on Windows > > < > https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf> > > > - I am not sure if this is intentional or part of the PR. > > > > Would it be possible please to split it up into atomic PRs - if you are > > fixing multiple issues can you send them as separate PRs, ideally each > one > > modifying only one or two files, and also remove things from the PR > that > > don't need to be in there? Then I can review more easily. If not, I > will > > take the time to review and try to merge as soon as I get some time, > > perhaps I will merge the files individually. > > > > I would like to get it merged as the changes look really valuable, > > especially if you have improved the data ingestion! > > > > We could always connect on a quick video call to go through the changes > if > > that works? I'm free in the week beginning 29 December. > > > > — > > Reply to this email directly, view it on GitHub > > < > #24 (comment)>, > > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE> > > > . > > You are receiving this because you were mentioned.Message ID: > > ***@***.*** > > com> > > > > — > Reply to this email directly, view it on GitHub > <#24 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ> > . > You are receiving this because you commented.Message ID: > ***@***.*** > .com> >

rfdougherty · 2025-12-26T06:29:51Z

agree-- I'll create a new PR! On Wed, Dec 24, 2025 at 2:58 AM Thomas Wood ***@***.***> wrote:

…

*woodthom2* left a comment (fastdatascience/drug_named_entity_recognition#24) <#24 (comment)> So it might be easier if, instead of deleting files from the existing PR, to make a new PR and copy just the necessary files into it one by one. I can merge a small one- or two-file PR very quickly On Wed, 24 Dec 2025, 10:23 Thomas Wood, ***@***.***> wrote: > Thanks Bob. Yes I'd appreciate it. It's still being maintained and used (I > think we have a few thousand users judging by the Pypi stats) so anything > you can submit would be useful. If possible atomic PRs that change one or > two files each are easiest for me to review. But no rush at all! > > On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***> > wrote: > >> *rfdougherty* left a comment >> (fastdatascience/drug_named_entity_recognition#24) >> < #24 (comment)> >> Hi Thomas, >> >> Thanks for the response! I did put more into the PR than I had intended, >> as >> it included some changes I made that were specific to my use-case. I >> noticed this after submitting and didn't know if the repo was being >> maintained so hadn't bothered to fix it. I'll redo the PR with just the >> generally useful changes and will break it apart if necessary. It may >> take >> me a week or so to get time to do this. >> >> cheers, >> bob >> >> On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***> >> wrote: >> >> > *woodthom2* left a comment >> > (fastdatascience/drug_named_entity_recognition#24) >> > < >> #24 (comment)> >> >> > >> > Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for >> this >> > pull request and I really appreciate the time you have put into it and >> your >> > willingness to contribute. Please forgive my late reply. >> > >> > I just have a quick request, there are a lot of files changed (17 >> files), >> > so it's a bit hard for me to review as this is the majority of the >> files in >> > the project. I can see at a glance that some things have been removed, >> such >> > as the call to curl if the user is on Windows >> > < >> https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf> >> >> > - I am not sure if this is intentional or part of the PR. >> > >> > Would it be possible please to split it up into atomic PRs - if you are >> > fixing multiple issues can you send them as separate PRs, ideally each >> one >> > modifying only one or two files, and also remove things from the PR >> that >> > don't need to be in there? Then I can review more easily. If not, I >> will >> > take the time to review and try to merge as soon as I get some time, >> > perhaps I will merge the files individually. >> > >> > I would like to get it merged as the changes look really valuable, >> > especially if you have improved the data ingestion! >> > >> > We could always connect on a quick video call to go through the changes >> if >> > that works? I'm free in the week beginning 29 December. >> > >> > — >> > Reply to this email directly, view it on GitHub >> > < >> #24 (comment)>, >> >> > or unsubscribe >> > < >> https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE> >> >> > . >> > You are receiving this because you were mentioned.Message ID: >> > ***@***.*** >> > com> >> > >> >> — >> Reply to this email directly, view it on GitHub >> < #24 (comment)>, >> or unsubscribe >> < https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ> >> . >> You are receiving this because you commented.Message ID: >> ***@***.*** >> .com> >> > — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXPQQDLMCEXEOAINFXY6D4DJWWXAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBZGQ3TGNJYGM> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

rfdougherty · 2026-01-03T20:44:35Z

I'm splitting this into two PRs. The first is ready for review: #25. I'll close this PR now.

rfdougherty added 9 commits November 10, 2025 13:07

Update Drugbank and MeSH

3f1a9e3

Fix and simplify update code, update drug db

db0e82b

fix fuzzy matching code

3559b08

Add fork note

480ac9e

clarify api

a072739

fix fuzzy match errors

143635a

Relax Python version requirement in pyproject.toml

36a5b84

Updated Python version requirement to allow newer versions.

Clean up readme, remove hard-coded excludes

1de49a9

fix logo

332c158

sart1991 approved these changes Nov 14, 2025

View reviewed changes

clean up manifest

da80bc3

rfdougherty closed this Jan 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix data harvesting and fuzzy matching #24

fix data harvesting and fuzzy matching #24

Uh oh!

rfdougherty commented Nov 11, 2025

Uh oh!

sart1991 Nov 14, 2025

Uh oh!

woodthom2 commented Dec 18, 2025

Uh oh!

rfdougherty commented Dec 23, 2025 via email

Uh oh!

woodthom2 commented Dec 24, 2025 via email

Uh oh!

woodthom2 commented Dec 24, 2025 via email

Uh oh!

rfdougherty commented Dec 26, 2025 via email

Uh oh!

rfdougherty commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix data harvesting and fuzzy matching #24

fix data harvesting and fuzzy matching #24

Uh oh!

Conversation

rfdougherty commented Nov 11, 2025

Uh oh!

sart1991 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

woodthom2 commented Dec 18, 2025

Uh oh!

rfdougherty commented Dec 23, 2025 via email

Uh oh!

woodthom2 commented Dec 24, 2025 via email

Uh oh!

woodthom2 commented Dec 24, 2025 via email

Uh oh!

rfdougherty commented Dec 26, 2025 via email

Uh oh!

rfdougherty commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants