-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
I have a script (shortened to minimum reproducible case below) where the selection for "table tr" fails to find the first <tr> tag. The example below only contains one <tr> tag, but I have other use cases where there are 9 tags, but only the last 8 are returned. But I have found that if I comment out the <script> from the HTML, then it does return the correct <tr> tag results.
(This is from a web scraper I've been playing with to pull information from goodreads.)
Test script:
local parser = {}
local htmlparser = require("htmlparser")
function parser.book_link(html, title, author)
local tree = htmlparser.parse(html)
local books = tree:select("table tr")
for _, book in ipairs(books) do
local book_title = book:select("a.bookTitle")
if book_title[1].nodes[1]:getcontent():match("^" .. title) then
local aut = book:select("a.authorName")
if aut[1].nodes[1]:getcontent():match(author) then
return "https://www.goodreads.com" .. book_title[1].attributes["href"]:gsub("?.*", "")
end
end
end
return nil
end
local file = io.open("gr.html", "r")
local search_html = file:read("*a")
file:close()
local title = "Waste Tide"
local author = "Chen Qiufan"
local book_link = parser.book_link(search_html, title, author)
assert("https://www.goodreads.com/book/show/39863294-waste-tide" == book_link, "Incorrect book link: " .. book_link)Test file:
<html><body>
<script type="text/javascript" charset="utf-8">
function refreshGroupBox(group_id, book_id) {
new Ajax.Updater('addGroupBooks' + book_id + '', '/group/add_book_box', {asynchronous:true, evalScripts:true, onSuccess:function(request){refreshGroupBoxComplete(request, book_id);}, parameters:'id=' + group_id + '&book_id=' + book_id + '&refresh=true' + '&authenticity_token=' + encodeURIComponent('g0GG+Rcqg7zUv1eOBiN/m0Gxr1TlkcUeCyRfv9ZM7OGYokz03bxSNPCIOn1o7esOziTbneeb1ztimZdGK0srsg==')})
}
</script>
<table><tr><td>
<a class="bookTitle" href="/book/show/39863294-waste-tide?from_search=true&from_srp=true&qid=WILgnaZ5jh&rank=1">
<span itemprop='name'>Waste Tide</span>
</a>
<a class="authorName"><span itemprop="name">Chen Qiufan</span></a>,
</td></tr></table>
</body></html>Metadata
Metadata
Assignees
Labels
No labels