-
Notifications
You must be signed in to change notification settings - Fork 35
Description
First, a great package and thank you for making it available!
I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.
I defined an annotation for the attribute id - that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.
Also, when the div element doesn't have any text or sub-elements, the annotation doesn't return anything.
Would I have to re-define the div tag handler (even though I don't know which tags may contain bookmarks)?
Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?
Many thanks!
from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis
from inscriptis import ParserConfig
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis import get_annotated_text
doc = r"""
<html><body>
<div><a href="#idd1">Part 1</a></div>
<div><a href="#idd2">Part 2</a></div>
<div id="idd1"></div>
<div id="idd2">target with text</div>
</body></html>
"""
annotation_rules = {"a": ["link"], "#id": ["target"]}
css = CSS_PROFILES['relaxed'].copy()
inscriptis_parser_config = ParserConfig(display_links=True, annotation_rules=annotation_rules, css=css)
html_tree = fromstring(doc)
parser = Inscriptis(html_tree, config=inscriptis_parser_config)
txt = parser.get_text()
ant = parser.get_annotations()
labels = [(a.start, a.end, a.metadata) for a in ant]
for ii, ant in enumerate(labels):
print(f"{ii} {ant[2]} {ant[0]} {txt[ant[0]:ant[1]]}")
The output is:
0 link 3 Part 1](#idd1)
1 link 21 Part 2](#idd2)
2 target 36 target with text
In this example, I am looking for the id of the last div element, as well as the id, location and text of the third div element.
(Note also that the text of the link doesn't include the opening [.)