Skip to content

Match links to bookmarks #87

@nslpls

Description

@nslpls

First, a great package and thank you for making it available!

I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.

I defined an annotation for the attribute id - that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.
Also, when the div element doesn't have any text or sub-elements, the annotation doesn't return anything.

Would I have to re-define the div tag handler (even though I don't know which tags may contain bookmarks)?
Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?

Many thanks!

from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis
from inscriptis import ParserConfig
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis import get_annotated_text

doc = r"""
<html><body>

<div><a href="#idd1">Part 1</a></div>
<div><a href="#idd2">Part 2</a></div>

<div id="idd1"></div>
<div id="idd2">target with text</div>

</body></html>
"""

annotation_rules = {"a": ["link"], "#id": ["target"]}
css = CSS_PROFILES['relaxed'].copy()
inscriptis_parser_config = ParserConfig(display_links=True, annotation_rules=annotation_rules, css=css)

html_tree = fromstring(doc)
parser = Inscriptis(html_tree, config=inscriptis_parser_config)
txt = parser.get_text()
ant = parser.get_annotations()
labels = [(a.start, a.end, a.metadata) for a in ant]

for ii, ant in enumerate(labels):
    print(f"{ii} {ant[2]} {ant[0]} {txt[ant[0]:ant[1]]}")

The output is:

   0 link              3     Part 1](#idd1)
   1 link             21     Part 2](#idd2)
   2 target           36   target with text

In this example, I am looking for the id of the last div element, as well as the id, location and text of the third div element.
(Note also that the text of the link doesn't include the opening [.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions