∞ HTML Tag¶
Documentation¶
- Class name:
LLMHTMLTagReader
- Category:
SALT/Language Toolkit/Readers
- Output node:
False
The LLMHTMLTagReader node is designed to read and interpret HTML tags from specified files, transforming them into a structured document format. It leverages BeautifulSoup to parse HTML content, focusing on specific tags and attributes to extract relevant information, while also allowing for customization through optional parameters.
Input types¶
Required¶
path
- Specifies the file path to the HTML file to be read. This is a crucial parameter as it determines the source of the HTML content to be processed.
- Comfy dtype:
STRING
- Python dtype:
str
Optional¶
tag
- Defines the specific HTML tag to focus on during the parsing process. This allows for targeted extraction of information from the HTML file.
- Comfy dtype:
STRING
- Python dtype:
str
ignore_no_id
- A boolean flag that, when set, instructs the reader to ignore HTML elements without an ID attribute. This can be useful for filtering out unnecessary elements.
- Comfy dtype:
COMBO[BOOLEAN]
- Python dtype:
bool
extra_info
- Allows for the inclusion of additional, custom information in the form of a string, which can be used to further customize the parsing behavior.
- Comfy dtype:
STRING
- Python dtype:
str
Output types¶
documents
- Comfy dtype:
DOCUMENT
- The structured document format output, which represents the parsed and interpreted HTML content.
- Python dtype:
tuple
- Comfy dtype:
Usage tips¶
- Infra type:
CPU
- Common nodes: unknown
Source code¶
class LLMHTMLTagReader(HTMLTagReader):
"""
@NOTE: Reads HTML tags into a llama_index Document
@Source: https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/html/base.py
@Documentation: https://docs.llamaindex.ai/en/latest/api_reference/readers/file/#llama_index.readers.file.HTMLTagReader
@Imports: from bs4 import BeautifulSoup
"""
def __init__(self):
super().__init__()
@classmethod
def INPUT_TYPES(cls):
return {
"required": {
"path": ("STRING", {"default": ""}),
},
"optional": {
"tag": ("STRING", {"default":"section"}),
"ignore_no_id": ([False, True],),
"extra_info": ("STRING", {"multiline": True, "dynamicPrompts": False, "default": "{}"}),
}
}
RETURN_TYPES = ("DOCUMENT", )
RETURN_NAMES = ("documents",)
FUNCTION = "execute"
CATEGORY = f"{MENU_NAME}/{SUB_MENU_NAME}/Readers"
def execute(self, path:str, tag:str="section", ignore_no_id:bool=False, extra_info:str="{}"):
get_full_path(1, path)
if not os.path.exists(path):
raise FileNotFoundError(f"No file available at: {path}")
path = Path(path)
self._tag = tag
self._ignore_no_id = ignore_no_id
extra_info = read_extra_info(extra_info)
data = self.load_data(path, extra_info)
return (data, )