∞ Jina.AI Scrape Website¶
Documentation¶
- Class name:
LLMJinaScrape
- Category:
SALT/Language Toolkit/Readers
- Output node:
False
The LLMJinaScrape node is designed to fetch and process web content from a specified URL using the Jina AI service. It handles HTTP requests, manages potential errors, and formats the response into a structured document, facilitating the integration of web data into further processing pipelines.
Input types¶
Required¶
url
- The URL to be scraped. This parameter is crucial for defining the target web page from which the content is to be fetched, directly influencing the node's output.
- Comfy dtype:
STRING
- Python dtype:
str
Output types¶
result
- Comfy dtype:
STRING
- The raw text content retrieved from the specified URL, serving as a direct representation of the web page's content.
- Python dtype:
str
- Comfy dtype:
documents
- Comfy dtype:
DOCUMENT
- A structured document containing the scraped web content along with additional metadata, such as the source and Jina AI URLs, enhancing data context and utility.
- Python dtype:
list
- Comfy dtype:
Usage tips¶
- Infra type:
CPU
- Common nodes: unknown
Source code¶
class LLMJinaScrape:
@classmethod
def INPUT_TYPES(cls):
return {
"required": {
"url": ("STRING", {}),
}
}
RETURN_TYPES = ("STRING", "DOCUMENT")
RETURN_NAMES = ("result", "documents")
FUNCTION = "scrape"
CATEGORY = f"{MENU_NAME}/{SUB_MENU_NAME}/Readers"
def scrape(self, url):
jina_url = f"https://r.jina.ai/{url}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response_text = "Empty response"
try:
response = requests.get(jina_url, headers=headers)
response.raise_for_status()
response_text = response.text
except requests.exceptions.HTTPError as http_err:
logger.error(f"HTTP error occurred: {http_err}")
except requests.exceptions.ConnectionError as conn_err:
logger.error(f"Connection error occurred: {conn_err}")
except requests.exceptions.Timeout as timeout_err:
logger.error(f"Timeout error occurred: {timeout_err}")
except requests.exceptions.RequestException as req_err:
logger.error(f"An error occurred: {req_err}")
except Exception as e:
logger.error(f"An unexpected error occurred: {str(e)}")
document = [Document(text=response_text, extra_info={"jina_url": jina_url, "source_url": url})]
return (response_text, document)