Skip to content

∞ Dataset Search (1-Dimensional)

Documentation

  • Class name: LLMParquetDatasetSearcher
  • Category: SALT/Language Toolkit/Tools/Dataset
  • Output node: False

This node is designed to perform advanced search operations within large datasets stored in Parquet format. It leverages language models to interpret and execute complex queries, applying filters, relevancy scoring, and parallel processing to efficiently retrieve and rank results based on the query's intent.

Input types

Required

  • file_type
    • Specifies the type of file to be searched, such as parquet, text, json, yaml, csv, or excel, determining the method of data extraction and processing.
    • Comfy dtype: COMBO[STRING]
    • Python dtype: str
  • path_or_url
    • The location of the file to be searched, either as a local file path or a URL, providing access to the dataset for the search operation.
    • Comfy dtype: STRING
    • Python dtype: str

Optional

  • search_term
    • The query or keywords to search for within the dataset, guiding the search and filtering process.
    • Comfy dtype: STRING
    • Python dtype: str
  • exclude_terms
    • Terms to be excluded from the search results, allowing for more refined and relevant outcomes.
    • Comfy dtype: STRING
    • Python dtype: str
  • columns
    • Specific columns within the dataset to search, enabling targeted searches and improving efficiency.
    • Comfy dtype: STRING
    • Python dtype: str
  • case_sensitive
    • Determines whether the search should be case sensitive, affecting the matching process.
    • Comfy dtype: BOOLEAN
    • Python dtype: bool
  • max_results
    • The maximum number of search results to return, controlling the scope of the search output.
    • Comfy dtype: INT
    • Python dtype: int
  • term_relevancy_threshold
    • A threshold for relevancy scoring, filtering results based on their relevance to the search term.
    • Comfy dtype: FLOAT
    • Python dtype: float
  • use_relevancy
    • Indicates whether relevancy scoring should be applied to the search results, enhancing result quality.
    • Comfy dtype: BOOLEAN
    • Python dtype: bool
  • min_length
    • The minimum length of the search results, filtering out results that do not meet this criterion.
    • Comfy dtype: INT
    • Python dtype: int
  • max_length
    • The maximum length of the search results, ensuring that results are within a specified size range.
    • Comfy dtype: INT
    • Python dtype: int
  • max_dynamic_retries
    • The number of times the search should be retried with dynamic adjustments in case of no results, improving the chances of finding relevant data.
    • Comfy dtype: INT
    • Python dtype: int
  • clean_content
    • Specifies whether the content should be cleaned or pre-processed before searching, affecting the accuracy of the results.
    • Comfy dtype: BOOLEAN
    • Python dtype: bool
  • excel_sheet_position
    • For excel files, specifies the sheet to be searched, allowing for targeted data extraction within multi-sheet documents.
    • Comfy dtype: INT
    • Python dtype: int
  • recache
    • Determines whether the data should be recached, potentially improving performance for repeated searches.
    • Comfy dtype: BOOLEAN
    • Python dtype: bool
  • condense_documents
    • Indicates whether the search results should be condensed, potentially reducing the volume of data returned.
    • Comfy dtype: BOOLEAN
    • Python dtype: bool
  • seed
    • A seed value for random operations within the search, ensuring reproducibility of results.
    • Comfy dtype: INT
    • Python dtype: int

Output types

  • results
    • Comfy dtype: STRING
    • The primary output containing the search results, including relevant data entries.
    • Python dtype: str
  • results_list
    • Comfy dtype: LIST
    • A list format of the search results, providing an alternative representation.
    • Python dtype: list
  • documents
    • Comfy dtype: DOCUMENT
    • Structured documents derived from the search results, potentially including metadata and additional context.
    • Python dtype: list

Usage tips

  • Infra type: CPU
  • Common nodes: unknown

Source code

class LLMParquetDatasetSearcher:
    @classmethod
    def INPUT_TYPES(cls):
        return {
            "required": {
                "file_type": (["parquet", "text", "json", "yaml", "csv", "excel"],),
                "path_or_url": ("STRING", {"placeholder": "Path to file or URL"}),
            },
            "optional": {
                "search_term": ("STRING", {"placeholder": "Enter search term"}),
                "exclude_terms": ("STRING", {"placeholder": "Terms to exclude, comma-separated"}),
                "columns": ("STRING", {"default": "*"}),
                "case_sensitive": ("BOOLEAN", {"default": False}),
                "max_results": ("INT", {"default": 10, "min": 1}),
                "term_relevancy_threshold": ("FLOAT", {"min": 0.0, "max": 1.0, "default": 0.25, "step": 0.01}),
                "use_relevancy": ("BOOLEAN", {"default": False}),
                #"num_threads": ("INT", {"default": 2}),
                "min_length": ("INT", {"min": 0, "max": 1023, "default": 0}),
                "max_length": ("INT", {"min": 3, "max": 1024, "default": 128}),
                "max_dynamic_retries": ("INT", {"default": 3}),
                "clean_content": ("BOOLEAN", {"default": False}),
                "excel_sheet_position": ("INT", {"min": 0, "default": "0"}),
                "recache": ("BOOLEAN", {"default": False}),
                "condense_documents": ("BOOLEAN", {"default": True}),
                "seed": ("INT", {"default": 0, "min": 0, "max": 0xffffffffffffffff}),
            }
        }

    RETURN_TYPES = ("STRING", "LIST", "DOCUMENT")
    RETURN_NAMES = ("results", "results_list", "documents")
    OUTPUT_IS_LIST = (True, False, False)

    FUNCTION = "search_dataset"
    CATEGORY = f"{MENU_NAME}/{SUB_MENU_NAME}/Tools/Dataset"

    def search_dataset(self, path_or_url, file_type, search_term="", exclude_terms="", columns="*", case_sensitive=False, max_results=10,
                       term_relevancy_threshold=None, use_relevancy=False, num_threads=2, min_length=0, max_length=-1, max_dynamic_retries=0,
                       clean_content=False, seed=None, excel_sheet_position="0", condense_documents=True, recache=False):

        # Validate path or download file and return path
        path = resolve_path(path_or_url)

        reader = ParquetReader1D()
        if file_type == "parquet":
            reader.from_parquet(path)
        elif file_type == "text":
            reader.from_text(path, recache=recache)
        elif file_type == "json":
            reader.from_json(path, recache=recache)
        elif file_type == "yaml":
            reader.from_yaml(path, recache=recache)
        elif file_type == "csv":
            reader.from_csv(path, recache=recache)
        elif file_type == "excel":
            reader.from_excel(path, sheet_name=excel_sheet_position, recache=recache)

        results = reader.search(
            search_term=search_term,
            exclude_terms=exclude_terms,
            columns=[col.strip() for col in columns.split(',') if col] if columns else ["*"],
            max_results=max_results,
            case_sensitive=case_sensitive,
            term_relevancy_score=term_relevancy_threshold,
            num_threads=num_threads,
            min_length=min_length,
            max_length=max_length,
            max_dynamic_retries=max_dynamic_retries,
            parse_content=clean_content,
            seed=min(seed, 99999999),
            use_relevancy=use_relevancy
        )

        from pprint import pprint
        pprint(results, indent=4)

        results_list = []
        results_text = "Prompts:\n\n"
        documents = []
        for result in results:
            results_list.append(list(result.values())[0])
            if not condense_documents:
                documents.append(Document(text=list(result.values())[0], extra_info={}))
            else:
                results_text += str(list(result.values())[0]) + "\n\n"
        if condense_documents:
            documents = [Document(text=results_text, extra_info={})]

        return (results_list, results_list, documents,)