docs/guides/uploading-files.mdx at main · devflowinc/docs

title	description	icon
Uploading Files to Trieve	Learn how to upload your files to Trieve	files

Overview

You must specifically use the `base64url` encoding for the `base64_file` field.

We provide the ability for users to upload their files to Trieve and use our automatic large language vision model or Apache Tika chunking. When uploading a file to Trieve, we automatically group the chunks together to link them. This is done through our upload file route.

Uploading a File to Trieve

Trieve supports various file types (e.g., HTML, DOCX, PDF). The file is uploaded to a S3 bucket associated with your dataset. As a user, you can use Apache Tika to convert these files to HTML or use a vision LLM to convert the files to markdown. To understand the difference between the two, refer to the section on Apache Tika vs vision model chunking.

Important Parameters

base64_file: To allow users to pass metadata with their file uploads, we require you to specifically use the base64url encoding. Convert + to -, / to _, and remove the ending = if present.
file_name: The name of the file being uploaded, including the extension. This will become the name of the resulting group.
group_tracking_id: This field allows you to assign an arbitrary ID to the group, aiding in coordination with your database system. You can search for this group using this ID.
link, tag_set, and time_stamp: These fields are indexed to enable fast filtering of groups based on these attributes.
target_splits_per_chunk: This is an optional field to specify number of splits you want per chunk. If not specified, the default 20 is used.
metadata: This field allows you to include any arbitrary metadata in the form of a JSON object with the group.
pdf2md_options: This allows you to use vision LLM to convert the files to markdown.
- use_pdf2md_ocr: If true, the file will be converted to markdown using vision LLM. You can test pdf2md performance at pdf2md.trieve.ai.

For the best filtering performance, we recommend using the `link`, `tag_set`, and `time_stamp` fields, as there are dedicated indexes for these. The metadata field has an index built for match queries but is not optimized for range queries.

Example Upload File Request

Whenever you make a request to the Trieve API, you need to include the `TR-Dataset` header with your dataset ID and the `Authorization` header with your API key. ```json cURL curl --request POST \ --url https://api.trieve.ai/api/file \ --header 'Authorization: ' \ // Replace with your API key --header 'Content-Type: application/json' \ --header 'TR-Dataset: ' \ // Replace with your dataset ID --data '{ "base64_file": "", "file_name": "example.pdf", "link": "https://example.com", "tag_set": [ "tag1", "tag2" ], "time_stamp": "2025-02-09T22:15:51", "target_splits_per_chunk": 20, "metadata": { "key1": "value1", "key2": "value2" }, "pdf2md_options": { "use_pdf2md_ocr": false // Set to true if you want to use a vision LLM to convert file to markdown } }' ```

import trieve_py_client
from trieve_py_client.models.upload_file_req_payload import UploadFileReqPayload
from trieve_py_client.models.upload_file_response_body import UploadFileResponseBody
from trieve_py_client.rest import ApiException
from pprint import pprint

configuration = trieve_py_client.Configuration(
    host = "https://api.trieve.ai"
)

configuration.api_key['ApiKey'] = "<Your API Key>" # Replace with your API key
configuration.api_key_prefix['ApiKey'] = 'Bearer'

with trieve_py_client.ApiClient(configuration) as api_client:
    api_instance = trieve_py_client.FileApi(api_client)
    tr_dataset = <Your Dataset ID> # Replace with your dataset ID

    upload_file_req_payload = trieve_py_client.UploadFileReqPayload(
        base64_file=<base64_encoded_file>, # Upload base64 encoded file
        file_name="example.pdf",
        link="https://example.com",
        tag_set=["tag1", "tag2"],
        time_stamp="2025-02-09T22:15:51",
        target_splits_per_chunk=20,
        metadata={
            "key1": "value1",
            "key2": "value2"
        },
        pdf2md_options={
            "use_pdf2md_ocr": False # Set to True if you want Vision LLM to convert the file to Markdown
        }
    )

    try:
        api_response = api_instance.upload_file_handler(tr_dataset, upload_file_req_payload)
        print("Uploading file response: \n")
        pprint(api_response)
    except Exception as e:
        print(f"Exception when uploading file: {e}\n")

import requests

url = "https://api.trieve.ai/api/file"

headers = {
    "TR-Dataset": "<tr-dataset>",  # Replace with your dataset ID
    "Authorization": "<api-key>",  # Replace with your API key
    "Content-Type": "application/json"
}

payload = {
    "base64_file": "<base64_encoded_file>", # Upload base64 encoded file
    "file_name": "example.pdf",
    "link": "https://example.com",
    "tag_set": ["tag1", "tag2"],
    "time_stamp": "2025-02-09T22:15:51",
    "target_splits_per_chunk": 20,
    "metadata": {
        "key1": "value1",
        "key2": "value2"
    },
    "pdf2md_options": {
        "use_pdf2md_ocr": False  # Set to True if you want vision LLM to convert the file to Markdown
    }
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

Example Response

```json 200 { "file_metadata": { "created_at": "2025-02-09 22:30:00.000", "dataset_id": "********-****-****-****-************", "file_name": "example.pdf", "id": "********-****-****-****-************", "link": "https://example.com", "metadata": { "key1": "value1", "key2": "value2" }, "size": 1000, "tag_set": "tag1,tag2", "time_stamp": "2025-02-09 22:30:00.000", "updated_at": "2025-02-09 22:30:00.000" } } ```

{
  "message": "Bad Request"
}

File chunking with vision LLM

When uploading a file to Trieve, you can use vision LLMs through our pdf2md service for intelligent document parsing to easily convert PDF content to LLM-ready, structured Markdown. You can test pdf2md performance at pdf2md.trieve.ai.

This allows for better preserving document context, readability, and structure and is especially useful when working with documents containing complex layouts (tables, lists, code blocks, etc).

After the uploaded file is converted into structured Markdown, chunks are created based on the semantic structure of the document allowing for better semantic coherence.

Apache Tika vs Vision LLM chunking

Apache Tika is a more traditional approach that extracts raw text from a document and converts it into HTML. The chunks are subsequently created by splitting at fixed lengths or pre-defined delimeters.

Recommended for:

Simple, unstructured documents (e.g. plain text files)

Vision LLM chunking creates chunks based on the semantic structure of the document, allowing for sections (e.g. headers, lists, tables, etc) to stay grouped.

Recommended for:

Complex document structures (e.g. technical documentation and reports)
Documents where context must be preserved

Heading Based Chunking

Heading based chunking allows you to use split contents based on the headings and body content of the document.

Each chunk will contain a heading and its corresponding content.
All related content (paragraphs, lists, tables) under the same heading are grouped together.

For example, if you upload the following HTML:

<h1>Introduction</h1>
<p>Hey, this is Trieve!</p>
<h2>Background</h2>
<p>Trieve brings AI search to the modern world.</p>

The response will be:

{
  "chunks": [
    {
      "headings": ["Introduction"],
      "body": "<h1>Introduction</h1><p>Hey, this is Trieve!</p>"
    },
    {
      "headings": ["Background"],
      "body": "<h2>Background</h2><p>Trieve brings AI search to the modern world.</p>"
    }
  ]
}

Grouping with File Upload

When uploading a file to Trieve, all chunks created from the file will be grouped together. The file_name field is used to specify the name of the resulting groups. Once uploaded, documents can be queried using the file_name field, allowing you to retrieve and perform operations on all chunks created from the file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uploading-files.mdx

uploading-files.mdx

Overview

Uploading a File to Trieve

Important Parameters

Example Upload File Request

Example Response

File chunking with vision LLM

Apache Tika vs Vision LLM chunking

Heading Based Chunking

Grouping with File Upload

Files

uploading-files.mdx

Latest commit

History

uploading-files.mdx

File metadata and controls

Overview

Uploading a File to Trieve

Important Parameters

Example Upload File Request

Example Response

File chunking with vision LLM

Apache Tika vs Vision LLM chunking

Heading Based Chunking

Grouping with File Upload