title | description | icon |
---|---|---|
Uploading Files to Trieve |
Learn how to upload your files to Trieve |
files |
We provide the ability for users to upload their files to Trieve and use our automatic large language vision model or Apache Tika chunking. When uploading a file to Trieve, we automatically group the chunks together to link them. This is done through our upload file route.
Trieve supports various file types (e.g., HTML, DOCX, PDF). The file is uploaded to a S3 bucket associated with your dataset. As a user, you can use Apache Tika to convert these files to HTML or use a vision LLM to convert the files to markdown. To understand the difference between the two, refer to the section on Apache Tika vs vision model chunking.
base64_file
: To allow users to pass metadata with their file uploads, we require you to specifically use thebase64url
encoding. Convert+
to-
,/
to_
, and remove the ending=
if present.file_name
: The name of the file being uploaded, including the extension. This will become the name of the resulting group.group_tracking_id
: This field allows you to assign an arbitrary ID to the group, aiding in coordination with your database system. You can search for this group using this ID.link
,tag_set
, andtime_stamp
: These fields are indexed to enable fast filtering of groups based on these attributes.target_splits_per_chunk
: This is an optional field to specify number of splits you want per chunk. If not specified, the default 20 is used.metadata
: This field allows you to include any arbitrary metadata in the form of a JSON object with the group.pdf2md_options
: This allows you to use vision LLM to convert the files to markdown.use_pdf2md_ocr
: If true, the file will be converted to markdown using vision LLM. You can testpdf2md
performance at pdf2md.trieve.ai.
import trieve_py_client
from trieve_py_client.models.upload_file_req_payload import UploadFileReqPayload
from trieve_py_client.models.upload_file_response_body import UploadFileResponseBody
from trieve_py_client.rest import ApiException
from pprint import pprint
configuration = trieve_py_client.Configuration(
host = "https://api.trieve.ai"
)
configuration.api_key['ApiKey'] = "<Your API Key>" # Replace with your API key
configuration.api_key_prefix['ApiKey'] = 'Bearer'
with trieve_py_client.ApiClient(configuration) as api_client:
api_instance = trieve_py_client.FileApi(api_client)
tr_dataset = <Your Dataset ID> # Replace with your dataset ID
upload_file_req_payload = trieve_py_client.UploadFileReqPayload(
base64_file=<base64_encoded_file>, # Upload base64 encoded file
file_name="example.pdf",
link="https://example.com",
tag_set=["tag1", "tag2"],
time_stamp="2025-02-09T22:15:51",
target_splits_per_chunk=20,
metadata={
"key1": "value1",
"key2": "value2"
},
pdf2md_options={
"use_pdf2md_ocr": False # Set to True if you want Vision LLM to convert the file to Markdown
}
)
try:
api_response = api_instance.upload_file_handler(tr_dataset, upload_file_req_payload)
print("Uploading file response: \n")
pprint(api_response)
except Exception as e:
print(f"Exception when uploading file: {e}\n")
import requests
url = "https://api.trieve.ai/api/file"
headers = {
"TR-Dataset": "<tr-dataset>", # Replace with your dataset ID
"Authorization": "<api-key>", # Replace with your API key
"Content-Type": "application/json"
}
payload = {
"base64_file": "<base64_encoded_file>", # Upload base64 encoded file
"file_name": "example.pdf",
"link": "https://example.com",
"tag_set": ["tag1", "tag2"],
"time_stamp": "2025-02-09T22:15:51",
"target_splits_per_chunk": 20,
"metadata": {
"key1": "value1",
"key2": "value2"
},
"pdf2md_options": {
"use_pdf2md_ocr": False # Set to True if you want vision LLM to convert the file to Markdown
}
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)
{
"message": "Bad Request"
}
When uploading a file to Trieve, you can use vision LLMs through our pdf2md service for intelligent document parsing to easily convert PDF content to LLM-ready, structured Markdown. You can test pdf2md
performance at pdf2md.trieve.ai.
This allows for better preserving document context, readability, and structure and is especially useful when working with documents containing complex layouts (tables, lists, code blocks, etc).
After the uploaded file is converted into structured Markdown, chunks are created based on the semantic structure of the document allowing for better semantic coherence.
Apache Tika is a more traditional approach that extracts raw text from a document and converts it into HTML. The chunks are subsequently created by splitting at fixed lengths or pre-defined delimeters.
Recommended for:
- Simple, unstructured documents (e.g. plain text files)
Vision LLM chunking creates chunks based on the semantic structure of the document, allowing for sections (e.g. headers, lists, tables, etc) to stay grouped.
Recommended for:
- Complex document structures (e.g. technical documentation and reports)
- Documents where context must be preserved
Heading based chunking allows you to use split contents based on the headings and body content of the document.
- Each chunk will contain a heading and its corresponding content.
- All related content (paragraphs, lists, tables) under the same heading are grouped together.
For example, if you upload the following HTML:
<h1>Introduction</h1>
<p>Hey, this is Trieve!</p>
<h2>Background</h2>
<p>Trieve brings AI search to the modern world.</p>
The response will be:
{
"chunks": [
{
"headings": ["Introduction"],
"body": "<h1>Introduction</h1><p>Hey, this is Trieve!</p>"
},
{
"headings": ["Background"],
"body": "<h2>Background</h2><p>Trieve brings AI search to the modern world.</p>"
}
]
}
When uploading a file to Trieve, all chunks created from the file will be grouped together.
The file_name
field is used to specify the name of the resulting groups. Once uploaded, documents can be queried using the file_name
field, allowing you to retrieve and perform operations on all chunks created from the file.