Note
This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.
FiftyOne Embeddings Dataset#
This dataset combines the FiftyOne Q&A and function calling datasets with pre-computed embeddings for fast similarity search.
Dataset Information#
Total samples: 28,118
Q&A samples: 14,069
Function samples: 14,049
Embedding model: text-embedding-3-large
Embedding dimension: 3072
Schema#
query: The original question/query textresponse: The unified response content (either answer text for Q&A or function call text for function samples)content_type: Either ‘qa_response’ or ‘function_call’embedding: Pre-computed embedding vector (3072 dimensions), based on thequeryfeaturedataset_type: Either ‘qa’ or ‘function’source_dataset: Original dataset nameembedding_model: Model used to compute embeddings
Usage#
from datasets import load_dataset
import numpy as np
from scipy.spatial.distance import cosine
# Load dataset with embeddings
dataset = load_dataset("Voxel51/fiftyone-embeddings-combined", split="train")
# Extract embeddings for similarity search
embeddings = np.array([item['embedding'] for item in dataset])
queries = [item['query'] for item in dataset]
def find_similar(query_embedding, top_k=5):
similarities = [1 - cosine(query_embedding, emb) for emb in embeddings]
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for i in top_indices:
item = dataset[i]
results.append({
'query': item['query'],
'response': item['response'], # Unified response field
'type': item['content_type'],
'similarity': similarities[i]
})
return results