Query
Deep Lake provides powerful query capabilities through its Tensor Query Language (TQL), with special focus on vector similarity search, text search, and operations on multidimensional arrays.
Query Functions
deeplake.query
query(query: str, token: str | None = None) -> DatasetView
Executes TQL queries optimized for ML data filtering and search.
TQL is a SQL-like query language designed for ML datasets, supporting: - Vector similarity search - Text semantic search - Complex data filtering - Joining across datasets - Efficient sorting and pagination
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
DatasetView |
DatasetView
|
Query results that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Vector similarity search:
# Find similar embeddings
similar = deeplake.query('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
# Use results in training
dataloader = similar.pytorch()
Text semantic search:
# Search documents using BM25
relevant = deeplake.query('''
SELECT * FROM "mem://documents"
ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
LIMIT 10
''')
Complex filtering:
# Filter training data
train = deeplake.query('''
SELECT * FROM "mem://dataset"
WHERE "split" = 'train'
AND confidence > 0.9
AND label IN ('cat', 'dog')
''')
Joins for feature engineering:
deeplake.query_async
Asynchronously executes TQL queries optimized for ML data filtering and search.
Non-blocking version of query() for better performance with large datasets.
Supports the same TQL features including vector similarity search, text search,
filtering, and joins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Future |
Future
|
Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Basic async query:
# Run query asynchronously
future = deeplake.query_async('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')
# Do other work while query runs
prepare_training()
# Get results when needed
results = future.result()
With async/await:
async def search_similar():
results = await deeplake.query_async('''
SELECT * FROM "mem://images"
ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
return results
async def main():
similar = await search_similar()
Non-blocking check:
Vector Search
Search by vector similarity:
# Cosine similarity search
results = ds.query("""
SELECT *
FROM "s3://bucket/embeddings"
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, ...]) DESC
LIMIT 100
""")
Text Search
Text search using BM25 or keyword matching:
# Semantic search using BM25
results = ds.query("""
SELECT *
FROM "s3://bucket/documents"
ORDER BY BM25_SIMILARITY(text, 'search query') DESC
LIMIT 10
""")
# Keyword search using CONTAINS
results = ds.query("""
SELECT *
FROM "s3://bucket/metadata"
WHERE CONTAINS(keywords, 'specific term')
""")
Array Operations
Operate on multidimensional arrays:
# Select specific array dimensions
results = ds.query("""
SELECT features[:, 0:10]
FROM "s3://bucket/features"
""")
# Filter by array values
results = ds.query("""
SELECT *
FROM "s3://bucket/features"
WHERE features[0] > 0.5
""")
# Aggregate array operations
results = ds.query("""
SELECT AVG(features, axis=0)
FROM "s3://bucket/features"
""")
Joining Datasets
Join data across different datasets and across different clouds:
# Join datasets from different storage
results = ds.query("""
SELECT i.image, i.embedding, m.labels, m.metadata
FROM "s3://bucket1/images" AS i
JOIN "s3://bucket2/metadata" AS m
ON i.id = m.image_id
WHERE m.verified = true
""")
# Complex join with filtering
results = ds.query("""
SELECT
i.image,
e.embedding,
l.label
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/embeddings" AS e ON i.id = e.image_id
JOIN "azure://container/labels" AS l ON i.id = l.image_id
WHERE l.confidence > 0.9
ORDER BY COSINE_SIMILARITY(e.embedding, ARRAY[...]) DESC
LIMIT 100
""")
Filtering
Filter data using WHERE clauses:
# Simple filters
results = ds.query("""
SELECT *
FROM "s3://bucket/dataset"
WHERE label = 'cat'
AND confidence > 0.9
""")
# Combine with vector search
results = ds.query("""
SELECT *
FROM "s3://bucket/dataset"
WHERE label IN ('cat', 'dog')
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[...]) DESC
LIMIT 100
""")
Query Results
Process query results:
# Iterate through results
for item in results:
image = item["images"]
label = item["label"]
# Direct column access (recommended for performance)
images = results["images"][:]
labels = results["labels"][:]
Async Queries
Execute queries asynchronously:
# Run query asynchronously
future = ds.query_async("""
SELECT *
FROM "s3://bucket/dataset"
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[...]) DESC
""")
# Get results when ready
results = future.result()
# Check completion
if future.is_completed():
results = future.result()
else:
print("Query still running")
Querying Views
Chain queries on views: