TQL Syntax
Overview
Deep Lake offers a performant SQL-based query engine called "TQL" (Tensor Query Language) optimized for machine learning and AI workloads. TQL combines familiar SQL syntax with powerful tensor operations, enabling efficient querying of embeddings, images, and other multi-modal data.
Basic Usage
Dataset Queries
TQL can be used directly on a dataset or across multiple datasets:
# Query on a single dataset (no FROM needed)
ds = deeplake.open("al://org_name/dataset_name")
result = ds.query("SELECT * WHERE id > 10")
# Query across datasets (requires FROM)
result = deeplake.query('SELECT * FROM "al://my_org/dataset_name" WHERE id > 10')
Query Syntax
String Values
String literals must use single quotes:
Special Characters
Column or dataset names with special characters need double quotes:
SELECT * WHERE contains("column-name", 'text_value')
SELECT * FROM "al://my_org/dataset" WHERE id > 10
Tip
When writing queries in Python, remember to properly escape quotes:
Vector Operations
Similarity Search
TQL provides multiple methods for vector similarity search:
-- Cosine similarity (higher is more similar)
SELECT *
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, ...]) DESC
LIMIT 100
-- L2 norm/Euclidean distance (lower is more similar)
SELECT *
ORDER BY L2_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
-- L1 norm/Manhattan distance
SELECT *
ORDER BY L1_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
-- L∞ norm/Chebyshev distance
SELECT *
ORDER BY LINF_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
Text Search
Semantic Search with BM25
-- Find semantically similar text
SELECT *
ORDER BY BM25_SIMILARITY(text_column, 'search query text') DESC
LIMIT 10
Keyword Search
Advanced Features
Cross-Cloud Dataset Joins
TQL enables joining datasets across different cloud storage providers:
-- Join datasets from different storage providers
SELECT
i.image,
i.embedding,
m.labels,
m.metadata
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/metadata" AS m
ON i.id = m.image_id
WHERE m.verified = true
ORDER BY COSINE_SIMILARITY(i.embedding, ARRAY[...]) DESC
Virtual Columns
Create computed columns on the fly:
-- Compute similarity scores
SELECT *,
COSINE_SIMILARITY(embedding, ARRAY[...]) as similarity_score
FROM dataset
ORDER BY similarity_score DESC
-- Complex computations
SELECT *,
column_1 + column_3 as sum,
any(boxes[:,0]) < 0 as box_beyond_image
WHERE label = 'person'
Logical Operations
-- Combining conditions
SELECT *
WHERE (contains(text, 'machine learning')
AND confidence > 0.9)
OR label IN ('cat', 'dog')
-- Array operations
SELECT *
WHERE any(logical_and(
bounding_boxes[:,3] > 0.5,
confidence > 0.8
))
Data Sampling
-- Weighted random sampling
SELECT *
SAMPLE BY MAX_WEIGHT(
high_confidence: 0.7,
medium_confidence: 0.2,
low_confidence: 0.1
) LIMIT 1000
-- Sampling with replacement
SELECT *
SAMPLE BY MAX_WEIGHT(
positive_samples: 0.5,
negative_samples: 0.5
) replace True LIMIT 2000
Grouping and Sequences
-- Group frames into videos
SELECT *
GROUP BY video_id, camera_id
-- Split videos into frames
SELECT *
UNGROUP BY split
## Built-in Functions
### Array Operations
- `SHAPE(array)`: Returns array dimensions
```sql
SELECT * WHERE SHAPE(embedding)[0] = 768
```
- `DATA(column, index)`: Access specific array elements
```sql
SELECT * ORDER BY L2_NORM(embedding - data(embedding, 10))
```
### Row Information
- `ROW_NUMBER()`: Returns zero-based row offset
```sql
SELECT *, ROW_NUMBER() WHERE ROW_NUMBER() < 100
```
### Array Logic
- `ANY()`, `ALL()`: Array-wise logical operations
```sql
SELECT * WHERE any(confidence > 0.9)
SELECT * WHERE all(scores > 0.5)
```
- `ALL_STRICT()`: Stricter version of ALL
```sql
-- Returns false for empty arrays
SELECT * WHERE all_strict(values > 0)
```
## Custom Functions
TQL supports registering custom Python functions:
```python
# Define and register custom function
def next_number(a):
return a + 1
deeplake.tql.register_function(next_number)
# Use in query
results = ds.query("SELECT * WHERE next_number(column_name) > 10")
Custom functions must: - Accept numpy arrays as input - Return numpy arrays as output - Be registered before use in queries