Column Classes
Deep Lake provides two column classes for different access levels:
| Class | Description |
|---|---|
| Column | Full read-write access to column data |
| ColumnView | Read-only access to column data |
Column Class
deeplake.Column
Bases: ColumnView
Provides read-write access to a column in a dataset. Column extends ColumnView with methods for modifying data, making it suitable for dataset creation and updates in ML workflows.
The Column class allows you to: - Read and write data using integer indices, slices, or lists of indices - Modify data asynchronously for better performance - Access and modify column metadata - Handle various data types common in ML: images, embeddings, labels, etc.
Examples:
Update training labels:
# Update single label
ds["labels"][0] = 1
# Update batch of labels
ds["labels"][0:32] = new_labels
# Async update for better performance
future = ds["labels"].set_async(slice(0, 32), new_labels)
future.wait()
Store image embeddings:
# Generate and store embeddings
embeddings = model.encode(images)
ds["embeddings"][0:len(embeddings)] = embeddings
Manage column metadata:
# Store preprocessing parameters
ds["images"].metadata["mean"] = [0.485, 0.456, 0.406]
ds["images"].metadata["std"] = [0.229, 0.224, 0.225]
__getitem__
Retrieve data from the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
__setitem__
Set data in the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) |
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Examples:
get_async
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Future |
Future
|
A Future object that resolves to the requested data. |
Examples:
set_async
Asynchronously set data in the column. Useful for large updates or when modifying multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Can be: - int: Single item index - slice: Range of indices |
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FutureVoid |
FutureVoid
|
A FutureVoid that completes when the update is finished. |
Examples:
ColumnView Class
deeplake.ColumnView
Provides read-only access to a column in a dataset. ColumnView is designed for efficient data access in ML workflows, supporting both synchronous and asynchronous operations.
The ColumnView class allows you to: - Access column data using integer indices, slices, or lists of indices - Retrieve data asynchronously for better performance in ML pipelines - Access column metadata and properties - Get information about linked data if the column contains references
Examples:
Load image data from a column for training:
# Access a single image
image = ds["images"][0]
# Load a batch of images
batch = ds["images"][0:32]
# Async load for better performance
images_future = ds["images"].get_async(slice(0, 32))
images = images_future.result()
Access embeddings for similarity search:
# Get all embeddings
embeddings = ds["embeddings"][:]
# Get specific embeddings by indices
selected = ds["embeddings"][[1, 5, 10]]
Check column properties:
# Get column name
name = ds["images"].name
# Access metadata
if "mean" in ds["images"].metadata.keys():
mean = dataset["images"].metadata["mean"]
__getitem__
Retrieve data from the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
get_async
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Future |
Future
|
A Future object that resolves to the requested data. |
Examples:
metadata
property
metadata: ReadOnlyMetadata
Access the column's metadata. Useful for storing statistics, preprocessing parameters, or other information about the column data.
Returns:
| Name | Type | Description |
|---|---|---|
ReadOnlyMetadata |
ReadOnlyMetadata
|
A ReadOnlyMetadata object for reading metadata. |
Examples:
Class Comparison
Column
- Provides read-write access
- Can modify data
- Can update metadata
- Available in Dataset
# Get mutable column
ds = deeplake.open("s3://bucket/dataset")
column = ds["images"]
# Read data
image = column[0]
batch = column[0:100]
# Write data
column[0] = new_image
column[0:100] = new_batch
# Async operations
future = column.set_async(0, new_image)
future.wait()
ColumnView
- Read-only access
- Cannot modify data
- Can read metadata
- Available in ReadOnlyDataset and DatasetView
# Get read-only column
ds = deeplake.open_read_only("s3://bucket/dataset")
column = ds["images"]
# Read data
image = column[0]
batch = column[0:100]
# Async read
future = column.get_async(slice(0, 100))
batch = future.result()
Examples
Data Access
# Direct indexing
single_item = column[0]
batch = column[0:100]
selected = column[[1, 5, 10]]
# Async data access
future = column.get_async(slice(0, 1000))
data = future.result()