Skip to content

Schemas

Deep Lake provides pre-built schema templates for common data structures.

Schema Templates

deeplake.schemas.SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create.

This class allows you to define and customize the schema for your dataset.

Parameters:

Name Type Description Default
schema dict[str, DataType | str | Type]

dict A dictionary where the key is the column name and the value is the data type.

required

Methods:

Name Description
add

str, dtype: deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type) -> SchemaTemplate: Adds a new column to the template.

remove

str) -> SchemaTemplate: Removes a column from the template.

rename

str, new_name: str) -> SchemaTemplate: Renames a column in the template.

Examples:

Create a new schema template, modify it, and create a dataset with the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.add("author", types.Text())
schema.remove("text")
schema.rename("embedding", "text_embedding")
ds = deeplake.create("tmp://", schema=schema)

__init__

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict.

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template.

Parameters:

Name Type Description Default
name str

str The column name.

required
dtype DataType | str | Type

deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type The column data type.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Add a new column to the schema:

schema = deeplake.schemas.SchemaTemplate({})
schema.add("author", types.Text())

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template.

Parameters:

Name Type Description Default
name str

str The column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Remove a column from the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.remove("text")

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name Type Description Default
old_name str

str Existing column name.

required
new_name str

str New column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Rename a column in the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.rename("embedding", "text_embedding")

Text Embeddings Schema

deeplake.schemas.TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768)
    .rename("embedding", "text_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", types.Text())
ds = deeplake.create("tmp://", schema=schema)

# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768))

# Customize before creation
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768)
        .rename("embedding", "text_embedding")
        .add("source", deeplake.types.Text()))

# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", deeplake.types.Text())
ds = deeplake.create("s3://bucket/dataset", schema=schema)

COCO Images Schema

deeplake.schemas.COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False
objects bool

bool, optional Whether to include object-related fields. Default is True.

True
keypoints bool

bool, optional Whether to include keypoint-related fields. Default is False.

False
stuffs bool

bool, optional Whether to include stuff-related fields. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
    .rename("embedding", "image_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema.add("location", types.Text())
ds = deeplake.create("tmp://", schema=schema)

# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768))

# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(
        embedding_size=768,
        keypoints=True,
        objects=True
    ))

# Customize schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768)
        .rename("image", "raw_image")
        .add("camera_id", deeplake.types.Text()))

Custom Schema Template

Create custom schema templates:

# Define custom schema
schema = deeplake.schemas.SchemaTemplate({
    "id": deeplake.types.UInt64(),
    "image": deeplake.types.Image(),
    "embedding": deeplake.types.Embedding(512),
    "metadata": deeplake.types.Dict()
})

# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Modify schema
schema.add("timestamp", deeplake.types.UInt64())
schema.remove("metadata")
schema.rename("embedding", "image_embedding")