Schemas
Deep Lake provides pre-built schema templates for common data structures.
Schema Templates
deeplake.schemas.SchemaTemplate
A template that can be used for creating a new dataset with deeplake.create.
This class allows you to define and customize the schema for your dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
dict[str, DataType | str | Type]
|
dict A dictionary where the key is the column name and the value is the data type. |
required |
Methods:
| Name | Description |
|---|---|
add |
str, dtype: deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type) -> SchemaTemplate: Adds a new column to the template. |
remove |
str) -> SchemaTemplate: Removes a column from the template. |
rename |
str, new_name: str) -> SchemaTemplate: Renames a column in the template. |
Examples:
Create a new schema template, modify it, and create a dataset with the schema:
schema = deeplake.schemas.SchemaTemplate({
"id": types.UInt64(),
"text": types.Text(),
"embedding": types.Embedding(768)
})
schema.add("author", types.Text())
schema.remove("text")
schema.rename("embedding", "text_embedding")
ds = deeplake.create("tmp://", schema=schema)
__init__
Constructs a new SchemaTemplate from the given dict.
add
add(
name: str, dtype: DataType | str | Type
) -> SchemaTemplate
Adds a new column to the template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
str The column name. |
required |
dtype
|
DataType | str | Type
|
deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type The column data type. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SchemaTemplate |
SchemaTemplate
|
The updated schema template. |
Examples:
Add a new column to the schema:
remove
remove(name: str) -> SchemaTemplate
Removes a column from the template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
str The column name. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SchemaTemplate |
SchemaTemplate
|
The updated schema template. |
Examples:
Remove a column from the schema:
rename
rename(old_name: str, new_name: str) -> SchemaTemplate
Renames a column in the template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
old_name
|
str
|
str Existing column name. |
required |
new_name
|
str
|
str New column name. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SchemaTemplate |
SchemaTemplate
|
The updated schema template. |
Examples:
Rename a column in the schema:
Text Embeddings Schema
deeplake.schemas.TextEmbeddings
TextEmbeddings(
embedding_size: int, quantize: bool = False
) -> SchemaTemplate
A schema for storing embedded text from documents.
This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_size
|
int
|
int Size of the embeddings. |
required |
quantize
|
bool
|
bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False. |
False
|
Examples:
Create a dataset with the standard schema:
Customize the schema before creating the dataset:
ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768)
.rename("embedding", "text_embed")
.add("author", types.Text()))
Add a new field to the schema:
# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.TextEmbeddings(768))
# Customize before creation
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.TextEmbeddings(768)
.rename("embedding", "text_embedding")
.add("source", deeplake.types.Text()))
# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", deeplake.types.Text())
ds = deeplake.create("s3://bucket/dataset", schema=schema)
COCO Images Schema
deeplake.schemas.COCOImages
COCOImages(
embedding_size: int,
quantize: bool = False,
objects: bool = True,
keypoints: bool = False,
stuffs: bool = False,
) -> SchemaTemplate
A schema for storing COCO-based image data.
This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.
If objects is true, the following fields are added:
- objects_bbox (bounding box): Bounding boxes for objects.
- objects_classes (segment mask): Segment masks for objects.
If keypoints is true, the following fields are added:
- keypoints_bbox (bounding box): Bounding boxes for keypoints.
- keypoints_classes (segment mask): Segment masks for keypoints.
- keypoints (2-dimensional array of uint32): Keypoints data.
- keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.
If stuffs is true, the following fields are added:
- stuffs_bbox (bounding boxes): Bounding boxes for stuffs.
- stuffs_classes (segment mask): Segment masks for stuffs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_size
|
int
|
int Size of the embeddings. |
required |
quantize
|
bool
|
bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False. |
False
|
objects
|
bool
|
bool, optional Whether to include object-related fields. Default is True. |
True
|
keypoints
|
bool
|
bool, optional Whether to include keypoint-related fields. Default is False. |
False
|
stuffs
|
bool
|
bool, optional Whether to include stuff-related fields. Default is False. |
False
|
Examples:
Create a dataset with the standard schema:
Customize the schema before creating the dataset:
ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
.rename("embedding", "image_embed")
.add("author", types.Text()))
Add a new field to the schema:
# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.COCOImages(768))
# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.COCOImages(
embedding_size=768,
keypoints=True,
objects=True
))
# Customize schema
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.COCOImages(768)
.rename("image", "raw_image")
.add("camera_id", deeplake.types.Text()))
Custom Schema Template
Create custom schema templates:
# Define custom schema
schema = deeplake.schemas.SchemaTemplate({
"id": deeplake.types.UInt64(),
"image": deeplake.types.Image(),
"embedding": deeplake.types.Embedding(512),
"metadata": deeplake.types.Dict()
})
# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)
# Modify schema
schema.add("timestamp", deeplake.types.UInt64())
schema.remove("metadata")
schema.rename("embedding", "image_embedding")