Fiddler Evals SDK

init()

Initialize the Fiddler client with connection parameters and global configuration.

This function establishes a connection to the Fiddler platform and configures the global client state. It handles authentication, server compatibility validation, logging setup, and creates the singleton connection instance used throughout the client library.

Parameters

Parameter
Type
Required
Default
Description

url

str

None

The base URL to your Fiddler platform instance

token

str

None

Authentication token obtained from the Fiddler UI Credentials tab

proxies

dict | None

None

Dictionary mapping protocol to proxy URL for HTTP requests

timeout

float | tuple[float, float] | None

None

HTTP request timeout settings (float or tuple of connect/read timeouts)

verify

bool

True

Whether to verify server’s TLS certificate (default: True)

validate

bool

True

Whether to validate server/client version compatibility (default: True)

  • Raises:

  • ValueError – If url or token parameters are empty

  • IncompatibleClient – If server version is incompatible with client version

  • ConnectionError – If unable to connect to the Fiddler platform

  • Return type: None

Examples

Basic initialization:

from fiddler_evals import init

init(
    url="https://your-fiddler-instance.com",
    token="your-auth-token"
)

Initialization with custom timeout and proxy:

init(
    url="https://your-fiddler-instance.com",
    token="your-auth-token",
    timeout=(10.0, 60.0),  # 10s connect, 60s read timeout
    proxies={"https": "https://proxy.company.com:8080"}
)

Initialization for development with relaxed settings:

init(
    url="https://dev-fiddler-instance.com",
    token="dev-token",
    verify=False,  # Skip SSL verification
    validate=False,  # Skip version compatibility check
)

The client implements automatic retry strategies for transient failures. Configure retry duration via FIDDLER_CLIENT_RETRY_MAX_DURATION_SECONDS environment variable (default: 300 seconds).

Logging is performed under the ‘fiddler’ namespace at INFO level. If no root logger is configured, a stderr handler is automatically attached unless auto_attach_log_handler=False.

Connection

Manages authenticated connections to the Fiddler platform.

The Connection class handles all aspects of connecting to and communicating with the Fiddler platform, including authentication, HTTP client management, server version compatibility checking, and organization context management.

This class provides the foundation for all API interactions with Fiddler, managing connection parameters, authentication tokens, and ensuring proper communication protocols are established.

Example

# Creating a basic connection
connection = Connection(
    url="https://your-fiddler-instance.com",
    token="your-auth-token"
)

# Creating a connection with custom timeout and proxy
connection = Connection(
    url="https://your-fiddler-instance.com",
    token="your-auth-token",
    timeout=(5.0, 30.0),  # (connect_timeout, read_timeout)
    proxies={"https": "https://proxy.company.com:8080"}
)

# Creating a connection without SSL verification
connection = Connection(
    url="https://your-fiddler-instance.com",
    token="your-auth-token",
    verify=False,  # Not recommended for production
    validate=False  # Skip version compatibility check
)

Initialize a connection to the Fiddler platform.

Parameters

Parameter
Type
Required
Default
Description

url

str

None

The base URL to your Fiddler platform instance

token

str

None

Authentication token obtained from the Fiddler UI

proxies

dict | None

None

Dictionary mapping protocol to proxy URL for HTTP requests

timeout

float | tuple[float, float] | None

None

HTTP request timeout settings (float or tuple of connect/read timeouts)

verify

bool

True

Whether to verify server’s TLS certificate (default: True)

validate

bool

True

Whether to validate server/client version compatibility (default: True)

  • Raises:

  • ValueError – If url or token parameters are empty

  • IncompatibleClient – If server version is incompatible with client version

  • Return type: None

property client : RequestClient

Get the HTTP request client instance for API communication.

  • Returns: Configured HTTP client with authentication headers, proxy settings, and timeout configurations.

  • Return type: RequestClient

property server_info : ServerInfo

Get server information and metadata from the Fiddler platform.

  • Returns: Server information including version, organization details, and platform configuration.

  • Return type: ServerInfo

property server_version : VersionInfo

Get the semantic version of the connected Fiddler server.

  • Returns: Semantic version object representing the server version.

  • Return type: VersionInfo

property organization_name : str

Get the name of the connected organization.

  • Returns: Name of the organization associated with this connection.

  • Return type: str

property organization_id : UUID

Get the UUID of the connected organization.

  • Returns: Unique identifier of the organization associated with this connection.

  • Return type: UUID

Project

Represents a project container for organizing GenAI Apps and resources.

A Project is the top-level organizational unit in Fiddler that groups related GenAI Applications, datasets, and monitoring configurations. Projects provide logical separation, access control, and resource management for GenAI App monitoring workflows.

Key Features:

  • GenAI Apps Organization: Container for related GenAI apps

  • Resource Isolation: Separate namespaces prevent naming conflicts

  • Access Management: Project-level permissions and access control

  • Monitoring Coordination: Centralized monitoring and alerting configuration

  • Lifecycle Management: Coordinated creation, updates, and deletion of resources

Project Lifecycle:

  1. Creation: Create project with unique name within organization

  2. App creation: Create GenAI applications with Application().create()

  3. Configuration: Set up monitoring, alerts, and evaluators.

  4. Operations: Publish logs, monitor performance, manage alerts

  5. Maintenance: Update configurations

  6. Cleanup: Delete project when no longer needed (removes all contained resources)

Example

# Create a new project for fraud detection models
project = Project.create(name="fraud-detection-2024")
print(f"Created project: {project.name} (ID: {project.id})")

Projects are permanent containers - once created, the name cannot be changed. Deleting a project removes all contained models, datasets, and configurations. Consider the organizational structure carefully before creating projects.

classmethod get_by_id(id_)

Retrieve a project by its unique identifier.

Fetches a project from the Fiddler platform using its UUID. This is the most direct way to retrieve a project when you know its ID.

  • Parameters:

  • id – The unique identifier (UUID) of the project to retrieve. Can be provided as a UUID object or string representation.

  • id_ (UUID | str)

  • Returns: The project instance with all metadata and configuration.

  • Return type: Project

  • Raises:

  • NotFound – If no project exists with the specified ID.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get project by UUID
project = Project.get_by_id(id_="550e8400-e29b-41d4-a716-446655440000")
print(f"Retrieved project: {project.name}")
print(f"Created: {project.created_at}")

This method makes an API call to fetch the latest project state from the server. The returned project instance reflects the current state in Fiddler.

classmethod get_by_name(name)

Retrieve a project by name.

Finds and returns a project using its name within the organization. This is useful when you know the project name but not its UUID. Project names are unique within an organization, making this a reliable lookup method.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the project to retrieve. Project names are unique within an organization and are case-sensitive.

  • Returns: The project instance matching the specified name.

  • Return type: Project

  • Raises:

  • NotFound – If no project exists with the specified name in the organization.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get project by name
project = Project.get_by_name(name="fraud-detection")
print(f"Found project: {project.name} (ID: {project.id})")
print(f"Created: {project.created_at}")

# Get project for specific environment
prod_project = Project.get_by_name(name="fraud-detection-prod")
staging_project = Project.get_by_name(name="fraud-detection-staging")

Project names are case-sensitive and must match exactly. Use this method when you have a known project name from configuration or user input.

classmethod list()

List all projects in the organization.

Retrieves all projects that the current user has access to within the organization. Returns an iterator for memory efficiency when dealing with many projects.

  • Yields: Project – Project instances for all accessible projects.

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

  • Return type: Iterator[Project]

Example

# List all projects
for project in Project.list():
    print(f"Project: {project.name}")
    print(f"  ID: {project.id}")
    print(f"  Created: {project.created_at}")

# Convert to list for counting and filtering
projects = list(Project.list())
print(f"Total accessible projects: {len(projects)}")

# Find projects by name pattern
prod_projects = [
    p for p in Project.list()
    if "prod" in p.name.lower()
]
print(f"Production projects: {len(prod_projects)}")

# Get project summaries
for project in Project.list():
    print(f"{project.name}")

This method returns an iterator for memory efficiency. Convert to a list with list(Project.list()) if you need to iterate multiple times or get the total count. The iterator fetches projects lazily from the API.

classmethod create(name)

Create the project on the Fiddler platform.

Persists this project instance to the Fiddler platform, making it available for adding GenAI Apps, configuring monitoring, and other operations. The project must have a unique name within the organization.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

Project name, must be unique within the organization. Should follow slug-like naming conventions:; Use lowercase letters, numbers, hyphens, and underscores; Start with a letter or number

  • Returns: This project instance, updated with server-assigned fields like : ID, creation timestamp, and other metadata.

  • Return type: Project

  • Raises:

  • Conflict – If a project with the same name already exists in the organization.

  • ValidationError – If the project configuration is invalid (e.g., invalid name format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Create a new project
project = Project.create(name="customer-churn-analysis")
print(f"Created project with ID: {project.id}")
print(f"Created at: {project.created_at}")

# Project is now available for adding GenAI Apps
assert project.id is not None

After successful creation, the project instance is returned with server-assigned metadata. The project is immediately available for adding GenAI Apps and other resources.

classmethod get_or_create(name)

Get an existing project by name or create a new one if it doesn’t exist.

This is a convenience method that attempts to retrieve a project by name, and if not found, creates a new project with that name. Useful for idempotent project setup in automation scripts and deployment pipelines.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the project to retrieve or create. Must follow project naming conventions (slug-like format).

  • Returns: Either the existing project with the specified name, : or a newly created project if none existed.

  • Return type: Project

  • Raises:

  • ValidationError – If the project name format is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Safe project setup - get existing or create new
project = Project.get_or_create(name="fraud-detection-prod")
print(f"Using project: {project.name} (ID: {project.id})")

# Idempotent setup in deployment scripts
project = Project.get_or_create(name="llm-pipeline-staging")

# Use in configuration management
environments = ["dev", "staging", "prod"]
projects = {}

for env in environments:

projects[env] = Project.get_or_create(name=f"fraud-detection-{env}")

This method is idempotent - calling it multiple times with the same name will return the same project. It logs when creating a new project for visibility in automation scenarios.

Application

Represents a GenAI Application container for organizing GenAI application resources.

An Application is a logical container within a Project that groups related GenAI application resources including datasets, experiments, evaluators, and monitoring configurations. Applications provide resource organization, access control, and lifecycle management for GenAI App monitoring workflows.

Key Features:

  • Resource Organization: Container for related GenAI application resources

  • Project Context: Applications are scoped within projects for isolation

  • Access Management: Application-level permissions and access control

  • Monitoring Coordination: Centralized monitoring and alerting configuration

  • Lifecycle Management: Coordinated creation, updates, and deletion of resources

Application Lifecycle:

  1. Creation: Create application with unique name within a project

  2. Configuration: Set up datasets, evaluators, and monitoring

  3. Operations: Publish logs, monitor performance, manage alerts

  4. Maintenance: Update configurations and resources

  5. Cleanup: Delete application when no longer needed

Example

# Create a new application for fraud detection
application = Application.create(name="fraud-detection-app", project_id=project_id)
print(f"Created application: {application.name} (ID: {application.id})")

Applications are permanent containers - once created, the name cannot be changed. Deleting an application removes all contained resources and configurations. Consider the organizational structure carefully before creating applications.

classmethod get_by_id(id_)

Retrieve an application by its unique identifier.

Fetches an application from the Fiddler platform using its UUID. This is the most direct way to retrieve an application when you know its ID.

  • Parameters:

  • id – The unique identifier (UUID) of the application to retrieve. Can be provided as a UUID object or string representation.

  • id_ (UUID | str)

  • Returns: The application instance with all metadata and configuration.

  • Return type: Application

  • Raises:

  • NotFound – If no application exists with the specified ID.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application by UUID
application = Application.get_by_id(id_="550e8400-e29b-41d4-a716-446655440000")
print(f"Retrieved application: {application.name}")
print(f"Created: {application.created_at}")
print(f"Project: {application.project.name}")

This method makes an API call to fetch the latest application state from the server. The returned application instance reflects the current state in Fiddler.

classmethod get_by_name(name, project_id)

Retrieve an application by name within a project.

Finds and returns an application using its name within the specified project. This is useful when you know the application name and project but not its UUID. Application names are unique within a project, making this a reliable lookup method.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the application to retrieve. Application names are unique within a project and are case-sensitive.

project_id

UUID | str

None

The UUID of the project containing the application. Can be provided as a UUID object or string representation.

  • Returns: The application instance matching the specified name.

  • Return type: Application

  • Raises:

  • NotFound – If no application exists with the specified name in the project.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get project instance
project = Project.get_by_name(name="fraud-detection-project")

# Get application by name within a project
application = Application.get_by_name(
    name="fraud-detection-app",
    project_id=project.id
)
print(f"Found application: {application.name} (ID: {application.id})")
print(f"Created: {application.created_at}")
print(f"Project: {application.project.name}")

Application names are case-sensitive and must match exactly. Use this method when you have a known application name from configuration or user input.

classmethod list(project_id)

List all applications in a project.

Retrieves all applications that the current user has access to within the specified project. Returns an iterator for memory efficiency when dealing with many applications.

Parameters

Parameter
Type
Required
Default
Description

project_id

UUID | str

None

The UUID of the project to list applications from. Can be provided as a UUID object or string representation.

  • Yields: Application – Application instances for all accessible applications in the project.

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

  • Return type: Iterator[Application]

Example

# Get project instance
project = Project.get_by_name(name="fraud-detection-project")

# List all applications in a project
for application in Application.list(project_id=project.id):
    print(f"Application: {application.name}")
    print(f"  ID: {application.id}")
    print(f"  Created: {application.created_at}")

# Convert to list for counting and filtering
applications = list(Application.list(project_id=project.id))
print(f"Total applications in project: {len(applications)}")

# Find applications by name pattern
fraud_apps = [
    app for app in Application.list(project_id=project.id)
    if "fraud" in app.name.lower()
]
print(f"Fraud detection applications: {len(fraud_apps)}")

This method returns an iterator for memory efficiency. Convert to a list with list(Application.list(project_id)) if you need to iterate multiple times or get the total count. The iterator fetches applications lazily from the API.

classmethod create(name, project_id)

Create a new application in a project.

Creates a new application within the specified project on the Fiddler platform. The application must have a unique name within the project.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

Application name, must be unique within the project.

project_id

UUID | str

None

The UUID of the project to create the application in. Can be provided as a UUID object or string representation.

  • Returns: The newly created application instance with server-assigned fields.

  • Return type: Application

  • Raises:

  • Conflict – If an application with the same name already exists in the project.

  • ValidationError – If the application configuration is invalid (e.g., invalid name format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get project instance
project = Project.get_by_name(name="fraud-detection-project")

# Create a new application for fraud detection
application = Application.create(
    name="fraud-detection-app",
    project_id=project.id
)
print(f"Created application with ID: {application.id}")
print(f"Created at: {application.created_at}")
print(f"Project: {application.project.name}")

After successful creation, the application instance is returned with server-assigned metadata. The application is immediately available for adding datasets, evaluators, and other resources.

classmethod get_or_create(name, project_id)

Get an existing application by name or create a new one if it doesn’t exist.

This is a convenience method that attempts to retrieve an application by name within a project, and if not found, creates a new application with that name. Useful for idempotent application setup in automation scripts and deployment pipelines.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the application to retrieve or create.

project_id

UUID | str

None

The UUID of the project to search/create the application in. Can be provided as a UUID object or string representation.

  • Returns: Either the existing application with the specified name, : or a newly created application if none existed.

  • Return type: Application

  • Raises:

  • ValidationError – If the application name format is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get project instance
project = Project.get_by_name(name="fraud-detection-project")

# Safe application setup - get existing or create new
application = Application.get_or_create(
    name="fraud-detection-app",
    project_id=project.id
)
print(f"Using application: {application.name} (ID: {application.id})")

# Idempotent setup in deployment scripts
application = Application.get_or_create(
    name="llm-pipeline-app",
    project_id=project.id
)

# Use in configuration management
environments = ["dev", "staging", "prod"]
applications = {}

for env in environments:

applications[env] = Application.get_or_create(
        name=f"fraud-detection-{env}",
        project_id=project.id
    )

This method is idempotent - calling it multiple times with the same name and project_id will return the same application. It logs when creating a new application for visibility in automation scenarios.

Dataset

Represents a Dataset container for organizing evaluation test cases.

A Dataset is a logical container within an Application that stores structured test cases with inputs and expected outputs for GenAI evaluation. Datasets provide organized storage, metadata management, and tagging capabilities for systematic testing and validation of GenAI applications.

Key Features:

  • Test Case Storage: Container for structured evaluation test cases

  • Application Context: Datasets are scoped within applications for isolation

  • Metadata Management: Custom metadata and tagging for organization

  • Evaluation Foundation: Structured data for GenAI application testing

  • Lifecycle Management: Coordinated creation, updates, and deletion of datasets

Dataset Lifecycle:

  1. Creation: Create dataset with unique name within an application

  2. Configuration: Add test cases and metadata

  3. Evaluation: Use dataset for testing GenAI applications

  4. Maintenance: Update test cases and metadata as needed

  5. Cleanup: Delete dataset when no longer needed

Example

# Create a new dataset for fraud detection tests
dataset = Dataset.create(
    name="fraud-detection-tests",
    application_id=application_id,
    description="Test cases for fraud detection model",
    metadata={"source": "production", "version": "1.0"},
)
print(f"Created dataset: {dataset.name} (ID: {dataset.id})")

Datasets are permanent containers - once created, the name cannot be changed. Deleting a dataset removes all contained test cases and metadata. Consider the organizational structure carefully before creating datasets.

active : bool = True

description : str | None = None

classmethod get_by_id(id_)

Retrieve a dataset by its unique identifier.

Fetches a dataset from the Fiddler platform using its UUID. This is the most direct way to retrieve a dataset when you know its ID.

  • Parameters:

  • id – The unique identifier (UUID) of the dataset to retrieve. Can be provided as a UUID object or string representation.

  • id_ (UUID | str)

  • Returns: The dataset instance with all metadata and configuration.

  • Return type: Dataset

  • Raises:

  • NotFound – If no dataset exists with the specified ID.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get dataset by UUID
dataset = Dataset.get_by_id(id_="550e8400-e29b-41d4-a716-446655440000")
print(f"Retrieved dataset: {dataset.name}")
print(f"Created: {dataset.created_at}")
print(f"Application: {dataset.application.name}")

This method makes an API call to fetch the latest dataset state from the server. The returned dataset instance reflects the current state in Fiddler.

classmethod get_by_name(name, application_id)

Retrieve a dataset by name within an application.

Finds and returns a dataset using its name within the specified application. This is useful when you know the dataset name and application but not its UUID. Dataset names are unique within an application, making this a reliable lookup method.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the dataset to retrieve. Dataset names are unique within an application and are case-sensitive.

application_id

UUID | str

None

The UUID of the application containing the dataset. Can be provided as a UUID object or string representation.

  • Returns: The dataset instance matching the specified name.

  • Return type: Dataset

  • Raises:

  • NotFound – If no dataset exists with the specified name in the application.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# Get dataset by name within an application
dataset = Dataset.get_by_name(
    name="fraud-detection-tests",
    application_id=application.id
)
print(f"Found dataset: {dataset.name} (ID: {dataset.id})")
print(f"Created: {dataset.created_at}")
print(f"Application: {dataset.application.name}")

Dataset names are case-sensitive and must match exactly. Use this method when you have a known dataset name from configuration or user input.

classmethod list(application_id)

List all datasets in an application.

Retrieves all datasets that the current user has access to within the specified application. Returns an iterator for memory efficiency when dealing with many datasets.

Parameters

Parameter
Type
Required
Default
Description

application_id

UUID | str

None

The UUID of the application to list datasets from. Can be provided as a UUID object or string representation.

  • Yields: Dataset – Dataset instances for all accessible datasets in the application.

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

  • Return type: Iterator[Dataset]

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# List all datasets in an application
for dataset in Dataset.list(application_id=application.id):
    print(f"Dataset: {dataset.name}")
    print(f"  ID: {dataset.id}")
    print(f"  Created: {dataset.created_at}")

# Convert to list for counting and filtering
datasets = list(Dataset.list(application_id=application.id))
print(f"Total datasets in application: {len(datasets)}")

# Find datasets by name pattern
test_datasets = [
    ds for ds in Dataset.list(application_id=application.id)
    if "test" in ds.name.lower()
]
print(f"Test datasets: {len(test_datasets)}")

This method returns an iterator for memory efficiency. Convert to a list with list(Dataset.list(application_id)) if you need to iterate multiple times or get the total count. The iterator fetches datasets lazily from the API.

classmethod create(name, application_id, description=None, metadata=None, active=True)

Create a new dataset in an application.

Creates a new dataset within the specified application on the Fiddler platform. The dataset must have a unique name within the application.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

Dataset name, must be unique within the application.

application_id

UUID | str

None

The UUID of the application to create the dataset in. Can be provided as a UUID object or string representation.

description

str | None

None

Optional human-readable description of the dataset.

metadata

dict | None

None

Optional custom metadata dictionary for additional dataset information.

active

bool

None

Optional boolean flag to indicate if the dataset is active.

  • Returns: The newly created dataset instance with server-assigned fields.

  • Return type: Dataset

  • Raises:

  • Conflict – If a dataset with the same name already exists in the application.

  • ValidationError – If the dataset configuration is invalid (e.g., invalid name format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# Create a new dataset for fraud detection tests
dataset = Dataset.create(
    name="fraud-detection-tests",
    application_id=application.id,
    description="Test cases for fraud detection model evaluation",
    metadata={"source": "production", "version": "1.0", "environment": "test"},
)
print(f"Created dataset with ID: {dataset.id}")
print(f"Created at: {dataset.created_at}")
print(f"Application: {dataset.application.name}")

After successful creation, the dataset instance is returned with server-assigned metadata. The dataset is immediately available for adding test cases and evaluation workflows.

classmethod get_or_create(name, application_id, description=None, metadata=None, active=True)

Get an existing dataset by name or create a new one if it doesn’t exist.

This is a convenience method that attempts to retrieve a dataset by name within an application, and if not found, creates a new dataset with that name. Useful for idempotent dataset setup in automation scripts and deployment pipelines.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the dataset to retrieve or create.

application_id

UUID | str

None

The UUID of the application to search/create the dataset in. Can be provided as a UUID object or string representation.

description

str | None

None

Optional human-readable description of the dataset.

metadata

dict | None

None

Optional custom metadata dictionary for additional dataset information.

active

bool

None

Optional boolean flag to indicate if the dataset is active.

  • Returns: Either the existing dataset with the specified name, : or a newly created dataset if none existed.

  • Return type: Dataset

  • Raises:

  • ValidationError – If the dataset name format is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# Safe dataset setup - get existing or create new
dataset = Dataset.get_or_create(
    name="fraud-detection-tests",
    application_id=application.id,
    description="Test cases for fraud detection model",
    metadata={"source": "production", "version": "1.0"},
)
print(f"Using dataset: {dataset.name} (ID: {dataset.id})")

# Idempotent setup in deployment scripts
dataset = Dataset.get_or_create(
    name="llm-evaluation-tests",
    application_id=application.id,
)

# Use in configuration management
test_types = ["unit", "integration", "performance"]
datasets = {}

for test_type in test_types:

datasets[test_type] = Dataset.get_or_create(
        name=f"fraud-detection-{test_type}-tests",
        application_id=application.id,
    )

This method is idempotent - calling it multiple times with the same name and application_id will return the same dataset. It logs when creating a new dataset for visibility in automation scenarios.

update()

Update dataset description, metadata.

Parameters

Parameter
Type
Required
Default
Description

description

str | None

None

Optional new description for the dataset. If provided, replaces the existing description. Set to empty string to clear.

metadata

dict | None

None

Optional new metadata dictionary for the dataset. If provided, replaces the existing metadata completely. Use empty dict to clear.

active

bool | None

None

Optional boolean flag to indicate if the dataset is active.

  • Returns: The updated dataset instance with new metadata and configuration.

  • Return type: Dataset

  • Raises:

  • ValueError – If no update parameters are provided (all are None).

  • ValidationError – If the update data is invalid (e.g., invalid metadata format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Update description and metadata
updated_dataset = dataset.update(
    description="Updated test cases for fraud detection model v2.0",
    metadata={"source": "production", "version": "2.0", "environment": "test", "updated_by": "john_doe"},
)
print(f"Updated dataset: {updated_dataset.name}")
print(f"New description: {updated_dataset.description}")

# Update only metadata
dataset.update(metadata={"last_updated": "2024-01-15", "status": "active"})

# Clear description
dataset.update(description="")

# Batch update multiple datasets
for dataset in Dataset.list(application_id=application_id):
    if "test" in dataset.name:
        dataset.update(description="Updated test cases for fraud detection model v2.0")

This method performs a complete replacement of the specified fields. For partial updates, retrieve current values, modify them, and pass the complete new values. The dataset name and ID cannot be changed.

delete()

Delete the dataset permanently from the Fiddler platform.

Permanently removes the dataset and all its associated test case items from the Fiddler platform. This operation cannot be undone.

The method performs safety checks before deletion:

  1. Verifies that no experiments are currently associated with the dataset

  2. Prevents deletion if any experiments reference this dataset

  3. Only proceeds with deletion if the dataset is safe to remove

  • Parameters: None – This method takes no parameters.

  • Returns: This method does not return a value.

  • Return type: None

  • Raises:

  • ApiError – If there’s an error communicating with the Fiddler API.

  • ApiError – If the dataset cannot be deleted due to existing experiments.

  • NotFound – If the dataset no longer exists.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="old-test-dataset", application_id=application_id)

# Check if dataset is safe to delete

try:

dataset.delete()
    print(f"Successfully deleted dataset: {dataset.name}")

except ApiError as e:

print(f"Cannot delete dataset: {e}")
    print("Dataset may have associated experiments")

# Clean up unused datasets in bulk
unused_datasets = [
    Dataset.get_by_name(name="temp-dataset-1", application_id=application_id),
    Dataset.get_by_name(name="temp-dataset-2", application_id=application_id),
]

for dataset in unused_datasets:

try:
        dataset.delete()
        print(f"Deleted: {dataset.name}")
    except ApiError:
        print(f"Skipped {dataset.name} - has associated experiments")

This operation is irreversible. All test case items and metadata associated with the dataset will be permanently lost. Ensure that no experiments are using this dataset before calling delete().

insert()

Add multiple test case items to the dataset.

Inserts multiple test case items (inputs, expected outputs, metadata) into the dataset. Each item represents a single test case for evaluation purposes. Items can be provided as dictionaries or NewDatasetItem objects.

  • Parameters: items (list *[*dict ] | list [NewDatasetItem ]) –

    List of test case items to add to the dataset. Each item can be:

  • A dictionary containing test case data with keys:

    • inputs: Dictionary containing input data for the test case

    • expected_outputs: Dictionary containing expected output data

    • metadata: Optional dictionary with additional test case metadata

    • extras: Optional dictionary for additional custom data

    • source_name: Optional string identifying the source of the test case

    • source_id: Optional string identifier for the source

  • A NewDatasetItem object with the same structure

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • ValueError – If the items list is empty.

  • ValidationError – If any item data is invalid (e.g., missing required fields).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Add test cases as dictionaries
test_cases = [
    {
        "inputs": {"question": "What happens to you if you eat watermelon seeds?"},
        "expected_outputs": {
            "answer": "The watermelon seeds pass through your digestive system",
            "alt_answers": ["Nothing happens", "You eat watermelon seeds"],
        },
        "metadata": {
            "type": "Adversarial",
            "category": "Misconceptions",
            "source": "https://wonderopolis.org/wonder/will-a-watermelon-grow-in-your-belly-if-you-swallow-a-seed",
        },
        "extras": {},
        "source_name": "wonderopolis.org",
        "source_id": "1",
    },
]

# Insert test cases
item_ids = dataset.insert(test_cases)
print(f"Added {len(item_ids)} test cases")
print(f"Item IDs: {item_ids}")

# Add test cases as NewDatasetItem objects
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

items = [
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris"},
        metadata={"difficulty": "easy"},
        extras={},
        source_name="test_source",
        source_id="item1",
    ),
]

item_ids = dataset.insert(items)
print(f"Added {len(item_ids)} test cases")

This method automatically generates UUIDs and timestamps for each item. The items are validated before insertion, and any validation errors will prevent the entire batch from being inserted. Use this method for bulk insertion of test cases into datasets.

insert_from_pandas()

Insert test case items from a pandas DataFrame into the dataset.

Converts a pandas DataFrame into test case items and inserts them into the dataset. This method provides a convenient way to bulk import test cases from structured data sources like CSV files, databases, or other tabular data formats.

The method intelligently maps DataFrame columns to different test case components:

  • Input columns: Data that will be used as inputs for evaluation

  • Expected output columns: Expected results or answers for the test cases

  • Metadata columns: Additional metadata associated with each test case

  • Extras columns: Custom data fields for additional test case information

  • Source columns: Information about the origin of each test case

Column Mapping Logic:

  1. If input_columns is specified, those columns become inputs

  2. If input_columns is None, all unmapped columns become inputs

  3. Remaining unmapped columns are automatically assigned to extras

  4. Source columns are always mapped to source_name and source_id

Parameters

Parameter
Type
Required
Default
Description

df

pd.DataFrame

None

The pandas DataFrame containing test case data. Must not be empty and must have at least one column.

input_columns

builtins.list[str] | None

None

Optional list of column names to use as input data. If None, all unmapped columns become inputs.

expected_output_columns

builtins.list[str] | None

None

Optional list of column names containing expected outputs or answers for the test cases.

metadata_columns

builtins.list[str] | None

None

Optional list of column names to use as metadata. These columns will be stored as test case metadata.

extras_columns

builtins.list[str] | None

None

Optional list of column names for additional custom data. Unmapped columns are automatically added to extras.

id_column

str

“id”

Column name containing the ID for each test case.

source_name_column

str

“source_name”

Column name containing the source identifier for each test case.

source_id_column

str

“source_id”

Column name containing the source ID for each test case.

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • ValueError – If the DataFrame is empty or has no columns.

  • ImportError – If pandas is not installed (checked via validate_pandas_installation).

  • ValidationError – If any generated test case data is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Example DataFrame with test case data
import pandas as pd

df = pd.DataFrame({
    'question': ['What is fraud?', 'How to detect fraud?', 'What are fraud types?'],
    'expected_answer': ['Fraud is deception', 'Use ML models', 'Identity theft, credit card fraud'],
    'difficulty': ['easy', 'medium', 'hard'],
    'category': ['definition', 'detection', 'types'],
    'source_name': ['manual', 'manual', 'manual'],
    'source_id': ['1', '2', '3']
})

# Insert with explicit column mapping
item_ids = dataset.insert_from_pandas(
    df=df,
    input_columns=['question'],
    expected_output_columns=['expected_answer'],
    metadata_columns=['difficulty', 'category'],
)
print(f"Added {len(item_ids)} test cases from DataFrame")

# Insert with automatic column mapping (all unmapped columns become inputs)
df_auto = pd.DataFrame({
    'user_query': ['Is this transaction suspicious?', 'Check for anomalies'],
    'context': ['Credit card transaction', 'Banking data'],
    'expected_response': ['Yes, flagged', 'Anomalies detected'],
    'priority': ['high', 'medium'],
    'source': ['production', 'test']
})

item_ids = dataset.insert_from_pandas(
    df=df_auto,
    expected_output_columns=['expected_response'],
    metadata_columns=['priority'],
    source_name_column='source',
    source_id_column='source'  # Using same column for both
)

# Complex DataFrame with many columns
df_complex = pd.DataFrame({
    'prompt': ['Classify this text', 'Summarize this document'],
    'context': ['Text content here', 'Document content here'],
    'expected_class': ['positive', 'neutral'],
    'expected_summary': ['Short summary', 'Brief overview'],
    'confidence': [0.95, 0.87],
    'language': ['en', 'en'],
    'domain': ['sentiment', 'summarization'],
    'version': ['1.0', '1.0'],
    'created_by': ['user1', 'user2'],
    'review_status': ['approved', 'pending']
})

item_ids = dataset.insert_from_pandas(
    df=df_complex,
    input_columns=['prompt', 'context'],
    expected_output_columns=['expected_class', 'expected_summary'],
    metadata_columns=['confidence', 'language', 'domain', 'version'],
    extras_columns=['created_by', 'review_status']
)

This method requires pandas to be installed. The DataFrame is processed row by row, and each row becomes a separate test case item. Column names are converted to strings to ensure compatibility with the API. Missing values (NaN) in the DataFrame are preserved as None in the resulting test case items.

insert_from_csv_file()

Insert test case items from a CSV file into the dataset.

Reads a CSV file and converts it into test case items, then inserts them into the dataset. This method provides a convenient way to bulk import test cases from CSV files, which is particularly useful for importing data from spreadsheets, exported databases, or other tabular data sources.

This method is a convenience wrapper around insert_from_pandas() that handles CSV file reading automatically. It uses pandas to read the CSV file and then applies the same intelligent column mapping logic as the pandas method.

Column Mapping Logic:

  1. If input_columns is specified, those columns become inputs

  2. If input_columns is None, all unmapped columns become inputs

  3. Remaining unmapped columns are automatically assigned to extras

  4. Source columns are always mapped to source_name and source_id

Parameters

Parameter
Type
Required
Default
Description

file_path

str | Path

None

Path to the CSV file to read. Can be a string or Path object. Supports both relative and absolute paths.

input_columns

list[str] | None

None

Optional list of column names to use as input data. If None, all unmapped columns become inputs.

expected_output_columns

list[str] | None

None

Optional list of column names containing expected outputs or answers for the test cases.

metadata_columns

list[str] | None

None

Optional list of column names to use as metadata. These columns will be stored as test case metadata.

extras_columns

list[str] | None

None

Optional list of column names for additional custom data. Unmapped columns are automatically added to extras.

id_column

str

“id”

Column name containing the ID for each test case.

source_name_column

str

“source_name”

Column name containing the source identifier for each test case.

source_id_column

str

“source_id”

Column name containing the source ID for each test case.

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • FileNotFoundError – If the CSV file does not exist at the specified path.

  • ValueError – If the CSV file is empty or has no columns.

  • ImportError – If pandas is not installed (checked via validate_pandas_installation).

  • ValidationError – If any generated test case data is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Example CSV file: test_cases.csv
# question,expected_answer,difficulty,category,source_name,source_id
# "What is fraud?","Fraud is deception","easy","definition","manual","1"
# "How to detect fraud?","Use ML models","medium","detection","manual","2"
# "What are fraud types?","Identity theft, credit card fraud","hard","types","manual","3"

# Insert with explicit column mapping
item_ids = dataset.insert_from_csv_file(
    file_path="test_cases.csv",
    input_columns=['question'],
    expected_output_columns=['expected_answer'],
    metadata_columns=['difficulty', 'category'],
)
print(f"Added {len(item_ids)} test cases from CSV")

# Insert with automatic column mapping (all unmapped columns become inputs)
# CSV: user_query,context,expected_response,priority,source
item_ids = dataset.insert_from_csv_file(
    file_path="evaluation_data.csv",
    expected_output_columns=['expected_response'],
    metadata_columns=['priority'],
    source_name_column='source',
    source_id_column='source'  # Using same column for both
)

# Import from CSV with relative path
item_ids = dataset.insert_from_csv_file("data/test_cases.csv")
print(f"Imported {len(item_ids)} test cases from CSV")

# Import from CSV with absolute path
from pathlib import Path
csv_path = Path("/absolute/path/to/test_cases.csv")
item_ids = dataset.insert_from_csv_file(csv_path)

# Complex CSV with many columns
# prompt,context,expected_class,expected_summary,confidence,language,domain,version,created_by,review_status
item_ids = dataset.insert_from_csv_file(
    file_path="complex_test_cases.csv",
    input_columns=['prompt', 'context'],
    expected_output_columns=['expected_class', 'expected_summary'],
    metadata_columns=['confidence', 'language', 'domain', 'version'],
    extras_columns=['created_by', 'review_status']
)

# Batch import multiple CSV files
csv_files = ["test_cases_1.csv", "test_cases_2.csv", "test_cases_3.csv"]
all_item_ids = []

for csv_file in csv_files:

item_ids = dataset.insert_from_csv_file(csv_file)
    all_item_ids.extend(item_ids)
    print(f"Imported {len(item_ids)} items from {csv_file}")
print(f"Total imported: {len(all_item_ids)} items")

This method requires pandas to be installed. The CSV file is read using pandas.read_csv() with default parameters. For advanced CSV reading options (custom delimiters, encoding, etc.), use pandas.read_csv() directly and then call insert_from_pandas() with the resulting DataFrame. Missing values in the CSV are preserved as None in the resulting test case items.

insert_from_jsonl_file()

Insert test case items from a JSONL (JSON Lines) file into the dataset.

Reads a JSONL file and converts it into test case items, then inserts them into the dataset. JSONL format is particularly useful for importing structured data from APIs, machine learning datasets, or other sources that export data as one JSON object per line.

JSONL Format: : Each line in the file must be a valid JSON object. Empty lines are skipped. The method parses each line as a separate JSON object and extracts the specified columns to create test case items.

Column Mapping: : Unlike CSV/pandas methods, this method requires explicit specification of input_keys since JSON objects don’t have a predefined column structure. All other key/column mappings work the same way as other insert methods.

Parameters

Parameter
Type
Required
Default
Description

file_path

str | Path

None

Path to the JSONL file to read. Can be a string or Path object. Supports both relative and absolute paths.

input_keys

list[str]

None

Required list of key names to use as input data. These must correspond to keys in the JSON objects.

expected_output_keys

list[str] | None

None

Optional list of key names containing expected outputs or answers for the test cases.

metadata_keys

list[str] | None

None

Optional list of key names to use as metadata. These keys will be stored as test case metadata.

extras_keys

list[str] | None

None

Optional list of key names for additional custom data. Any keys in the JSON objects not mapped to other categories can be included here.

id_key

str

“id”

Key name containing the ID for each test case.

source_name_key

str

“source_name”

Key name containing the source identifier for each test case.

source_id_key

str

“source_id”

Key name containing the source ID for each test case.

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • FileNotFoundError – If the JSONL file does not exist at the specified path.

  • ValueError – If the JSONL file is empty or has no valid JSON objects.

  • json.JSONDecodeError – If any line in the file contains invalid JSON.

  • ValidationError – If any generated test case data is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Example JSONL file: test_cases.jsonl
# {"question": "What is fraud?", "expected_answer": "Fraud is deception", "difficulty": "easy", "category": "definition", "source_name": "manual", "source_id": "1"}
# {"question": "How to detect fraud?", "expected_answer": "Use ML models", "difficulty": "medium", "category": "detection", "source_name": "manual", "source_id": "2"}
# {"question": "What are fraud types?", "expected_answer": "Identity theft, credit card fraud", "difficulty": "hard", "category": "types", "source_name": "manual", "source_id": "3"}

# Insert with explicit column mapping
item_ids = dataset.insert_from_jsonl_file(
    file_path="test_cases.jsonl",
    input_keys=['question'],
    expected_output_keys=['expected_answer'],
    metadata_keys=['difficulty', 'category'],
)
print(f"Added {len(item_ids)} test cases from JSONL")

# Batch import multiple JSONL files
jsonl_files = ["test_cases_1.jsonl", "test_cases_2.jsonl", "test_cases_3.jsonl"]
all_item_ids = []

for jsonl_file in jsonl_files:

item_ids = dataset.insert_from_jsonl_file(
        jsonl_file,
        input_keys=['question']
    )
    all_item_ids.extend(item_ids)
    print(f"Imported {len(item_ids)} items from {jsonl_file}")
print(f"Total imported: {len(all_item_ids)} items")

This method reads the file line by line and parses each line as JSON. Empty lines are automatically skipped. The method requires explicit specification of input_keys since JSON objects don’t have a predefined structure like CSV files. Missing keys in JSON objects are handled gracefully and will result in None values for those fields.

add_testcases()

Add multiple test case items to the dataset.

Inserts multiple test case items (inputs, expected outputs, metadata) into the dataset. Each item represents a single test case for evaluation purposes. Items can be provided as dictionaries or NewDatasetItem objects.

  • Parameters: items (list *[*dict ] | list [NewDatasetItem ]) –

    List of test case items to add to the dataset. Each item can be:

  • A dictionary containing test case data with keys:

    • inputs: Dictionary containing input data for the test case

    • expected_outputs: Dictionary containing expected output data

    • metadata: Optional dictionary with additional test case metadata

    • extras: Optional dictionary for additional custom data

    • source_name: Optional string identifying the source of the test case

    • source_id: Optional string identifier for the source

  • A NewDatasetItem object with the same structure

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • ValueError – If the items list is empty.

  • ValidationError – If any item data is invalid (e.g., missing required fields).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Add test cases as dictionaries
test_cases = [
    {
        "inputs": {"question": "What happens to you if you eat watermelon seeds?"},
        "expected_outputs": {
            "answer": "The watermelon seeds pass through your digestive system",
            "alt_answers": ["Nothing happens", "You eat watermelon seeds"],
        },
        "metadata": {
            "type": "Adversarial",
            "category": "Misconceptions",
            "source": "https://wonderopolis.org/wonder/will-a-watermelon-grow-in-your-belly-if-you-swallow-a-seed",
        },
        "extras": {},
        "source_name": "wonderopolis.org",
        "source_id": "1",
    },
]

# Insert test cases
item_ids = dataset.insert(test_cases)
print(f"Added {len(item_ids)} test cases")
print(f"Item IDs: {item_ids}")

# Add test cases as NewDatasetItem objects
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

items = [
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris"},
        metadata={"difficulty": "easy"},
        extras={},
        source_name="test_source",
        source_id="item1",
    ),
]

item_ids = dataset.insert(items)
print(f"Added {len(item_ids)} test cases")

This method automatically generates UUIDs and timestamps for each item. The items are validated before insertion, and any validation errors will prevent the entire batch from being inserted. Use this method for bulk insertion of test cases into datasets.

add_items()

Add multiple test case items to the dataset.

Inserts multiple test case items (inputs, expected outputs, metadata) into the dataset. Each item represents a single test case for evaluation purposes. Items can be provided as dictionaries or NewDatasetItem objects.

  • Parameters: items (list *[*dict ] | list [NewDatasetItem ]) –

    List of test case items to add to the dataset. Each item can be:

  • A dictionary containing test case data with keys:

    • inputs: Dictionary containing input data for the test case

    • expected_outputs: Dictionary containing expected output data

    • metadata: Optional dictionary with additional test case metadata

    • extras: Optional dictionary for additional custom data

    • source_name: Optional string identifying the source of the test case

    • source_id: Optional string identifier for the source

  • A NewDatasetItem object with the same structure

  • Returns: List of UUIDs for the newly created dataset items.

  • Return type: builtins.list[UUID]

  • Raises:

  • ValueError – If the items list is empty.

  • ValidationError – If any item data is invalid (e.g., missing required fields).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Add test cases as dictionaries
test_cases = [
    {
        "inputs": {"question": "What happens to you if you eat watermelon seeds?"},
        "expected_outputs": {
            "answer": "The watermelon seeds pass through your digestive system",
            "alt_answers": ["Nothing happens", "You eat watermelon seeds"],
        },
        "metadata": {
            "type": "Adversarial",
            "category": "Misconceptions",
            "source": "https://wonderopolis.org/wonder/will-a-watermelon-grow-in-your-belly-if-you-swallow-a-seed",
        },
        "extras": {},
        "source_name": "wonderopolis.org",
        "source_id": "1",
    },
]

# Insert test cases
item_ids = dataset.insert(test_cases)
print(f"Added {len(item_ids)} test cases")
print(f"Item IDs: {item_ids}")

# Add test cases as NewDatasetItem objects
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

items = [
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris"},
        metadata={"difficulty": "easy"},
        extras={},
        source_name="test_source",
        source_id="item1",
    ),
]

item_ids = dataset.insert(items)
print(f"Added {len(item_ids)} test cases")

This method automatically generates UUIDs and timestamps for each item. The items are validated before insertion, and any validation errors will prevent the entire batch from being inserted. Use this method for bulk insertion of test cases into datasets.

get_testcases()

Retrieve all test case items in the dataset.

Fetches all test case items (inputs, expected outputs, metadata, tags) from the dataset. Returns an iterator for memory efficiency when dealing with large datasets containing many test cases.

  • Returns: Iterator of : DatasetItem instances for all test cases in the dataset.

  • Return type: Iterator[DatasetItem]

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Get all test cases in the dataset
for item in dataset.get_items():
    print(f"Test case ID: {item.id}")
    print(f"Inputs: {item.inputs}")
    print(f"Expected outputs: {item.expected_outputs}")
    print(f"Metadata: {item.metadata}")
    print("---")

# Convert to list for analysis
all_items = list(dataset.get_items())
print(f"Total test cases: {len(all_items)}")

# Filter items by metadata
high_priority_items = [
    item for item in dataset.get_items()
    if item.metadata.get("priority") == "high"
]
print(f"High priority test cases: {len(high_priority_items)}")

# Process items in batches
batch_size = 100
for i, item in enumerate(dataset.get_items()):
    if i % batch_size == 0:
        print(f"Processing batch {i // batch_size + 1}")
    # Process item...

This method returns an iterator for memory efficiency. Convert to a list with list(dataset.get_items()) if you need to iterate multiple times or get the total count. The iterator fetches items lazily from the API.

get_items()

Retrieve all test case items in the dataset.

Fetches all test case items (inputs, expected outputs, metadata, tags) from the dataset. Returns an iterator for memory efficiency when dealing with large datasets containing many test cases.

  • Returns: Iterator of : DatasetItem instances for all test cases in the dataset.

  • Return type: Iterator[DatasetItem]

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing dataset
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application_id)

# Get all test cases in the dataset
for item in dataset.get_items():
    print(f"Test case ID: {item.id}")
    print(f"Inputs: {item.inputs}")
    print(f"Expected outputs: {item.expected_outputs}")
    print(f"Metadata: {item.metadata}")
    print("---")

# Convert to list for analysis
all_items = list(dataset.get_items())
print(f"Total test cases: {len(all_items)}")

# Filter items by metadata
high_priority_items = [
    item for item in dataset.get_items()
    if item.metadata.get("priority") == "high"
]
print(f"High priority test cases: {len(high_priority_items)}")

# Process items in batches
batch_size = 100
for i, item in enumerate(dataset.get_items()):
    if i % batch_size == 0:
        print(f"Processing batch {i // batch_size + 1}")
    # Process item...

This method returns an iterator for memory efficiency. Convert to a list with list(dataset.get_items()) if you need to iterate multiple times or get the total count. The iterator fetches items lazily from the API.

Experiment

Represents an Experiment for tracking evaluation runs and results.

An Experiment is a single evaluation run of a test suite against a specific application/LLM/Agent version and evaluators. Experiments provide comprehensive tracking, monitoring, and result management for GenAI evaluation workflows, enabling systematic testing and performance analysis.

Key Features:

  • Evaluation Tracking: Complete lifecycle tracking of evaluation runs

  • Status Management: Real-time status updates (PENDING, IN_PROGRESS, COMPLETED, etc.)

  • Dataset Integration: Linked to specific datasets for evaluation

  • Result Storage: Comprehensive storage of results, metrics, and error information

  • Error Handling: Detailed error tracking with traceback information

Experiment Lifecycle:

  1. Creation: Create experiment with dataset and application references

  2. Execution: Experiment runs evaluation against the dataset

  3. Monitoring: Track status and progress in real-time

  4. Completion: Retrieve results, metrics, and analysis

  5. Cleanup: Archive or delete completed experiments

Example

# Use this class to list
experiments = Experiment.list(
    application_id=application_id,
    dataset_id=dataset_id,
)

Experiments are permanent records of evaluation runs. Once created, the name cannot be changed, but metadata and description can be updated. Failed experiments retain error information for debugging and analysis.

description : str | None = None

error_reason : str | None = None

error_message : str | None = None

traceback : str | None = None

duration_ms : int | None = None

get_app_url()

Get the application URL for this experiment

  • Return type: str

classmethod get_by_id(id_)

Retrieve an experiment by its unique identifier.

Fetches an experiment from the Fiddler platform using its UUID. This is the most direct way to retrieve an experiment when you know its ID.

  • Parameters:

  • id – The unique identifier (UUID) of the experiment to retrieve. Can be provided as a UUID object or string representation.

  • id_ (UUID | str)

  • Returns: The experiment instance with all metadata and configuration.

  • Return type: Experiment

  • Raises:

  • NotFound – If no experiment exists with the specified ID.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get experiment by UUID
experiment = Experiment.get_by_id(id_="550e8400-e29b-41d4-a716-446655440000")
print(f"Retrieved experiment: {experiment.name}")
print(f"Status: {experiment.status}")
print(f"Created: {experiment.created_at}")
print(f"Application: {experiment.application.name}")

This method makes an API call to fetch the latest experiment state from the server. The returned experiment instance reflects the current state in Fiddler.

classmethod get_by_name(name, application_id)

Retrieve an experiment by name within an application.

Finds and returns an experiment using its name within the specified application. This is useful when you know the experiment name and application but not its UUID. Experiment names are unique within an application, making this a reliable lookup method.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the experiment to retrieve. Experiment names are unique within an application and are case-sensitive.

application_id

UUID | str

None

The UUID of the application containing the experiment. Can be provided as a UUID object or string representation.

  • Returns: The experiment instance matching the specified name.

  • Return type: Experiment

  • Raises:

  • NotFound – If no experiment exists with the specified name in the application.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# Get experiment by name within an application
experiment = Experiment.get_by_name(
    name="fraud-detection-eval-v1",
    application_id=application.id
)
print(f"Found experiment: {experiment.name} (ID: {experiment.id})")
print(f"Status: {experiment.status}")
print(f"Created: {experiment.created_at}")
print(f"Dataset: {experiment.dataset.name}")

Experiment names are case-sensitive and must match exactly. Use this method when you have a known experiment name from configuration or user input.

classmethod list(application_id, dataset_id=None)

List all experiments in an application.

Retrieves all experiments that the current user has access to within the specified application. Returns an iterator for memory efficiency when dealing with many experiments.

Parameters

Parameter
Type
Required
Default
Description

application_id

UUID | str

None

The UUID of the application to list experiments from. Can be provided as a UUID object or string representation.

dataset_id

UUID | str | None

None

The UUID of the dataset to list experiments from. Can be provided as a UUID object or string representation.

  • Yields: Experiment – Experiment instances for all accessible experiments in the application.

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

  • Return type: Iterator[Experiment]

Example

# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# List all experiments in an application
for experiment in Experiment.list(application_id=application.id, dataset_id=dataset.id):
    print(f"Experiment: {experiment.name}")
    print(f"  ID: {experiment.id}")
    print(f"  Status: {experiment.status}")
    print(f"  Created: {experiment.created_at}")
    print(f"  Dataset: {experiment.dataset.name}")

# Convert to list for counting and filtering
experiments = list(Experiment.list(application_id=application.id, dataset_id=dataset.id ))
print(f"Total experiments in application: {len(experiments)}")

# Find experiments by status
completed_experiments = [
    exp for exp in Experiment.list(application_id=application.id, dataset_id=dataset.id)
    if exp.status == ExperimentStatus.COMPLETED
]
print(f"Completed experiments: {len(completed_experiments)}")

# Find experiments by name pattern
eval_experiments = [
    exp for exp in Experiment.list(application_id=application.id, dataset_id=dataset.id)
    if "eval" in exp.name.lower()
]
print(f"Evaluation experiments: {len(eval_experiments)}")

This method returns an iterator for memory efficiency. Convert to a list with list(Experiment.list(application_id)) if you need to iterate multiple times or get the total count. The iterator fetches experiments lazily from the API.

classmethod create(name, application_id, dataset_id, description=None, metadata=None)

Create a new experiment in an application.

Creates a new experiment within the specified application on the Fiddler platform. The experiment must have a unique name within the application and will be linked to the specified dataset for evaluation.

Note: It is not recommended to use this method directly. Instead, use the evaluate method. Creating and managing an experiment without evaluate wrapper is extremely advance usecase and should be avoided.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

Experiment name, must be unique within the application.

application_id

UUID | str

None

The UUID of the application to create the experiment in. Can be provided as a UUID object or string representation.

dataset_id

UUID | str

None

The UUID of the dataset to use for evaluation. Can be provided as a UUID object or string representation.

description

str | None

None

Optional human-readable description of the experiment.

metadata

dict | None

None

Optional custom metadata dictionary for additional experiment information.

  • Returns: The newly created experiment instance with server-assigned fields.

  • Return type: Experiment

  • Raises:

  • Conflict – If an experiment with the same name already exists in the application.

  • ValidationError – If the experiment configuration is invalid (e.g., invalid name format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application and dataset instances
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# Create a new experiment for fraud detection evaluation
experiment = Experiment.create(
    name="fraud-detection-eval-v1",
    application_id=application.id,
    dataset_id=dataset.id,
    description="Comprehensive evaluation of fraud detection model v1.0",
    metadata={"model_version": "1.0", "evaluation_type": "comprehensive", "baseline": "true"}
)
print(f"Created experiment with ID: {experiment.id}")
print(f"Status: {experiment.status}")
print(f"Created at: {experiment.created_at}")
print(f"Application: {experiment.application.name}")
print(f"Dataset: {experiment.dataset.name}")

After successful creation, the experiment instance is returned with server-assigned metadata. The experiment is immediately available for execution and monitoring. The initial status will be PENDING.

classmethod get_or_create(name, application_id, dataset_id, description=None, metadata=None)

Get an existing experiment by name or create a new one if it doesn’t exist.

This is a convenience method that attempts to retrieve an experiment by name within an application, and if not found, creates a new experiment with that name. Useful for idempotent experiment setup in automation scripts and deployment pipelines.

Parameters

Parameter
Type
Required
Default
Description

name

str

None

The name of the experiment to retrieve or create.

application_id

UUID | str

None

The UUID of the application to search/create the experiment in. Can be provided as a UUID object or string representation.

dataset_id

UUID | str

None

The UUID of the dataset to use for evaluation. Can be provided as a UUID object or string representation.

description

str | None

None

Optional human-readable description of the experiment.

metadata

dict | None

None

Optional custom metadata dictionary for additional experiment information.

  • Returns: Either the existing experiment with the specified name, : or a newly created experiment if none existed.

  • Return type: Experiment

  • Raises:

  • ValidationError – If the experiment name format is invalid.

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get application and dataset instances
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# Safe experiment setup - get existing or create new
experiment = Experiment.get_or_create(
    name="fraud-detection-eval-v1",
    application_id=application.id,
    dataset_id=dataset.id,
    description="Comprehensive evaluation of fraud detection model v1.0",
    metadata={"model_version": "1.0", "evaluation_type": "comprehensive"}
)
print(f"Using experiment: {experiment.name} (ID: {experiment.id})")

# Idempotent setup in deployment scripts
experiment = Experiment.get_or_create(
    name="llm-benchmark-eval",
    application_id=application.id,
    dataset_id=dataset.id,
    metadata={"baseline": "true"}
)

# Use in configuration management
model_versions = ["v1.0", "v1.1", "v2.0"]
experiments = {}

for version in model_versions:

experiments[version] = Experiment.get_or_create(
        name=f"fraud-detection-eval-{version}",
        application_id=application.id,
        dataset_id=dataset.id,
        metadata={"model_version": version}
    )

This method is idempotent - calling it multiple times with the same name and application_id will return the same experiment. It logs when creating a new experiment for visibility in automation scenarios.

update()

Update experiment description, metadata, and status.

Updates the experiment’s description, metadata, and/or status. This method allows you to modify the experiment’s configuration after creation, including updating the experiment status and error information for failed experiments.

Parameters

Parameter
Type
Required
Default
Description

description

str | None

None

Optional new description for the experiment. If provided, replaces the existing description. Set to empty string to clear.

metadata

dict | None

None

Optional new metadata dictionary for the experiment. If provided, replaces the existing metadata completely. Use empty dict to clear.

error_reason

str | None

None

Required when status is FAILED. The reason for the experiment failure.

error_message

str | None

None

Required when status is FAILED. Detailed error message for the failure.

traceback

str | None

None

Required when status is FAILED. Stack trace information for debugging.

duration_ms

int | None

None

Optional duration in milliseconds for the experiment execution

  • Returns: The updated experiment instance with new metadata and configuration.

  • Return type: Experiment

  • Raises:

  • ValueError – If no update parameters are provided (all are None) or if status is FAILED but error_reason, error_message, or traceback are missing.

  • ValidationError – If the update data is invalid (e.g., invalid metadata format).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Update description and metadata
updated_experiment = experiment.update(
    description="Updated comprehensive evaluation of fraud detection model v1.1",
    metadata={"model_version": "1.1", "evaluation_type": "comprehensive", "updated_by": "john_doe"}
)
print(f"Updated experiment: {updated_experiment.name}")
print(f"New description: {updated_experiment.description}")

# Update only metadata
experiment.update(metadata={"last_updated": "2024-01-15", "status": "active"})

# Update experiment status to completed
experiment.update(status=ExperimentStatus.COMPLETED)

# Mark experiment as failed with error details
experiment.update(
    status=ExperimentStatus.FAILED,
    error_reason="Evaluation timeout",
    error_message="The evaluation process exceeded the maximum allowed time",
    traceback="Traceback (most recent call last): File evaluate.py, line 42..."
)

# Clear description
experiment.update(description="")

# Batch update multiple experiments
for experiment in Experiment.list(application_id=application_id):
    if experiment.status == ExperimentStatus.COMPLETED:
        experiment.update(metadata={"archived": "true"})

This method performs a complete replacement of the specified fields. For partial updates, retrieve current values, modify them, and pass the complete new values. The experiment name and ID cannot be changed. When updating status to FAILED, all error-related parameters are required.

delete()

Delete the experiment.

Permanently deletes the experiment and all associated data from the Fiddler platform. This action cannot be undone and will remove all experiment results, metrics, and metadata.

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

  • Return type: None

Example

# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Delete the experiment
experiment.delete()
print("Experiment deleted successfully")

# Delete multiple experiments
for experiment in Experiment.list(application_id=application_id):
    if experiment.status == ExperimentStatus.FAILED:
        print(f"Deleting failed experiment: {experiment.name}")
        experiment.delete()

This operation is irreversible. Once deleted, the experiment and all its associated data cannot be recovered. Consider archiving experiments instead of deleting them if you need to preserve historical data.

add_items()

Add outputs of LLM/Agent/Application against dataset items to the experiment.

Adds outputs of LLM/Agent/Application (task or target function) against dataset items to the experiment, representing individual test case outcomes. Each item contains the outputs of LLM/Agent/Application results, timing information, and status for a specific dataset item.

Parameters

Parameter
Type
Required
Default
Description

items

list[NewExperimentItem]

None

List of NewExperimentItem instances containing outputs of LLM/Agent/Application against dataset items. Each item should include:; dataset_item_id: UUID of the dataset item being evaluated; outputs: Dictionary containing the outputs of the task function against dataset item; duration_ms: Duration of the execution in milliseconds:; status: Status of the outputs of the task function / scoring against dataset item (PENDING, COMPLETED, FAILED, etc.); error_reason: Reason for failure, if applicable; error_message: Detailed error message, if applicable

  • Returns: List of UUIDs for the newly created experiment items.

  • Return type: builtins.list[UUID]

  • Raises:

  • ValueError – If the items list is empty.

  • ValidationError – If any item data is invalid (e.g., missing required fields).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Create evaluation result items
from fiddler_evals.pydantic_models.experiment import NewExperimentItem
from datetime import datetime, timezone

items = [
    NewExperimentItem(
        dataset_item_id=dataset_item_id_1,
        outputs={"answer": "The watermelon seeds pass through your digestive system"},
        duration_ms=1000,
        end_time=datetime.now(tz=timezone.utc),
        status="COMPLETED",
        error_reason=None,
        error_message=None
    ),
    NewExperimentItem(
        dataset_item_id=dataset_item_id_2,
        outputs={"answer": "The precise origin of fortune cookies is unclear"},
        duration_ms=1000,
        end_time=datetime.now(tz=timezone.utc),
        status="COMPLETED",
        error_reason=None,
        error_message=None
    )
]

# Add items to experiment
item_ids = experiment.add_items(items)
print(f"Added {len(item_ids)} evaluation result items")
print(f"Item IDs: {item_ids}")

# Add items from evaluation results
items = [
    {
        "dataset_item_id": str(dataset_item_id),
        "outputs": {"answer": result["answer"]},
        "duration_ms": result["duration_ms"],
        "end_time": result["end_time"],
        "status": "COMPLETED"
    }
    for result in items
]
item_ids = experiment.add_items([NewExperimentItem(**item) for item in items])

# Batch add items with error handling

try:

item_ids = experiment.add_items(items)
    print(f"Successfully added {len(item_ids)} items")

except ValueError as e:

print(f"Validation error: {e}")

except Exception as e:

print(f"Failed to add items: {e}")

This method is typically used after running evaluations to store the results in the experiment. Each item represents the evaluation of a single dataset item and contains all relevant timing, output, and status information.

get_items()

Retrieve all experiment result items from the experiment.

Fetches all experiment result items (outputs, timing, status) that were generated by the task function against dataset items. Returns an iterator for memory efficiency when dealing with large experiments containing many result items.

  • Returns: Iterator of : ExperimentItem instances for all result items in the experiment.

  • Return type: Iterator[ExperimentItem]

  • Raises: ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Get all result items from the experiment
for item in experiment.get_items():
    print(f"Item ID: {item.id}")
    print(f"Dataset Item ID: {item.dataset_item_id}")
    print(f"Outputs: {item.outputs}")
    print(f"Status: {item.status}")
    print(f"Duration: {item.duration_ms}")
    if item.error_reason:
        print(f"Error: {item.error_reason} - {item.error_message}")
    print("---")

# Convert to list for analysis
all_items = list(experiment.get_items())
print(f"Total result items: {len(all_items)}")

# Filter items by status
completed_items = [
    item for item in experiment.get_items()
    if item.status == "COMPLETED"
]
print(f"Completed items: {len(completed_items)}")

# Filter items by error status
failed_items = [
    item for item in experiment.get_items()
    if item.status == "FAILED"
]
print(f"Failed items: {len(failed_items)}")

# Process items in batches
batch_size = 100
for i, item in enumerate(experiment.get_items()):
    if i % batch_size == 0:
        print(f"Processing batch {i // batch_size + 1}")
    # Process item...

# Analyze outputs
for item in experiment.get_items():
    if item.outputs.get("confidence", 0) < 0.8:
        print(f"Low confidence item: {item.id}")

This method returns an iterator for memory efficiency. Convert to a list with list(experiment.get_items()) if you need to iterate multiple times or get the total count. The iterator fetches items lazily from the API.

add_results()

Add evaluation results to the experiment.

Adds complete evaluation results to the experiment, including both the experiment item data (outputs, timing, status) and all associated evaluator scores. This method is typically used after running evaluations to store the complete results of the evaluation process for a batch of dataset items.

This method will only append the results to the experiment.

Note: It is not recommended to use this method directly. Instead, use the evaluate method. Creating and managing an experiment without evaluate wrapper is extremely advance usecase and should be avoided.

Parameters

Parameter
Type
Required
Default
Description

items

list[ExperimentItemResult]

None

List of ExperimentItemResult instances containing:; experiment_item: NewExperimentItem with outputs, timing, and status; scores: List of Score objects from evaluators for this item

  • Returns: Results are added to the experiment on the server.

  • Return type: None

  • Raises:

  • ValueError – If the items list is empty.

  • ValidationError – If any item data is invalid (e.g., missing required fields).

  • ApiError – If there’s an error communicating with the Fiddler API.

Example

# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Create experiment item with outputs
experiment_item = NewExperimentItem(
    dataset_item_id=dataset_item.id,
    outputs={"prediction": "fraud", "confidence": 0.95},
    duration_ms=1000,
    end_time=datetime.now(tz=timezone.utc),
    status="COMPLETED"
)

# Create scores from evaluators
scores = [
    Score(
        name="accuracy",
        evaluator_name="AccuracyEvaluator",
        value=1.0,
        label="Correct",
        reasoning="Prediction matches ground truth"
    ),
    Score(
        name="confidence",
        evaluator_name="ConfidenceEvaluator",
        value=0.95,
        label="High",
        reasoning="High confidence in prediction"
    )
]

# Create result combining item and scores
result = ExperimentItemResult(
    experiment_item=experiment_item,
    scores=scores
)

# Add results to experiment
experiment.add_results([result])

This method is typically called after running evaluations to store complete results. The results include both the experiment item data and all evaluator scores, providing a complete record of the evaluation process.

ExperimentStatus

Values:

  • PENDING - PENDING

  • IN_PROGRESS - IN_PROGRESS

  • COMPLETED - COMPLETED

  • FAILED - FAILED

  • CANCELLED - CANCELLED

ExperimentItemStatus

Values:

  • SUCCESS - SUCCESS

  • FAILED - FAILED

  • SKIPPED - SKIPPED

ScoreStatus

The status of a score. Values:

  • SUCCESS - SUCCESS

  • FAILED - FAILED

  • SKIPPED - SKIPPED

NewDatasetItem

Model to create a new dataset

model_config : ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'ignore'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

DatasetItem

Dataset item from Fiddler API

model_config : ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'ignore'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Score

A single output of an evaluator.

  • status (ScoreStatus)

  • reasoning (str | None)

  • error_reason (str | None)

  • error_message (str | None)

model_config : ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'ignore'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

evaluate()

Evaluate a dataset using a task function and a list of evaluators.

This is the main entry point for running evaluation experiments. It creates an experiment, runs the evaluation task on all dataset items, and executes the specified evaluators to generate scores.

The function automatically:

  1. Creates a new experiment with a unique name

  2. Runs the evaluation task on each dataset item

  3. Executes all evaluators on the task outputs

  4. Returns comprehensive results with timing and error information

Key Features:

  • Automatic Experiment Creation: Creates experiments with unique names

  • Task Execution: Runs custom evaluation tasks on dataset items

  • Evaluator Orchestration: Executes multiple evaluators on outputs

  • Error Handling: Gracefully handles task and evaluator failures

  • Result Collection: Returns detailed results with timing information

  • Flexible Configuration: Supports custom parameter mapping for evaluators

  • Concurrent Processing: Supports concurrent processing of dataset items

Use Cases:

  • Model Evaluation: Evaluate LLM models on test datasets

  • A/B Testing: Compare different model versions or configurations

  • Quality Assurance: Validate model performance across different inputs

  • Benchmarking: Run standardized evaluations on multiple models

Parameters

Parameter
Type
Required
Default
Description

task

Callable[[Dict[str, Any], Dict[str, Any], Dict[str, Any]], Dict[str, Any]]

None

Function that processes dataset items and returns outputs. Must accept (inputs, extras, metadata) and return dict of outputs.

name_prefix

str | None

None

Optional prefix for the experiment name. If not provided, uses the dataset name as prefix. A unique ID is always appended.

description

str | None

None

Optional description for the experiment.

metadata

dict | None

None

Optional metadata dictionary for the experiment.

score_fn_kwargs_mapping

Dict[str, str | Callable[[Dict[str, Any]], Any]] | None

None

Optional evaluation-level mapping for transforming evaluator parameters. Maps parameter names to either string keys or transformation functions. This mapping has lower priority than evaluator-level mappings set in the evaluator constructor, allowing evaluators to define sensible defaults while still permitting customization at the evaluation level.

max_workers

int

None

Maximum number of workers to use for concurrent processing. Use more than 1 only if the eval task function is thread-safe.

  • Returns: List of ExperimentItemResult objects, each containing : the experiment item data and scores for one dataset item.

  • Return type: ExperimentResult

  • Raises:

  • ValueError – If dataset is empty or evaluators are invalid.

  • RuntimeError – If no connection is available for API calls.

  • ApiError – If there’s an error creating the experiment or communicating with the Fiddler API.

Example

from fiddler_evals import evaluate
from fiddler_evals.evaluators import AnswerRelevance, Conciseness, RegexSearch
from fiddler_evals import Dataset

# Get dataset
dataset = Dataset.get_by_name("my-eval-dataset")

# Define evaluation task
def eval_task(inputs, extras, metadata):
    # Your model inference logic here
    question = inputs["question"]
    answer = my_model.generate(question)
    return {"answer": answer, "question": question}

# Example 1: Basic evaluation with parameter mapping
results = evaluate(
    dataset=dataset,
    task=eval_task,
    evaluators=[AnswerRelevance(), Conciseness()],
    name_prefix="my-model-eval",
    description="Evaluation of my model on Q&A dataset",
    metadata={"model_version": "v1.0", "temperature": 0.7},
    score_fn_kwargs_mapping={
        "output": "answer",
        "question": lambda x: x["inputs"]["question"]
    }
)

# Example 2: Multiple evaluator instances with score_name_prefix for differentiation
evaluators = [
    RegexSearch(
        r"\d+",
        score_name_prefix="question",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "question"}
    ),
    RegexSearch(
        r"\d+",
        score_name_prefix="answer",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "answer"}
    )
]
results = evaluate(
    dataset=dataset,
    task=eval_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        "question": lambda x: x["inputs"]["question"],
        # Note: "answer" mapping not needed since evaluator defines it
    }
)
# Process results

for result in results:

item_id = result.experiment_item.dataset_item_id
    status = result.experiment_item.status
    print(f"Item {item_id}: {status}")

    for score in result.scores:
        print(f"  {score.name}: {score.value} ({score.status})")

The function processes dataset items sequentially. For large datasets, consider implementing parallel processing or batch processing strategies. The experiment name is automatically made unique by appending datetime.

Parameter Mapping Priority: When both evaluator-level and evaluation-level mappings are present, evaluator-level mappings take precedence. This allows evaluators to define sensible defaults while still permitting customization at the evaluation level.

Mapping Priority (highest to lowest):

  1. Evaluator-level score_fn_kwargs_mapping (set in evaluator constructor)

  2. Evaluation-level score_fn_kwargs_mapping (passed to evaluate function)

  3. Default parameter resolution

Evaluator

Abstract base class for creating custom evaluators in Fiddler Evals.

The Evaluator class provides a flexible framework for creating builtin and custom evaluators that can assess LLM outputs against various criteria. Each evaluator is responsible for a single, specific evaluation task (e.g., hallucination detection, answer relevance, exact match, etc.).

Parameter Mapping: : Evaluators can define their own parameter mappings using score_fn_kwargs_mapping in the constructor. These mappings specify how data from the evaluation context (inputs, outputs, expected_outputs) should be passed to the evaluator’s score method. Mapping Priority (highest to lowest):

  1. Evaluator-level score_fn_kwargs_mapping (set in constructor)

  2. Evaluation-level kwargs_mapping (passed to evaluate function)

  3. Default parameter resolution\

This allows evaluators to define sensible defaults while still permitting customization at the evaluation level.

Creating Custom Evaluators: : To create a custom evaluator, inherit from this class and implement the score method with parameters specific to your evaluation needs: Example - Custom evaluator with parameter mapping: class ExactMatchEvaluator(Evaluator):\

def __init__(self, output_key: str = “answer”, score_name_prefix: str = None): : super()._init_( : score_name_prefix=score_name_prefix, score_fn_kwargs_mapping={“output”: output_key} )

def score(self, output: str, expected_output: str) -> Score: : is_match = output.strip().lower() == expected_output.strip().lower() return Score( > name=f”{self.score_name_prefix}exact_match”, > value=1.0 if is_match else 0.0, > reasoning=f”Match: {is_match}” )

Parameters

Parameter
Type
Required
Default
Description

score_name_prefix

str | None

None

Optional prefix to prepend to score names. Useful for distinguishing scores when using multiple instances of the same evaluator on different fields or with different configurations.

score_fn_kwargs_mapping

ScoreFnKwargsMappingType | None

None

Optional mapping for parameter transformation. Maps parameter names to either string keys or transformation functions. This mapping takes precedence over evaluation-level mappings when running the evaluate method.

The score method signature is intentionally flexible using *args and **kwargs to allow each evaluator to define its own parameter requirements. This design enables maximum flexibility while maintaining a consistent interface across all evaluators in the framework.

Initialize the evaluator with parameter mapping configuration.

Parameters

Parameter
Type
Required
Default
Description

score_name_prefix

str | None

None

Optional prefix to prepend to score names. Useful for distinguishing scores when using multiple instances of the same evaluator on different fields or with different configurations.

score_fn_kwargs_mapping

Dict[str, str | Callable[[Dict[str, Any]], Any]] | None

None

Optional mapping for parameter transformation. Maps parameter names to either string keys or transformation functions. This mapping takes precedence over evaluation-level mappings when running the evaluate method.

  • Return type: None

Example

# Simple string mapping
evaluator = MyEvaluator(score_fn_kwargs_mapping={"output": "answer"})

# Complex transformation function
evaluator = MyEvaluator(score_fn_kwargs_mapping={
    "question": lambda x: x["inputs"]["question"],
    "response": "answer"
})

# Using score name prefix for multiple instances
evaluator1 = RegexSearch(r"\d+", score_name_prefix="question")
evaluator2 = RegexSearch(r"\d+", score_name_prefix="answer")
# Results in scores named "question_has_number" and "answer_has_number"
* **Raises:**
**ScoreFunctionInvalidArgs** – If the mapping contains invalid parameter names
      that don’t match the evaluator’s score method signature.
* **Return type:**
  None

property name : str

abstractmethod score(*args, **kwargs)

Evaluate inputs and return a score or list of scores.

This method must be implemented by all concrete evaluator classes. Each evaluator can define its own parameter signature based on what it needs for evaluation.

Common parameter patterns:

  • Output-only: score(self, output: str) -> Score

  • Input-Output: score(self, input: str, output: str) -> Score

  • Comparison: score(self, output: str, expected_output: str) -> Score

  • All parameters: score(self, input: str, output: str, context: list[str]) -> Score

Parameters

Parameter
Type
Required
Default
Description

*args

Any

None

Positional arguments specific to the evaluator’s needs.

  • Returns: A single Score object or list of Score objects : representing the evaluation results. Each Score should include:

  • name: The score name (e.g., “has_zipcode”)

  • evaluator_name: The evaluator name (e.g., “RegexMatch”)

  • value: The score value (typically 0.0 to 1.0)

  • status: SUCCESS, FAILED, or SKIPPED

  • reasoning: Optional explanation of the score

  • error: Optional error information if evaluation failed

  • Return type: Score | list[Score]

  • Raises:

  • ValueError – If required parameters are missing or invalid.

  • TypeError – If parameters have incorrect types.

  • Exception – Any other evaluation-specific errors.

EvalFn

Evaluator that wraps a user-provided function for dynamic evaluation.

This class allows users to create evaluators from any callable function, automatically handling parameter passing, validation, and result conversion to Score objects.

Key Features:

  • Dynamic Function Wrapping: Converts any callable into an evaluator

  • Argument Validation: Validates that provided arguments match function signature

  • Smart Result Conversion: Automatically converts various return types to Score

  • Error Handling: Gracefully handles function execution and argument errors

  • Parameter Flexibility: Supports functions with any parameter signature

Parameters

Parameter
Type
Required
Default
Description

fn

Callable

None

The callable function to wrap as an evaluator.

score_name

str | None

None

Optional custom name for the score. If not provided, uses the function name.

Example

def equals(a, b):
    return a == b

evaluator = EvalFn(equals, score_name="exact_match")
score = evaluator.score(a="hello", b="hello")
print(score.value)  # 1.0

def length_check(text, min_length=5):
    return len(text) >= min_length

evaluator = EvalFn(length_check)

# Invalid arguments raise TypeError
try:
    evaluator.score(wrong_param="value")
except TypeError as e:
    print(f"Error: {e}")

property name : str

score()

Execute the wrapped function and convert result to Score.

Calls the wrapped function with the provided arguments and converts the result to a Score object. Validates that the provided arguments match the function’s signature.

Parameters

Parameter
Type
Required
Default
Description

*args

Any

None

Positional arguments to pass to the wrapped function.

  • Returns: A Score object representing the function’s evaluation result.

  • Return type: Score

  • Raises: TypeError – If the provided arguments don’t match the function signature.

The function result is converted to a Score as follows:

  • bool: 1.0 for True, 0.0 for False

  • int/float: Direct value conversion

  • Score: Returns as-is

AnswerRelevance

Evaluator to assess how well an answer addresses a given question.

The AnswerRelevance evaluator measures whether an LLM’s answer is relevant and directly addresses the question being asked. This is a critical metric for ensuring that LLM responses stay on topic and provide meaningful value to users.

Key Features:

  • Relevance Assessment: Determines if the answer directly addresses the question

  • Binary Scoring: Returns 1.0 for relevant answers, 0.0 for irrelevant ones

  • Detailed Reasoning: Provides explanation for the relevance assessment

  • Fiddler API Integration: Uses Fiddler’s built-in relevance evaluation model

Use Cases:

  • Q&A Systems: Ensuring answers stay on topic

  • Customer Support: Verifying responses address user queries

  • Educational Content: Checking if explanations answer the question

  • Research Assistance: Validating that responses are relevant to queries

Scoring Logic:

  • 1.0 (Relevant): Answer directly addresses the question with relevant information

  • 0.0 (Irrelevant): Answer doesn’t address the question or goes off-topic

Parameters

Parameter
Type
Required
Default
Description

prompt

str

None

The question being asked.

response

str

None

The LLM’s response to evaluate.

  • Returns: A Score object containing: : - value: 1.0 if relevant, 0.0 if irrelevant

  • label: String representation of the boolean result

  • reasoning: Detailed explanation of the assessment

  • Return type: Score

Example

from fiddler_evals.evaluators import AnswerRelevance
evaluator = AnswerRelevance()
# Relevant answer
score = evaluator.score(

prompt=”What is the capital of France?”,
response=”The capital of France is Paris.”

)
print(f”Relevance: {score.value}”)  # 1.0
print(f”Reasoning: {score.reasoning}”)

# Irrelevant answer
score = evaluator.score(

question=”What is the capital of France?”,
answer=”I like pizza and Italian food.”

)
print(f”Relevance: {score.value}”)  # 0.0

This evaluator uses Fiddler’s built-in relevance assessment model and requires an active connection to the Fiddler API.

name = 'answer_relevance'

score()

Score the relevance of an answer to a question.

Parameters

Parameter
Type
Required
Default
Description

prompt

str

None

The question being asked.

response

str

None

The LLM’s response to evaluate.

  • Returns: A Score object containing: : - value: 1.0 if relevant, 0.0 if irrelevant

  • label: String representation of the boolean result

  • reasoning: Detailed explanation of the assessment

  • Return type: Score

Coherence

Evaluator to assess the coherence and logical flow of a response.

The Coherence evaluator measures whether a response is well-structured, logically consistent, and flows naturally from one idea to the next. This metric is important for ensuring that responses are easy to follow and understand, with clear connections between different parts of the text.

Key Features:

  • Coherence Assessment: Determines if the response has logical flow and structure

  • Binary Scoring: Returns 1.0 for coherent responses, 0.0 for incoherent ones

  • Optional Context: Can optionally use a prompt for context-aware evaluation

  • Detailed Reasoning: Provides explanation for the coherence assessment

  • Fiddler API Integration: Uses Fiddler’s built-in coherence evaluation model

Use Cases:

  • Content Quality: Ensuring responses are well-structured and logical

  • Educational Content: Verifying explanations flow logically

  • Technical Documentation: Checking if instructions are coherent

  • Creative Writing: Assessing narrative flow and consistency

  • Conversational AI: Ensuring responses make sense in context

Scoring Logic:

  • 1.0 (Coherent): Response has clear logical flow and structure

  • 0.0 (Incoherent): Response lacks logical flow or has structural issues

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The response to evaluate for coherence.

prompt

str, optional

None

The original prompt that generated the response. Used for context-aware coherence evaluation.

  • Returns: A Score object containing: : - name: “is_coherent”

  • evaluator_name: “Coherence”

  • value: 1.0 if coherent, 0.0 if incoherent

  • label: String representation of the boolean result

  • reasoning: Explanation for the coherence assessment

  • Return type: Score

  • Raises: ValueError – If the response is empty or None, or if no scores are returned from the API.

Example

from fiddler_evals.evaluators import Coherence
evaluator = Coherence()
# Coherent response
score = evaluator.score(

response=”First, we need to understand the problem. Then, we can identify potential solutions. Finally, we should test our approach.”

)
print(f”Coherence: {score.value}”)  # 1.0

# Incoherent response
incoherent_score = evaluator.score(

response=”The sky is blue. I like pizza. Quantum physics is complex. Let’s go shopping.”

)
print(f”Coherence: {incoherent_score.value}”)  # 0.0

# With context
contextual_score = evaluator.score(

prompt=”Explain the process of making coffee”,
response=”First, grind the beans. Then, heat the water. Next, pour water over grounds. Finally, enjoy your coffee.”

)
print(f”Coherence: {contextual_score.value}”)  # 1.0

# Check coherence
if score.value == 1.0:

print(“Response is coherent and well-structured”)

This evaluator uses Fiddler’s built-in coherence assessment model and requires an active connection to the Fiddler API. The optional prompt parameter can provide additional context for more accurate coherence evaluation, especially when the response needs to be evaluated in relation to a specific question or task.

name = 'coherence'

score()

Score the coherence of a response.

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The response to evaluate for coherence.

prompt

str, optional

None

The original prompt that generated the response.

  • Returns: A Score object for coherence assessment.

  • Return type: Score

Conciseness

Evaluator to assess how concise and to-the-point an answer is.

The Conciseness evaluator measures whether an LLM’s answer is appropriately brief and direct without unnecessary verbosity. This metric is important for ensuring that responses are efficient and don’t waste users’ time with irrelevant details or excessive elaboration.

Key Features:

  • Conciseness Assessment: Determines if the answer is appropriately brief

  • Binary Scoring: Returns 1.0 for concise answers, 0.0 for verbose ones

  • Detailed Reasoning: Provides explanation for the conciseness assessment

  • Fiddler API Integration: Uses Fiddler’s built-in conciseness evaluation model

Use Cases:

  • Customer Support: Ensuring responses are direct and helpful

  • Technical Documentation: Verifying explanations are clear and brief

  • Educational Content: Checking if explanations are appropriately detailed

  • API Responses: Ensuring responses are efficient and focused

Scoring Logic:

  • 1.0 (Concise): Answer is appropriately brief and to-the-point

  • 0.0 (Verbose): Answer is unnecessarily long or contains irrelevant details

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The LLM’s response to evaluate for conciseness.

  • Returns: A Score object containing: : - value: 1.0 if concise, 0.0 if verbose

  • label: String representation of the boolean result

  • reasoning: Detailed explanation of the assessment

  • Return type: Score

Example

from fiddler_evals.evaluators import Conciseness
evaluator = Conciseness()
# Concise answer
score = evaluator.score(“The capital of France is Paris.”)
print(f”Conciseness: {score.value}”)  # 1.0
print(f”Reasoning: {score.reasoning}”)

# Verbose answer
score = evaluator.score(

“Well, that’s a great question about France. Let me think about this…”
“France is a beautiful country in Europe, and it has many wonderful cities…”
“The capital city of France is Paris, which is located in the north-central part…”

)
print(f”Conciseness: {score.value}”)  # 0.0

This evaluator uses Fiddler’s built-in conciseness assessment model and requires an active connection to the Fiddler API.

name = 'conciseness'

score()

Score the conciseness of an answer.

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The LLM’s response to evaluate for conciseness.

  • Returns: A Score object containing: : - value: 1.0 if concise, 0.0 if verbose

  • label: String representation of the boolean result

  • reasoning: Detailed explanation of the assessment

  • Return type: Score

Toxicity

Evaluator to assess text toxicity using Fiddler’s unbiased toxicity model.

The Toxicity evaluator uses Fiddler’s implementation of the unitary/unbiased-toxic-roberta model to evaluate the toxicity level of text content. This evaluator helps identify potentially harmful, offensive, or inappropriate language in text, providing a probability score for toxicity assessment.

Key Features:

  • Toxicity Assessment: Evaluates text for toxic, harmful, or offensive content

  • Probability-Based Scoring: Returns probability scores (0.0-1.0) for toxicity

  • Unbiased Model: Uses unitary/unbiased-toxic-roberta for fair toxicity detection

  • Fiddler Integration: Leverages Fiddler’s optimized toxicity evaluation model

  • Single Score Output: Returns a single toxicity probability score

Toxicity Categories Evaluated:

  • toxicity_prob: Probability that the text contains toxic content

Use Cases:

  • Content Moderation: Filtering user-generated content for toxicity

  • Social Media Monitoring: Detecting harmful language in posts and comments

  • Chatbot Safety: Ensuring AI responses are not toxic or offensive

  • Community Guidelines: Enforcing platform policies on appropriate language

  • Content Filtering: Automatically flagging potentially harmful content

Scoring Logic: : The toxicity score represents the probability that the text contains toxic content:

  • 0.0-0.3: Low toxicity (likely safe content)

  • 0.3-0.7: Medium toxicity (may contain some harmful language)

  • 0.7-1.0: High toxicity (likely contains toxic or offensive content)

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text content to evaluate for toxicity.

  • Returns: A Score object containing: : - name: The toxicity category name (“toxicity_prob”)

  • evaluator_name: “Toxicity”

  • value: Probability score (0.0-1.0) for toxicity

  • Return type: Score

  • Raises: ValueError – If the text is empty or None, or if no scores are returned from the API.

Example

from fiddler_evals.evaluators import Toxicity
evaluator = Toxicity()
# Safe content
score = evaluator.score(“Hello, how are you today?”)
print(f”Toxicity: {score.value}”)
# Toxicity: 0.02

# Potentially toxic content
toxic_score = evaluator.score(“This is absolutely terrible and offensive!”)
print(f”Toxicity: {toxic_score.value}”)
# Toxicity: 0.75

# Highly toxic content
very_toxic_score = evaluator.score(“You are a worthless piece of garbage!”)
print(f”Toxicity: {very_toxic_score.value}”)
# Toxicity: 0.95

# Filter based on toxicity threshold

if score.value > 0.7:

print(“Content flagged as potentially toxic”)

This evaluator is designed for toxicity assessment and should be used as part of a comprehensive content moderation strategy. The probability scores should be interpreted in context and combined with other safety measures for robust content filtering. The model is trained to be unbiased and fair across different demographics and contexts.

name = 'toxicity'

score()

Score the toxicity of text content.

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text content to evaluate for toxicity.

  • Returns: A Score object for toxicity probability.

  • Return type: Score

Sentiment

Evaluator to assess text sentiment using Fiddler’s sentiment analysis model.

The Sentiment evaluator uses Fiddler’s implementation of the cardiffnlp/twitter-roberta-base-sentiment-latest model to evaluate the sentiment polarity of text content. This evaluator helps identify the emotional tone and attitude expressed in text, providing both sentiment labels and confidence scores for sentiment classification.

Key Features:

  • Sentiment Classification: Evaluates text for positive, negative, or neutral sentiment

  • Dual Score Output: Returns both sentiment label and probability confidence

  • Fiddler Integration: Leverages Fiddler’s optimized sentiment evaluation model

  • Multi-Score Output: Returns both sentiment label and probability scores

Sentiment Categories Evaluated:

  • sentiment: The predicted sentiment label (positive, negative, neutral)

  • sentiment_prob: Probability score (0.0-1.0) for the predicted sentiment

Use Cases:

  • Social Media Monitoring: Analyzing sentiment in tweets, posts, and comments

  • Customer Feedback Analysis: Understanding customer satisfaction and opinions

  • Brand Monitoring: Tracking public sentiment about products or services

  • Content Moderation: Identifying emotionally charged or problematic content

  • Market Research: Analyzing public opinion and sentiment trends

Scoring Logic: : The sentiment evaluation provides two complementary scores:

  • sentiment: The predicted sentiment label\

    • “positive”: Text expresses positive emotions or opinions

    • “negative”: Text expresses negative emotions or opinions

    • “neutral”: Text expresses neutral or balanced sentiment

  • sentiment_prob: Confidence score (0.0-1.0) for the prediction : - 0.0-0.3: Low confidence in sentiment prediction

  • 0.3-0.7: Medium confidence in sentiment prediction

  • 0.7-1.0: High confidence in sentiment prediction

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text content to evaluate for sentiment.

  • Returns: A list of Score objects containing: : - sentiment: Score object with sentiment label (positive/negative/neutral)

  • sentiment_prob: Score object with probability score (0.0-1.0)

  • Return type: list[Score]

  • Raises: ValueError – If the text is empty or None, or if no scores are returned from the API.

Example

from fiddler_evals.evaluators import Sentiment
evaluator = Sentiment()
# Positive sentiment
scores = evaluator.score(“I love this product! It’s amazing!”)
print(f”Sentiment: {scores[0].label}”)
print(f”Confidence: {scores[1].value}”)
# Sentiment: positive
# Confidence: 0.95

# Negative sentiment
negative_scores = evaluator.score(“This is terrible and disappointing!”)
print(f”Sentiment: {negative_scores[0].label}”)
print(f”Confidence: {negative_scores[1].value}”)
# Sentiment: negative
# Confidence: 0.88

# Neutral sentiment
neutral_scores = evaluator.score(“The weather is okay today.”)
print(f”Sentiment: {neutral_scores[0].label}”)
print(f”Confidence: {neutral_scores[1].value}”)
# Sentiment: neutral
# Confidence: 0.72

# Filter based on sentiment and confidence
if scores[0].label == “positive” and scores[1].value > 0.8:

print(“High confidence positive sentiment detected”)

This evaluator is optimized for social media and informal text analysis using the cardiffnlp/twitter-roberta-base-sentiment-latest model. It performs best on short, conversational text similar to Twitter posts. For formal or academic text, consider using specialized sentiment analysis models. The dual-score output provides both categorical classification and confidence assessment for robust sentiment analysis workflows.

name = 'sentiment_analysis'

score()

Score the sentiment of text content.

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text content to evaluate for sentiment.

  • Returns: A list of Score objects for sentiment label and probability.

  • Return type: list[Score]

RegexSearch

Regex search attempts to match the regex pattern only at the beginning of the string.

property match_fn : Callable

Match function to use for the regex evaluator.

RegexMatch

Regex match scans the entire string from beginning to end, looking for the first occurrence where the regex pattern matches.

property match_fn : Callable

Match function to use for the regex evaluator.

FTLPromptSafety

Evaluator to assess prompt safety using Fiddler’s Trust Model.

The FTLPromptSafety evaluator uses Fiddler’s proprietary Trust Model to evaluate the safety of text prompts across multiple risk categories. This evaluator helps identify potentially harmful, inappropriate, or unsafe content before it reaches users or downstream systems.

Key Features:

  • Multi-Dimensional Safety Assessment: Evaluates 11 different safety categories

  • Probability-Based Scoring: Returns probability scores (0.0-1.0) for each risk category

  • Comprehensive Risk Coverage: Covers illegal, hateful, harassing, and other harmful content

  • Fiddler Trust Model: Uses Fiddler’s proprietary safety evaluation model

  • Batch Scoring: Returns multiple scores for comprehensive safety analysis

Safety Categories Evaluated:

  • illegal_prob: Probability of containing illegal content or activities

  • hateful_prob: Probability of containing hate speech or discriminatory language

  • harassing_prob: Probability of containing harassing or threatening content

  • racist_prob: Probability of containing racist language or content

  • sexist_prob: Probability of containing sexist language or content

  • violent_prob: Probability of containing violent or graphic content

  • sexual_prob: Probability of containing inappropriate sexual content

  • harmful_prob: Probability of containing content that could cause harm

  • unethical_prob: Probability of containing unethical or manipulative content

  • jailbreaking_prob: Probability of containing prompt injection or jailbreaking attempts

  • max_risk_prob: Maximum risk probability across all categories

Use Cases:

  • Content Moderation: Filtering user-generated content for safety

  • Prompt Validation: Ensuring user prompts are safe before processing

  • AI Safety: Protecting AI systems from harmful or manipulative inputs

  • Compliance: Meeting regulatory requirements for content safety

  • Risk Assessment: Evaluating potential risks in text content

Scoring Logic: : Each safety category returns a probability score between 0.0 and 1.0:

  • 0.0-0.3: Low risk (safe content)

  • 0.3-0.7: Medium risk (requires review)

  • 0.7-1.0: High risk (likely unsafe content)

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text prompt to evaluate for safety.

  • Returns: A list of Score objects, one for each safety category: : - name: The safety category name (e.g., “illegal_prob”)

  • evaluator_name: “FTLPromptSafety”

  • value: Probability score (0.0-1.0) for that category

  • Return type: list[Score]

  • Raises: ValueError – If the text is empty or None.

Example

from fiddler_evals.evaluators import FTLPromptSafety
evaluator = FTLPromptSafety()
# Safe content
scores = evaluator.score(“What is the weather like today?”)

for score in scores:

print(f”{score.name}: {score.value}”)

# illegal_prob: 0.01
# hateful_prob: 0.02
# harassing_prob: 0.01
# …

# Potentially unsafe content
unsafe_scores = evaluator.score(“How to hack into someone’s computer?”)

for score in unsafe_scores:

if score.value > 0.5:
print(f”High risk detected: {score.name} = {score.value}”)

# Filter based on maximum risk
max_risk_score = next(s for s in scores if s.name == “max_risk_prob”)

if max_risk_score.value > 0.7:

print(“Content flagged as potentially unsafe”)

This evaluator is designed for prompt safety assessment and should be used as part of a comprehensive content moderation strategy. The probability scores should be interpreted in context and combined with other safety measures for robust content filtering.

name = 'ftl_prompt_safety'

score()

Score the safety of a text prompt.

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text prompt to evaluate for safety.

  • Returns: A list of Score objects, one for each safety category.

  • Return type: list[Score]

FTLResponseFaithfulness

Evaluator to assess response faithfulness using Fiddler’s Trust Model.

The FTLResponseFaithfulness evaluator uses Fiddler’s proprietary Trust Model to evaluate how faithful an LLM response is to the provided context. This evaluator helps ensure that responses accurately reflect the information in the source context and don’t contain hallucinated or fabricated information.

Key Features:

  • Faithfulness Assessment: Evaluates how well the response reflects the context

  • Probability-Based Scoring: Returns probability scores (0.0-1.0) for faithfulness

  • Context-Response Alignment: Compares response against provided context

  • Fiddler Trust Model: Uses Fiddler’s proprietary faithfulness evaluation model

  • Hallucination Detection: Identifies responses that go beyond the context

Faithfulness Categories Evaluated:

  • faithful_prob: Probability that the response is faithful to the context

Use Cases:

  • RAG Systems: Ensuring responses stay grounded in retrieved context

  • Document Q&A: Verifying answers are based on provided documents

  • Fact-Checking: Validating that responses don’t contain fabricated information

  • Content Validation: Ensuring responses accurately reflect source material

  • Hallucination Detection: Identifying responses that go beyond the context

Scoring Logic: : The faithfulness score represents the probability that the response is faithful to the context:

  • 0.0-0.3: Low faithfulness (likely contains hallucinated information)

  • 0.3-0.7: Medium faithfulness (some information may not be grounded)

  • 0.7-1.0: High faithfulness (response accurately reflects context)

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The LLM response to evaluate for faithfulness.

context

str

None

The source context that the response should be faithful to.

  • Returns: A list of Score objects containing: : - name: The faithfulness category name (“faithful_prob”)

  • evaluator_name: “FTLResponseFaithfulness”

  • value: Probability score (0.0-1.0) for faithfulness

  • Return type: list[Score]

  • Raises: ValueError – If the response or context is empty or None.

Example

from fiddler_evals.evaluators import FTLResponseFaithfulness
evaluator = FTLResponseFaithfulness()
# Faithful response
context = “The capital of France is Paris. It is located in northern Europe.”
response = “Paris is the capital of France.”
scores = evaluator.score(response=response, context=context)

for score in scores:

print(f”{score.name}: {score.value}”)

# faithful_prob: 0.95

# Unfaithful response with hallucination
context = “The capital of France is Paris.”
response = “The capital of France is Paris, and it has a population of 2.1 million people.”
scores = evaluator.score(response=response, context=context)

for score in scores:

print(f”{score.name}: {score.value}”)

# faithful_prob: 0.65 (population info not in context)

# Highly unfaithful response
context = “The capital of France is Paris.”
response = “The capital of France is London.”
scores = evaluator.score(response=response, context=context)

for score in scores:

print(f”{score.name}: {score.value}”)

# faithful_prob: 0.05

# Filter based on faithfulness threshold
faithful_score = next(s for s in scores if s.name == “faithful_prob”)

if faithful_score.value < 0.7:

print(“Response flagged as potentially unfaithful”)

This evaluator is designed for response faithfulness assessment and should be used in conjunction with other evaluation metrics for comprehensive response quality assessment. The probability scores should be interpreted in context and combined with other quality measures for robust response validation.

name = 'ftl_response_faithfulness'

score()

Score the faithfulness of a response to its context.

Parameters

Parameter
Type
Required
Default
Description

response

str

None

The LLM response to evaluate for faithfulness.

context

str

None

The source context that the response should be faithful to.

  • Returns: A Score object for faithfulness probability.

  • Return type: Score

TopicClassification

Evaluator to classify text topics using Fiddler’s zero-shot topic classification model.

The TopicClassification evaluator uses Fiddler’s implementation of the mortizlaurer/roberta-base-zeroshot-v2-0-c model to classify text content into predefined topic categories. This evaluator helps identify the main subject matter or theme of text content, providing both topic labels and confidence scores for topic classification.

Key Features:

  • Topic Classification: Classifies text into predefined topic categories

  • Dual Score Output: Returns both topic label and probability confidence

  • Zero-Shot Model: Uses mortizlaurer/roberta-base-zeroshot-v2-0-c for flexible topic classification

  • Multi-Score Output: Returns both topic name and probability scores

Topic Categories Evaluated:

  • top_topic: The predicted topic name from the provided topics list

  • top_topic_prob: Probability score (0.0-1.0) for the predicted topic

Use Cases:

  • Content Categorization: Automatically organizing content by topic

  • Document Classification: Sorting documents by subject matter

  • News Analysis: Categorizing news articles by topic

  • Customer Support: Routing tickets by topic or issue type

  • Content Moderation: Identifying content themes for policy enforcement

Scoring Logic: : The topic classification provides two complementary scores:

  • top_topic: The predicted topic name from the provided topics list\

    • Selected from the topics provided during initialization

    • Represents the most relevant topic for the input text

  • top_topic_prob: Confidence score (0.0-1.0) for the prediction : - 0.0-0.3: Low confidence in topic prediction

  • 0.3-0.7: Medium confidence in topic prediction

  • 0.7-1.0: High confidence in topic prediction

Parameters

Parameter
Type
Required
Default
Description

topics

list[str]

None

List of topic categories to classify text into.

  • Returns: A list of Score objects containing: : - top_topic: Score object with predicted topic name

  • top_topic_prob: Score object with probability score (0.0-1.0)

  • Return type: list[Score]

  • Raises: ValueError – If the text is empty or None, or if no scores are returned from the API.

Example

from fiddler_evals.evaluators import TopicClassification
evaluator = TopicClassification(topics=["technology", "sports", "politics", "entertainment"])
# Technology topic
scores = evaluator.score(“The new AI model shows promising results in natural language processing.”)
print(f”Topic: {scores[0].label}”)
print(f”Confidence: {scores[1].value}”)
# Topic: technology
# Confidence: 0.92

# Sports topic
sports_scores = evaluator.score(“The team won the championship with an amazing performance!”)
print(f”Topic: {sports_scores[0].label}”)
print(f”Confidence: {sports_scores[1].value}”)
# Topic: sports
# Confidence: 0.88

# Politics topic
politics_scores = evaluator.score(“The new policy will affect millions of citizens.”)
print(f”Topic: {politics_scores[0].label}”)
print(f”Confidence: {politics_scores[1].value}”)
# Topic: politics
# Confidence: 0.85

# Filter based on topic and confidence
if scores[0].label == “technology” and scores[1].value > 0.8:

print(“High confidence technology topic detected”)

This evaluator uses zero-shot classification, meaning it can classify text into any set of topics provided during initialization without requiring training data for those specific topics. The mortizlaurer/roberta-base-zeroshot-v2-0-c model is particularly effective for general-purpose topic classification across diverse domains. The dual-score output provides both categorical classification and confidence assessment for robust topic analysis workflows.

name = 'topic_classification'

Initialize the TopicClassification evaluator.

Parameters

Parameter
Type
Required
Default
Description

topics

list[str]

None

List of topic categories to classify text into.

  • Raises: ValueError – If the topics are empty or None.

score()

Score the topic classification of text content.

Parameters

Parameter
Type
Required
Default
Description

text

str

None

The text content to evaluate for topic classification.

  • Returns: A list of Score objects for topic name and probability.

  • Return type: list[Score]