# Experiment

Represents an Experiment for tracking evaluation runs and results.

An Experiment is a single evaluation run of a test suite against a specific application/LLM/Agent version and evaluators. Experiments provide comprehensive tracking, monitoring, and result management for GenAI evaluation workflows, enabling systematic testing and performance analysis.

Key Features:

* **Evaluation Tracking**: Complete lifecycle tracking of evaluation runs
* **Status Management**: Real-time status updates (PENDING, IN\_PROGRESS, COMPLETED, etc.)
* **Dataset Integration**: Linked to specific datasets for evaluation
* **Result Storage**: Comprehensive storage of results, metrics, and error information
* **Error Handling**: Detailed error tracking with traceback information

Experiment Lifecycle:

1. **Creation**: Create experiment with dataset and application references
2. **Execution**: Experiment runs evaluation against the dataset
3. **Monitoring**: Track status and progress in real-time
4. **Completion**: Retrieve results, metrics, and analysis
5. **Cleanup**: Archive or delete completed experiments

## Example

```python
# Use this class to list
experiments = Experiment.list(
    application_id=application_id,
    dataset_id=dataset_id,
)
```

{% hint style="info" %}
Experiments are permanent records of evaluation runs. Once created, the name cannot be changed, but metadata and description can be updated. Failed experiments retain error information for debugging and analysis.
{% endhint %}

## description *: str | None* *= None*

## error\_reason *: str | None* *= None*

## error\_message *: str | None* *= None*

## traceback *: str | None* *= None*

## duration\_ms *: int | None* *= None*

## get\_app\_url()

Get the application URL for this experiment

**Return type:** str

## *classmethod* get\_by\_id(id\_)

Retrieve an experiment by its unique identifier.

Fetches an experiment from the Fiddler platform using its UUID. This is the most direct way to retrieve an experiment when you know its ID.

## Parameters

| Parameter | Type   | Required | Default | Description |
| --------- | ------ | -------- | ------- | ----------- |
| `id_`     | \`UUID | str\`    | ✗       | `None`      |

## Returns

The experiment instance with all metadata and configuration.

**Return type:** `Experiment`

## Raises

* **NotFound** -- If no experiment exists with the specified ID.
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get experiment by UUID
experiment = Experiment.get_by_id(id_="550e8400-e29b-41d4-a716-446655440000")
print(f"Retrieved experiment: {experiment.name}")
print(f"Status: {experiment.status}")
print(f"Created: {experiment.created_at}")
print(f"Application: {experiment.application.name}")
```

{% hint style="info" %}
This method makes an API call to fetch the latest experiment state from the server. The returned experiment instance reflects the current state in Fiddler.
{% endhint %}

## *classmethod* get\_by\_name(name, application\_id)

Retrieve an experiment by name within an application.

Finds and returns an experiment using its name within the specified application. This is useful when you know the experiment name and application but not its UUID. Experiment names are unique within an application, making this a reliable lookup method.

## Parameters

| Parameter        | Type   | Required | Default | Description                                                                                                       |
| ---------------- | ------ | -------- | ------- | ----------------------------------------------------------------------------------------------------------------- |
| `name`           | `str`  | ✗        | `None`  | The name of the experiment to retrieve. Experiment names are unique within an application and are case-sensitive. |
| `application_id` | \`UUID | str\`    | ✗       | `None`                                                                                                            |

## Returns

The experiment instance matching the specified name.

**Return type:** `Experiment`

## Raises

* **NotFound** -- If no experiment exists with the specified name in the application.
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)

# Get experiment by name within an application
experiment = Experiment.get_by_name(
    name="fraud-detection-eval-v1",
    application_id=application.id
)
print(f"Found experiment: {experiment.name} (ID: {experiment.id})")
print(f"Status: {experiment.status}")
print(f"Created: {experiment.created_at}")
print(f"Dataset: {experiment.dataset.name}")
```

{% hint style="info" %}
Experiment names are case-sensitive and must match exactly. Use this method when you have a known experiment name from configuration or user input.
{% endhint %}

## *classmethod* list(application\_id, dataset\_id=None)

List all experiments in an application.

Retrieves all experiments that the current user has access to within the specified application. Returns an iterator for memory efficiency when dealing with many experiments.

## Parameters

| Parameter        | Type   | Required | Default | Description |
| ---------------- | ------ | -------- | ------- | ----------- |
| `application_id` | \`UUID | str\`    | ✗       | `None`      |
| `dataset_id`     | \`UUID | str      | None\`  | ✗           |

## Yields

`Experiment` -- Experiment instances for all accessible experiments in the application.

## Raises

**ApiError** -- If there's an error communicating with the Fiddler API. **Return type:** *Iterator*\[[*Experiment*](#experiment)]

## Example

```python
# Get application instance
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# List all experiments in an application
for experiment in Experiment.list(application_id=application.id, dataset_id=dataset.id):
    print(f"Experiment: {experiment.name}")
    print(f"  ID: {experiment.id}")
    print(f"  Status: {experiment.status}")
    print(f"  Created: {experiment.created_at}")
    print(f"  Dataset: {experiment.dataset.name}")

# Convert to list for counting and filtering
experiments = list(Experiment.list(application_id=application.id, dataset_id=dataset.id ))
print(f"Total experiments in application: {len(experiments)}")

# Find experiments by status
completed_experiments = [
    exp for exp in Experiment.list(application_id=application.id, dataset_id=dataset.id)
    if exp.status == ExperimentStatus.COMPLETED
]
print(f"Completed experiments: {len(completed_experiments)}")

# Find experiments by name pattern
eval_experiments = [
    exp for exp in Experiment.list(application_id=application.id, dataset_id=dataset.id)
    if "eval" in exp.name.lower()
]
print(f"Evaluation experiments: {len(eval_experiments)}")
```

{% hint style="info" %}
This method returns an iterator for memory efficiency. Convert to a list with list(Experiment.list(application\_id)) if you need to iterate multiple times or get the total count. The iterator fetches experiments lazily from the API.
{% endhint %}

## *classmethod* create(name, application\_id, dataset\_id, description=None, metadata=None)

Create a new experiment in an application.

Creates a new experiment within the specified application on the Fiddler platform. The experiment must have a unique name within the application and will be linked to the specified dataset for evaluation.

Note: It is not recommended to use this method directly. Instead, use the evaluate method. Creating and managing an experiment without evaluate wrapper is extremely advance usecase and should be avoided.

## Parameters

| Parameter        | Type   | Required | Default | Description                                             |
| ---------------- | ------ | -------- | ------- | ------------------------------------------------------- |
| `name`           | `str`  | ✗        | `None`  | Experiment name, must be unique within the application. |
| `application_id` | \`UUID | str\`    | ✗       | `None`                                                  |
| `dataset_id`     | \`UUID | str\`    | ✗       | `None`                                                  |
| `description`    | \`str  | None\`   | ✗       | `None`                                                  |
| `metadata`       | \`dict | None\`   | ✗       | `None`                                                  |

## Returns

The newly created experiment instance with server-assigned fields.

**Return type:** `Experiment`

## Raises

* **Conflict** -- If an experiment with the same name already exists in the application.
* **ValidationError** -- If the experiment configuration is invalid (e.g., invalid name format).
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get application and dataset instances
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# Create a new experiment for fraud detection evaluation
experiment = Experiment.create(
    name="fraud-detection-eval-v1",
    application_id=application.id,
    dataset_id=dataset.id,
    description="Comprehensive evaluation of fraud detection model v1.0",
    metadata={"model_version": "1.0", "evaluation_type": "comprehensive", "baseline": "true"}
)
print(f"Created experiment with ID: {experiment.id}")
print(f"Status: {experiment.status}")
print(f"Created at: {experiment.created_at}")
print(f"Application: {experiment.application.name}")
print(f"Dataset: {experiment.dataset.name}")
```

{% hint style="info" %}
After successful creation, the experiment instance is returned with server-assigned metadata. The experiment is immediately available for execution and monitoring. The initial status will be PENDING.
{% endhint %}

## *classmethod* get\_or\_create(name, application\_id, dataset\_id, description=None, metadata=None)

Get an existing experiment by name or create a new one if it doesn't exist.

This is a convenience method that attempts to retrieve an experiment by name within an application, and if not found, creates a new experiment with that name. Useful for idempotent experiment setup in automation scripts and deployment pipelines.

## Parameters

| Parameter        | Type   | Required | Default | Description                                       |
| ---------------- | ------ | -------- | ------- | ------------------------------------------------- |
| `name`           | `str`  | ✗        | `None`  | The name of the experiment to retrieve or create. |
| `application_id` | \`UUID | str\`    | ✗       | `None`                                            |
| `dataset_id`     | \`UUID | str\`    | ✗       | `None`                                            |
| `description`    | \`str  | None\`   | ✗       | `None`                                            |
| `metadata`       | \`dict | None\`   | ✗       | `None`                                            |

## Returns

Either the existing experiment with the specified name, : or a newly created experiment if none existed.

**Return type:** `Experiment`

## Raises

* **ValidationError** -- If the experiment name format is invalid.
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get application and dataset instances
application = Application.get_by_name(name="fraud-detection-app", project_id=project_id)
dataset = Dataset.get_by_name(name="fraud-detection-tests", application_id=application.id)

# Safe experiment setup - get existing or create new
experiment = Experiment.get_or_create(
    name="fraud-detection-eval-v1",
    application_id=application.id,
    dataset_id=dataset.id,
    description="Comprehensive evaluation of fraud detection model v1.0",
    metadata={"model_version": "1.0", "evaluation_type": "comprehensive"}
)
print(f"Using experiment: {experiment.name} (ID: {experiment.id})")

# Idempotent setup in deployment scripts
experiment = Experiment.get_or_create(
    name="llm-benchmark-eval",
    application_id=application.id,
    dataset_id=dataset.id,
    metadata={"baseline": "true"}
)

# Use in configuration management
model_versions = ["v1.0", "v1.1", "v2.0"]
experiments = {}
for version in model_versions:
    experiments[version] = Experiment.get_or_create(
        name=f"fraud-detection-eval-{version}",
        application_id=application.id,
        dataset_id=dataset.id,
        metadata={"model_version": version}
    )
```

{% hint style="info" %}
This method is idempotent - calling it multiple times with the same name and application\_id will return the same experiment. It logs when creating a new experiment for visibility in automation scenarios.
{% endhint %}

## update()

Update experiment description, metadata, and status.

Updates the experiment's description, metadata, and/or status. This method allows you to modify the experiment's configuration after creation, including updating the experiment status and error information for failed experiments.

## Parameters

| Parameter       | Type               | Required | Default | Description |
| --------------- | ------------------ | -------- | ------- | ----------- |
| `description`   | \`str              | None\`   | ✗       | `None`      |
| `metadata`      | \`dict             | None\`   | ✗       | `None`      |
| `status`        | \`ExperimentStatus | None\`   | ✗       | `None`      |
| `error_reason`  | \`str              | None\`   | ✗       | `None`      |
| `error_message` | \`str              | None\`   | ✗       | `None`      |
| `traceback`     | \`str              | None\`   | ✗       | `None`      |
| `duration_ms`   | \`int              | None\`   | ✗       | `None`      |

## Returns

The updated experiment instance with new metadata and configuration.

**Return type:** `Experiment`

## Raises

* **ValueError** -- If no update parameters are provided (all are None) or if status is FAILED but error\_reason, error\_message, or traceback are missing.
* **ValidationError** -- If the update data is invalid (e.g., invalid metadata format).
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Update description and metadata
updated_experiment = experiment.update(
    description="Updated comprehensive evaluation of fraud detection model v1.1",
    metadata={"model_version": "1.1", "evaluation_type": "comprehensive", "updated_by": "john_doe"}
)
print(f"Updated experiment: {updated_experiment.name}")
print(f"New description: {updated_experiment.description}")

# Update only metadata
experiment.update(metadata={"last_updated": "2024-01-15", "status": "active"})

# Update experiment status to completed
experiment.update(status=ExperimentStatus.COMPLETED)

# Mark experiment as failed with error details
experiment.update(
    status=ExperimentStatus.FAILED,
    error_reason="Evaluation timeout",
    error_message="The evaluation process exceeded the maximum allowed time",
    traceback="Traceback (most recent call last): File evaluate.py, line 42..."
)

# Clear description
experiment.update(description="")

# Batch update multiple experiments
for experiment in Experiment.list(application_id=application_id):
    if experiment.status == ExperimentStatus.COMPLETED:
        experiment.update(metadata={"archived": "true"})
```

{% hint style="info" %}
This method performs a complete replacement of the specified fields. For partial updates, retrieve current values, modify them, and pass the complete new values. The experiment name and ID cannot be changed. When updating status to FAILED, all error-related parameters are required.
{% endhint %}

## delete()

Delete the experiment.

Permanently deletes the experiment and all associated data from the Fiddler platform. This action cannot be undone and will remove all experiment results, metrics, and metadata.

## Raises

**ApiError** -- If there's an error communicating with the Fiddler API. **Return type:** None

## Example

```python
# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Delete the experiment
experiment.delete()
print("Experiment deleted successfully")

# Delete multiple experiments
for experiment in Experiment.list(application_id=application_id):
    if experiment.status == ExperimentStatus.FAILED:
        print(f"Deleting failed experiment: {experiment.name}")
        experiment.delete()
```

{% hint style="info" %}
This operation is irreversible. Once deleted, the experiment and all its associated data cannot be recovered. Consider archiving experiments instead of deleting them if you need to preserve historical data.
{% endhint %}

## add\_items()

Add outputs of LLM/Agent/Application against dataset items to the experiment.

Adds outputs of LLM/Agent/Application (task or target function) against dataset items to the experiment, representing individual test case outcomes. Each item contains the outputs of LLM/Agent/Application results, timing information, and status for a specific dataset item.

## Parameters

| Parameter | Type                      | Required | Default | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| --------- | ------------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `items`   | `list[NewExperimentItem]` | ✗        | `None`  | List of NewExperimentItem instances containing outputs of `LLM`/Agent/Application against dataset items. Each item should include: dataset\_item\_id: `UUID` of the dataset item being evaluated; outputs: Dictionary containing the outputs of the task function against dataset item; duration\_ms: Duration of the execution in milliseconds: status: Status of the outputs of the task function / scoring against dataset item (`PENDING`, `COMPLETED`, `FAILED`, etc.); error\_reason: Reason for failure, if applicable; error\_message: Detailed error message, if applicable |

## Returns

List of UUIDs for the newly created experiment items.

**Return type:** builtins.list\[UUID]

## Raises

* **ValueError** -- If the items list is empty.
* **ValidationError** -- If any item data is invalid (e.g., missing required fields).
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Create evaluation result items
from fiddler_evals.pydantic_models.experiment import NewExperimentItem
from datetime import datetime, timezone

items = [
    NewExperimentItem(
        dataset_item_id=dataset_item_id_1,
        outputs={"answer": "The watermelon seeds pass through your digestive system"},
        duration_ms=1000,
        end_time=datetime.now(tz=timezone.utc),
        status="COMPLETED",
        error_reason=None,
        error_message=None
    ),
    NewExperimentItem(
        dataset_item_id=dataset_item_id_2,
        outputs={"answer": "The precise origin of fortune cookies is unclear"},
        duration_ms=1000,
        end_time=datetime.now(tz=timezone.utc),
        status="COMPLETED",
        error_reason=None,
        error_message=None
    )
]

# Add items to experiment
item_ids = experiment.add_items(items)
print(f"Added {len(item_ids)} evaluation result items")
print(f"Item IDs: {item_ids}")

# Add items from evaluation results
items = [
    {
        "dataset_item_id": str(dataset_item_id),
        "outputs": {"answer": result["answer"]},
        "duration_ms": result["duration_ms"],
        "end_time": result["end_time"],
        "status": "COMPLETED"
    }
    for result in items
]
item_ids = experiment.add_items([NewExperimentItem(**item) for item in items])

# Batch add items with error handling
try:
    item_ids = experiment.add_items(items)
    print(f"Successfully added {len(item_ids)} items")
except ValueError as e:
    print(f"Validation error: {e}")
except Exception as e:
    print(f"Failed to add items: {e}")
```

{% hint style="info" %}
This method is typically used after running evaluations to store the results in the experiment. Each item represents the evaluation of a single dataset item and contains all relevant timing, output, and status information.
{% endhint %}

## get\_items()

Retrieve all experiment result items from the experiment.

Fetches all experiment result items (outputs, timing, status) that were generated by the task function against dataset items. Returns an iterator for memory efficiency when dealing with large experiments containing many result items.

## Returns

Iterator of : ExperimentItem instances for all result items in the experiment.

**Return type:** Iterator\[`ExperimentItem`]

## Raises

**ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Get all result items from the experiment
for item in experiment.get_items():
    print(f"Item ID: {item.id}")
    print(f"Dataset Item ID: {item.dataset_item_id}")
    print(f"Outputs: {item.outputs}")
    print(f"Status: {item.status}")
    print(f"Duration: {item.duration_ms}")
    if item.error_reason:
        print(f"Error: {item.error_reason} - {item.error_message}")
    print("---")

# Convert to list for analysis
all_items = list(experiment.get_items())
print(f"Total result items: {len(all_items)}")

# Filter items by status
completed_items = [
    item for item in experiment.get_items()
    if item.status == "COMPLETED"
]
print(f"Completed items: {len(completed_items)}")

# Filter items by error status
failed_items = [
    item for item in experiment.get_items()
    if item.status == "FAILED"
]
print(f"Failed items: {len(failed_items)}")

# Process items in batches
batch_size = 100
for i, item in enumerate(experiment.get_items()):
    if i % batch_size == 0:
        print(f"Processing batch {i // batch_size + 1}")
    # Process item...

# Analyze outputs
for item in experiment.get_items():
    if item.outputs.get("confidence", 0) < 0.8:
        print(f"Low confidence item: {item.id}")
```

{% hint style="info" %}
This method returns an iterator for memory efficiency. Convert to a list with list(experiment.get\_items()) if you need to iterate multiple times or get the total count. The iterator fetches items lazily from the API.
{% endhint %}

## add\_results()

Add evaluation results to the experiment.

Adds complete evaluation results to the experiment, including both the experiment item data (outputs, timing, status) and all associated evaluator scores. This method is typically used after running evaluations to store the complete results of the evaluation process for a batch of dataset items.

This method will only append the results to the experiment.

Note: It is not recommended to use this method directly. Instead, use the evaluate method. Creating and managing an experiment without evaluate wrapper is extremely advance usecase and should be avoided.

## Parameters

| Parameter | Type                         | Required | Default | Description                                                                                                                                                                          |
| --------- | ---------------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `items`   | `list[ExperimentItemResult]` | ✗        | `None`  | List of ExperimentItemResult instances containing: experiment\_item: NewExperimentItem with outputs, timing, and status; scores: List of Score objects from evaluators for this item |

## Returns

Results are added to the experiment on the server.

**Return type:** None

## Raises

* **ValueError** -- If the items list is empty.
* **ValidationError** -- If any item data is invalid (e.g., missing required fields).
* **ApiError** -- If there's an error communicating with the Fiddler API.

## Example

```python
# Get existing experiment
experiment = Experiment.get_by_name(name="fraud-detection-eval-v1", application_id=application_id)

# Create experiment item with outputs
experiment_item = NewExperimentItem(
    dataset_item_id=dataset_item.id,
    outputs={"prediction": "fraud", "confidence": 0.95},
    duration_ms=1000,
    end_time=datetime.now(tz=timezone.utc),
    status="COMPLETED"
)

# Create scores from evaluators
scores = [
    Score(
        name="accuracy",
        evaluator_name="AccuracyEvaluator",
        value=1.0,
        label="Correct",
        reasoning="Prediction matches ground truth"
    ),
    Score(
        name="confidence",
        evaluator_name="ConfidenceEvaluator",
        value=0.95,
        label="High",
        reasoning="High confidence in prediction"
    )
]

# Create result combining item and scores
result = ExperimentItemResult(
    experiment_item=experiment_item,
    scores=scores
)

# Add results to experiment
experiment.add_results([result])
```

{% hint style="info" %}
This method is typically called after running evaluations to store complete results. The results include both the experiment item data and all evaluator scores, providing a complete record of the evaluation process.
{% endhint %}
