Tutorial

Installation

tl;dr: python3 -mpip install coniferest

You usually want to create virtual environment first, for example with venv:

python3 -mvenv venv
source venv/bin/activate

We need up-to-date pip:

python3 -mpip install -U pip

Finally, install coniferest:

python3 -mpip install coniferest

For any problems, please file an issue on the GitHub.

Example: non-active anomaly detection

Let’s generate a simple 2-D dataset with a single outlier as a last object, and run IsolationForest model (see Liu et al. 2008) on it:

from coniferest.datasets import single_outlier
from coniferest.isoforest import IsolationForest

data, _metadata = single_outlier(10_000)
model = IsolationForest(random_seed=0)
model.fit(data)
scores = model.score_samples(data)
print("Index of the outlier:", scores.argmin())

Example: ZTF light curves of M31 field

Let’s use built-in dataset of ZTF light curve features adopted from Malanchev at al. (2021):

from coniferest.datasets import ztf_m31

data, metadata = ztf_m31()
print(data.shape)

Here data is 2-D feature dataset (first axis is for objects, second is for features) and metadata is 1-D array of ZTF DR object IDs. Next we need a active anomaly detection model to find outliers in this dataset. We will use PineForest model developed by the SNAD team. It is based on Isolation forest model and uses a tree filtering algorithm to learn from the decisions made by the user:

from coniferest.pineforest import PineForest

model = PineForest(
    # Number of trees to use for predictions
    n_trees=256,
    # Number of new tree to grow for each decision
    n_spare_trees=768,
    # Fix random seed for reproducibility
    random_seed=0,
)

Now we are ready to run active anomaly detection Session:

from coniferest.session import Session
from coniferest.session.callback import (
    TerminateAfter, viewer_decision_callback,
)

session = Session(
    data=data,
    metadata=metadata,
    model=model,
    # Prompt for a decision and open object's page on the SNAD Viewer
    decision_callback=viewer_decision_callback,
    on_decision_callbacks=[
        # Terminate session after 10 decisions
        TerminateAfter(10),
    ],
)
session.run()

This will prompt you to make a decision for an object with the highest outlier score and show you this object in the browser. Each decision you make retrains the model and updates the outlier scores. After 10 decisions the session will be terminated, but you can also stop it by pressing Ctrl+C. Please note that Session mutates the model, so you should re-create the model if you want to start a new session with the blank model.

For the first candidate you should get a recurrent variable ZTF DR 695211200075348 / M31N 2013-11b / MASTER OTJ004126.22+414350.0. SNAD team reported this object as an anomaly in Malanchev at al. (2021), it is believed to be a recurrent Nova or a long-period variable star. If you are interested in this type of object, say Y otherwise mark it as not-interesting by replying N and continue your investigations — the objects shown to you will depend on the sequence of your replies.

After the session is finished you can explore Session objects for the decisions you made and final state of the model:

from pprint import pprint

print('Decisions:')
pprint({metadata[idx]: label.name for idx, label in session.known_labels.items()})
print('Final scores:')
pprint({metadata[idx]: session.scores[idx] for idx in session.known_labels})

coniferest also provides “Active Anomaly Detection” model based on Isolation forest AADForest model (see Das et al., 2017 and Ishida et al., 2021 for details). You can use it instead of PineForest model in the example above:

from coniferest.aadforest import AADForest

model = AADForest(
    # Use 1024 trees, a trade-off between speed and accuracy
    n_trees=1024,
    # Fix random seed for reproducibility
    random_seed=0,
)

Use your own data

In examples above we used built-in datasets, but you can easily use your own data. The only requirement is that your data should be a 2-D array of features, where first axis is for objects and second is for features. You also need an 1-D array of object metadata, which can be anything you want, but typically you would make it to be object IDs or names.

Let’s generate a simple 2-D dataset and run PineForest model on it using a simple interactive prompt as a decision callback:

import numpy as np
from coniferest.pineforest import PineForest
from coniferest.session import Session
from coniferest.session.callback import (
    TerminateAfter, prompt_decision_callback,
)

n_objects = 1000
n_features = 10

rng = np.random.default_rng(0)
data = rng.normal(size=(n_objects, n_features))
# prompt_decision_callback will print object's metadata, so it should
# have a pretty representation. Integer index is good for this example
metadata = np.arange(n_objects)

session = Session(
    data=data,
    metadata=metadata,
    model=PineForest(random_seed=0),
    decision_callback=prompt_decision_callback,
    on_decision_callbacks=TerminateAfter(10),
)
session.run()

Session life cycle

Examples above use Session class to run active anomaly detection. Let’s take a closer look at how it works, please refer to Session API documentation for more details on the class constructor arguments and the class methods and attributes.

  1. Initialize session with Session() constructor

  2. Call Session.run() method to start the session. It will do the following:

    1. Session initializes the model (calls .fit() on it) with the data

    2. The decision loop starts and last until session is terminated or all data points are labeled:

      1. model is fit for current Session.known_labels (calls .fit_known() on it)

      2. on_refit_callbacks are called

      3. If all data points are labeled, the session is terminated

      4. decision_callback is called, the decision is stored in Session.known_labels

      5. on_decision_callbacks are called

      6. If session was terminated in one of the callbacks, then the decision loop is terminated. Otherwise, go to step i)

  3. Inspect Session() object attributes like known_labels, scores and model to get the results of the session

Note that Session mutates input model, so if you use the same model object again for a new session, it will continue from the previous model state.

Tune the Session with callbacks

Now we know how Session works, but how can we change its behavior? The answer is callbacks. Callbacks are functions that are called at different stages of the session lifecycle. Let’s implement a simple callback per each stage:

from coniferest.pineforest import PineForest
from coniferest.label import Label
from coniferest.session import Session


def my_on_refit_callback(session):
    print('Refitting model with known labels:')
    print(session.known_labels)


def my_decision_callback(metadata, data, session):
    """Say YES when the first feature is positive"""
    print(f'Labeling object {metadata}')
    return Label.ANOMALY if data[0] > 0.0 else Label.REGULAR


def my_on_decision_callback(metadata, data, session):
    print(f'Decision made for {metadata}: {session.last_decision}.')


def terminate_after_5_anomalies(metadata, data, session):
    if session.known_anomalies.size >= 5:
        session.terminate()


class RecordCallback:
    def __init__(self):
        self.records = []

    def __call__(self, metadata, data, session):
        self.records.append(f'{metadata} -> {session.last_decision}')

    def print_report(self):
        print('Records:')
        print('\n'.join(self.records))


record_callback = RecordCallback()

rng = np.random.default_rng(0)
data = rng.normal(size=(1000, 2))
metadata = np.arange(data.shape[0])
model = PineForest(random_seed=0)

session = Session(
    data=data,
    metadata=metadata,
    model=model,
    decision_callback=my_decision_callback,
    # We can give an only function/callable as a callback
    on_refit_callbacks=my_on_refit_callback,
    # Or a list of callables
    on_decision_callbacks=[
        my_on_decision_callback,
        record_callback,
        terminate_after_5_anomalies,
    ],
)
session.run()

print()
record_callback.print_report()

Use prior knowledge with known labels

The final component of the :class:<Session <coniferest.session.Session>> constructor we haven’t discussed yet is known_labels. It allows you to provide prior knowledge about few samples in the dataset. This is useful when you know that some samples are anomalies or regular objects, so you can find more objects you like or get less false positives.

Let’s use a simulated dataset with 1024 regular objects and three “outlier” classes, each with 32 objects. Within these three classes, only one considered as anomalous, while other two are some kind of “bogus” objects that we don’t want to find.

# Install matplotlib if you don't have it
import matplotlib.pyplot as plt
from coniferest.datasets import non_anomalous_outliers
from coniferest.label import Label

data, metadata = non_anomalous_outliers(
    # Number of regular objects
    inliers=1024,
    # Number of objects per "outlier" class
    outliers=32,
    # Classification of "outlier" classes
    regions=[Label.R, Label.R, Label.A],
)

# Plot the data
index = metadata == Label.R
plt.scatter(*data[index, :].T, marker='.', color='#22114C', label='regular')
plt.scatter(*data[~index, :].T, marker='*', color='#22114C', label='anomaly')
plt.legend()
plt.show()
_images/known_labels_dataset.png

Let’s see what IsolationForest will find in this dataset within top-32 outliers:

from coniferest.isoforest import IsolationForest

scores = IsolationForest(random_seed=0).fit(data).score_samples(data)
top32 = scores.argsort()[:32]

color = np.full_like(metadata, '#22114C', dtype=object)
color[top32] = '#FCBD43'

plt.cla()
plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular')
plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly')
plt.legend()
plt.show()
_images/known_labels_isoforest.png

Here we show 32 candidates in “SNAD yellow” color. Not bad, but what can we do with the active anomaly detection?

from coniferest.pineforest import PineForest
from coniferest.session import Session
from coniferest.session.callback import TerminateAfter

session = Session(
    data,
    metadata,
    model=PineForest(random_seed=0),
    # metadata consists of true labels, so we can use it as a decision
    decision_callback=lambda metadata, data, session: metadata,
    on_decision_callbacks=TerminateAfter(32),
)
session.run()

color = np.full_like(metadata, '#22114C', dtype=object)
color[list(session.known_labels)] = '#FCBD43'

plt.cla()
plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular')
plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly')
plt.show()
_images/known_labels_pineforest.png

Looks good, right? But what if we know that one of the “outlier” objects is not an anomaly? Let’s suppose that before doing the active anomaly detection we investigated the most-left object and found that it is a bogus object. We can mark it as a regular object and run the active anomaly detection again:

from coniferest.pineforest import PineForest
from coniferest.session import Session
from coniferest.session.callback import TerminateAfter

# Key is index of the object, value is its label.
known_labels = {data[:, 0].argmin(): Label.REGULAR}

session = Session(
    data,
    metadata,
    known_labels=known_labels,
    model=PineForest(random_seed=0),
    decision_callback=lambda metadata, data, session: metadata,
    on_decision_callbacks=TerminateAfter(32),
)
session.run()

color = np.full_like(metadata, '#22114C', dtype=object)
new_labels = set(session.known_labels) - set(known_labels)
color[list(new_labels)] = '#FCBD43'

plt.cla()
plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular')
plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly')
plt.show()
_images/known_labels_pineforest_known_regular.png

Note that the object we marked as regular was not even selected by the previous run, but here it influenced the model to not select other objects of this class.