We saw in Part 1 the basic structure of a decision tree. We are now going to create a class to handle the samples and labels of a data set. This class will be used in the remaining parts of this serie.

Handling the samples and labels of a data set

We create a class LabeledData that contains both the samples and labels of a data set.

At each node of a decision tree, the data set will have to be divided into new subsets. In practice, a subset will be a combination of references to the samples and labels of the original superset as well as a vector of indices that indicates which samples are actually included in the subset. The whole original data set has therefore an indices vector made of all the indices of the samples.

class LabeledData[A, B] private (
    private val referenceSamples: Vector[A],
    private val referenceLabels: Vector[B],
    private val indices: Vector[Int]) {

  def size: Int = indices.length

  def isEmpty: Boolean = size == 0

  def subset(indices: Vector[Int]): LabeledData[A, B] = 
    new LabeledData(referenceSamples, referenceLabels, indices)

  def emptySet: LabeledData[A, B] = subset(Vector())

  def groupBy[C](f: A => C): Map[C, LabeledData[A, B]] = 
    ((indices groupBy {idx => f(referenceSamples(idx))}) 
      mapValues subset)

  def partition(f: A => Boolean): (LabeledData[A, B], LabeledData[A, B]) = {
    val groups = groupBy(f)
    (groups(true), groups(false))

  def union(that: LabeledData[A, B]): LabeledData[A, B] = {
      referenceSamples == that.referenceSamples && 
        referenceLabels == that.referenceLabels,
      "Union is only allowed for subsets of the same superset.")
    subset((indices ++ that.indices).distinct)  

  def countDistinctLabels: Int = 
    (indices map referenceLabels).distinct.length

  def mapSamples[C](f: A => C): Vector[C] = 
    indices map {idx => f(referenceSamples(idx))}

object LabeledData {

  def apply[A, B](samples: Vector[A], labels: Vector[B]): 
      LabeledData[A, B] = {
      samples.length == labels.length, 
      "Samples and labels should have the same number of elements.")
    new LabeledData[A, B](samples, labels, samples.indices.toVector)

Let’s quickly show how to use some methods of the above class:

val referenceSamples = Vector(
  Vector(1, 2, 3),
  Vector(4, 5, 6),
  Vector(7, 8, 9))

val referenceLabels = Vector(0.1, 0.4, 0.7)

val data = LabeledData(referenceSamples, referenceLabels)

println(s"Data Set size: ${data.size}")
println(s"Data Set is empty: ${data.isEmpty}")

val s0 = data.subset(Vector(0, 1))
val s1 = data.subset(Vector(1, 2))

println(s"Subset 0 size: ${s0.size}")
println(s"Subset 1 size: ${s1.size}")

val u = s0.union(s1)

println(s"Union size: ${u.size}")

val m = data mapSamples {_.sum}
(0 to 2) map (i => s"Sum of ${referenceSamples(i)} is ${m(i)}") foreach println
Data Set size: 3
Data Set is empty: false
Subset 0 size: 2
Subset 1 size: 2
Union size: 3
Sum of Vector(1, 2, 3) is 6
Sum of Vector(4, 5, 6) is 15
Sum of Vector(7, 8, 9) is 24


In this part, we created a class to handle the samples and labels of a data set. We will see in Part 3 how to compute the values of the leaves of a prediction tree to fit a data set.