RLCS: An Introduction

An issue with “AI”: explainability

Let’s call it Machine Learning. Today that’s mostly Neural networks. And mostly, that means it’s all black boxes.


There are ways around that. (e.g. Trees and other “open book” algorithms…)


John H. Holland proposed an algorithm with the Cognitive System One program (1976). Later, people came up with variations… Today we focus on Michigan-style LCS.

The goal: A new R package

Have you ever found something that no one else has done?

Learning Classifier System: Algorithm

Keywords

So how does it work?

Imagine you receive samples of data points (aka states) with their corresponding classes (aka actions). All your input entries (aka instances) form an environment.


In this case, we will accept binary string states for the environment instances (examples):

State Action/Class From the…
0010111100110010 “setosa” Iris dataset
001010 1 “Mux 6”

Wait, what? Binary input?!

Well, for now… Yes. There are alternatives. But for now…


Here using 4-bits with Gray encoding of “double quartiles” per variable, we can create binary string:

  Sepal.Length Sepal.Width Petal.Length Petal.Width  slb  swb  plb  pwb
1          5.1         3.5          1.4         0.2 0010 1111 0011 0010
2          4.9         3.0          1.4         0.2 0011 0101 0011 0010
3          4.7         3.2          1.3         0.2 0000 1101 0000 0010
             state  class
1 0010111100110010 setosa
2 0011010100110010 setosa
3 0000110100000010 setosa

Key concepts

1 - Generating a rule

The key: “#” means “I don’t care”

Covering a state with a probability of “#” values means making a rule that matches the input state and class/action.

Something that could match other (partially) similar input:

> generate_cover_rule_for_unmatched_instance('010001', 0.2)
[1] "0#0001"
> generate_cover_rule_for_unmatched_instance('010001', 0.8)
[1] "##0###"

2 - Matching

You generate new rules when you see new environment instances that do not match any rule yet. Suppose you have a set of rules. That’s your population (of rules).


  • If one(+) rule(s) in your population matches your new instance state -> increase the match count of the corresponding classifier.

  • If one(+) rule(s) in your population matches your new instance state && class/action –> increase the correct count.

3 - Rule Discovery

After a few epochs of exposing the LCS to your full environment, you will have a few rules that match correctly a given instance, the correct set.


Match and Correct count are indicators of how good each rule is. But are there other better possibilities?


Take all the correct set, and apply a Genetic Algorithm to that set, to generate new rules!

3 - Rule Discovery (GA)

mut_point <- which(runif(nchar(t_instance_state)) < mut_prob)

4 - Population Size

Matching must go through all the population every time an environment instance is presented to the LCS.

\[O(match) = epochs*N(environment)*N(population)\]

Where \(N()\) means “size of”.

\[e.g. 1.000 * 5.000 * 1.000 = 5.000.000.000\]


One option: Reduce the population of rules.

4 - Population Size


  • Subsumption: A “perfect” classifier that has 100% accuracy might be simpler (more # characters) than other classifiers in the population with same classification. Keep only the best classifiers. (Implemented: Accuracy-based subsumption)

  • Compaction: You can keep all classifiers that have e.g. 60%+ accuracy after a number of epochs.

  • Deletion: But you can also cap the population size, keeping only e.g. 10.000 best classifiers.

5 - Prediction

Imagine a new sample/instance, never seen before. (Test environment)


Prediction is about returning the match set for that new instance.


Majority (possibly weighted by numerosity, accuracy…) of proposed class/action will be the prediction.

That’s it! It also means, this is natively an ensemble learning algorithm.

6 - Other uses!

Why talk about “environment” and “action”? This comes from the world of Reinforcement Learning.


And because one can read the rules, and “understand” the population, you can also use the LCS to interpret the results and thus do data mining!

All with the same algorithm!

Demo Time

Iris

[1] "Training Runtime: 5.08055098454158" (minutes)
[1] "Training Set Size: 119"
[1] "Confusion 'Matrix' for Class setosa:"
setosa 
     9 
[1] "Confusion 'Matrix' for Class versicolor:"
    setosa versicolor  virginica 
         1         10          2 
[1] "Confusion 'Matrix' for Class virginica:"
versicolor  virginica 
         1          7 

Iris

visualizing one classifier - iris

Images Classifier

Epistasis

LCS can somehow recognize how two different parts interact. Aptly… The term comes from of genetics (genes modified by other genes…). (e.g. XOR…)

  condition action match_count correct_count accuracy numerosity first_seen
     10##1#      1       15913         15913        1        324        688
     000###      0       15575         15575        1        357       3394
     001###      1       14842         14842        1        298       9231
     11###0      0       13149         13149        1        263      22839
     ...

Also included: Data Mining

Given that the rules are “expressive”, sometimes you can ask the LCS to find rules that appear in your data, NOT to classify future samples, but to identify what is important in different classes of your data.


REAL WORLD anecdote: inventory of 10K rows with 20 columns, each duly binary encoded. I learnt something about my inventory!

RL, TOO!

First self-brain-storm on RL with LCS

RL Video

Some R code

A (future, sorry!) R package

A package to implement a simple version of the Learning Classifier System algorithm:

  • Binary Alphabet, tournament/one-point crossover GA, accuracy based, Michigan-style LCS

  • With examples, demonstrating the implementation for:

    • Data Mining

    • Supervised Learning

    • Reinforcement Learning

How did I approach the thing?

  • Lists. Lists, everywhere. Which might have been a bad idea… (data.table?)

  • from there, lapply() & al. is then my best friend

  • Start small, grow fast (because I get obsessed)

  • then clean it

  • then clean it some more

  • ever postponing the move to a Package, though

Examples

Before

lcs_res <- rlcs_meta_train(train_environment,
                           1, ## Warmup with just one epoch
                           wildcard_prob,
                           rd_trigger,
                           parents_selection_mode,
                           mutation_probability,
                           tournament_pressure,
                           deletion_trigger) ## Deletion won't be triggered

Too many parameters! (Uncle Bob wouldn’t like it)

Examples

After, using an object (reference class, “R5”, in this case)

default_lcs_hyperparameters <- RLCS_hyperparameters()
example_lcs <- rlcs_train(train_environment, default_lcs_hyperparameters)

Examples

Or, you know…

source("run_params/datamining_examples_recommended_hyperparameters_v001.R")
basic_hyperparameters <- RLCS_hyperparameters(
  wildcard_prob = wildcard_prob,
  ## defaults for rd_trigger, mutation_probability,
  ## parents_selection_mode && tournament_pressure
  n_epochs = n_epochs,
  deletion_trigger = deletion_trigger,
  deletion_threshold = deletion_threshold
)

## It makes it more readable here:
example_lcs <- rlcs_train(train_environment, basic_hyperparameters)

Examples

Before

inc_match_count <- function(M_pop) { ## All versions
  lapply(M_pop, \(x) {
    x$match_count <- x$match_count + 1
    x
  })
}

inc_correct_count <- function(C_pop) { ## SL Specific
  lapply(C_pop, \(x) {
    x$correct_count <- x$correct_count + 1
    x
  })
}

inc_action_count <- function(A_pop) { ## RL Specific
  lapply(A_pop, \(x) {
    x$action_count <- x$action_count + 1
    x
  })
}

Examples

After, using a function factory

## Function factory to increase parameter counts
inc_param_count <- function(param) {
  param <- as.name(param)
  function(pop) {
    lapply(pop, \(x) {
      x[[param]] <- x[[param]] + 1
      x
    })
  }
}

inc_match_count <- inc_param_count("match_count")
inc_correct_count <- inc_param_count("correct_count")
inc_action_count <- inc_param_count("action_count")

Examples

Before

## Support function for human-compatible printing:
make_pop_printable <- function(classifier) {
    
    df <- plyr::rbind.fill(lapply(1:length(classifier), \(i) {
        t_c <- classifier[[i]]
        data.frame(id = t_c$id,
                   condition = t_c$condition_string,
                   action = t_c$action,
                   match_count = t_c$match_count,
                   correct_count = t_c$correct_count,
                   accuracy = t_c$accuracy,
                   numerosity = t_c$numerosity,
                   first_seen = t_c$first_seen)
    }))
    df[order(df$accuracy, df$numerosity, decreasing = T),]
}

(Even the parameter name is wrong…)

Examples

After - S3 object (dependency! plyr::rbind.fill)

print.rlcs_population <- function(pop) {
  if(length(pop) == 0) return(NULL)
  
  pop <- lcs_best_sort_sl(pop)
  pop <- unclass(pop)
  plyr::rbind.fill(lapply(1:length(pop), \(i) {
    t_c <- pop[[i]]
    data.frame(condition = t_c$condition_string, action = t_c$action,
               match_count = t_c$match_count, correct_count = t_c$correct_count,
               accuracy = t_c$accuracy, numerosity = t_c$numerosity,
               first_seen = t_c$first_seen)
  }))
}
print(example_lcs_population)

Then again


This is all work in progress.


I plan to make it into a CRAN Package.

So: document more, write more tests, reorganize functions…

More Resources

Supplementary

Why nobody has done it yet?

It’s not fast

  • There are many sequential steps, rather unavoidable ones at that

  • Not ideal to compete with a world of GPUs and parallel processing (yet ;))

It’s “complex”

  • Or so does the Wikipedia entry say…

  • When it comes to “alphabets”, it does get messy, I’ll admit

Execution Speed

For instance, this is a “slow” algorithm. Option: RCpp for Matching? (under testing right now!)

profviz

Parallel Computing

Parallel computing? %dopar% was tested (it works, but…)

  • vertical and horizontal partitioning
    • Break data set (vertical). Two options

      • instances subsets (reduce population covered per thread/core)

      • Substrings of states (reduce search space)

      • both are “risky”

    • run fewer iterations (epochs) on full dataset, but on several cores in parallel

Parallel Computing

Depending on the dataset, it can be done… Or not…

Reinforcement Learning Conundrum

We’ve seen it works, but… How do you package an RL algorithm?

You must make assumptions about the “world” your agent is going to interact with. This makes things complicated:

  • What to include inside the package? What not?

  • What to expose from the package? What not?

And a few other such questions slow me down a bit…

Thank you!