Further Down AI Powered Chatbot Rabbit Hole

Summary

In my previous article Chatbots, AI and Docker! I talked about a little of the theory behind this but for this article I wanted to fully go down the rabbit hole and produced my own chatbot. To do this, I had to find an updated chatterbot fork, learn a little more python to handle dependencies better, create my own fork of corpus/training material and learn Google Cloud Run. Ultimately you can skip straight to the source if you like. That’s the great part of GitOps/IaC

And Then Some

Previously, I had a workable local instance that I was able to host in podman/kind but I wanted to put this in my hosting environment on Google. In order to personalize this, I wanted to be able to add some training data and use some better practices. Having previously used Google App Engine, assumedly that would naturally be the landing place for this. I then ran into some hiccups and came across Cloud Run which was not originally available and seemed like a suitable fit as it is built for containerized workflows. It provided me a way to use my existing Dockerfile to unify the build and deploy. For tools, I have a separate build and test workflow in my cloudbuild.yaml.

Get on with Chatbots!

In my last article, I mentioned I had to find a fork of chatterbot because it has not been recently maintained. In reality though it only allowed command line prompting which is not terribly useful for a wider audience to test. I came across this amazing medium post which I have to give full credit for (and do in the html as well). The skin is pretty amazing. It also provides a wealth of in depth details.

For the web framework, I opted to use Flask and gunicorn which was fairly trivial to get going after finding that great medium post above.

Training Data

Without any training data AI/ML does not really exist. It needs to be pre-trained and/or train “on the job”. For this, chatterbot-corpus comes into play. This is a pre-built training data set for the chatterbot library. It has some decent basic training. I wanted to be able to add my own and based on the input of casmith, its in python so shouldn’t it be able to converse with Monty Python quotes? So I did and created my own section for that.

categories:
- Monty
- humor
conversations:
- - What is your name?
  - My name is Sir Lancelot of Camelot.
- - What is your quest?
  - To seek the Holy Grail.
- - What is the air speed velocity of an unladen swallow?
  - What do you mean? An African or European swallow?
- - How do know so much about swallows?
  - Well, you have to know these things when you're a king, you know.

I have the real-time training disabled or rather put my chatterbot into read only mode because the internet can be a cruel place and I don’t need my creation coming home with a foul mouth! For my lab, the training is loaded at image creation time. This is primarily because its using the default sqlite back end. I could easily use a database for this and load the training out of band so it doesn’t require a deploy.

Logic Adapters

You may be thinking this is a simple bot that’s just doing string matching to figure out how to respond. For the most part you’re correct. This is not deep learning and it doesn’t fully understand what you are asking. With that said its very extensible with multiple logic adapters. The default is a “BestMatch” based on text input. Others allow it to report time and do math. It will weigh the confidence of the response on each adapter to let the highest scoring/weighing response win. Pretty neat!

chatbot = ChatBot(
    "Sure, Not",
    logic_adapters=[
        'chatterbot.logic.BestMatch',
        'chatterbot.logic.MathematicalEvaluation',
        'chatterbot.logic.TimeLogicAdapter'
    ],
    read_only=True
)

Over To The Infrastructure

For all of this, it starts with a Dockerfile. I already had this but it was a little bloated with build dependencies. Therefore, I created a multistage image using virtual python environment as guided by https://pythonspeed.com/articles/multi-stage-docker-python. I am not new to multistage images. My golang images use it. I was, however, new to doing this with Python. Not only did it reduce my image size down 100MB but it also removed 30 vulnerabilities from the images because of a dependency on git for some of the python libraries.

Cloud Run

To get deployed to Cloud Run, it was pretty simple although there were a few trial an errors due to permissions. The service account needed Cloud Run Admin access. Aside from that, this pumped everything through and let me keep my singular Dockerfile.

steps:
  # Docker Build
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', 
           'us.gcr.io/${PROJECT_ID}/chatbot:${SHORT_SHA}', '.']

  # Docker push to Google Artifact Registry
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push',  'us.gcr.io/${PROJECT_ID}/chatbot:${SHORT_SHA}']

  # Deploy to Cloud Run
  - name: google/cloud-sdk
    args: ['gcloud', 'run', 'deploy', 'chatbot', 
           '--image=us.gcr.io/${PROJECT_ID}/chatbot:${SHORT_SHA}', 
           '--region', 'us-central1', '--platform', 'managed', 
           '--allow-unauthenticated', '--port', '5000', '--memory', '256Mi',
           '--no-cpu-boost']

# Store images in Google Artifact Registry 
images:
  - us.gcr.io/${PROJECT_ID}/chatbot:${SHORT_SHA}

It really was this simple since I had a working local environment and working Dockerfile. Just don’t look at my commit history 🙂 Quite a few silly mistakes were made if you look deep enough.

Caveat

Google App Engine lets you use custom domain mapping and bring your own certificates. I use Cloudflare to protect my entire environment and for this in GAE I placed a Cloudflare Origin certificate to help prevent it from being accessed by the outside world as no browser would trust it bypassing Cloudflare.

Google Cloud run has a preview feature of custom domain mapping. The easiest of the options doesn’t support custom certificates and therefore wants to issue you a certificate. The temp workaround for this is to not proxy through Cloudflare until the certificate is issued and then turn on proxy. Rinse and repeat yearly when the cert needs to be renewed.

I have to imagine this will get rectified once out of preview to be feature parity with Google App Engine since it seems Cloud Run intends to replace GAE.

Credits

For Multi-stage help with Python Docker Images – https://pythonspeed.com/articles/multi-stage-docker-python

For the entire UI of this demo/test – https://medium.com/@kumaramanjha2901/building-a-chatbot-in-python-using-chatterbot-and-deploying-it-on-web-7a66871e1d9b

Chatbots, AI and Docker!

Summary

I have started my learning journey about AI. With that I started reading Artificial Intelligence & Generative AI for Beginners. One of the use cases it went through for NLP (Natural Language Processing) was Chatbots.

To the internet I went – ready to go down a rabbit hole and came across a Python library called ChatterBox. I knew I did not want to bloat and taint my local environment so I started using a Docker instance in Podman.

Down the Rabbit Hole

I quickly realized the project has not been actively maintained in a number of years and had some specific and dated dependencies. For example, it seemed to do best with python 3.6 whereas the latest at the time if this writing is 3.12.

This is where Docker shines though. It is really easy to find older images and declare which versions you want. The syntax of Dockerfile is such that you can specify the image and layer the commands you want to run on it. It will work every time, no matter where it is deployed from there.

I eventually found a somewhat updated fork of it here which simplified things but it still had its nuances. chatterbox-corpus (the training data) required PyYaml 3.13 but to get this to work it needed 5.

Dockerfile

FROM python:3.6-slim

WORKDIR /usr/src/app

#COPY requirements.txt ./
#RUN pip install --no-cache-dir -r requirements.txt
RUN pip install spacy==2.2.4
RUN pip install pytz pyyaml chatterbot_corpus
RUN python -m spacy download en

RUN pip install --no-cache-dir chatterbot==1.0.8

COPY ./chatter.py .

CMD [ "python", "./chatter.py" ]

Here we can see, I needed a specific version of Python(3.6) whereas at the time of writing the latest is 3.12. It also required a specific spacy package version. With this I have a repeatable environment that I can reproduce and share (to peers or even to production!)

Dockerfile2

Just for grins, when I was able to use the updated fork it did not take much!

FROM python:3.12-slim

WORKDIR /usr/src/app

#COPY requirements.txt ./
#RUN pip install --no-cache-dir -r requirements.txt
RUN pip install spacy==3.7.4 --only-binary=:all:
RUN python -m spacy download en_core_web_sm

RUN apt-get update && apt-get install -y git
RUN pip install git+https://github.com/ShoneGK/ChatterPy

RUN pip install chatterbot-corpus

RUN pip uninstall -y PyYaml
RUN pip install --upgrade PyYaml

COPY ./chatter.py .

CMD [ "python", "./chatter.py" ]

Without Docker

Without Docker(podman) I would have tainted my local environment with many different dependencies. At the point of getting it all working, I couldn’t be sure it would work properly on another machine. Even if it did, was their environment tainted as well? With Docker, I knew I could easily repeat the process from a fresh image to validate.

Previous projects I worked on that were python related could have also tainted my local to cause unexpected results on other machines or excessive hours troubleshooting something unique to my machine. All of that avoided with a container image!

Declarative Version Management

When it becomes time to update to the next version of Python, it will be a really easy feat. Many tools will even parse these types of files and do dependency management like Dependabot or Snyk

Mozilla SOPS To Protect My cloudflared Secrets In Kubernetes

Summary

Aren’t these titles getting ridiculous? When talking about some of these stacks, you need a laundry list of names to drop. In this case I was working on publishing my CloudFlare Tunnels FTW work that houses my kind lab into my public GitHub Repository. I wanted to tie in FluxCD to it and essentially be able to easily blow away the cluster and recreate with secrets all through FluxCD.

I was able to successfully achieve that with all but the private key which needs to be manually loaded into the cluster so it can decrypt the sensitive information.

Why Do We Care About This?

While trying to go fully GitOps for Kubernetes, everything is stored in a Git Repository. This makes change management extremely simple and reduces complexities of compliance. Things like policy bots can automate change approval processes and document. But generally everything in Git is clear text.

Sure, there are private repositories but do all the the developers that work on the project need to read sensitive records like passwords for that project? Its best that they don’t and as a developer you really don’t want that responsibility!

Mozilla SOPS To The Rescue!

Mozilla SOPS is very well documented. In my case I’m using Flux which also has great documentation. For my lab, this work is focusing on “cluster3” which simply deploys my https://www.woohoosvcs.com and https://tools.woohoosvcs.com in my kind lab for local testing before pushing out to production.

Create Key with Age

Age appears to be the preferred encryption tool to use right now. It is pretty simple to use and going by the flux documentation we simply need to run

age-keygen -o age.agekey

This will create a file that contains both the public and private key. The public key will be in the comment and the command line will output the public key. We will need the private key later to add as a secret manually to decrypt. I’m sure there are ways of getting this into the cluster securely but for this blog article this is the only thing done outside of GitOps.

Let’s Get To the Details!

With Flux I have a bootstrap script to load flux into the environment. I also have a generate_cluster3.sh script that creates the yaml.

The pertinent lines to add to it above the standard are the following. The first line indicates that sops is a decryption provider. The second indicates the name of the secret to be stored. Flux requires this to be in the flux-system namespace

    --decryption-provider=sops \
    --decryption-secret=sops-age \

From there you simpley need to run the bootstrap_cluster3.sh which just loads the yaml manifests for flux. With flux you can do this on the command line but I preferred to have this generation and bootstrapping in Git. As you want to upgrade flux there’s also a upgrade_cluster3.sh script that is really a one liner.

flux install --export > ./clusters/cluster3/flux-system/gotk-components.yaml

This will update the components. If you’re already bootstrapped and running flux, you can run this and commit to push out the upgrades to use flux to upgrade itself!

In the root of the cluster3 folder I have .sops.yaml. This tells the kustomization module in flux what to decrypt and which public key to use.

Loading Private Key Via Secret

Once you have run the bootstrap_cluster3.sh you can then load the private key via

cat age.agekey | kubectl create secret generic sops-age \
  --namespace=flux-system --from-file=age.agekey=/dev/stdin

Caveat

This lab won’t work for you out of the box. This is because it requires a few confidential details

  1. My cloudflared secret is encrypted with my public key. You do not have my private key so you cannot load it into your cluster to decrypt it
  2. I have some private applications I am pushing into my kind cluster. You will have to clone and modify for your needs