OpenShift LightSpeed (OLS) is an AI powered assistant that runs on OpenShift and provides answers to product questions using backend LLM services. Currently OpenAI, Azure OpenAI, OpenShift AI, RHEL AI, and Watsonx are officially supported as backends. Other providers, even ones that are not fully supported, can be used as well. For example, it is possible to use BAM (IBM's research environment). It is also possible to run InstructLab locally, configure model, and connect to it.
- Prerequisites
- Installation
- Configuration
- 1. Configure OpenShift LightSpeed (OLS)
- 2. Configure LLM providers
- 3. Configure OLS Authentication
- 4. Configure OLS TLS communication
- 5. (Optional) Configure the local document store
- 6. (Optional) Configure conversation cache
- 7. (Optional) Incorporating additional CA(s). You have the option to include an extra TLS certificate into the OLS trust store as follows.
- 8. (Optional) Configure the number of workers
- 9. Registering a new LLM provider
- 10. TLS security profiles
- 11. Fine tuning
- Usage
- Project structure
- New
pdm
commands available in project repository - Additional tools
- Contributing
- License
- Python 3.11 or Python 3.12
- please note that currently Python 3.13 is not officially supported, because OLS LightSpeed depends on some packages that can not be used in this Python version
- all sources are made (backward) compatible with Python 3.11; it is checked on CI
- Git, pip and PDM
- An LLM API key or API secret (in case of Azure OpenAI)
- (Optional) extra certificates to access LLM API
git clone https://github.com/openshift/lightspeed-service.git
cd lightspeed-service
make install-deps
This step depends on provider type
Please look into (OpenAI api key)
Please look at following articles describing how to retrieve API key or secret from Azure: Get subscription and tenant IDs in the Azure portal and How to get client id and client secret in Azure Portal. Currently it is possible to use both ways to auth. to Azure OpenAI: by API key or by using secret
Please look at into Generating API keys for authentication
(TODO: to be updated)
(TODO: to be updated)
1. Get a BAM API Key at [https://bam.res.ibm.com](https://bam.res.ibm.com)
* Login with your IBM W3 Id credentials.
* Copy the API Key from the Documentation section.
![BAM API Key](docs/bam_api_key.png)
2. BAM API URL: https://bam-api.res.ibm.com
Depends on configuration, but usually it is not needed to generate or use API key.
Here is a proposed scheme for storing API keys on your development workstation. It is similar to how private keys are stored for OpenSSH. It keeps copies of files containing API keys from getting scattered around and forgotten:
$ cd <lightspeed-service local git repo root>
$ find ~/.openai -ls
72906922 0 drwx------ 1 username username 6 Feb 6 16:45 /home/username/.openai
72906953 4 -rw------- 1 username username 52 Feb 6 16:45 /home/username/.openai/key
$ ls -l openai_api_key.txt
lrwxrwxrwx. 1 username username 26 Feb 6 17:41 openai_api_key.txt -> /home/username/.openai/key
$ grep openai_api_key.txt olsconfig.yaml
credentials_path: openai_api_key.txt
OLS configuration is in YAML format. It is loaded from a file referred to by the OLS_CONFIG_FILE
environment variable and defaults to olsconfig.yaml
in the current directory.
You can find a example configuration in the examples/olsconfig.yaml file in this repository.
The example configuration file defines providers for six LLM providers: BAM, OpenAI, Azure OpenAI, Watsonx, OpenShift AI VLLM (RHOAI VLLM), and RHELAI (RHEL AI), but defines BAM as the default provider. If you prefer to use a different LLM provider than BAM, such as OpenAI, ensure that the provider definition points to a file containing a valid OpenAI, Watsonx etc. API key, and change the default_model
and default_provider
values to reference the selected provider and model.
The example configuration also defines locally running provider InstructLab which is OpenAI-compatible and can use several models. Please look at instructlab pages for detailed information on how to set up and run this provider.
API credentials are in turn loaded from files specified in the config YAML by the credentials_path
attributes. If these paths are relative,
they are relative to the current working directory. To use the example olsconfig.yaml as is, place your BAM API Key into a file named bam_api_key.txt
in your working directory.
[!NOTE]
There are two supported methods to provide credentials for Azure OpenAI. The first method is compatible with other providers, i.e. credentials_path
contains a directory name containing one file with API token. In the second method, that directory should contain three files named tenant_id
, client_id
, and client_secret
. Please look at following articles describing how to retrieve this information from Azure: Get subscription and tenant IDs in the Azure portal and How to get client id and client secret in Azure Portal.
Multiple models can be configured, but default_model
will be used, unless specified differently via REST API request:
type: openai
url: "https://api.openai.com/v1"
credentials_path: openai_api_key.txt
models:
- name: gpt-4-1106-preview
- name: gpt-4o-mini
Make sure the url
and deployment_name
are set correctly.
- name: my_azure_openai
type: azure_openai
url: "https://myendpoint.openai.azure.com/"
credentials_path: azure_openai_api_key.txt
deployment_name: my_azure_openai_deployment_name
models:
- name: gpt-4o-mini
Make sure the project_id
is set up correctly.
- name: my_watsonx
type: watsonx
url: "https://us-south.ml.cloud.ibm.com"
credentials_path: watsonx_api_key.txt
project_id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
models:
- name: ibm/granite-3-8b-instruct
It is possible to use RHELAI as a provider too. That provider is OpenAI-compatible
and can be configured the same way as other OpenAI providers. For example if
RHEL AI is running as EC2 instance and granite-7b-lab
model is deployed, the
configuration might look like:
- name: my_rhelai
type: openai
url: "http://{PATH}.amazonaws.com:8000/v1/"
credentials_path: openai_api_key.txt
models:
- name: granite-7b-lab
To use RHOAI (Red Hat OpenShiftAI) as provider, the following
configuration can be utilized (mistral-7b-instruct
model is supported by
RHOAI, as well as other models):
- name: my_rhoai
type: openai
url: "http://{PATH}:8000/v1/"
credentials_path: openai_api_key.txt
models:
- name: mistral-7b-instruct
It is possible to configure the service to use local ollama server. Please look into an examples/olsconfig-local-ollama.yaml file that describes all required steps.
-
Common providers configuration options
-
name
: unique name, can be any proper YAML literal -
type
: provider type: any ofbam
,openai
,azure_openai
,rhoai_vllm
,rhelai_vllm
, orwatsonx
-
url
: URL to be used to call LLM via REST API -
api_key
: path to secret (token) used to call LLM via REST API -
models
: list of models configuration (model name + model-specific parameters)Notes:
Context window size
varies based on provider/model.Max response tokens
depends on user need and should be in reasonable proportion to context window size. If value is too less then there is a risk of response truncation. If we set it too high then we will reserve too much for response & truncate history/rag context unnecessarily.- These are optional setting, if not set; then default will be used (which may be incorrect and may cause truncation & potentially error by exceeding context window).
-
-
Specific configuration options for WatsonX
project_id
: as specified on WatsonX AI page
-
Specific configuration options for Azure OpenAI
api_version
: as specified in official documentation, if not set; by default2024-02-15-preview
is used.deployment_name
: as specified in AzureAI project settings
-
Default provider and default model
-
one provider and its model needs to be selected as default. When no provider+model is specified in REST API calls, the default provider and model are used:
ols_config: default_provider: my_bam default_model: ibm/granite-3-8b-instruct
-
[!NOTE] Currently, only K8S-based authentication can be used. In future versions, more authentication mechanisms will be configurable.
This section provides guidance on how to configure authentication within OLS. It includes instructions on enabling or disabling authentication, configuring authentication through OCP RBAC, overriding authentication configurations, and specifying a static authentication token in development environments.
-
Enabling and Disabling Authentication
Authentication is enabled by default in OLS. To disable authentication, modify the
dev_config
in your configuration file as shown below:dev_config: disable_auth: true
-
Configuring Authentication with OCP RBAC
OLS utilizes OCP RBAC for authentication, necessitating connectivity to an OCP cluster. It automatically selects the configuration from the first available source, either an in-cluster configuration or a KubeConfig file.
-
Overriding Authentication Configuration
You can customize the authentication configuration by overriding the default settings. The configurable options include:
- Kubernetes Cluster API URL (
k8s_cluster_api
): The URL of the K8S/OCP API server where tokens are validated. - CA Certificate Path (
k8s_ca_cert_path
): Path to a CA certificate for clusters with self-signed certificates. - Skip TLS Verification (
skip_tls_verification
): If true, the Kubernetes client skips TLS certificate validation for the OCP cluster.
To apply any of these overrides, update your configuration file as follows:
ols_config: authentication_config: k8s_cluster_api: "https://api.example.com:6443" k8s_ca_cert_path: "/Users/home/ca.crt" skip_tls_verification: false
- Kubernetes Cluster API URL (
-
Providing a Static Authentication Token in Development Environments
For development environments, you may wish to use a static token for authentication purposes. This can be configured in the
dev_config
section of your configuration file:dev_config: k8s_auth_token: your-user-token
Note: using static token will require you to set the
k8s_cluster_api
mentioned in section 6.4, as this will disable the loading of OCP config from in-cluster/kubeconfig.
This section provides instructions on configuring TLS (Transport Layer Security) for the OLS Application, enabling secure connections via HTTPS. TLS is enabled by default; however, if necessary, it can be disabled through the dev_config
settings.
-
Enabling and Disabling TLS
By default, TLS is enabled in OLS. To disable TLS, adjust the
dev_config
in your configuration file as shown below:dev_config: disable_tls: false
-
Configuring TLS in local Environments:
- Generate Self-Signed Certificates: To generate self-signed certificates, run the following command from the project's root directory:
./scripts/generate-certs.sh
- Update OLS Configuration: Modify your config.yaml to include paths to your certificate and its private key:
ols_config: tls_config: tls_certificate_path: /full/path/to/certs/cert.pem tls_key_path: /full/path/to/certs/key.pem
- Launch OLS with HTTPS: After applying the above configurations, OLS will run over HTTPS.
- Generate Self-Signed Certificates: To generate self-signed certificates, run the following command from the project's root directory:
-
Configuring OLS in OpenShift:
For deploying in OpenShift, Service-Served Certificates can be utilized. Update your ols-config.yaml as shown below, based on the example provided in the examples directory:
ols_config: tls_config: tls_certificate_path: /app-root/certs/cert.pem tls_key_path: /app-root/certs/key.pem
-
Using a Private Key with a Password If your private key is encrypted with a password, specify a path to a file that contains the key password as follows:
ols_config: tls_config: tls_key_password_path: /app-root/certs/password.txt
The following command downloads a copy of the whole image containing RAG embedding model and vector database:
make get-rag
Please note that the link to the specific image to be downloaded is stored in the file build.args
(and that file is autoupdated by bots when new a RAG is re-generated):
Conversation cache can be stored in memory (it's content will be lost after shutdown) or in PostgreSQL database. It is possible to specify storage type in olsconfig.yaml
configuration file.
- Cache stored in memory:
ols_config: conversation_cache: type: memory memory: max_entries: 1000
- Cache stored in PostgreSQL:
In this case, file
conversation_cache: type: postgres postgres: host: "foobar.com" port: "1234" dbname: "test" user: "user" password_path: postgres_password.txt ca_cert_path: postgres_cert.crt ssl_mode: "require"
postgres_password.txt
contains password required to connect to PostgreSQL. Also CA certificate can be specified usingpostgres_ca_cert.crt
to verify trusted TLS connection with the server. All these files needs to be accessible.
7. (Optional) Incorporating additional CA(s). You have the option to include an extra TLS certificate into the OLS trust store as follows.
ols_config:
extra_ca:
- "path/to/cert_1.crt"
- "path/to/cert_2.crt"
This action may be required for self-hosted LLMs.
By default the number of workers is set to 1, you can increase the number of workers to scale up the REST API by modifying the max_workers config option in olsconfig.yaml
.
ols_config:
max_workers: 4
Please look here for more info.
TLS security profile can be set for the service itself and also for any configured provider. To specify TLS security profile for the service, the following section can be added into ols
section in the olsconfig.yaml
configuration file:
tlsSecurityProfile:
type: OldType
ciphers:
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
minTLSVersion: VersionTLS13
type
can be set to: OldType, IntermediateType, ModernType, or CustomminTLSVersion
can be set to: VersionTLS10, VersionTLS11, VersionTLS12, or VersionTLS13ciphers
is list of enabled ciphers. The values are not checked.
Please look into examples
folder that contains olsconfig.yaml
with filled-in TLS security profile for the service.
Additionally the TLS security profile can be set for any configured provider. In this case the tlsSecurityProfile
needs to be added into the olsconfig.yaml
file into llm_providers/{selected_provider}
section. For example:
llm_providers:
- name: my_openai
type: openai
url: "https://api.openai.com/v1"
credentials_path: openai_api_key.txt
models:
- name: gpt-4-1106-preview
- name: gpt-4o-mini
tlsSecurityProfile:
type: Custom
ciphers:
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
minTLSVersion: VersionTLS13
[!NOTE]
The tlsSecurityProfile
is fully optional. When it is not specified, the LLM call won't be affected by specific SSL/TLS settings.
The service uses the, so called, system prompt to put the question into context before the question is sent to the selected LLM. The default system prompt is fine tuned for questions about OpenShift and Kubernetes. It is possible to use a different system prompt via the configuration option system_prompt_path
in the ols_config
section. That option must contain the path to the text file with the actual system prompt (can contain multiple lines). An example of such configuration:
ols_config:
system_prompt_path: "system_prompts/system_prompt_for_product_XYZZY"
Additionally an optional string parameter system_prompt
can be specified in /v1/query
endpoint to override the configured system prompt. This override mechanism can be used only when the dev_config.enable_system_prompt_override
configuration options is set to true
in the service configuration file. Please note that the default value for this option is false
, so the system prompt cannot be changed. This means, when the dev_config.enable_system_prompt_override
is set to false
and /v1/query
is invoked with the system_prompt
parameter, the value specified in system_prompt
parameter is ignored.
OLS service can be started locally. In this case GradIO web UI is used to interact with the service. Alternatively the service can be accessed through REST API.
[!TIP]
To enable GradIO web UI you need to have the following dev_config
section in your configuration file:
dev_config:
enable_dev_ui: true
...
...
...
If Python virtual environment is setup already, it is possible to start the service by following command:
make run
It is also possible to initialize virtual environment and start the service by using just one command:
pdm start
There is an all-in-one image that has the document store included already.
-
Follow steps above to create your config yaml and your API key file(s).
-
Place your config yaml and your API key file(s) in a known location (eg:
/path/to/config
) -
Make sure your config yaml references the config folder for the path to your key file(s) (eg:
credentials_path: config/openai_api_key.txt
) -
Run the all-in-one-container. Example invocation:
podman run -it --rm -v `/path/to/config:/app-root/config:Z \ -e OLS_CONFIG_FILE=/app-root/config/olsconfig.yaml -p 8080:8080 \ quay.io/openshift-lightspeed/lightspeed-service-api:latest
In the examples
folder is a set of YAML manifests,
openshift-lightspeed.yaml
. This includes all the resources necessary to get
OpenShift Lightspeed running in a cluster. It is configured expecting to only
use OpenAI as the inference endpoint, but you can easily modify these manifests,
looking at the olsconfig.yaml
to see how to alter it to work with BAM as the
provider.
There is a commented-out OpenShift Route with TLS Edge termination available if you wish to use it.
To deploy, assuming you already have an OpenShift environment to target and that you are logged in with sufficient permissions:
- Make the change to your API keys and/or provider configuration in the manifest file
- Create a namespace/project to hold OLS
oc apply -f examples/openshift-lightspeed-tls.yaml -n created-namespace
Once deployed, it is probably easiest to oc port-forward
into the pod where
OLS is running so that you can access it from your local machine.
To send a request to the server you can use the following curl command:
curl -X 'POST' 'http://127.0.0.1:8080/v1/query' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"query": "write a deployment yaml for the mongodb image"}'
Web page with Swagger UI has the standard /docs
endpoint. If the service is running on localhost on port 8080, Swagger UI can be accessed on address http://localhost:8080/docs
.
OpenAPI schema is available docs/openapi.json. It is possible to re-generate the document with schema by using:
make schema
When the OLS service is started OpenAPI schema is available on /openapi.json
endpoint. For example, for service running on localhost on port 8080, it can be accessed and pretty printed by using following command:
curl 'http://127.0.0.1:8080/openapi.json' | jq .
Service exposes metrics in Prometheus format on /metrics
endpoint. Scraping them is straightforward:
curl 'http://127.0.0.1:8080/metrics'
There is a minimal Gradio UI you can use when running the OLS server locally. To use it, it is needed to enable UI in olsconfig.yaml
file:
dev_config:
enable_dev_ui: true
Then start the OLS server per Run the server and then browse to the built in Gradio interface at http://localhost:8080/ui
By default this interface will ask the OLS server to retain and use your conversation history for subsequent interactions. To disable this behavior, expand the Additional Inputs
configuration at the bottom of the page and uncheck the Use history
checkbox. When not using history each message you submit to OLS will be treated independently with no context of previous interactions.
OLS API documentation is available at http://localhost:8080/docs
To enable CPU profiling, please deploy your own pyroscope server and specify its URL in the devconfig
as shown below. This will help OLS to send profiles to a specified endpoint.
dev_config:
pyroscope_url: https://your-pyroscope-url.com
To enable memory profiling, simply start the server with the below command.
make memray-run
Once you are done executing a few queries and want to look at the memory flamegraphs, please run the below command and it should spit out a html file for us.
make memray-flamegraph
A Helm chart is available for installing the service in OpenShift.
Before installing the chart, you must configure the auth.key
parameter in the Values file
To install the chart with the release name ols-release
in the namespace openshift-lightspeed
:
helm upgrade --install ols-release helm/ --create-namespace --namespace openshift-lightspeed
The command deploys the service in the default configuration.
The default configuration contains OLS fronting with a kube-rbac-proxy.
To uninstall/delete the chart with the release name ols-release
:
helm delete ols-release --namespace openshift-lightspeed
Chart customization is available using the Values file.
- REST API handlers
- Configuration loader
- LLM providers registry
- LLM loader
- Interface to LLM providers
- Doc retriever from vector storage
- Question validator
- Docs summarizer
- Conversation cache
- (Local) Web-based user interface
Overall architecture with all main parts is displayed below:
OpenShift LightSpeed service is based on the FastAPI framework (Uvicorn) with Langchain for LLM interactions. The service is split into several parts described below.
Handles REST API requests from clients (mainly from UI console, but can be any REST API-compatible tool), handles requests queue, and also exports Prometheus metrics. The Uvicorn framework is used as a FastAPI implementation.
Manages authentication flow for REST API endpoints. Currently K8S/OCL-based authorization is used, but in the future it will be implemented in a more modular way to allow registering other auth. checkers.
Retrieves user queries, validates them, redacts them, calls LLM, and summarizes feedback.
Redacts the question based on the regex filters provided in the configuration file.
Validates questions and provides one-word responses. It is an optional component.
Summarizes documentation context.
Unified interface used to store and retrieve conversation history with optionally defined maximum length.
Currently there exist three conversation history cache implementations:
- in-memory cache
- Redis cache
- Postgres cache
Entries stored in cache have compound keys that consist of user_id
and conversation_id
. It is possible for one user to have multiple conversations and thus multiple conversation_id
values at the same time. Global cache capacity can be specified. The capacity is measured as the number of entries; entries sizes are ignored in this computation.
In-memory cache is implemented as a queue with a defined maximum capacity specified as the number of entries that can be stored in a cache. That number is the limit for all cache entries, it doesn't matter how many users are using the LLM. When the new entry is put into the cache and if the maximum capacity is reached, the oldest entry is removed from the cache.
Entries are stored in Redis as a dictionary. LRU policy can be specified that allows Redis to automatically remove the oldest entries.
Entries are stored in one Postgres table with the following schema:
Column | Type | Nullable | Default | Storage |
-----------------+-----------------------------+----------+---------+----------+
user_id | text | not null | | extended |
conversation_id | text | not null | | extended |
value | bytea | | | extended |
updated_at | timestamp without time zone | | | plain |
Indexes:
"cache_pkey" PRIMARY KEY, btree (user_id, conversation_id)
"cache_key_key" UNIQUE CONSTRAINT, btree (key)
"timestamps" btree (updated_at)
Access method: heap
During a new record insertion the maximum number of entries is checked and when the defined capacity is reached, the oldest entry is deleted.
Manages LLM providers implementations. If a new LLM provider type needs to be added, it is registered by this machinery and its libraries are loaded to be used later.
Currently there exist the following LLM providers implementations:
- OpenAI
- Azure OpenAI
- RHEL AI
- OpenShift AI
- WatsonX
- BAM
- Fake provider (to be used by tests and benchmarks)
Sequence of operations performed when user asks a question:
The context window size is limited for all supported LLMs which means that token truncation algorithm needs to be performed for longer queries, queries with long conversation history etc. Current truncation logic/context window token check:
- Tokens for current prompt system instruction + user query + attachment (if any) + tokens reserved for response (default 512) should not be greater than model context window size, otherwise OLS will raise an error.
- Let’s say above tokens count as default tokens that will be used all the time. If any token is left after default usage then RAG context will be used completely or truncated depending upon how much tokens are left.
- Finally if we have further available tokens after using complete RAG context, then history will be used (or will be truncated)
- There is a flag set to True by the service, if history is truncated due to tokens limitation.
╭───────────────────────────────────┬──────┬────────────────────────────────────────────────╮
│ Name │ Type │ Description │
├───────────────────────────────────┼──────┼────────────────────────────────────────────────┤
│ benchmarks │ cmd │ pdm run make benchmarks │
│ check-types │ cmd │ pdm run make check-types │
│ coverage-report │ cmd │ pdm run make coverage-report │
│ generate-schema │ cmd │ pdm run make schema │
│ integration-tests-coverage-report │ cmd │ pdm run make integration-tests-coverage-report │
│ requirements │ cmd │ pdm run make requirements.txt │
│ security-check │ cmd │ pdm run make security-check │
│ start │ cmd │ pdm run make run │
│ test │ cmd │ pdm run make test │
│ test-e2e │ cmd │ pdm run make test-e2e │
│ test-integration │ cmd │ pdm run make test-integration │
│ test-unit │ cmd │ pdm run make test-unit │
│ unit-tests-coverage-report │ cmd │ pdm run make unit-tests-coverage-report │
│ version │ cmd │ pdm run make print-version │
╰───────────────────────────────────┴──────┴────────────────────────────────────────────────╯
This script re-generated OpenAPI schema for the Lightspeed Service REST API.
scripts/generate_openapi_schema.py
pdm generate-schema`
Generate list of packages to be prefetched in Cachi2 and used in Konflux for hermetic build.
This script performs several steps:
- removes torch+cpu dependency from project file
- generates requirements.txt file from pyproject.toml + pdm.lock
- removes all torch dependencies (including CUDA/Nvidia packages)
- downloads torch+cpu wheel
- computes hashes for this wheel
- adds the URL to wheel + hash to resulting requirements.txt file
- downloads script
pip_find_builddeps
from the Cachito project - generated requirements-build.in file
- compiles requirements-build.in file into requirements-build.txt file
Please note that this script depends on tool that is downloaded from repository containing Cachito system. This tool is run locally w/o any additional security checks etc. so some care is needed (run this script from within containerized environment etc.).
scripts/generate_packages_to_prefetch.py
usage: generate_packages_to_prefetch.py [-h] [-p]
options:
-h, --help show this help message and exit
-p, --process-special-packages
Enable or disable processing special packages like torch etc.
-c, --cleanup Enable or disable work directory cleanup
-w WORK_DIRECTORY, --work-directory WORK_DIRECTORY
Work directory to store files generated during different stages
of processing
When SQLAlchemy package is not locked to latest version in pyproject.toml
and pdm.lock
, this script will fail due to issue in pip
. To fix this issue it is needed to follow those steps:
- Look at https://pypi.org/project/SQLAlchemy/ to retrieve latest SQLAlchemy version
- Update
pyproject.toml
file accordingly usingSQLAlchemy=={latest_version}
- Run
pdm update sqlalchemy
A dictionary containing the credentials of the S3 bucket must be specified, containing the keys:
AWS_BUCKET
AWS_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
-
See contributors guide.
-
See the open issues for a full list of proposed features (and known issues).
Published under the Apache 2.0 License