spike: IngestorCluster queue config via ConfigMap + init container (zero-restart first-boot)#1733
spike: IngestorCluster queue config via ConfigMap + init container (zero-restart first-boot)#1733vivekr-splunk wants to merge 12 commits intodevelopfrom
Conversation
…om IngestorClusterStatus CredentialSecretVersion and ServiceAccount status fields are no longer needed. The new event-driven ConfigMap approach (CSPL-4556) uses controller watches (Queue CR, ObjectStorage CR, Secret) to trigger reconcile and splctrl.ApplyConfigMap as the idempotency gate. The ingestorQueueConfigRev pod annotation replaces the credential version change signal. - Remove CredentialSecretVersion and ServiceAccount from IngestorClusterStatus - Remove scale-up credential reset in ApplyIngestorCluster - Remove PhaseReady credential/serviceAccount comparison and REST restart block - Fix namespaceScopedSecret → _ (no longer passed to pod manager) - Remove CredentialSecretVersion from test fixture - Regenerate CRD YAML via make manifests CSPL-4556
…orQueueConfigMapName Add five package-level constants and one exported naming function to support the ConfigMap-based ingestor queue configuration delivery: - ingestorQueueConfigAppName: Splunk app directory "100-sok-ingestorcluster" - ingestorQueueConfigTemplateStr: ConfigMap name pattern "splunk-<name>-ingestor-queue-config" - ingestorQueueConfigRevAnnotation: pod annotation key "ingestorQueueConfigRev" - ingestorQueueConfigMountPath: "/mnt/splunk-queue-config" - commandForIngestorQueueConfig: init container bash command (mkdir + ln -sfn + cp local.meta) - GetIngestorQueueConfigMapName(crName): formats ConfigMap name CSPL-4554
…ecksum for queue config Add five pure functions to build the ConfigMap content for the ingestor queue configuration app. These functions are side-effect-free and called by buildAndApplyIngestorQueueConfigMap in the next commit. - computeIngestorConfChecksum: SHA-256 of outputsConf+defaultModeConf stored in local.meta [app/install] so Splunk detects content changes - generateIngestorOutputsConf: builds outputs.conf reusing getQueueAndObjectStorageInputsForIngestorConfFiles; embeds credentials when non-empty (same pattern as GetSmartstoreVolumesConfig) - generateIngestorDefaultModeConf: builds default-mode.conf reusing getPipelineInputsForConfFile(false) - generateIngestorAppConf: builds app.conf with state=enabled - generateIngestorLocalMeta: builds local.meta with system access + install_source_checksum stanza CSPL-4554
Build and idempotently apply the ingestor queue config ConfigMap on every reconcile. splctrl.ApplyConfigMap is the idempotency gate — it skips the write and returns (false, nil) when content is unchanged, so no extra work is done on no-op reconciles triggered by the controller watches. - Takes resolved QueueOSConfig (Queue, OS, AccessKey, SecretKey) - Builds all four ConfigMap keys using the INI builder functions - Sets owner reference so ConfigMap is GC'd with the IngestorCluster CR - Returns (changed bool, error) matching splctrl.ApplyConfigMap signature CSPL-4554
…orStatefulSet Add the init container that pre-delivers queue config files before Splunk starts, enabling zero-restart first-boot configuration (CSPL-4553). setupIngestorInitContainer: - Adds ConfigMap volume (mnt-splunk-queue-config) to the pod spec - Appends init-ingestor-queue-config init container with the same security context as the existing setupInitContainer (runAsUser=41812, drop ALL, SeccompProfileTypeRuntimeDefault) - Mounts /opt/splk/etc (etc PVC or ephemeral) + ConfigMap at mount path - Runs commandForIngestorQueueConfig: mkdir + ln -sfn (conf files) + cp local.meta (cp not ln -sfn: Splunk replaces metadata symlinks with regular files) - Sets ingestorQueueConfigRev pod annotation to ConfigMap ResourceVersion so the Restart EPIC detects content changes and triggers a rolling restart getIngestorStatefulSet updated to call setupIngestorInitContainer after setupAppsStagingVolume. CSPL-4553
…figMap, remove REST path Replace the PhaseReady credential-comparison/REST-update/RestartSplunk flow with unconditional ConfigMap delivery on every reconcile. Controller watches (Queue CR, ObjectStorage CR, Secret) and splctrl.ApplyConfigMap's idempotency gate together ensure correctness without polling logic. ApplyIngestorCluster: - Add ResolveQueueAndObjectStorage + buildAndApplyIngestorQueueConfigMap calls before getIngestorStatefulSet (runs on every reconcile) - Remove scale-up reset block that set CredentialSecretVersion="0" - PhaseReady block no longer has secretChanged/serviceAccountChanged comparison or REST update/restart path Removed entirely: - ingestorClusterPodManager struct - newIngestorClusterPodManager var - getClient method - updateIngestorConfFiles method - getQueueAndPipelineInputsForIngestorConfFiles wrapper Removed imports: go-logr/logr, splunk/client (splclient) (splutil, splctrl, splcommon still used by remaining code) ingestorcluster_test.go: remove REST-path mock block from TestApplyIngestorCluster; remove TestGetQueueAndPipeline* and TestUpdateIngestorConfFiles functions; remove addRemoteQueueHandlers and newTestIngestorQueuePipelineManager helpers; remove now-unused imports (fmt, net/http, strings, logr, splclient, splcommon). CSPL-4556 CSPL-4553
… update fixtures Replace REST-path mock-based tests with ConfigMap delivery tests. ingestorcluster_test.go: - TestApplyIngestorCluster: remove REST mock setup (newIngestorClusterPodManager override, addRemoteQueueHandlersForIngestor, HTTP handler setup); the second reconcile is now a plain ApplyIngestorCluster call — no HTTP mocks needed - TestGetIngestorStatefulSet: switch from configTester to configTester2 (no space-stripping) because the bash init container command contains meaningful spaces; configTester's strings.ReplaceAll was mangling the command string - Add TestComputeIngestorConfChecksum: 64-char hex, deterministic, content-sensitive - Add TestGenerateIngestorOutputsConf: IRSA (no creds), static creds embedded - Add TestGenerateIngestorDefaultModeConf: all 6 pipeline stanzas present - Add TestGenerateIngestorAppConf: required stanzas present - Add TestGenerateIngestorLocalMeta: checksum and export stanzas present - Remove imports: fmt, net/http, strings, logr, splclient, splcommon (splutil, spltest, assert, appsv1/corev1/metav1 still used) Fixture files (all 4 statefulset_ingestor*.json): - Regenerated with json.Marshal format (& as \u0026) for exact configTester2 match - Add initContainers: init-ingestor-queue-config with bash command, security context matching setupInitContainer pattern, two volume mounts - Add volume: mnt-splunk-queue-config ConfigMap volume go test ./pkg/splunk/enterprise/... passes with no regressions. CSPL-4555 CSPL-4556
Replace pod-by-pod REST calls with ClusterManager bundle push pattern, mirroring the existing SmartStore indexes.conf delivery mechanism. Changes: - pkg/splunk/common/paths.go: add path constants for outputs.conf, inputs.conf, and default-mode.conf under manager-apps and mnt - pkg/splunk/enterprise/names.go: add commandForCMSmartstoreAndQueue and setSymbolicLinkCmanagerWithQueue constants for extended symlink setup when queue config is present - pkg/splunk/enterprise/util.go: preserve queue config keys in ApplySmartstoreConfigMap so CM reconcile does not overwrite them; extend resetSymbolicLinks to use queue-aware command when present - pkg/splunk/enterprise/indexercluster.go: add applyIdxcQueueConfigToCM which writes outputs.conf/inputs.conf/default-mode.conf into the CM's smartstore ConfigMap and sets BundlePushTracker.NeedToPushManagerApps; add generateIdxcOutputsConf/InputsConf/DefaultModeConf INI builders; remove old secretChanged/updateIndexerConfFiles/RestartSplunk block - pkg/splunk/enterprise/clustermanager.go: use extended init container command when ConfigMap has queue config keys - pkg/splunk/enterprise/configuration.go: include queue config keys in smartstore volume Items when present - pkg/splunk/enterprise/ingestorcluster.go: fix defaultMode:420 on ConfigMap volume and remove invalid install_source_checksum from local.meta - test fixtures and test assertions updated accordingly
…er bundle push Add a separate dedicated init container (init-cm-queue-config) on the ClusterManager pod to symlink outputs.conf, inputs.conf, default-mode.conf from the new standalone CM queue config ConfigMap (splunk-<cm>-clustermanager-queue-config) into manager-apps/splunk-operator/local/. This init container is separate from the existing smartstore 'init' container — it only runs when the queue config ConfigMap exists, does not touch any smartstore volume or Items list, and avoids the Init:0/1 failure that occurred when mixing queue config into the smartstore ConfigMap. Changes: - names.go: add cmQueueConfigMapTemplateStr constant, GetCMQueueConfigMapName() function - clustermanager.go: add setupCMQueueConfigInitContainer(), call from getClusterManagerStatefulSet - indexercluster.go: applyIdxcQueueConfigToCM now writes to GetCMQueueConfigMapName() instead of smartstore ConfigMap — fully decoupled, owned by ClusterManager CR - util.go: resetSymbolicLinks removes stale setSymbolicLinkCmanagerWithQueue reference; adds queue config symlink reset after smartstore symlinks when CM queue config CM exists - clustermanager_test.go: update Get call counts (+1 per reconcile for queue config CM lookup)
…orcluster Extract applyQueueConfigMap, buildQueueConfStanza, generateQueueConfigAppConf, and generateQueueConfigLocalMeta as package-level shared helpers so both IngestorCluster and ClusterManager (IndexerCluster bundle push) use the same implementation without duplication. Changes: - ingestorcluster.go: extract applyQueueConfigMap (idempotent CM apply with owner ref), buildQueueConfStanza (INI stanza builder), generateQueueConfigAppConf (parameterized label), generateQueueConfigLocalMeta (shared local.meta) - buildAndApplyIngestorQueueConfigMap: simplified to call shared helpers - generateIngestorOutputsConf: simplified to call buildQueueConfStanza - computeIngestorConfChecksum and generateIngestorLocalMeta(checksum) removed (local.meta no longer embeds a checksum — simpler and avoids invalid field) - ingestorcluster_test.go: update TestGenerateIngestorAppConf and TestGenerateIngestorLocalMeta to call shared function names
…vent reconcile loop Omitting DefaultMode on ConfigMapVolumeSource caused MergePodUpdates to see a volume diff on every reconcile (k8s stores defaultMode:420 but revised spec had nil), triggering an infinite StatefulSet update and CM pod recycle loop.
…th C4 architecture diagram - Rewrote doc with what-are-ingestors concept intro, full CR reference tables, and a step-by-step critical user journey (IAM -> Queue -> ObjectStorage -> IngestorCluster -> IndexerCluster -> verify data flow) - Added Day-2 operations (scaling, credential rotation, pause, deletion order) and troubleshooting section - Added C4 Container diagram rendered as PNG with Google-style muted Material Design colour palette
|
CLA Assistant Lite bot CLA Assistant Lite bot All contributors have signed the COC ✍️ ✅ |
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA vivek.name: "Vivek Reddy seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
959608b to
996fbf0
Compare
There was a problem hiding this comment.
Pull request overview
This spike replaces REST-based queue configuration delivery for IngestorCluster with a Kubernetes-native ConfigMap + init container approach, enabling zero-restart first-boot for Splunk pods. The same ConfigMap-based delivery is also implemented for IndexerCluster queue configuration via ClusterManager bundle push.
Changes:
- IngestorCluster queue config delivered via ConfigMap + init container symlinks, eliminating first-boot REST restarts
- IndexerCluster queue config delivered via ClusterManager bundle push using a dedicated ConfigMap and init container
- Removed
CredentialSecretVersionandServiceAccountstatus fields from IngestorClusterStatus API - Event-driven reconciliation using ConfigMap ResourceVersion annotations to trigger rolling restarts on credential/topology changes
- Support for credential-free IRSA mode (empty AWS credentials allowed when using pod identity)
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
api/v4/ingestorcluster_types.go |
Removed obsolete CredentialSecretVersion and ServiceAccount status fields |
config/crd/bases/enterprise.splunk.com_ingestorclusters.yaml |
Updated CRD to remove deleted status fields |
pkg/splunk/enterprise/names.go |
Added queue config naming constants and init container commands |
pkg/splunk/enterprise/ingestorcluster.go |
Complete rewrite: ConfigMap-based delivery with init containers, removed REST configuration path |
pkg/splunk/enterprise/ingestorcluster_test.go |
Rewrote tests for ConfigMap approach, removed REST-based test code |
pkg/splunk/enterprise/indexercluster.go |
Added CM bundle push-based queue config delivery |
pkg/splunk/enterprise/clustermanager.go |
Added init container for CM queue config symlinks |
pkg/splunk/enterprise/configuration.go |
Updated smartstore ConfigMap volume to conditionally include queue config files |
pkg/splunk/enterprise/util.go |
Modified credential validation to allow empty credentials for IRSA, preserved queue config in CM ConfigMap |
pkg/splunk/common/paths.go |
Added path constants for queue config files |
test/testenv/*.go |
Added IMAGE_PULL_SECRET support and retry logic for deployment updates |
test/index_and_ingestion_separation/index_and_ingestion_separation_test.go |
Updated integration test to verify ConfigMap and init container instead of status fields |
pkg/splunk/enterprise/testdata/fixtures/*.json |
Updated test fixtures to reflect new StatefulSet structure with init containers |
docs/IndexIngestionSeparation.md |
Complete documentation rewrite as critical user journey with architecture diagram |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| commandForCMQueueConfig = "mkdir -p " + splcommon.OperatorClusterManagerAppsLocal + | ||
| " && ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + | ||
| " && ln -sfn " + cmQueueConfigMountPath + "/inputs.conf " + splcommon.OperatorClusterManagerAppsLocalInputsConf + | ||
| " && ln -sfn " + cmQueueConfigMountPath + "/default-mode.conf " + splcommon.OperatorClusterManagerAppsLocalDefaultModeConf | ||
|
|
||
| // setSymbolicLinkCmanagerQueueConfig resets queue config symlinks on the CM pod after a bundle push. | ||
| setSymbolicLinkCmanagerQueueConfig = "ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + |
There was a problem hiding this comment.
Similar to the IngestorCluster init container command, this command lacks error handling. The command creates directories and symlinks but has no validation that the operations succeed. Consider adding set -e at the beginning of the command string to ensure the init container fails fast if any operation fails, rather than proceeding with incomplete configuration.
This is especially important for ClusterManager since a misconfigured bundle push could affect all indexer peers.
| commandForCMQueueConfig = "mkdir -p " + splcommon.OperatorClusterManagerAppsLocal + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/inputs.conf " + splcommon.OperatorClusterManagerAppsLocalInputsConf + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/default-mode.conf " + splcommon.OperatorClusterManagerAppsLocalDefaultModeConf | |
| // setSymbolicLinkCmanagerQueueConfig resets queue config symlinks on the CM pod after a bundle push. | |
| setSymbolicLinkCmanagerQueueConfig = "ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + | |
| commandForCMQueueConfig = "set -e; mkdir -p " + splcommon.OperatorClusterManagerAppsLocal + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/inputs.conf " + splcommon.OperatorClusterManagerAppsLocalInputsConf + | |
| " && ln -sfn " + cmQueueConfigMountPath + "/default-mode.conf " + splcommon.OperatorClusterManagerAppsLocalDefaultModeConf | |
| // setSymbolicLinkCmanagerQueueConfig resets queue config symlinks on the CM pod after a bundle push. | |
| setSymbolicLinkCmanagerQueueConfig = "set -e; ln -sfn " + cmQueueConfigMountPath + "/outputs.conf " + splcommon.OperatorClusterManagerAppsLocalOutputsConf + |
| // Only add the init container if the queue config ConfigMap exists. | ||
| _, err := splctrl.GetConfigMap(ctx, client, types.NamespacedName{Name: configMapName, Namespace: cr.GetNamespace()}) | ||
| if err != nil { | ||
| // ConfigMap doesn't exist yet — no queue config configured for this CM. |
There was a problem hiding this comment.
The ClusterManager init container setup function silently returns when the ConfigMap doesn't exist. While this is intentional (the ConfigMap is only created when IndexerCluster has queueRef set), the silent return makes debugging difficult. Consider logging at Info level when the ConfigMap is not found to make it clear that queue config init is being skipped.
This would help operators understand why init containers may or may not be present on ClusterManager pods.
| // ConfigMap doesn't exist yet — no queue config configured for this CM. | |
| logger := log.FromContext(ctx) | |
| if k8serrors.IsNotFound(err) { | |
| // ConfigMap doesn't exist yet — no queue config configured for this CM. | |
| logger.Info("Queue config ConfigMap not found; skipping queue config init container", | |
| "configMap", configMapName, | |
| "namespace", cr.GetNamespace(), | |
| "clusterManager", cr.GetName()) | |
| } else { | |
| // Unexpected error retrieving ConfigMap — skip init container but log the error for debugging. | |
| logger.Error(err, "Failed to get queue config ConfigMap; skipping queue config init container", | |
| "configMap", configMapName, | |
| "namespace", cr.GetNamespace(), | |
| "clusterManager", cr.GetName()) | |
| } |
| commandForIngestorQueueConfig = "mkdir -p /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/local && " + | ||
| "mkdir -p /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/metadata && " + | ||
| "ln -sfn " + ingestorQueueConfigMountPath + "/app.conf /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/local/app.conf && " + | ||
| "ln -sfn " + ingestorQueueConfigMountPath + "/outputs.conf /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/local/outputs.conf && " + | ||
| "ln -sfn " + ingestorQueueConfigMountPath + "/default-mode.conf /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/local/default-mode.conf && " + | ||
| "cp " + ingestorQueueConfigMountPath + "/local.meta /opt/splk/etc/apps/" + ingestorQueueConfigAppName + "/metadata/local.meta" |
There was a problem hiding this comment.
The init container command uses string concatenation to build shell commands, which can be fragile. The command creates directories and symlinks but doesn't include any error handling or verification. Consider adding shell command error checking (e.g., using set -e or explicit || error traps) to ensure that if any step fails, the init container fails rather than silently proceeding with an incomplete configuration.
For example, if the ConfigMap mount is not available or the directory creation fails, the symlinks would be created pointing to non-existent targets, and Splunk would start without queue configuration—leading to hard-to-debug runtime failures.
There was a problem hiding this comment.
why make it less readable?
|
|
||
| ## Example | ||
| **How SOK configures ingestor pods:** | ||
| SOK builds a Kubernetes ConfigMap named `splunk-<name>-ingestor-queue-config` containing the Splunk app `100-sok-ingestorcluster` with `outputs.conf`, `default-mode.conf`, `app.conf`, and `local.meta`. An init container (`init-ingestor-queue-config`) runs before Splunk starts and symlinks these files from the ConfigMap mount into the Splunk app directory. Splunk boots with the correct queue configuration already in place — no restart is required. |
There was a problem hiding this comment.
I really don't like configuration by convention (having configmap named in X format). Can we reduce convention-based naming with simple configuration?
For me it looks like each of those topics should be a separate PR |
Summary
This spike replaces the REST-based ingestor configuration path with a Kubernetes-native ConfigMap + init container approach. Splunk boots with the correct queue configuration already in place — no in-place REST restart is needed on first provisioning or scale-up.
What changed
splunk-<name>-ingestor-queue-configcontaining a Splunk app (100-sok-ingestorcluster) withoutputs.conf,default-mode.conf,app.conf, andlocal.metainit-ingestor-queue-configinit container symlinks the ConfigMap contents into the Splunk app directory before Splunk starts — zero restarts on first bootingestorQueueConfigRevpod annotation tracks ConfigMapResourceVersion; credential or topology changes trigger a rolling restart automaticallyDeletionTimestampcheck beforeApplySplunkConfiginApplyIngestorClusterto preventmanual-app-updateConfigMap creation in a terminating namespaceCredentialSecretVersionassertions from integration Test 3 (field removed fromIngestorClusterStatus)IndexIngestionSeparation.mdas a critical user journey with a C4 Container architecture diagramCommits
feat(api): remove staleCredentialSecretVersionandServiceAccountfromIngestorClusterStatusfeat(names): add ingestor queue config naming constants andGetIngestorQueueConfigMapNamefeat(enterprise): add INI builder functions andcomputeIngestorConfChecksumfeat(enterprise): addbuildAndApplyIngestorQueueConfigMapfeat(enterprise): addsetupIngestorInitContainerand updategetIngestorStatefulSetrefactor(enterprise): rewriteApplyIngestorCluster— event-driven ConfigMap, remove REST pathtest(enterprise): rewrite ingestor unit tests for ConfigMap delivery, update fixturesfeat(enterprise): deliver IndexerCluster queue config via CM bundle pushfeat(enterprise): add CM queue config init container for IndexerCluster bundle pushrefactor(enterprise): extract shared queue config helpers from ingestorclusterfix(enterprise): setdefaultMode=420on CM queue config volume to prevent reconcile loopdocs: rewriteIndexIngestionSeparation.mdas critical user journey with C4 architecture diagramTest plan
index_and_ingestion_separationintegration tests pass (3/3, ~885s parallel run)go test ./pkg/splunk/enterprise/... -count=1— all unit tests passgo build ./...— clean build