Skip to content

Appframeworks s1 tests stability fix#1712

Closed
gabrielm-splunk wants to merge 4 commits intodevelopfrom
appframeworksS1-tests-stability-fix
Closed

Appframeworks s1 tests stability fix#1712
gabrielm-splunk wants to merge 4 commits intodevelopfrom
appframeworksS1-tests-stability-fix

Conversation

@gabrielm-splunk
Copy link
Collaborator

@gabrielm-splunk gabrielm-splunk commented Feb 19, 2026

Description

Small fix to appframeworksS1 tests as the standalone specs that reference the MC cause some issues (as identified by claude). Will post claude explanation of bug and fix in comments

Key Changes

Just adding Replicas: 1 to standalone specs with MC ref

Testing and Verification

Ran tests locally

Related Issues

Stemmed from 10.2 certification: https://splunk.atlassian.net/browse/CSPL-4531

PR Checklist

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@gabrielm-splunk
Copy link
Collaborator Author

** Claude analysis **

Bug Explanation: MonitoringConsole Restart Loop Due to Empty SPLUNK_STANDALONE_URL

The Problem

When a Standalone CR with a MonitoringConsoleRef was deployed, the MonitoringConsole pod would enter a restart loop, timing out on its startup probe after ~6.7 minutes and continuously restarting.

Root Cause

The bug stems from a race condition in the operator's reconciliation logic combined with Go's zero-value semantics:

  1. Go Zero Values: In Go, when you create a struct without specifying a field, numeric types default to 0. So when the test created a StandaloneSpec without setting Replicas, it defaulted to 0:

spec := enterpriseApi.StandaloneSpec{
CommonSplunkSpec: enterpriseApi.CommonSplunkSpec{...},
AppFrameworkConfig: appFrameworkSpec,
// Replicas is not set, so it defaults to int32(0)
}
2. ConfigMap Creation Order: During reconciliation, the operator creates the MonitoringConsole's ConfigMap before applying the default replica count. The flow is:
- pkg/splunk/enterprise/standalone.go:228 calls ApplyMonitoringConsoleEnvConfigMap()
- This happens before line 291-292 where the default is applied:
if cr.Spec.Replicas == 0 {
cr.Spec.Replicas = 1
}
3. URL Generation with 0 Replicas: The function GetSplunkStatefulsetUrls() in pkg/splunk/enterprise/names.go:267-272 generates URLs based on replica count:
func GetSplunkStatefulsetUrls(..., replicas int32, ...) string {
urls := make([]string, replicas) // With replicas=0, this creates an empty slice
for i := int32(0); i < replicas; i++ {
urls[i] = GetSplunkStatefulsetURL(...)
}
return strings.Join(urls, ",") // Returns empty string ""
}
4. MonitoringConsole Startup Failure: The MonitoringConsole pod starts with an empty SPLUNK_STANDALONE_URL in its ConfigMap. Inside the container, an Ansible playbook tries to configure peer connections to the standalones listed in this URL. With an
empty/invalid URL list, the playbook hangs indefinitely waiting for non-existent hosts.
5. Startup Probe Timeout: After ~6.7 minutes (400 seconds), the Kubernetes startup probe kills the container, causing a restart. This creates an infinite restart loop.

Why the Fix Works

By explicitly setting Replicas: 1 in the test spec:

spec := enterpriseApi.StandaloneSpec{
CommonSplunkSpec: enterpriseApi.CommonSplunkSpec{...},
Replicas: 1, // Explicitly set
AppFrameworkConfig: appFrameworkSpec,
}

The Replicas field is already 1 when ApplyMonitoringConsoleEnvConfigMap() is called, so GetSplunkStatefulsetUrls() correctly generates SPLUNK_STANDALONE_URL=splunk--standalone-0...svc.cluster.local instead of an empty string.

Potential Operator-Level Fix

The proper fix in the operator code would be to apply default values before creating dependent resources like ConfigMaps. This would involve reordering the logic in pkg/splunk/enterprise/standalone.go to apply defaults before line 228.

@coveralls
Copy link
Collaborator

coveralls commented Feb 19, 2026

Pull Request Test Coverage Report for Build 22240607963

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 829 unchanged lines in 10 files lost coverage.
  • Overall coverage decreased (-2.0%) to 83.919%

Files with Coverage Reduction New Missed Lines %
pkg/splunk/enterprise/types.go 4 94.87%
pkg/splunk/enterprise/monitoringconsole.go 26 79.38%
pkg/splunk/enterprise/clustermanager.go 30 74.54%
pkg/splunk/client/enterprise.go 32 84.98%
pkg/splunk/enterprise/upgrade.go 38 26.97%
pkg/splunk/enterprise/configuration.go 49 94.87%
internal/controller/indexercluster_controller.go 85 39.33%
pkg/splunk/enterprise/afwscheduler.go 125 92.11%
pkg/splunk/enterprise/indexercluster.go 205 72.44%
pkg/splunk/enterprise/util.go 235 87.46%
Totals Coverage Status
Change from base Build 22145448988: -2.0%
Covered Lines: 11820
Relevant Lines: 14085

💛 - Coveralls

@kasiakoziol
Copy link
Collaborator

I think that if this happens as described in the comment, then it would be better to fix it in the root instead of making false positive results. Is someone looking into permanent fix?

@kasiakoziol
Copy link
Collaborator

Also, validateStandaloneSpec is executed earlier than ApplyMonitoringConsoleEnvConfigMap and it sets replicas to 1

func validateStandaloneSpec(ctx context.Context, c splcommon.ControllerClient, cr *enterpriseApi.Standalone) error {
	if cr.Spec.Replicas < 0 {
		return fmt.Errorf("replicas must be >= 0")
	}
	if cr.Spec.Replicas == 0 {
		cr.Spec.Replicas = 1
	}

…t claude was able to identify while running CI for 10.2 certification"

This reverts commit 38298fd.
@gabrielm-splunk
Copy link
Collaborator Author

I think that if this happens as described in the comment, then it would be better to fix it in the root instead of making false positive results. Is someone looking into permanent fix?

@kasiakoziol I have implemented a similar fix in the operator-level code

@gabrielm-splunk
Copy link
Collaborator Author

Also, validateStandaloneSpec is executed earlier than ApplyMonitoringConsoleEnvConfigMap and it sets replicas to 1

func validateStandaloneSpec(ctx context.Context, c splcommon.ControllerClient, cr *enterpriseApi.Standalone) error {
	if cr.Spec.Replicas < 0 {
		return fmt.Errorf("replicas must be >= 0")
	}
	if cr.Spec.Replicas == 0 {
		cr.Spec.Replicas = 1
	}

Yes, however it does seem that there is a related race condition as outlined here by claude:

The existing code in validateStandaloneSpec() sets      
  cr.Spec.Replicas = 1 if it's 0, but this only modifies the in-memory object during reconciliation. The CR stored in etcd
   still had Replicas = 0, causing a race condition where the MC ConfigMap could be created/read before the Standalone    
  reconciler ran.

@vivekr-splunk
Copy link
Collaborator

should we not set standalone min replicas to 1 instead of 0. I am unsure if its correct, if i create standalone instance , i always see 1 replica created.

Copy link
Collaborator

@vivekr-splunk vivekr-splunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure standalone do not create any replicas. i have run this locally many times and I can see it creating single replica by default. look at validate() function in standalone its should have default set to 1 for standalone

@kasiakoziol
Copy link
Collaborator

@patrykw-splunk is already working on adding kubebuilder validations wherever applicable in v4.

Copy link
Collaborator

@patrykw-splunk patrykw-splunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we don't have a consensus here. I think that from engineering side it would be good @gabrielm-splunk if you could provide a demo/poc regarding this behaviour to confirm it

@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2026
@gabrielm-splunk
Copy link
Collaborator Author

Didn't want this to be a point of contention/high-effort fix. This seemed to help the appframeworksS1 test consistently pass, so I included it to ease the release process, but since we've gotten some pushback, will just omit and proceed with the 3.1.0 release process

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants