Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Some cloud resource state info not getting saved if exited the process due to OS_SIGNALS or network failures #257

Closed
Horiodino opened this issue Dec 21, 2023 · 7 comments
Labels
help wanted Extra attention is needed kind/bug Something isn't working needs-triage priority/high highest priority

Comments

@Horiodino
Copy link
Contributor

Horiodino commented Dec 21, 2023

Describe 🐞

we have ksctl dashbaord and cli , for that currently static db #226 , for that what we can do is db for dashboard and we use local storage for the cli .
for now we can use contexts, dashboard and cli in the core and based upon that we can use it in savestatehelper funtion , we use context as const but need to check condition regarding context

You can check the code below for better context

func (obj *AzureProvider) CreateNetworkInterface(ctx context.Context, storage resources.StorageFactory,
	nicName string, subnetID string, publicIPID string, networkSecurityGroupID string, index int, role consts.KsctlRole) error {

	// ...

	pollerResponse, err := obj.client.BeginCreateNIC(nicName, parameters, nil)
	if err != nil {
		return err
	}
	done := make(chan struct{})
	var errCreate error
	go func() {
		defer close(done)
		obj.mxState.Lock()
		defer obj.mxState.Unlock()

		switch role {
		case consts.RoleWp:
			azureCloudState.InfoWorkerPlanes.NetworkInterfaceNames[index] = nicName
		case consts.RoleCp:
			azureCloudState.InfoControlPlanes.NetworkInterfaceNames[index] = nicName

		case consts.RoleLb:
			azureCloudState.InfoLoadBalancer.NetworkInterfaceName = nicName
		case consts.RoleDs:
			azureCloudState.InfoDatabase.NetworkInterfaceNames[index] = nicName
		}
		if err := saveStateHelper(storage); err != nil {
			errCreate = err
			return
		}
	}()
	<-done
	if errCreate != nil {
		return errCreate
	}
	log.Print("Creating the network interface...", "name", nicName)

	var errCreatenic error //just to make sure its nil
	donePoll := make(chan struct{})
	go func() {
		defer close(donePoll)
		resp, err := obj.client.PollUntilDoneCreateNetInterface(ctx, pollerResponse, nil)
		if err != nil {
			errCreatenic = err
			return
		}

		obj.mxState.Lock()
		defer obj.mxState.Unlock()

		switch role {
		case consts.RoleWp:
			azureCloudState.InfoWorkerPlanes.NetworkInterfaceIDs[index] = *resp.ID
			azureCloudState.InfoWorkerPlanes.PrivateIPs[index] = *resp.Properties.IPConfigurations[0].Properties.PrivateIPAddress
		case consts.RoleCp:
			azureCloudState.InfoControlPlanes.NetworkInterfaceIDs[index] = *resp.ID
			azureCloudState.InfoControlPlanes.PrivateIPs[index] = *resp.Properties.IPConfigurations[0].Properties.PrivateIPAddress

		case consts.RoleLb:
			azureCloudState.InfoLoadBalancer.NetworkInterfaceID = *resp.ID
			azureCloudState.InfoLoadBalancer.PrivateIP = *resp.Properties.IPConfigurations[0].Properties.PrivateIPAddress
		case consts.RoleDs:
			azureCloudState.InfoDatabase.NetworkInterfaceIDs[index] = *resp.ID
			azureCloudState.InfoDatabase.PrivateIPs[index] = *resp.Properties.IPConfigurations[0].Properties.PrivateIPAddress
		}

		if err := saveStateHelper(storage); err != nil {
			errCreatenic = err
			return
		}
	}()
	<-donePoll
	if errCreatenic != nil {
		return errCreatenic
	}
	log.Debug("Printing", "azureCloudState", azureCloudState)

	log.Success("Created network interface", "name", nicName)
	return nil
}

Here you can see that if there is program termination just after resp got the result from azure API but just before it got assigned to the azureCloudState. so there is a potential data loss.

Provider in effect

  • Azure
  • Civo
  • Local

Reproduce 💻 ➡️ 💻

It's hard to get the timing correct but there is fairly good chances that it will happen some point in time:
to reproduce just exit the program after all Nic are made, after that wait 2 or 3 seconds and exit.

Possible Fix & Expected behavior🔧

we could have a global waitgroup which acts as a protection that before the program terminates the waitgroup.Wait will happen and the left-out process happens

So when there is a signal like SIGTERM
the select {} gets executes <-ctx.Done()
and inside that there is a func call cleanup()
which in tern does the wg.Wait()

Screenshots 🖼️

Operating System

  • all

ksctl version

current and all previous versions

Additional context

Add any other context about the problem here.
Check Contribution's guidelines

@Horiodino Horiodino added kind/bug Something isn't working help wanted Extra attention is needed priority/high highest priority needs-triage labels Dec 21, 2023
@dipankardas011
Copy link
Member

We need to increase the knowledge base

@dipankardas011
Copy link
Member

@Horiodino Can you included from where you find the bug?

@dipankardas011
Copy link
Member

May be a global sync.WaitGroup can help with this problem along with context.Context to know when the signal of stop is triggered and in the Cleanup() we can wait till the waitgroup has reached to 0
cc @prateek041 , @Horiodino

@Horiodino
Copy link
Contributor Author

or what we can do is we can just wait just for saving the data, then we return exit-code! but it does not make any point if it returns after saving or after running the entire function

@Horiodino
Copy link
Contributor Author

@Horiodino Can you included from where you find the bug?

I was just testing Aws-Ha and for some reason my system crashed and after that I see that the Api request has been successfully created resource, but the data is not saved as it takes some few seconds to allocate resources and to save that info we need to wait for some time.

@dipankardas011
Copy link
Member

yes i am planning for sync.Waitgrpup to tell us then somethingis not yet completed
so we can include further Syscall in the context for now I am thinking about SIGTERM and SIGINT signals
later we can add more
@Horiodino

@dipankardas011 dipankardas011 self-assigned this Dec 29, 2023
@dipankardas011 dipankardas011 changed the title bug for the data inconsistencies bug: for the data inconsistencies Jan 5, 2024
@dipankardas011 dipankardas011 removed their assignment Jan 5, 2024
@dipankardas011 dipankardas011 changed the title bug: for the data inconsistencies bug: Some cloud resource state info not getting saved if exited the process due to OS_SIGNALS or network failures Feb 3, 2024
@dipankardas011
Copy link
Member

dipankardas011 commented Feb 4, 2024

we should use the Process Context be listening for this
https://pkg.go.dev/os#Interrupt

may be this helps

// Run starts the ksctl server.
func Run(ctx context.Context, wg *sync.WaitGroup) {
	defer wg.Done()

	sig := make(chan os.Signal, 1)
	signal.Notify(sig, os.Interrupt)

	// Start the server.
	go func() {
		if err := startServer(ctx); err != nil {
			slog.Info("Failed to start server: %v", err)
		}
	}()

	// Wait for a signal to terminate.
	select {
	case <-sig:
		slog.Info("Received termination signal. Shutting down...")
	case <-ctx.Done():
		slog.Info("Context done. Shutting down...")
	}

	// Wait for all goroutines to finish.

	// Close the server.
	if err := closeServer(); err != nil {
		slog.Info("Failed to close server: %v", err)
	}
}

@dipankardas011 dipankardas011 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/bug Something isn't working needs-triage priority/high highest priority
Projects
None yet
Development

No branches or pull requests

2 participants