Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart with redis fails due to 'failed to set key to value' #336

Closed
mrsheepuk opened this issue Jun 22, 2020 · 9 comments · Fixed by #337
Closed

Helm chart with redis fails due to 'failed to set key to value' #336

mrsheepuk opened this issue Jun 22, 2020 · 9 comments · Fixed by #337

Comments

@mrsheepuk
Copy link
Contributor

Describe the bug
Installing cloudinfo using the helm chart at 0.7.1 and the default version of the cloudinfo image (0.9.15) fails to create a working environment when enabling redis in the chart, as the scraper cannot write to it, logging errors like the below for every value it tries to write:

{"application":"cloudinfo","cistore":"redis","environment":"production","key":"/banzaicloud.com/cloudinfo/providers/google/regions/asia-northeast3/prices/n1-highcpu-2","level":"error","msg":"failed to set key to value","time":"2020-06-22T09:21:22Z","value":{"onDemandPrice":0.0909891416015625,"spotPrice":{}}}

Steps to reproduce the issue:
helm install cloudinftest banzaicloud-stable/cloudinfo -f ./cloudinfo-deploy.yml

... with the following values in cloudinfo-deploy.yml (including correct and valid google credentials):

providers:
  google:
    enabled: true
    credentials: **redacted**
store:
  redis:
    enabled: true

redis:
  enabled: true

  # TODO: support redis password
  usePassword: false
  cluster:
    enabled: false

Expected behavior
Scraper works as expected and posts values into redis for the application.

Screenshots
n/a

Additional context
Tested only with Google Cloud enabled at this time.

@mrsheepuk
Copy link
Contributor Author

mrsheepuk commented Jun 22, 2020

I've tested this by additionally specifying the image tag 0.12.1:

image:
  tag: 0.12.1

... and it appears to work as expected, so it looks like this is specific to the default version in the current helm chart.

@mrsheepuk
Copy link
Contributor Author

Sorry, the above comment isn't correct - I believed it was working with 0.12.1 but now I'm seeing the same behaviour, so it seems to be transient / timing related on start up.

Scaling the scraper deployment down to 0 and back to 1 after deploying seems to fix it consistently, so it might be related to #184

@sagikazarmark
Copy link
Member

@mrsheepuk thanks for reporting.

It indeed looks like a timing issue, but I would assume that the scraper component recovers once it's able to connect to redis. Isn't it the case? Or does the helm install process fail entirely?

@mrsheepuk
Copy link
Contributor Author

Thanks for responding @sagikazarmark - the helm install completes successfully, and the pods are all up and running as expected, it's just the errors that are logged which make it look like it's not working.

I'll double check whether it effectively recovers or not - I was basing this on what was logged, but it might be that it's not logging when it works so I'd assumed it wasn't working when in fact it's recovered correctly. I'll report back shortly after confirming this.

@sagikazarmark
Copy link
Member

Well, I'm not completely sure TBH. The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

Let me know if it recovers.

@mrsheepuk
Copy link
Contributor Author

The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

That's what I assumed was happening, however now I want to reproduce the problem, of course, it is working perfectly every time with no errors logged on 0.12.1.

Testing again with the default version, however, I can see that on 0.9.15, it doesn't recover cleanly - everything is 'up and running' but there's no data persisted and available through the front end, and the log indicates it's not trying to re-scrape or re-persist anything. I imagine it would recover after 24 hours, as you suggest.

I'll keep trying to reproduce this on 0.12.1 - it certainly seems to happen much less often on that version.

@sagikazarmark
Copy link
Member

There are no changes between 0.9.x and 0.12.x that would "fix" the recovery, so if it works it's probably because of timing (ie. it takes time to download the new image).

I think the scraper should be resilient enough to retry scraping if the result cannot be saved, or hold it in memory until it can.

@mrsheepuk
Copy link
Contributor Author

First time I've looked at the cloudinfo code, so I might have something wrong, but it looks to me like if there are any failures writing into redis, they will be logged but essentially ignored, for example (added a few comments inline):

func (rps *redisProductStore) set(key string, value interface{}) (interface{}, bool) {
	// snipped code 
	if _, err = conn.Do("SET", key, mJson); err != nil {
		rps.log.Error("failed to set key to value", map[string]interface{}{"key": key, "value": value})
		// this return is not checked in (e.g.) StorePrice, so nothing 
		// other than this log message will know this has failed
		return nil, false
	}

	return mJson, true
}

And the way this is being called:

func (rps *redisProductStore) StorePrice(provider, region, instanceType string, val types.Price) {
	// not checking return value here, so callers of StorePrice won't know this
	// has failed:
	rps.set(rps.getKey(cloudinfo.PriceKeyTemplate, provider, region, instanceType), val)
}

The 'sledgehammer' approach to this specific problem could be something that pings a test value into redis repeatedly on startup waiting for it to be available/healthy before completing startup - but that wouldn't help if there was a transient problem with the redis instance at some later point in time. A more nuanced approach would require changing all of these StoreXxx interfaces to return the success/failure so wherever they are called could retry as needed.

If I get time later, I'll see if I can work out a PR to attempt to address one or other of these options.

@mrsheepuk
Copy link
Contributor Author

@sagikazarmark I've raised a PR to simply fail on startup if the product store is not available, let me know what you think.

mrsheepuk added a commit to mrsheepuk/cloudinfo that referenced this issue Jul 6, 2020
mrsheepuk added a commit to mrsheepuk/cloudinfo that referenced this issue Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants