Helm chart with redis fails due to 'failed to set key to value' #336

mrsheepuk · 2020-06-22T09:26:34Z

Describe the bug
Installing cloudinfo using the helm chart at 0.7.1 and the default version of the cloudinfo image (0.9.15) fails to create a working environment when enabling redis in the chart, as the scraper cannot write to it, logging errors like the below for every value it tries to write:

{"application":"cloudinfo","cistore":"redis","environment":"production","key":"/banzaicloud.com/cloudinfo/providers/google/regions/asia-northeast3/prices/n1-highcpu-2","level":"error","msg":"failed to set key to value","time":"2020-06-22T09:21:22Z","value":{"onDemandPrice":0.0909891416015625,"spotPrice":{}}}

Steps to reproduce the issue:
helm install cloudinftest banzaicloud-stable/cloudinfo -f ./cloudinfo-deploy.yml

... with the following values in cloudinfo-deploy.yml (including correct and valid google credentials):

providers:
  google:
    enabled: true
    credentials: **redacted**
store:
  redis:
    enabled: true

redis:
  enabled: true

  # TODO: support redis password
  usePassword: false
  cluster:
    enabled: false

Expected behavior
Scraper works as expected and posts values into redis for the application.

Screenshots
n/a

Additional context
Tested only with Google Cloud enabled at this time.

The text was updated successfully, but these errors were encountered:

mrsheepuk · 2020-06-22T09:49:37Z

I've tested this by additionally specifying the image tag 0.12.1:

image:
  tag: 0.12.1

... and it appears to work as expected, so it looks like this is specific to the default version in the current helm chart.

mrsheepuk · 2020-06-22T10:12:17Z

Sorry, the above comment isn't correct - I believed it was working with 0.12.1 but now I'm seeing the same behaviour, so it seems to be transient / timing related on start up.

Scaling the scraper deployment down to 0 and back to 1 after deploying seems to fix it consistently, so it might be related to #184

sagikazarmark · 2020-06-22T11:56:56Z

@mrsheepuk thanks for reporting.

It indeed looks like a timing issue, but I would assume that the scraper component recovers once it's able to connect to redis. Isn't it the case? Or does the helm install process fail entirely?

mrsheepuk · 2020-06-22T12:13:49Z

Thanks for responding @sagikazarmark - the helm install completes successfully, and the pods are all up and running as expected, it's just the errors that are logged which make it look like it's not working.

I'll double check whether it effectively recovers or not - I was basing this on what was logged, but it might be that it's not logging when it works so I'd assumed it wasn't working when in fact it's recovered correctly. I'll report back shortly after confirming this.

sagikazarmark · 2020-06-22T12:17:25Z

Well, I'm not completely sure TBH. The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

Let me know if it recovers.

mrsheepuk · 2020-06-22T12:38:31Z

The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

That's what I assumed was happening, however now I want to reproduce the problem, of course, it is working perfectly every time with no errors logged on 0.12.1.

Testing again with the default version, however, I can see that on 0.9.15, it doesn't recover cleanly - everything is 'up and running' but there's no data persisted and available through the front end, and the log indicates it's not trying to re-scrape or re-persist anything. I imagine it would recover after 24 hours, as you suggest.

I'll keep trying to reproduce this on 0.12.1 - it certainly seems to happen much less often on that version.

sagikazarmark · 2020-06-22T19:12:05Z

There are no changes between 0.9.x and 0.12.x that would "fix" the recovery, so if it works it's probably because of timing (ie. it takes time to download the new image).

I think the scraper should be resilient enough to retry scraping if the result cannot be saved, or hold it in memory until it can.

mrsheepuk · 2020-06-23T09:18:40Z

First time I've looked at the cloudinfo code, so I might have something wrong, but it looks to me like if there are any failures writing into redis, they will be logged but essentially ignored, for example (added a few comments inline):

func (rps *redisProductStore) set(key string, value interface{}) (interface{}, bool) {
	// snipped code 
	if _, err = conn.Do("SET", key, mJson); err != nil {
		rps.log.Error("failed to set key to value", map[string]interface{}{"key": key, "value": value})
		// this return is not checked in (e.g.) StorePrice, so nothing 
		// other than this log message will know this has failed
		return nil, false
	}

	return mJson, true
}

And the way this is being called:

func (rps *redisProductStore) StorePrice(provider, region, instanceType string, val types.Price) {
	// not checking return value here, so callers of StorePrice won't know this
	// has failed:
	rps.set(rps.getKey(cloudinfo.PriceKeyTemplate, provider, region, instanceType), val)
}

The 'sledgehammer' approach to this specific problem could be something that pings a test value into redis repeatedly on startup waiting for it to be available/healthy before completing startup - but that wouldn't help if there was a transient problem with the redis instance at some later point in time. A more nuanced approach would require changing all of these StoreXxx interfaces to return the success/failure so wherever they are called could retry as needed.

If I get time later, I'll see if I can work out a PR to attempt to address one or other of these options.

mrsheepuk · 2020-06-23T12:06:31Z

@sagikazarmark I've raised a PR to simply fail on startup if the product store is not available, let me know what you think.

mrsheepuk added a commit to mrsheepuk/cloudinfo that referenced this issue Jun 23, 2020

Check product store is ready on startup. Addresses banzaicloud#336.

83911af

mrsheepuk mentioned this issue Jun 23, 2020

Check product store is ready on startup #337

Merged

3 tasks

mrsheepuk added a commit to mrsheepuk/cloudinfo that referenced this issue Jul 6, 2020

Check product store is ready on startup. Addresses banzaicloud#336.

6ca9102

mrsheepuk added a commit to mrsheepuk/cloudinfo that referenced this issue Jul 6, 2020

Check product store is ready on startup. Addresses banzaicloud#336.

b342ef0

sagikazarmark closed this as completed in #337 Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm chart with redis fails due to 'failed to set key to value' #336

Helm chart with redis fails due to 'failed to set key to value' #336

mrsheepuk commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020 •

edited

Loading

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 23, 2020

mrsheepuk commented Jun 23, 2020

Helm chart with redis fails due to 'failed to set key to value' #336

Helm chart with redis fails due to 'failed to set key to value' #336

Comments

mrsheepuk commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020 • edited Loading

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 22, 2020

sagikazarmark commented Jun 22, 2020

mrsheepuk commented Jun 23, 2020

mrsheepuk commented Jun 23, 2020

mrsheepuk commented Jun 22, 2020 •

edited

Loading