Collector remains in failed state after disk full #308

jvincze84 · 2018-11-09T14:05:01Z

Hi,

Problem description

If the host run out of disk space (no space left on device) and after free up some space the collector remains failed state until it is manually restarted on Graylog interface.

Steps to reproduce the problem

Wait for disk full
Free up some space
See the collector status on Graylog Web interface. It should be in Failed state.
After restart the collector if will be in Running sate.

Environment

Sidecar Version: 0.1.5
Graylog Version: Graylog 2.4.4+4659dbe
Operating System: Red Hat Enterprise Linux Server release 7.1 (Maipo)
Elasticsearch Version: 5.6.5
MongoDB Version: mongodb-linux-x86_64-rhel70-3.6.2

Do you have any suggestion?
How can I configure sidecar and/or filebeat not to give up trying the restart after 3 tries?

Thank you in advance,
Janos Vincze

mariussturm · 2018-11-09T14:13:24Z

Hi,
thanks for the feedback!
Currently there is no way to say that the Sidecar should not stop trying to restart the collector. But we could consider of adding this to the next major release.

jvincze84 · 2018-11-09T14:35:58Z

Thank you very much for you super fast reply.

jvincze84 · 2018-11-10T09:02:01Z

Hi,

I wrote a shell script which tries to restart failing collector through Graylog API.
Maybe not the best solution, but I hope it can help to somebody else as well.
Be aware this script is not fully tested.

Before use GL_* variables must be set.

#!/usr/bin/env bash
set -o errexit
set -o nounset
#set -o xtrace

###
## Redirect ALL output to a FILE
## LOG="[LOG file location]"  
## exec >> $LOG 2>&1  
###

GL_USER='janos.vincze'
GL_PASS=''
GL_HOST=''
GL_PORT='80'


function LOG() {
echo "[ $(date +%F\ %T) ]  - ${1}"
}


TMPFILE_COLLECTORS=$( mktemp /tmp/gl-tmp.XXXXXXXXX )
TMPFILE_FAILED=$( mktemp /tmp/gl-tmp.XXXXXXXXXX )
i=0

LOG "========================= SCRIPT STARTED ========================="
LOG "Query All Collectors And Status"
curl "http://${GL_USER}:${GL_PASS}@${GL_HOST}:${GL_PORT}/api/plugins/org.graylog.plugins.collector/collectors" 2>/dev/null > ${TMPFILE_COLLECTORS}
LOG "Collecting collectors where the status of filebeat backend is not null (0), but the collector itself is in ACTIVE state"
cat ${TMPFILE_COLLECTORS} | jq -c '.collectors[] |  select(.active == true and .node_details.status.backends.filebeat.status!=0)' | jq -r '.id' > ${TMPFILE_FAILED}

while IFS='' read -r ID || [[ -n "$ID" ]]; do
NODE_NAME=$( cat ${TMPFILE_COLLECTORS} | jq -c ".collectors[] | select(.id==\"${ID}\")" | jq -r '.node_id' )
LOG "Restarting collector sidecar on node: ${NODE_NAME} (ID: ${ID})"
LOG "###### RESPONSE ######"
curl -i -X PUT "http://${GL_USER}:${GL_PASS}@${GL_HOST}:${GL_PORT}/api/plugins/org.graylog.plugins.collector/collectors/${ID}/action" -H 'Content-Type: application/json' -d'
[
  {
    "backend": "filebeat",
    "properties": {
      "restart": true
    }
  }
]' 2>/dev/null |  while read line; do echo "--------------------------> $line"; done
LOG "######################"

i=$((i+1))

done < ${TMPFILE_FAILED}



rm ${TMPFILE_COLLECTORS} ${TMPFILE_FAILED}
[ $i -eq 0 ] && LOG "Yuuupi, there are no Failing collectors" || LOG "There were $i failed collectors"
LOG "========================= SCRIPT FINISHED ========================="

Best Regards,
Janos Vincze

mariussturm added triaged feature labels Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector remains in failed state after disk full #308

Collector remains in failed state after disk full #308

jvincze84 commented Nov 9, 2018

mariussturm commented Nov 9, 2018

jvincze84 commented Nov 9, 2018

jvincze84 commented Nov 10, 2018 •

edited

Loading

Collector remains in failed state after disk full #308

Collector remains in failed state after disk full #308

Comments

jvincze84 commented Nov 9, 2018

Problem description

Steps to reproduce the problem

Environment

mariussturm commented Nov 9, 2018

jvincze84 commented Nov 9, 2018

jvincze84 commented Nov 10, 2018 • edited Loading

jvincze84 commented Nov 10, 2018 •

edited

Loading