Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11618. Enable HA modes for OM and SCM #10

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions charts/ozone/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,26 @@ app.kubernetes.io/instance: {{ .Release.Name }}
{{- $pods | join "," }}
{{- end }}

{{/* List of comma separated om ids */}}
{{- define "ozone.om.cluster.ids" -}}
{{- $pods := list }}
{{- $replicas := .Values.om.replicas | int }}
{{- range $i := until $replicas }}
{{- $pods = append $pods (printf "ozone-om-%d" $i) }}
{{- end }}
{{- $pods | join "," }}
{{- end }}

{{/* List of comma separated om ids */}}
{{- define "ozone.scm.cluster.ids" -}}
{{- $pods := list }}
{{- $replicas := .Values.scm.replicas | int }}
{{- range $i := until $replicas }}
{{- $pods = append $pods (printf "ozone-scm-%d" $i) }}
{{- end }}
{{- $pods | join "," }}
{{- end }}

{{/* Common configuration environment variables */}}
{{- define "ozone.configuration.env" -}}
- name: OZONE-SITE.XML_hdds.datanode.dir
Expand All @@ -59,14 +79,42 @@ app.kubernetes.io/instance: {{ .Release.Name }}
value: /data
- name: OZONE-SITE.XML_ozone.metadata.dirs
value: /data/metadata
{{- if gt (int .Values.scm.replicas) 1 }}
- name: OZONE-SITE.XML_ozone.scm.ratis.enable
value: "true"
- name: OZONE-SITE.XML_ozone.scm.service.ids
value: cluster1
- name: OZONE-SITE.XML_ozone.scm.nodes.cluster1
value: {{ include "ozone.scm.cluster.ids" . }}
{{- range $i, $val := until ( .Values.scm.replicas | int ) }}
- name: {{ printf "OZONE-SITE.XML_ozone.scm.address.cluster1.ozone-scm-%d" $i }}
value: {{ printf "%s-scm-%d.%s-scm-headless.%s.svc.cluster.local" $.Release.Name $i $.Release.Name $.Release.Namespace }}
{{- end }}
- name: OZONE-SITE.XML_ozone.scm.primordial.node.id
value: "ozone-scm-0"
{{- else }}
- name: OZONE-SITE.XML_ozone.scm.block.client.address
value: {{ include "ozone.scm.pods" . }}
- name: OZONE-SITE.XML_ozone.scm.client.address
value: {{ include "ozone.scm.pods" . }}
- name: OZONE-SITE.XML_ozone.scm.names
value: {{ include "ozone.scm.pods" . }}
{{- end}}
{{- if gt (int .Values.om.replicas) 1 }}
- name: OZONE-SITE.XML_ozone.om.ratis.enable
value: "true"
- name: OZONE-SITE.XML_ozone.om.service.ids
value: cluster1
- name: OZONE-SITE.XML_ozone.om.nodes.cluster1
value: {{ include "ozone.om.cluster.ids" . }}
{{- range $i, $val := until ( .Values.om.replicas | int ) }}
- name: {{ printf "OZONE-SITE.XML_ozone.om.address.cluster1.ozone-om-%d" $i }}
value: {{ printf "%s-om-%d.%s-om-headless.%s.svc.cluster.local" $.Release.Name $i $.Release.Name $.Release.Namespace }}
{{- end }}
{{- else }}
- name: OZONE-SITE.XML_ozone.om.address
value: {{ include "ozone.om.pods" . }}
{{- end}}
- name: OZONE-SITE.XML_hdds.scm.safemode.min.datanode
value: "3"
- name: OZONE-SITE.XML_ozone.datanode.pipeline.limit
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ spec:
ports:
- name: ui
port: {{ .Values.datanode.service.port }}
- name: ratis-ipc
port: 9858
- name: ipc
port: 9859
Comment on lines +31 to +34
Copy link

@ptlrs ptlrs Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the port numbers in all the services and statefulsets be referenced from the values.yaml file? We would like to avoid hardcoding them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely! I just proposed this. This was also my personal question ^^

selector:
{{- include "ozone.selectorLabels" . | nindent 4 }}
app.kubernetes.io/component: datanode
4 changes: 4 additions & 0 deletions charts/ozone/templates/datanode/datanode-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ spec:
ports:
- name: ui
containerPort: {{ .Values.datanode.service.port }}
- name: ratis-ipc
containerPort: 9858
- name: ipc
containerPort: 9859
livenessProbe:
httpGet:
path: /
Expand Down
4 changes: 4 additions & 0 deletions charts/ozone/templates/om/om-service-headless.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ spec:
ports:
- name: ui
port: {{ .Values.om.service.port }}
{{- if gt (int .Values.om.replicas) 1 }}
- name: ratis
port: 9872
{{- end }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?
My understanding is that the current selector will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of OM pods are manually modified by kubectl scale then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per OM documentation

This logical name is called serviceId and can be configured in the ozone-site.xml
The defined serviceId can be used instead of a single OM host using client interfaces

Perhaps there should be a different service which maps and groups pods as per the serviceId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?

yes this was my reason

My understanding is that the current selector will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.

ok if this is the case maybe we can remove this export. I found and used the following Ticket for the port configuration: https://issues.apache.org/jira/browse/HDDS-4677. I'm not really familiar with the architecture. If we can remove it I will do so :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of OM pods are manually modified by kubectl scale then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.

Great point! I hadn't considered that scenario. What could be the downside of exposing the port within an internal Kubernetes network? Perhaps we can use a Helm lookup mechanism to check the current replica count as an alternative, but the simplest approach is to always expose the port.

selector:
{{- include "ozone.selectorLabels" . | nindent 4 }}
app.kubernetes.io/component: om
5 changes: 5 additions & 0 deletions charts/ozone/templates/om/om-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ metadata:
app.kubernetes.io/component: om
spec:
replicas: {{ .Values.om.replicas }}
podManagementPolicy: Parallel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per OM documentation

To convert a non-HA OM to be HA or to add new OM nodes to existing HA OM ring, new OM node(s) need to be bootstrapped.

Shouldn't there be an init container which calls --bootstrap for OM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I get stuck in the bootstrap init container. I believe this step is only necessary for new cluster nodes. Thus, the problem reoccurs when we scale (once again, the scaling issue).

serviceName: {{ .Release.Name }}-om-headless
selector:
matchLabels:
Expand Down Expand Up @@ -71,6 +72,10 @@ spec:
containerPort: 9862
- name: ui
containerPort: {{ .Values.om.service.port }}
{{- if gt (int .Values.om.replicas) 1 }}
- name: ratis
containerPort: 9872
{{- end }}
livenessProbe:
httpGet:
path: /
Expand Down
12 changes: 12 additions & 0 deletions charts/ozone/templates/scm/scm-service-headless.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,18 @@ spec:
ports:
- name: ui
port: {{ .Values.scm.service.port }}
- name: rpc-datanode
port: 9861
- name: block-client
port: 9863
- name: rpc-client
port: 9860
{{- if gt (int .Values.scm.replicas) 1 }}
- name: ratis
port: 9894
- name: grpc
port: 9895
{{- end }}
selector:
{{- include "ozone.selectorLabels" . | nindent 4 }}
app.kubernetes.io/component: scm
27 changes: 27 additions & 0 deletions charts/ozone/templates/scm/scm-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ metadata:
app.kubernetes.io/component: scm
spec:
replicas: {{ .Values.scm.replicas }}
podManagementPolicy: Parallel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a Parallel policy disruptive when we use kubectl scale up/down.
It would be interesting to see if the Ratis rings are disrupted. Perhaps a PreStop hook for graceful shutdown of OM/SCM/Datanodes is required in a new Jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a different problem: If we use kubectl scale the configuration from helpers of cluster ids and so on is also not correct. Puhhh no idea how to solve this at the moment...

serviceName: {{ .Release.Name }}-scm-headless
selector:
matchLabels:
Expand Down Expand Up @@ -61,6 +62,24 @@ spec:
mountPath: {{ .Values.configuration.dir }}
- name: {{ .Release.Name }}-scm
mountPath: {{ .Values.scm.persistence.path }}
{{- if gt (int .Values.scm.replicas) 1 }}
- name: bootstrap
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
args: ["ozone", "scm", "--bootstrap"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SCM HA docs:

The initialization of the first SCM-HA node is the same as a non-HA SCM:
ozone scm --init
Second and third nodes should be bootstrapped instead of init
ozone scm --bootstrap

Here, we call init and then bootstrap for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.

It is not clear from the documentation whether init and bootstrap should only be performed once during the lifetime of the pod or upon every pod restart.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SCM HA docs:

The initialization of the first SCM-HA node is the same as a non-HA SCM:
ozone scm --init
Second and third nodes should be bootstrapped instead of init
ozone scm --bootstrap

Here, we call init and then bootstrap for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.

It is not clear from the documentation whether init and bootstrap should only be performed once during the lifetime of the pod or upon every pod restart.

I used the following doc:

Auto-bootstrap

In some environments (e.g. Kubernetes) we need to have a common, unified way to initialize SCM HA quorum. As a reminder, the standard initialization flow is the following:

On the first, “primordial” node: ozone scm --init
On second/third nodes: ozone scm --bootstrap

This can be improved: primordial SCM can be configured by setting ozone.scm.primordial.node.id in the config to one of the nodes.

ozone.scm.primordial.node.id scm1

With this configuration both scm --init and scm --bootstrap can be safely executed on all SCM nodes. Each node will only perform the action applicable to it based on the ozone.scm.primordial.node.id and its own node ID.

Note: SCM still needs to be started after the init/bootstrap process.

ozone scm --init
ozone scm --bootstrap
ozone scm --daemon start

For Docker/Kubernetes, use ozone scm to start it in the foreground.

This can be found in the Auto-bootstrap section here. I understood that we can run this commands on every instance and it will automatically detect what to do if ozone.scm.primordial.node.id is set correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully

However, this is only relevant for the bootstrap process. So, you might be right. We probably only need the bootstrap once in persistent mode. Maybe someone from the Ozone contributors can provide a definitive answer to this question.

env:
{{- include "ozone.configuration.env" . | nindent 12 }}
{{- with $env }}
{{- tpl (toYaml .) $ | nindent 12 }}
{{- end }}
{{- with $envFrom }}
envFrom: {{- tpl (toYaml .) $ | nindent 12 }}
{{- end }}
volumeMounts:
- name: config
mountPath: {{ .Values.configuration.dir }}
- name: {{ .Release.Name }}-scm
mountPath: {{ .Values.scm.persistence.path }}
{{- end }}
containers:
- name: scm
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
Expand All @@ -82,10 +101,18 @@ spec:
ports:
- name: rpc-client
containerPort: 9860
- name: block-client
containerPort: 9863
- name: rpc-datanode
containerPort: 9861
- name: ui
containerPort: {{ .Values.scm.service.port }}
{{- if gt (int .Values.scm.replicas) 1 }}
- name: ratis
containerPort: 9894
- name: grpc
containerPort: 9895
{{- end }}
Comment on lines +110 to +115
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are exposing ports, we should also have a readiness probe to toggle access to these ports in a separate Jira.

livenessProbe:
httpGet:
path: /
Expand Down