-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11618. Enable HA modes for OM and SCM #10
base: main
Are you sure you want to change the base?
Changes from 3 commits
d5a2af0
8c2606c
d7e4322
4c27e2d
d60087d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,10 @@ spec: | |
ports: | ||
- name: ui | ||
port: {{ .Values.om.service.port }} | ||
{{- if gt (int .Values.om.replicas) 1 }} | ||
- name: ratis | ||
port: 9872 | ||
{{- end }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the number of OM pods are manually modified by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per OM documentation
Perhaps there should be a different service which maps and groups pods as per the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes this was my reason
ok if this is the case maybe we can remove this export. I found and used the following Ticket for the port configuration: https://issues.apache.org/jira/browse/HDDS-4677. I'm not really familiar with the architecture. If we can remove it I will do so :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Great point! I hadn't considered that scenario. What could be the downside of exposing the port within an internal Kubernetes network? Perhaps we can use a Helm lookup mechanism to check the current replica count as an alternative, but the simplest approach is to always expose the port. |
||
selector: | ||
{{- include "ozone.selectorLabels" . | nindent 4 }} | ||
app.kubernetes.io/component: om |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ metadata: | |
app.kubernetes.io/component: om | ||
spec: | ||
replicas: {{ .Values.om.replicas }} | ||
podManagementPolicy: Parallel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per OM documentation
Shouldn't there be an init container which calls There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think you are right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point, I get stuck in the bootstrap init container. I believe this step is only necessary for new cluster nodes. Thus, the problem reoccurs when we scale (once again, the scaling issue). |
||
serviceName: {{ .Release.Name }}-om-headless | ||
selector: | ||
matchLabels: | ||
|
@@ -71,6 +72,10 @@ spec: | |
containerPort: 9862 | ||
- name: ui | ||
containerPort: {{ .Values.om.service.port }} | ||
{{- if gt (int .Values.om.replicas) 1 }} | ||
- name: ratis | ||
containerPort: 9872 | ||
{{- end }} | ||
livenessProbe: | ||
httpGet: | ||
path: / | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ metadata: | |
app.kubernetes.io/component: scm | ||
spec: | ||
replicas: {{ .Values.scm.replicas }} | ||
podManagementPolicy: Parallel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see a different problem: If we use |
||
serviceName: {{ .Release.Name }}-scm-headless | ||
selector: | ||
matchLabels: | ||
|
@@ -61,6 +62,24 @@ spec: | |
mountPath: {{ .Values.configuration.dir }} | ||
- name: {{ .Release.Name }}-scm | ||
mountPath: {{ .Values.scm.persistence.path }} | ||
{{- if gt (int .Values.scm.replicas) 1 }} | ||
- name: bootstrap | ||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" | ||
args: ["ozone", "scm", "--bootstrap"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the SCM HA docs:
Here, we call It is not clear from the documentation whether There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It appears from the documentation that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I used the following doc:
This can be found in the Auto-bootstrap section here. I understood that we can run this commands on every instance and it will automatically detect what to do if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
However, this is only relevant for the bootstrap process. So, you might be right. We probably only need the bootstrap once in persistent mode. Maybe someone from the Ozone contributors can provide a definitive answer to this question. |
||
env: | ||
{{- include "ozone.configuration.env" . | nindent 12 }} | ||
{{- with $env }} | ||
{{- tpl (toYaml .) $ | nindent 12 }} | ||
{{- end }} | ||
{{- with $envFrom }} | ||
envFrom: {{- tpl (toYaml .) $ | nindent 12 }} | ||
{{- end }} | ||
volumeMounts: | ||
- name: config | ||
mountPath: {{ .Values.configuration.dir }} | ||
- name: {{ .Release.Name }}-scm | ||
mountPath: {{ .Values.scm.persistence.path }} | ||
{{- end }} | ||
containers: | ||
- name: scm | ||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" | ||
|
@@ -82,10 +101,18 @@ spec: | |
ports: | ||
- name: rpc-client | ||
containerPort: 9860 | ||
- name: block-client | ||
containerPort: 9863 | ||
- name: rpc-datanode | ||
containerPort: 9861 | ||
- name: ui | ||
containerPort: {{ .Values.scm.service.port }} | ||
{{- if gt (int .Values.scm.replicas) 1 }} | ||
- name: ratis | ||
containerPort: 9894 | ||
- name: grpc | ||
containerPort: 9895 | ||
{{- end }} | ||
Comment on lines
+110
to
+115
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we are exposing ports, we should also have a readiness probe to toggle access to these ports in a separate Jira. |
||
livenessProbe: | ||
httpGet: | ||
path: / | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the port numbers in all the services and statefulsets be referenced from the
values.yaml
file? We would like to avoid hardcoding them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely! I just proposed this. This was also my personal question ^^