-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add nftables for firewalling #5
base: master
Are you sure you want to change the base?
Conversation
7e99dc8
to
946d047
Compare
946d047
to
9d228a4
Compare
tcp dport 8551 accept comment "Consensus Client Engine API" | ||
tcp dport 30303 accept comment "Execution Client P2P" | ||
tcp dport { 8545, 8546 } accept comment "Execution Client RPC" | ||
tcp dport 27017 accept comment "Searcher Input Channel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's listening on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Angela's doc, it is the searcher's input channel
ct state established,related accept | ||
|
||
# Production mode rules | ||
tcp dport 8551 accept comment "Consensus Client Engine API" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should we be accepting requests to CL API apart from p2p?
I think Lighthouse listens at 9000 by default, should probably be listened for and accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you are right but I was following the table in Angela's doc here.
I guess, this is specific to the deployment of searcher stuff within podman. We need more context sync with Angela on that doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke with searcher, they prefer to run CL on their machine too. So let's remove 8551, open 9000 on our trusted CL node outside, and forward to 9000 on the searcher's machine.
|
||
# Production mode rules | ||
tcp dport 30303 accept comment "Execution Client P2P" | ||
tcp dport { 8545, 8546 } accept comment "Execution Client RPC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should we accept EL RPC to anyone external?
I wonder if there's a good way for us to restrict RPC to only between containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about that but I guess princess is running an EL node within podman too. Or that was my impression from Angela's doc.
However, this is just a draft and not the final nftable confs. I would require @bakhtin to check them out and provide me with the conf that we usually require from production instances instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanted to open 8545 for the hackathon so we wouldn't have to wait for their EL node inside to sync, but removing this from the spec now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one caveat is this only works if the untrusted user themselves cannot listen on those ports, which can be the policy but should be up front and obvious that it's the requirement
I'm guessing the script enabling and disabling maintenance mode is wip?
I wonder if we can make the user containers have explicit networks https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md
One interesting thing is putting containers in a single pod (see Communicate between two rootless containers in a pod) -- that way they can communicate over localhost, and we would not have to open any of the ports on the host machine. |
yes, the maintenance and production mode switching script is the next step after finalizing those nftable firewalling stuff. I guess we need both the nftables firewalling for the root/user and also explicit network configuration for the user's containers. |
Interesting, we could look into that approach but not sure what's the hard requirements for princess use-case is. wouldn't their LH/EL nodes within the pod containers require any external access? |
c329e38
to
7243a78
Compare
7243a78
to
41a607d
Compare
if (!wait_for_namespace()) { | ||
fprintf(stderr, "Network namespace not available, running in default namespace\n"); | ||
execl("/bin/sh", "sh", NULL); | ||
perror("execl"); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just crash it if the namespace is not available and not allow any ssh login as a drastic security measurement
// Temporarily elevate privileges | ||
if (setuid(0) != 0) { | ||
perror("setuid"); | ||
return 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, despite setting the binary as SUID to execute with root privileges, it didn't work because setting the namespace requires root privileges and was returning operation not permitted. So we need this hack here were we escalate the privileges just to enter the name space and then set it back to the searcher user in line 80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add dropping privileges as atexit just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some logic that puts the machine in automatic maintenance mode if the nftables firewall failed to be setup correctly. Ideally maintenance mode shouldn't require firewall rules to work, so if the firewall setup fails for some reason, we're still in a situation where no live data can be leaked.
return 0; | ||
} | ||
|
||
int enter_namespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the restrictions of this namespace? is it only network wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all described in details in the searcher-net-init script below.
One of the main headaches I have with this approach is entering the namespace via Could you try setting up a wrapper shell that is calling ns-enter?
|
yeah, I simplified this yesterday through the dropbear configuration to enter a similar wrapper shell for the transition. |
define BUILDER_ENDPOINT = 0.0.0.0 # Replace with actual builder endpoint IP | ||
|
||
table inet filter { | ||
chain input { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not accept by default and drop everything coming from & going to BUILDER_ENDPOINT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of this too but I think in the maintenance mode the searcher would need this to sync their stuff when deploying the podman for their EL/CL. Wouldn't they need to send and receive from peer connections?
That's why I only opened those and dropped everything else including the state diff stream which is sensitive and not necessary in the maintenance stage.
ct state established,related accept | ||
|
||
# Allow only builder endpoint connections to these ports | ||
ip saddr $BUILDER_ENDPOINT tcp dport 9000 accept comment "FB-operated Consensus Client P2P" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess more accurate would be "Builder-operated..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I took the naming from the table in the document but that can be changed for sure
type filter hook output priority 0; policy drop | ||
|
||
# Allow established/related connections | ||
ct state established,related accept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, if a searcher establishes a connection during maintenance to their own server, wouldn't this rule let them continue sending stuff? Or are you maybe dropping the established connections on transition to production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently dropping/killing everything the searcher establishes during transition. But that is why I also want a way to test this somehow.
It is kind of difficult to check which ones to keep established and which one to drop. Also, we need to make sure that the undesired established connection doesn't try to re-establish itself upon termination during the transition. But that shouldn't be possible because the transition nftable configuration should drop anything not allowed.
ip daddr $BUILDER_ENDPOINT tcp dport 8645 accept comment "Builder RPC" | ||
|
||
# Allow fluent-bit to access log endpoint | ||
ip daddr $SEARCHER_LOG_HOST tcp dport $SEARCHER_LOG_PORT accept comment "Logs Collector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be only from the root network ns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only the root should be able to access this and intercept sent/received stuff over that address:port
ip daddr $SEARCHER_LOG_HOST tcp dport $SEARCHER_LOG_PORT accept comment "Logs Collector" | ||
|
||
# Allow traffic to searcher namespace | ||
oifname "veth0" accept comment "Searcher network interface" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't veth0 root ns? I think the rule is correct but the comment is not
# Searcher Namespace Root Namespace External
# [veth1 10.0.0.2] [veth0 10.0.0.1] [eth0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this opens that port to the outside. so it is more like veth0 -> eth0-> external world.
Internally, the veth1 is connected to veth0 and anything from the searcher trying to reach that address will be intercepted by veth0 to handle with our delayed fluent-bit.
flush ruleset | ||
|
||
table inet filter { | ||
chain input { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess we should allow EL and CL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we probably should keep those EL/CL connections. I believe I did this as a strict transition configuration and then re-allow them.
I have no strong opinion on this. So we can re-add the EL/CL rules here again without the state diff ofc
// Wait for processes to terminate | ||
sleep(2); | ||
|
||
// Force kill remaining processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why are you killing processes? Wouldn't dropping connections be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried dropping but the already established ones where still intakt and upon several tries with different approaches, the killing was doing it. But I am for sure open to suggestions on how and which connections to drop
long pid; | ||
|
||
// Elevate privileges | ||
if (setuid(0) != 0 || setgid(0) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd set reverting privilege as atexit before elevating them
https://en.cppreference.com/w/c/program/atexit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, yeah. We can abstract the drop privileges
in a separate function and wrap it with atexit
// Temporarily elevate privileges | ||
if (setuid(0) != 0) { | ||
perror("setuid"); | ||
return 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add dropping privileges as atexit just in case
return 1; | ||
} | ||
|
||
if (strcmp(argv[1], "transition-to-maintenance") == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment that strcmp
is a buffer overrun waiting to happen - I'd strongly recommend strnlen_s
+ strncmp
for all the comparisons (or C++ but it's a bit too late for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to refactor all to C++ or different language if necessary. Like Go.
First I just wanted to quickly have something running and doing the logic we need in a compiled binary.
I avoided Rust for the sake of delivery speed but we can also consider it if needed
} | ||
|
||
chain output { | ||
type filter hook output priority 0; policy drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess established output is alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the transition, you need to make sure that the killed undesired connection do not re-establish themselves during transition. So anything not explicitly added should be by default not allowed to establish during transition
|
||
# Configure veth1 inside the namespace | ||
log "Configuring veth1 in namespace $NAMESPACE..." | ||
if ! ip netns exec "$NAMESPACE" ip addr add 10.0.0.2/24 dev veth1; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 10.0.0.2/32 or 10.0.2.0/24? Similar comment for the 10.0.0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more on why?
} | ||
|
||
// Apply final rules | ||
snprintf(cmd, sizeof(cmd), "/usr/sbin/nft -f /etc/nftables-%s.conf", to_mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid reusing the buffer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but in this case it is hardcoded size but I see we don't clear it which indeed might be a problem.
I will refactor later on.
return ret; | ||
} | ||
|
||
static int execute_transition_script(const char *from_mode, const char *to_mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess the function would be simpler if you hardcoded the transitions as one function each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had them as one function each but it became annoying each time to refactor an issue/bug, I had to handle them in each separately due to duplication. So here I left the common code.
I can separate them again if we see it better or more readable.
uid_t real_gid = getgid(); | ||
|
||
log_message(LOG_INFO, "Applying transition rules..."); | ||
if (setuid(0) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you drop the privilege
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. This only elevates and down below it should drop it again
} | ||
|
||
static int read_current_mode(char *mode, size_t size) { | ||
if (acquire_lock() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably lock the whole execution -- I don't think concurrent access is handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean not on function level but rather directly in the main. Lock it and only unlock it when finished instead?
export PS1='searcher@tdx:\w\$ ' | ||
|
||
# Check current mode and handle transition if needed | ||
mode=$(mode-controller status | awk '{print $NF}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to not read the mode here, instead call mode-controller
with no arguments and let it figure out the mode.
I think /usr/bin/ns-enter
can be called regardless of mode which is super convenient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it all in mode-controller originally but there were issues regarding the user not being set correctly.
Also the execution status check was not working correctly upon ssh-in into the machine while in production mode.
That's why we need this check in the login-shell for it to check prior to passing it to the mode-controller
No description provided.