Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add nftables for firewalling #5

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

MoeMahhouk
Copy link
Collaborator

No description provided.

@MoeMahhouk MoeMahhouk requested review from bakhtin and Ruteri October 21, 2024 16:57
@MoeMahhouk MoeMahhouk force-pushed the firewalling branch 2 times, most recently from 7e99dc8 to 946d047 Compare October 21, 2024 17:31
tcp dport 8551 accept comment "Consensus Client Engine API"
tcp dport 30303 accept comment "Execution Client P2P"
tcp dport { 8545, 8546 } accept comment "Execution Client RPC"
tcp dport 27017 accept comment "Searcher Input Channel"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's listening on this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Angela's doc, it is the searcher's input channel

ct state established,related accept

# Production mode rules
tcp dport 8551 accept comment "Consensus Client Engine API"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should we be accepting requests to CL API apart from p2p?
I think Lighthouse listens at 9000 by default, should probably be listened for and accepted

Copy link
Collaborator Author

@MoeMahhouk MoeMahhouk Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you are right but I was following the table in Angela's doc here.
I guess, this is specific to the deployment of searcher stuff within podman. We need more context sync with Angela on that doc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with searcher, they prefer to run CL on their machine too. So let's remove 8551, open 9000 on our trusted CL node outside, and forward to 9000 on the searcher's machine.


# Production mode rules
tcp dport 30303 accept comment "Execution Client P2P"
tcp dport { 8545, 8546 } accept comment "Execution Client RPC"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should we accept EL RPC to anyone external?
I wonder if there's a good way for us to restrict RPC to only between containers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about that but I guess princess is running an EL node within podman too. Or that was my impression from Angela's doc.
However, this is just a draft and not the final nftable confs. I would require @bakhtin to check them out and provide me with the conf that we usually require from production instances instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to open 8545 for the hackathon so we wouldn't have to wait for their EL node inside to sync, but removing this from the spec now.

Copy link
Collaborator

@Ruteri Ruteri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one caveat is this only works if the untrusted user themselves cannot listen on those ports, which can be the policy but should be up front and obvious that it's the requirement
I'm guessing the script enabling and disabling maintenance mode is wip?

I wonder if we can make the user containers have explicit networks https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md

@Ruteri
Copy link
Collaborator

Ruteri commented Oct 21, 2024

One interesting thing is putting containers in a single pod (see Communicate between two rootless containers in a pod) -- that way they can communicate over localhost, and we would not have to open any of the ports on the host machine.
This could get rid of the requirement for allowing RPC ports in particular

@MoeMahhouk
Copy link
Collaborator Author

I guess one caveat is this only works if the untrusted user themselves cannot listen on those ports, which can be the policy but should be up front and obvious that it's the requirement I'm guessing the script enabling and disabling maintenance mode is wip?

I wonder if we can make the user containers have explicit networks https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md

yes, the maintenance and production mode switching script is the next step after finalizing those nftable firewalling stuff.

I guess we need both the nftables firewalling for the root/user and also explicit network configuration for the user's containers.
If our devops (@bakhtin) already have such confs available, I can re-use the in-prod ones. Otherwise, we need to craft those for this use case.

@MoeMahhouk
Copy link
Collaborator Author

One interesting thing is putting containers in a single pod (see Communicate between two rootless containers in a pod) -- that way they can communicate over localhost, and we would not have to open any of the ports on the host machine. This could get rid of the requirement for allowing RPC ports in particular

Interesting, we could look into that approach but not sure what's the hard requirements for princess use-case is. wouldn't their LH/EL nodes within the pod containers require any external access?
Is it only going to be the port for providing the state diffs through it? Not sure how their deployment works since they do it manually.
Some of it is described here

@MoeMahhouk MoeMahhouk force-pushed the firewalling branch 2 times, most recently from c329e38 to 7243a78 Compare October 23, 2024 15:49
Comment on lines 61 to 65
if (!wait_for_namespace()) {
fprintf(stderr, "Network namespace not available, running in default namespace\n");
execl("/bin/sh", "sh", NULL);
perror("execl");
return 1;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just crash it if the namespace is not available and not allow any ssh login as a drastic security measurement

Comment on lines +68 to +72
// Temporarily elevate privileges
if (setuid(0) != 0) {
perror("setuid");
return 1;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, despite setting the binary as SUID to execute with root privileges, it didn't work because setting the namespace requires root privileges and was returning operation not permitted. So we need this hack here were we escalate the privileges just to enter the name space and then set it back to the searcher user in line 80

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add dropping privileges as atexit just in case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some logic that puts the machine in automatic maintenance mode if the nftables firewall failed to be setup correctly. Ideally maintenance mode shouldn't require firewall rules to work, so if the firewall setup fails for some reason, we're still in a situation where no live data can be leaked.

return 0;
}

int enter_namespace() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the restrictions of this namespace? is it only network wise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all described in details in the searcher-net-init script below.

@fnerdman
Copy link
Contributor

One of the main headaches I have with this approach is entering the namespace via .profile.

Could you try setting up a wrapper shell that is calling ns-enter?

# 1. Create the restricted shell
cat > /bin/rsh << 'EOF'
#!/bin/busybox sh
exec /bin/sh
EOF

# 2. Set correct permissions
chmod 755 /bin/rsh
chown root:root /bin/rsh

# 3. Add to /etc/shells
echo "/bin/rsh" >> /etc/shells

@MoeMahhouk
Copy link
Collaborator Author

One of the main headaches I have with this approach is entering the namespace via .profile.

Could you try setting up a wrapper shell that is calling ns-enter?

# 1. Create the restricted shell
cat > /bin/rsh << 'EOF'
#!/bin/busybox sh
exec /bin/sh
EOF

# 2. Set correct permissions
chmod 755 /bin/rsh
chown root:root /bin/rsh

# 3. Add to /etc/shells
echo "/bin/rsh" >> /etc/shells

yeah, I simplified this yesterday through the dropbear configuration to enter a similar wrapper shell for the transition.
The latest commit that I pushed added the mode-controller feature with the wrapper shell to work well together

define BUILDER_ENDPOINT = 0.0.0.0 # Replace with actual builder endpoint IP

table inet filter {
chain input {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not accept by default and drop everything coming from & going to BUILDER_ENDPOINT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of this too but I think in the maintenance mode the searcher would need this to sync their stuff when deploying the podman for their EL/CL. Wouldn't they need to send and receive from peer connections?

That's why I only opened those and dropped everything else including the state diff stream which is sensitive and not necessary in the maintenance stage.

ct state established,related accept

# Allow only builder endpoint connections to these ports
ip saddr $BUILDER_ENDPOINT tcp dport 9000 accept comment "FB-operated Consensus Client P2P"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess more accurate would be "Builder-operated..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I took the naming from the table in the document but that can be changed for sure

type filter hook output priority 0; policy drop

# Allow established/related connections
ct state established,related accept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, if a searcher establishes a connection during maintenance to their own server, wouldn't this rule let them continue sending stuff? Or are you maybe dropping the established connections on transition to production?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently dropping/killing everything the searcher establishes during transition. But that is why I also want a way to test this somehow.
It is kind of difficult to check which ones to keep established and which one to drop. Also, we need to make sure that the undesired established connection doesn't try to re-establish itself upon termination during the transition. But that shouldn't be possible because the transition nftable configuration should drop anything not allowed.

ip daddr $BUILDER_ENDPOINT tcp dport 8645 accept comment "Builder RPC"

# Allow fluent-bit to access log endpoint
ip daddr $SEARCHER_LOG_HOST tcp dport $SEARCHER_LOG_PORT accept comment "Logs Collector"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be only from the root network ns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only the root should be able to access this and intercept sent/received stuff over that address:port

ip daddr $SEARCHER_LOG_HOST tcp dport $SEARCHER_LOG_PORT accept comment "Logs Collector"

# Allow traffic to searcher namespace
oifname "veth0" accept comment "Searcher network interface"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't veth0 root ns? I think the rule is correct but the comment is not

    #                 Searcher Namespace        Root Namespace            External
    #                 [veth1 10.0.0.2]         [veth0 10.0.0.1]           [eth0]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this opens that port to the outside. so it is more like veth0 -> eth0-> external world.
Internally, the veth1 is connected to veth0 and anything from the searcher trying to reach that address will be intercepted by veth0 to handle with our delayed fluent-bit.

flush ruleset

table inet filter {
chain input {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess we should allow EL and CL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably should keep those EL/CL connections. I believe I did this as a strict transition configuration and then re-allow them.
I have no strong opinion on this. So we can re-add the EL/CL rules here again without the state diff ofc

// Wait for processes to terminate
sleep(2);

// Force kill remaining processes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why are you killing processes? Wouldn't dropping connections be enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried dropping but the already established ones where still intakt and upon several tries with different approaches, the killing was doing it. But I am for sure open to suggestions on how and which connections to drop

long pid;

// Elevate privileges
if (setuid(0) != 0 || setgid(0) != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set reverting privilege as atexit before elevating them
https://en.cppreference.com/w/c/program/atexit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, yeah. We can abstract the drop privileges in a separate function and wrap it with atexit

Comment on lines +68 to +72
// Temporarily elevate privileges
if (setuid(0) != 0) {
perror("setuid");
return 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add dropping privileges as atexit just in case

return 1;
}

if (strcmp(argv[1], "transition-to-maintenance") == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment that strcmp is a buffer overrun waiting to happen - I'd strongly recommend strnlen_s + strncmp for all the comparisons (or C++ but it's a bit too late for that)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to refactor all to C++ or different language if necessary. Like Go.
First I just wanted to quickly have something running and doing the logic we need in a compiled binary.
I avoided Rust for the sake of delivery speed but we can also consider it if needed

}

chain output {
type filter hook output priority 0; policy drop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess established output is alright?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the transition, you need to make sure that the killed undesired connection do not re-establish themselves during transition. So anything not explicitly added should be by default not allowed to establish during transition


# Configure veth1 inside the namespace
log "Configuring veth1 in namespace $NAMESPACE..."
if ! ip netns exec "$NAMESPACE" ip addr add 10.0.0.2/24 dev veth1; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 10.0.0.2/32 or 10.0.2.0/24? Similar comment for the 10.0.0.1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on why?

}

// Apply final rules
snprintf(cmd, sizeof(cmd), "/usr/sbin/nft -f /etc/nftables-%s.conf", to_mode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid reusing the buffer here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but in this case it is hardcoded size but I see we don't clear it which indeed might be a problem.
I will refactor later on.

return ret;
}

static int execute_transition_script(const char *from_mode, const char *to_mode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess the function would be simpler if you hardcoded the transitions as one function each

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had them as one function each but it became annoying each time to refactor an issue/bug, I had to handle them in each separately due to duplication. So here I left the common code.
I can separate them again if we see it better or more readable.

uid_t real_gid = getgid();

log_message(LOG_INFO, "Applying transition rules...");
if (setuid(0) != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you drop the privilege

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. This only elevates and down below it should drop it again

}

static int read_current_mode(char *mode, size_t size) {
if (acquire_lock() != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably lock the whole execution -- I don't think concurrent access is handled

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean not on function level but rather directly in the main. Lock it and only unlock it when finished instead?

export PS1='searcher@tdx:\w\$ '

# Check current mode and handle transition if needed
mode=$(mode-controller status | awk '{print $NF}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to not read the mode here, instead call mode-controller with no arguments and let it figure out the mode.
I think /usr/bin/ns-enter can be called regardless of mode which is super convenient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it all in mode-controller originally but there were issues regarding the user not being set correctly.
Also the execution status check was not working correctly upon ssh-in into the machine while in production mode.
That's why we need this check in the login-shell for it to check prior to passing it to the mode-controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants