-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging differences based on startup network state #87
Comments
Thank you for posting issues: they all help to improve Thespian in some way! You've encountered a bit of a tricky support contortions in the networking support: namely that each Actor should self-identify with an address that is reachable by other actors. For example, if an actor simply identified itself as being at Note that Thespian currently does not support an Actor Convention that spans multiple networks; I'm not sure it ever will (a VPN would probably be a better approach to that scenario anyhow). This isn't your situation, but I thought I should mention this limitation while on the subject. In your case, you aren't using a Convention, so you are reasonably expecting the use of localhost. However, the Thespian code above was written to support eventual conventions, so in the case where there is no Convention Leader specified, it will default to using Google's DNS server (at 8.8.8.8) to identify the network most likely to provide general access to the current system. This is (obviously) why shutting down the network on your system causes logging output to be suspended: the network providing the address routing is down. Also, in case I haven't adequately explained this elsewhere, here's how logging works:
Each of these nodes uses the same transport functionality for communication. You hadn't noticed the effects yet (primarily because it's for internal support) in your network disconnect situation, but your TopActor also lost communication to the Admin when your network went down. You would also have lost connectivity between Actors themselves, had you been running multiple Actors. When there is a delivery problem, the effects depend on which system base is used:
I believe the above correctly describes the effects of your testing in the different scenarios. Let me know if I've overlooked something somewhere though. SolutionIn your case, the solution is to indicate to Thespian which network you'd like it to use by specifying the Convention Admin address (even though you won't be running a Convention); see https://thespianpy.com/doc/using.html#hH-9d33a877-b4f0-4012-9510-442d81b0837c. Something like:
This should be enough to inform Thespian that it should use localhost for the addresses of Actors (and the logger). Let me know if this resolves your issues and questions. I'm also going to add some better description to the Thespian documentation to provide a better explanation. |
Kevin, I tried your solution but ended up with the same result as before. I also tested a slightly different setup after testing the change you suggested. The results:
Here is my current setup that I used as an additional test. I have a Raspberry Pi with the following network adapters: Network A: 192.168.1.1/24
Network B: 192.168.2.1/24
To test network connection, I am simply unplugging the cable going to Network Card A. {'Convention Admin.IPv4': '192.168.2.1:1900'} I set up the convention address as the address of Network Card B. I did this because unplugging card A's network cable should have no effect on card B. However, even with this setup, unplugging card A from the network still freezes all logging in my child actor. This almost seems like the specified convention admin IP isn't being recognized but there is no difference in the thesplog.log file. I even tried manually updating the capabilities using the following code to see if anything changed in the thesplog but I didn't see any difference in the logs aside from the lines in the logs below the code. asys = ActorSystem('multiprocTCPBase', logDefs=logcfg, capabilities={'Convention Admin.IPv4': '192.168.2.1:1900'})
asys.updateCapability('Convention Admin.IPv4', '192.168.2.1')
asys.updateCapability('Admin Port', 1901)
I was expecting the parent (admin) address in the last line of the log to show However... after initializing the Top actor, I ran the following lines top_status = asys.ask(top_actor, Thespian_StatusReq(), 10)
res = asys.ask(top_status.adminAddress, Thespian_StatusReq(), 10)
print(res.capabilities) This printed out the capabilities containing the correct IP and Admin Port values. |
Hi again. Sorry to keep posting new issues for you.
Environment
Tested on Debian 10 (buster) with an ARM Cortex A8 (1 core)
Also tested on an Ubuntu 20.04 LTS VM with 4 cores allocated to test if this was a threading issue.
Python Versions: 3.7.3 3.8.10, 3.9.5
Manually set the following environment variables:
export THESPLOG_FILE=./thespian.log
export THESPLOG_THRESHOLD=DEBUG
Problem
The Thespian logging system has unexpected behavior when network connectivity is lost on the host machine, specifically when network connectivity is lost in the form of the network adapter becoming disabled or disconnected. This is an issue since if WiFi connection is lost or an ethernet cable becomes unplugged, the network adapter is disabled.
There are two main scenarios that I've discovered when testing this issue. The two files below are used in both scenarios.
start.py
This is the startup script for this test. It sets up the actorsystem with logging then creates the top-level actor. It then enters a loop where it waits for a Ctrl+C (SIGINT) signal. When this signal is caught, the actor system is shut down and the program exits.
top.py
This is the top level actor. It has two wakeup loops that we are using to figure out what is happening in the logs. The first wakeup loop is
print
which just prints out the number of times that it has woken up. We use this to ensure that the system is still running while network is disconnected. The second loop isnetwork
which checks if we have network connectivity every second and prints out "CONNECTED" or "DISCONNECTED" whenever the network state changes.Scenario 1: Startup Online
In this scenario, the Thespian ActorSystem is created and initialized while there is an active network connection on the host machine (connected both to the local network and able to ping out to the internet).
Log Files: actor_log, thesplog
8.8.8.8
)python start.py
actor_log.log
, disable network (disconnect from wifi, unplug ethernet cable, disable network adapter, etc.)actor_log.log
is no longer receiving any updates whilethespian.log
continues showing thetop
actor continuing execution as expected.actor_log.log
will receive all of the queued messages that it missed while the network was disconnected.In this scenario, it seems that the ActorSystem is queueing all messages to the outward-facing logging actor and waiting for the network connection to reestablish. However this seems odd to me since I would have thought that thespian would use the localhost (127.0.0.1) for all internal operations such as allowing the logging actor to write to its output file. I suppose I don't fully understand the inner workings of the TCP/UDP base.
My main concern with this is that the thespian internal logging may continue queueing these messages during long periods of network loss and it will cause a system crash. I noticed that the longer I waited before reconnecting the network, the longer it took for the queued messages to get written to the
actor_log.log
file.Scenario 2: Startup Offline
In this scenario, the Thespian ActorSystem is created and initialized while there is not an active network connection on the host machine (network cable unplugged, wifi not connected, adapter disabled, etc.).
Log Files: actor_log, thesplog
python start.py
actor_log.log
.actor_log.log
.actor_log.log
.This scenario is how I was expecting the ActorSystem's logging to operate in the event of network loss (such as in scenario 1's case).
Additional Notes
With
multiprocUDPBase
, Scenario 1 just errored out when I lost network connectivity. actor_log, thesplogWith
multiprocUDPBase
, Scenario 2 operated the same as withmultiprocTCPBase
.Gists
Here is a Gist containing the log files that were mentioned in this problem: https://gist.github.com/bryang-spindance/9bfbe442f04aa588f5af20305c187d61
The text was updated successfully, but these errors were encountered: