You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if a third-party server responds with a 4xx or 5xx error Via's NGINX component (that handles PDF proxying) directly proxies that response so Via itself responds with the same 4xx or 5xx. The problems with this are:
It can cause noise in our monitoring because users proxying broken third-party servers can cause Via to send 4xx or 5xx responses. If it's "normal" for Via to 4xx or 5xx then that can cover up actual issues
It can cause false alarms if users proxying broken third-party servers can trigger alarms that think Via is unhealthy or down because it's 4xx'ing or 5xx'ing, waking up on-call engineers. This can cause us to ignore or disable alarms and could cover up actual problems
History
Previously Via actually didn't proxy error responses from third-parties but this was implemented by completely swallowing the third-party status code and body. This led to difficulty in diagnosing a real issue in production when Via's proxying of Google Drive really was broken. As a result we removed the error hiding code but this leaves Via open to the monitoring noise and false alarms mentioned above.
This doesn't apply to Google Drive
Via now special-cases Google Drive URLs and routes them to a separate Python endpoint. This issue is just about all the other PDF URLs that get routed to the NGINX endpoint
Some file hosts might be special
We don't want arbitrary broken third-party servers to be able to cause Via to respond with arbitrary 4xx and 5xx codes and create monitoring and alerting noise. But we probably do want error responses from Google Drive (for example) to trigger errors from Via: we do want alarms to go off if proxying Google Drive is failing. This is because Google Drive is:
A particularly important file host for our users, because the LMS app has a built-in Google Drive integration
A file host that we trust to be reliable and well-behaved
Google Drive is already special-cased and routed to a separate Python endpoint so any changes to the NGINX endpoint won't affect Google Drive anyway. But the same reasoning might apply to other files hosts that the LMS app has pickers for: Canvas and Blackboard Files, Microsoft OneDrive. It might also apply to some hosts that we don't have pickers for if we know users use that host a lot, e.g. maybe Dropbox, or Box.com.
Canvas and Blackboard files may not be clear cut since not all Canvas and Blackboard instances are hosted by Instructure and Blackboard. Schools can host their own. So they may not always be as reliable.
Google Drive happens to be easy to special-case because all its URLs are drive.google.com but this may not be so easy for all hosts that we might want to special case, some of them might not be easy to identify.
The original status code and response still needs to be accessible somewhere
For debugging.
Solution
Some ideas:
Replace third-party error statuses with our own singular "third-party was broken" status code, so as not to create monitoring noise or trigger false alarms
Don't replace the third-party error bodies. Leave those intact for debugging
Note that there's no point in Via returning a nice error page here. Via's response goes to PDF.js's JavaScript code, it is not shown to the user. (For debugging you can find the response body in dev tools, which is why proxying through the third-party's response body can be useful for debugging.)
Copy the third-party's original status header into a custom header, for debugging
Exempt certain important domains (such as Microsoft OneDrive) that we do want to trigger alarms. We should continue to proxy error statuses for these domains
We might be able to use NGINX's ngx_http_sub_module to achieve some of the above or, failing that, the Lua module or the JavaScript module. The JavaScript module in particular seems squarely aimed at our use-case here.
The text was updated successfully, but these errors were encountered:
seanh
changed the title
Don't completely swallow third-party PDF error responses
Don't directly proxy error statuses from arbitrary third parties
Sep 13, 2021
Problem
Currently if a third-party server responds with a 4xx or 5xx error Via's NGINX component (that handles PDF proxying) directly proxies that response so Via itself responds with the same 4xx or 5xx. The problems with this are:
History
Previously Via actually didn't proxy error responses from third-parties but this was implemented by completely swallowing the third-party status code and body. This led to difficulty in diagnosing a real issue in production when Via's proxying of Google Drive really was broken. As a result we removed the error hiding code but this leaves Via open to the monitoring noise and false alarms mentioned above.
This doesn't apply to Google Drive
Via now special-cases Google Drive URLs and routes them to a separate Python endpoint. This issue is just about all the other PDF URLs that get routed to the NGINX endpoint
Some file hosts might be special
We don't want arbitrary broken third-party servers to be able to cause Via to respond with arbitrary 4xx and 5xx codes and create monitoring and alerting noise. But we probably do want error responses from Google Drive (for example) to trigger errors from Via: we do want alarms to go off if proxying Google Drive is failing. This is because Google Drive is:
Google Drive is already special-cased and routed to a separate Python endpoint so any changes to the NGINX endpoint won't affect Google Drive anyway. But the same reasoning might apply to other files hosts that the LMS app has pickers for: Canvas and Blackboard Files, Microsoft OneDrive. It might also apply to some hosts that we don't have pickers for if we know users use that host a lot, e.g. maybe Dropbox, or Box.com.
Canvas and Blackboard files may not be clear cut since not all Canvas and Blackboard instances are hosted by Instructure and Blackboard. Schools can host their own. So they may not always be as reliable.
Google Drive happens to be easy to special-case because all its URLs are
drive.google.com
but this may not be so easy for all hosts that we might want to special case, some of them might not be easy to identify.The original status code and response still needs to be accessible somewhere
For debugging.
Solution
Some ideas:
We might be able to use NGINX's
ngx_http_sub_module
to achieve some of the above or, failing that, the Lua module or the JavaScript module. The JavaScript module in particular seems squarely aimed at our use-case here.The text was updated successfully, but these errors were encountered: