Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EES-4823: Pre-empt content urls that will redirect #5013

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

sambiramairelogic
Copy link
Contributor

Google has crawled over 200 links on our site which hit redirects, and is marking us down for this (in Michael's spreadsheet)

A sizable chunk of these are renamed publications/methodologies as per <contentApi>/redirects, and a few others also hit redirect rules introduced to server.js (i.e., ending in a trailing slash or having a hostname starting with www).

This PR applies all our redirect rules to hrefs before they are rendered by ContentHtml to pre-empt these redirects. This unfortunately means making an API call to the content API each time we render a link, which is hardly ideal. I've left a TODO comment in the files below to prompt a discussion on this - there's a few ways we could go.

@sambiramairelogic sambiramairelogic changed the title Ees 4823/preempt content urls that will redirect EES-4823: Pre-empt content urls that will redirect Jun 28, 2024

const redirectService = {
async list(): Promise<Redirects> {
return (await fetch(`${contentApiUrl}/redirects`)).json();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eurgh, so apparently this causes build errors in the frontend. We could re-implement the frontend version of this service and have two, but that's messy. Problem as things stand I need to make this call available from ContentHtml which is in common.

@ntsim I'm gonna hold off solving this pending your thoughts on my big TODO: comment in AnchorHtml, as I suspect there might be a few changes
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service was implemented like this because Axios doesn't work with Next's middleware runtime (which is using this service).

There's potentially other libraries that work across all runtimes, but it's probably not worth investing any time on this as Axios works fine for like 99% of our use-cases.

Would just revert this to how it was previously as this is the most prudent approach.

If we go with the context approach for redirects, we'll also want to move this service back into the frontend module as well.

Comment on lines +123 to +125
await waitFor(() => {
expect(screen.getByText('Featured table')).toBeInTheDocument();
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement the context based redirects as suggested, we don't need to do any waiting around for link text to be correct. Make sure to undo all these changes in this PR!


const redirectService = {
async list(): Promise<Redirects> {
return (await fetch(`${contentApiUrl}/redirects`)).json();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service was implemented like this because Axios doesn't work with Next's middleware runtime (which is using this service).

There's potentially other libraries that work across all runtimes, but it's probably not worth investing any time on this as Axios works fine for like 99% of our use-cases.

Would just revert this to how it was previously as this is the most prudent approach.

If we go with the context approach for redirects, we'll also want to move this service back into the frontend module as well.

Comment on lines 16 to 33
const [sanitisedHref, setSanitisedHref] = useState<string | null>(null);

// TODO: Discuss how we want to achieve this in PR review
// One option is to make this a hook
// Another is to make this call during frontend's GetServerSideProps and expose to common through a context or something
// Another is to leave it like this - but there's still a question of what to render until the API has responded
// Also, can we handle caching a bit more smartly?
useEffect(() => {
const sanitiseHref = async (rawHref: string) => {
setSanitisedHref(await applyRedirectRules(rawHref));
};

sanitiseHref(href);
}, [href]);

if (!sanitisedHref) {
return <a href={undefined}>[link loading...]</a>;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a fair few problems with this approach which you've already identified, but a fundamental problem is that it's not likely to be parsed correctly by crawlers and could break all our content links.

There aren't strong guarantees that a crawler can read client rendered content correctly. Ensuring crawlers have everything they need in the initial server rendered document is the safest approach.

There's some additional issues too:

  • Without any caching, making an API call per link is just a non-starter as content could have many links inside it
  • These links would also be rendered in admin too. This would likely break without adjusting the implementation

I think the options for a server-rendered approach are either:

  • Context (as you've mentioned)
  • Transforming HTML strings server-side e.g. in the backend API, or even frontend

Given we're already transforming the HTML strings as part of React rendering, it's probably more efficient to go down the context route.

Implementation

If we agree the context approach is correct, then I'd propose the following implementation:

The context itself

This should be declared in a similar way to our other contexts, which should consist of a provider component (RedirectsContextProvider) and a hook (useRedirectsContext) to access the context.

The context should not fetch its own redirects from the service. It should just receive these as prop so we can leave it to the consuming application to implement the fetching.

Typically, we implement these contexts so that if the provider hasn't been set up, an exception is thrown by the hook when it's used. However, this will be tougher to do as a fair few tests may need to be updated to add the context provider in (without any redirects).

It might be simpler to implement this so that it doesn't throw, but you might want to try the throwing approach first to see how many tests would need to be updated.

If we do decide to throw and to update the tests, the simplest approach would be to update our custom render function in common. This can then be swapped in really easily into the tests:

+ import render from '@common-test/render'
- import { render } from '@testing-library/react'

Adding globally

It'd be good for to implement the context globally so we can provide the redirects across any link, not just links in content.

This approach is less fiddly as we don't need to wrap areas of the page with the context, which would otherwise create lots of changes. It also means that we don't forget to add the context for new pages.

To do this, we can just add the context in _app.tsx. We can fetch the redirects in App.getInitialProps and pass these through as props to Providers. The Providers component will need to be modified pass the props through to the context.

Caching

A missing part of the puzzle will be to cache the redirects rather than re-fetch them on every getInitialProps (which is on every page render).

The simplest cache implementation will be to store the redirects in a module-scoped variable using a simple time-based expiration. We're already doing this type of caching in the redirectPages middleware, but we could move this to redirectsService so other places can benefit from this caching as well.

We don't have to do this, but for a nice bonus, I'd personally be tempted to re-implement the caching via a util function that could be used for this type of caching elsewhere e.g.

import createMemoryCache from '@common/utils/cache/createMemoryCache';

const redirects = createMemoryCache({
  get: () => (await fetch(`${contentApiUrl}/redirects`)).json(),
  duration: getCacheTime()
});

const redirectService = {
  list(): Promise<Redirects> {
    return redirects.get();
  }
}

Admin

To simplify things, I'd probably suggest that we don't bother fetching the redirects in admin. The SEO penalty doesn't apply to admin so it's probably not worth the hassle to implement these here as well.

If we implement the useRedirectsContext hook to not throw when there isn't a provider, we can just omit setting up a RedirectsContextProvider for the admin.

href: string;
}

export default function AnchorHtml({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just call this Link, or ContentLink if you want to avoid confusion with the other Link components.

@@ -87,11 +88,11 @@ export default function ContentHtml({
return !node.attribs.href.includes(
'explore-education-statistics.service.gov.uk',
) && typeof node.attribs['data-featured-table'] === 'undefined' ? (
<a href={url} target="_blank" rel="noopener noreferrer">
<AnchorHtml href={url} target="_blank" rel="noopener noreferrer">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need redirects for external links? We don't have redirects for external URLs, so this doesn't seem necessary

Comment on lines 9 to 14
let newUrl: string = url;

newUrl = await noTrailingSlashUnlessOnlyHost(newUrl);
newUrl = await pathMustNotEndInSlash1000(newUrl);
newUrl = await publicSiteHostMustNotStartWithWww(newUrl);
newUrl = await mustNotTriggerAContentApiRedirect(newUrl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this approach is potentially over-complicating things by trying to be functional.

I'm not really sure it's worth the effort as we don't typically write code like this, and it's less performant as it has to parse into a URL object multiple times and is using regexes over the entire URL in multiple places.

Would suggest that we just construct the URL object once and potentially just simplifying this back to some if statements (with comments) which modify the url i.e.

const newUrl = new URL(url);

// Do a check
if (blah) {
  newUrl.pathname = ...;
}

// Do another check
if (blah) {
  newUrl.hostname = ...;
}

// Using functions is still fine for more
// contained pieces of logic.
await callAFunction(newUrl);

return newUrl.toString();

export default async function applyRedirectRules(url: string): Promise<string> {
let newUrl: string = url;

newUrl = await noTrailingSlashUnlessOnlyHost(newUrl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't modify as suggested in comment below, then we should remove any unnecessary awaits. This is only necessary on the function that calls redirectService.list.

redirectPathStarts,
} from '@common/services/redirectService';

const prodPublicUrl = 'https://explore-education-statistics.service.gov.uk/';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the right approach to use this as the base? If we're on non-prod environments, won't this potentially result in incorrect links in those environments?

I mentioned this in your other PR, but we should probably set the host's URL using setHostUrl / getHostUrl functions and use this instead?

Comment on lines 58 to 90
if (
Object.values(redirectPathStarts).find(path =>
parsedPath.startsWith(path),
) &&
parsedPath.split('/').length > 1
) {
const redirectPath = Object.keys(redirectPathStarts).reduce((acc, key) => {
const redirectType = key as RedirectType;
// If starts with "methodologies/" or "publications/"
if (parsedPath.startsWith(redirectPathStarts[redirectType])) {
const pathSegments = parsedPath.split('/');

const rewriteRule = redirectsFromContentApi[redirectType]?.find(
({ fromSlug }) => pathSegments[2].toLowerCase() === fromSlug,
);

if (rewriteRule) {
return pathSegments
.map(segment =>
segment.toLowerCase() === rewriteRule?.fromSlug
? rewriteRule?.toSlug
: segment.toLowerCase(),
)
.join('/');
}
}
return acc;
}, '');

if (redirectPath) {
return new URL(`${redirectPath}${parsedUrl.search}`, prodPublicUrl).href;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the reduce and we're splitting out the path segments too many times.

This can be simplified to:

const redirectPathStart = Object.entries(redirectPathStarts).find(
  ([_, path]) => parsedPath.startsWith(path),
);
const pathSegments = parsedPath.split('/');

if (redirectPathStart && pathSegments.length > 1) {
  const [key] = redirectPathStart;
  const redirectType = key as RedirectType;

  const rewriteRule = redirectsFromContentApi[redirectType].find(
    ({ fromSlug }) => pathSegments[2].toLowerCase() === fromSlug,
  );

  if (rewriteRule) {
    const path = pathSegments
      .map(segment =>
        segment.toLowerCase() === rewriteRule?.fromSlug
          ? rewriteRule?.toSlug
          : segment.toLowerCase(),
      )
      .join('/');

    return new URL(`${path}${parsedUrl.search}`, prodPublicUrl).href;
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants