Skip to content

Commit

Permalink
[ST-1268] Added canonical tag translation support and higher timeout …
Browse files Browse the repository at this point in the history
…for search engine (#200)

* added support for canonical tag translation

* lint.

* fixed mocked request.

* Added user-agent based tests.

* ported url_language_switcher from html-swapper.

* simplify filtering

* updated tests names to be more descriptive.

* Use activesupoort base testcase class

* removes testing on private methods.

* updated README.

* fixed missing import.

* fixed missing imports and allowed hosts

* Docker build optimization.

* fixed host checking.

* change test page title to wovnrb

* added test cases for url_language_switcher and url.

* rubocop

* minor version bump.

* fixed hard-coded version number in tests
  • Loading branch information
zeyuwu authored Feb 16, 2022
1 parent e79ec36 commit 680a03f
Show file tree
Hide file tree
Showing 22 changed files with 1,728 additions and 74 deletions.
43 changes: 30 additions & 13 deletions README.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,19 +74,22 @@ After completing setup, start the Ruby Application, and make sure the WOVN.io li

The following is a list of the WOVN.io Ruby Library's valid parameters.
Parameter Name | Required | Default Setting
----------------------| -------- | ----------------
project_token | yes | ''
default_lang | yes | 'ja'
supported_langs | yes | ['ja', 'en']
url_pattern | yes | 'path'
lang_param_name | | 'wovn'
query | | []
ignore_class | | []
translate_fragment | | true
ignore_paths | | []
install_middleware | | true
compress_api_requests | | true
Parameter Name | Required | Default Setting
-------------------------------| -------- | ----------------
project_token | yes | ''
default_lang | yes | 'ja'
supported_langs | yes | ['ja', 'en']
url_pattern | yes | 'path'
lang_param_name | | 'wovn'
query | | []
ignore_class | | []
translate_fragment | | true
ignore_paths | | []
install_middleware | | true
compress_api_requests | | true
api_timeout_seconds | | 1.0
api_timeout_search_engine_bots | | 5.0
translate_canonical_tag | | true
### 2.1. project_token
Expand Down Expand Up @@ -201,3 +204,17 @@ WOVN.rb needs to be added after any compression middleware.
### 2.11 compress_api_requests
By default, requests to the translation API will be sent with gzip compression. Set to false to disable compression.
### 2.12 api_timeout_seconds
Configures the amount of time in seconds wovnrb will wait for the translation API for a response before the
request is considered timed-out. This setting defaults to `1.0`.
### 2.13 api_timeout_search_engine_bots
Similar to `api_timeout_seconds`, this timeout setting is applied when handling requests made by search engine bots.
Currently, bots from Google, Yahoo, Bing, Yandex, DuckDuckGo and Baidu are supported. This setting
defaults to `5.0`.
### 2.14 translate_canonical_tag
Configures if wovnrb should automatically translate existing canonical tag in the HTML. When set to `true`, wovnrb
will translate the canonical URL with the current language code according to your `url_pattern` setting.
This setting defaults to `true`.
9 changes: 9 additions & 0 deletions docker/rails/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,13 @@ FROM ruby:2.6.5
WORKDIR /usr/src/app

COPY ./TestSite/ .

RUN gem update --system
RUN gem uninstall bundler && rm /usr/local/bin/bundle && rm /usr/local/bin/bundler
RUN gem install bundler:2.1.4
RUN apt update && apt install npm -y
RUN npm install --global yarn
RUN yarn install --check-files
RUN bundle install

CMD ["/bin/bash", "start.sh"]
2 changes: 0 additions & 2 deletions docker/rails/TestSite/Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,3 @@ end

# Windows does not include zoneinfo files, so bundle the tzinfo-data gem
gem 'tzinfo-data', platforms: [:mingw, :mswin, :x64_mingw, :jruby]

gem 'wovnrb', path: './wovnrb'
2 changes: 2 additions & 0 deletions docker/rails/TestSite/config/environments/development.rb
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,6 @@
# Use an evented file watcher to asynchronously detect changes in source code,
# routes, locales, etc. This feature depends on the listen gem.
config.file_watcher = ActiveSupport::EventedFileUpdateChecker

config.hosts << "example.com"
end
2 changes: 2 additions & 0 deletions docker/rails/TestSite/config/environments/production.rb
Original file line number Diff line number Diff line change
Expand Up @@ -109,4 +109,6 @@
# config.active_record.database_selector = { delay: 2.seconds }
# config.active_record.database_resolver = ActiveRecord::Middleware::DatabaseSelector::Resolver
# config.active_record.database_resolver_context = ActiveRecord::Middleware::DatabaseSelector::Resolver::Session

config.hosts << "example.com"
end
2 changes: 2 additions & 0 deletions docker/rails/TestSite/config/environments/test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,6 @@

# Raises error for missing translations.
# config.action_view.raise_on_missing_translations = true

config.hosts << "example.com"
end
2 changes: 1 addition & 1 deletion docker/rails/TestSite/public/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@
<title>test</title>
</head>
<body>
<h1>Hello WOVN.php world!</h1>
<h1>Hello WOVN.rb world!</h1>
</body>
</html>
13 changes: 2 additions & 11 deletions docker/rails/TestSite/start.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,4 @@
cd /usr/src/app
gem update --system
gem uninstall bundler
rm /usr/local/bin/bundle
rm /usr/local/bin/bundler
gem install bundler:2.1.4
update --bundler
# ./wovnrb should not be cached
echo "gem 'wovnrb', path: './wovnrb'" >> Gemfile
bundle install
apt update
apt install npm -y
npm install --global yarn
yarn install --check-files
bin/rails server -b 0.0.0.0 -e development -p 4000
4 changes: 3 additions & 1 deletion lib/wovnrb.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
require 'wovnrb/lang'
require 'wovnrb/services/html_converter'
require 'wovnrb/services/html_replace_marker'
require 'wovnrb/url_language_switcher'
require 'nokogiri'
require 'active_support'
require 'json'
Expand Down Expand Up @@ -76,7 +77,8 @@ def switch_lang(headers, body)
html_body = Helpers::NokogumboHelper.parse_html(string_body)

if !wovn_ignored?(html_body) && !amp_page?(html_body)
html_converter = HtmlConverter.new(html_body, @store, headers)
url_lang_switcher = Wovnrb::UrlLanguageSwitcher.new(@store)
html_converter = HtmlConverter.new(html_body, @store, headers, url_lang_switcher)

if needs_api?(html_body, headers)
converted_html, marker = html_converter.build_api_compatible_html
Expand Down
7 changes: 6 additions & 1 deletion lib/wovnrb/api_translator.rb
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ def build_api_params(body)
'lang_code' => lang_code,
'url_pattern' => url_pattern,
'lang_param_name' => lang_param_name,
'translate_canonical_tag' => translate_canonical_tag,
'product' => 'WOVN.rb',
'version' => VERSION,
'body' => body
Expand All @@ -130,7 +131,7 @@ def api_uri
end

def api_timeout
@store.settings['api_timeout_seconds']
@headers.search_engine_bot? ? @store.settings['api_timeout_search_engine_bots'] : @store.settings['api_timeout_seconds']
end

def settings_hash
Expand All @@ -157,6 +158,10 @@ def custom_lang_aliases
@store.settings['custom_lang_aliases']
end

def translate_canonical_tag
@store.settings['translate_canonical_tag']
end

def page_url
"#{@headers.protocol}://#{@headers.url}"
end
Expand Down
21 changes: 19 additions & 2 deletions lib/wovnrb/headers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def initialize(env, settings)
@pathname = @pathname.gsub(/\/$/, '')
end

def url_with_scheme
"#{@protocol}://#{@url}"
end

def unmasked_pathname_without_trailing_slash
@unmasked_pathname.chomp('/')
end
Expand Down Expand Up @@ -197,11 +201,24 @@ def out(headers)
end

def dirname
if pathname.include?('/')
pathname.end_with?('/') ? pathname : pathname[0, pathname.rindex('/') + 1]
if pathname_with_trailing_slash_if_present.include?('/')
pathname_with_trailing_slash_if_present.end_with?('/') ? pathname_with_trailing_slash_if_present : pathname_with_trailing_slash_if_present[0, pathname_with_trailing_slash_if_present.rindex('/') + 1]
else
'/'
end
end

def search_engine_bot?
return false unless @env.key?('HTTP_USER_AGENT')

bots = %w[Googlebot/ bingbot/ YandexBot/ YandexWebmaster/ DuckDuckBot-Https/ Baiduspider/ Slurp Yahoo]
bots.any? { |bot| @env['HTTP_USER_AGENT'].include?(bot) }
end

def to_absolute_path(path)
absolute_path = path.blank? ? '/' : path
absolute_path = absolute_path.starts_with?('/') ? absolute_path : URL.join_paths(dirname, absolute_path)
URL.normalize_path_slash(path, absolute_path)
end
end
end
18 changes: 17 additions & 1 deletion lib/wovnrb/services/html_converter.rb
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
module Wovnrb
class HtmlConverter
def initialize(dom, store, headers)
def initialize(dom, store, headers, url_lang_switcher)
@dom = dom
@headers = headers
@store = store
@url_lang_switcher = url_lang_switcher
end

def build
Expand Down Expand Up @@ -32,6 +33,7 @@ def transform_html
replace_snippet
replace_hreflangs
inject_lang_html_tag
translate_canonical_tag if @store.settings['translate_canonical_tag']
end

def replace_snippet
Expand All @@ -48,6 +50,7 @@ def replace_dom(marker)
insert_snippet(adds_backend_error_mark: true)
insert_hreflang_tags
inject_lang_html_tag
translate_canonical_tag if @store.settings['translate_canonical_tag']

html
end
Expand Down Expand Up @@ -143,6 +146,19 @@ def insert_hreflang_tags
end
end

def translate_canonical_tag
canonical_node = @dom.at_css('link[rel="canonical"]')
return unless canonical_node

lang_code = @headers.lang_code
return if lang_code == @store.settings['default_lang'] && @store.settings['custom_lang_aliases'][lang_code].nil?

canonical_url = canonical_node['href']

translated_canonical_url = @url_lang_switcher.add_lang_code(canonical_url, lang_code, @headers)
canonical_node['href'] = translated_canonical_url
end

# Remove wovn snippet code from dom
def strip_snippet
@dom.xpath('//script').each do |script_node|
Expand Down
136 changes: 136 additions & 0 deletions lib/wovnrb/services/url.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
module Wovnrb
# URL utility ported from html-swapper
class URL
module FileExtension
IMG_FILES = 'jpe|jpe?g|bmp|gif|png|btif|tiff?|psd|djvu?|xif|wbmp|webp|p(n|b|g|p)m|rgb|tga|x(b|p)m|xwd|pic|ico|fh(c|4|5|7)?|xif|f(bs|px|st)'.freeze
AUDIO_FILES = 'mp(3|2)|m(p?2|3|p?4|pg)a|midi?|kar|rmi|web(m|a)|aif(f?|c)|w(ma|av|ax)|m(ka|3u)|sil|s3m|og(a|g)|uvv?a'.freeze
VIDEO_FILES = 'm(x|4)u|fl(i|v)|3g(p|2)|jp(gv|g?m)|mp(4v?|g4|(?!$)e?g?)|m(1|2)v|ogv|m(ov|ng)|qt|uvv?(h|m|p|s|v)|dvb|mk(v|3d|s)|f4v|as(x|f)|w(m(v|x)|vx)|xvid'.freeze
DOC_FILES = '(7|g)?zip|tar|rar|7z|gz|ez|aw|atom(cat|svc)?|(cc)?xa?ml|cdmi(a|c|d|o|q)?|epub|g(ml|px|xf)|jar|js|ser|class|json(ml)?|do(c|t)(m|x)?|xls(m|x)?|xps|pp(a|tx?|s)m?|potm?|sldm|mp(p|t)|bin|dms|lrf|mar|so|dist|distz|m?pkg|bpk|dump|rtf|tfi|pdf|pgp|apk|o(t|d)(b|c|ft?|g|h|i|p|s|t)'.freeze
end

# TODO: Maybe this should be applied to all get_attribute calls rather than just href
def self.normalize_url(href)
return nil unless href

href.delete("\u200b").strip
end

def self.absolute_url?(href)
href =~ %r{^(https?:)?//}i
end

def self.absolute_path?(href)
href.match?(%r{^/})
end

def self.relative_path?(href)
!absolute_url?(href) && !absolute_path?(href)
end

# @param parsed_uri [Addressable::URI]
def self.path_and_query(parsed_uri)
parsed_uri.path + (parsed_uri.query ? "?#{parsed_uri.query}" : '')
end

def self.path_and_query_and_hash(parsed_uri)
uri = parsed_uri.path
uri += "?#{parsed_uri.query}" if parsed_uri.query
uri += "##{parsed_uri.fragment}" if parsed_uri.fragment
uri
end

def self.host_with_port(parsed_uri)
if parsed_uri.port
"#{parsed_uri.host}:#{parsed_uri.port}"
else
parsed_uri.host.to_s
end
end

def self.resolve_absolute_uri(base_url, href)
# This resolves ./../ and also handles href already being absolute
Addressable::URI.join(base_url, href)
rescue Addressable::URI::InvalidURIError, ArgumentError => e
Rollbar.warning('Failed to resolve absolute URI', original_error: e, base_url: base_url, href: href)
raise
end

def self.resolve_absolute_path(base_url, href)
normalized_uri = resolve_absolute_uri(base_url, href)
path = normalized_uri.path
query = normalized_uri.query ? "?#{normalized_uri.query}" : ''
fragment = normalized_uri.fragment ? "##{normalized_uri.fragment}" : ''

path + query + fragment
end

# Set the path lang to
def self.prepend_path(url, dir)
url.sub(%r{(.+\.[^/]+)(/|$)}, "\\1/#{dir}\\2")
end

def self.trim_slashes(path)
path.gsub(%r{^/|/$}, '')
end

def self.prepend_path_slash(path)
path ||= ''
return path if path.starts_with?('/')

"/#{path}"
end

def self.join_paths(*paths)
paths.inject('') do |left, right|
case [left.end_with?('/'), right.start_with?('/')]
when [true, true]
left + right[1..-1]
when [false, false]
left + (right.blank? ? right : "/#{right}")
else
left + right
end
end
end

# @param uri [Addressable::URI]
# @param new_protocol [String | nil]
# @return copy of uri [Addressable::URI]
def self.change_protocol(uri, new_protocol)
result = uri.dup
result.scheme = new_protocol
result
end

def self.valid_protocol?(href)
scheme_matches = /^\s*(?<scheme>[a-zA-Z]+):/.match(href)
scheme = scheme_matches ? scheme_matches[:scheme] : nil

scheme.nil? || %w[http https].include?(scheme)
end

def self.file?(href_with_query_and_hash)
href = remove_query_and_hash(href_with_query_and_hash)
img_files = %r{^(https?://)?.*(\.(#{FileExtension::IMG_FILES}))((\?|#).*)?$}io
audio_files = %r{^(https?://)?.*(\.(#{FileExtension::AUDIO_FILES}))((\?|#).*)?$}io
video_files = %r{^(https?://)?.*(\.(#{FileExtension::VIDEO_FILES}))((\?|#).*)?$}io
doc_files = %r{^(https?://)?.*(\.(#{FileExtension::DOC_FILES}))((\?|#).*)?$}io
href.match?(img_files) || href.match?(audio_files) || href.match?(video_files) || href.match?(doc_files)
end

def self.remove_query_and_hash(href)
href.gsub(/[#?].*/, '')
end

# if original path does not end in slash, remove it from new path
# if original path ends in slash, add it to new path
def self.normalize_path_slash(original_path, new_path)
if !original_path.end_with?('/') && new_path.end_with?('/')
new_path = new_path.chomp('/')
elsif original_path.end_with?('/') && !new_path.end_with?('/')
new_path += '/'
end
new_path
end
end
end
Loading

0 comments on commit 680a03f

Please sign in to comment.