SEO (with tips for MediaWiki and WordPress)

Wiki.TerraBase.info
Jump to navigation Jump to search

There are so many articles, book, software, etc. devoted to SEO (Search Engine Optimization). While there a lot of great information, the vast majority of content on the subject seems to be devoted to the last 5% / fine tuning of a website. What about the other 95% of SEO, like the basics. It ain't glamorous, but it is important.

And don't forget about CMS (Content Management Systems) that allow for easy publishing of websites. How does one control and make changes to the items Search Engines deem important? Should it all be left up to software to automatically manage that? Nope. So again, how is that controlled and manipulated?

An element of the internet that is missing today is some Google Sized Directory Service (not a Search Engine), but a Directory Service that is maintained by people and has a certain amount of bias towards high quality web sites. ODP (Open Directory Project) is about all that's left of this, although it could be argued that Wikipedia is a reflection of this.

Some topics covered in this article;

  • SEO (Search Engine Optimization)
  • Tips for MediaWiki and WordPress related to SEO
  • SiteMaps (and don't Image SiteMaps too)
  • Robots (as in GoogleBot) - robots.txt (and yes, Google has a wizard / generator for that)
  • Google Search Console (https://en.wikipedia.org/wiki/Google_Search_Console)
  • WebMaster tools for Google and other search sites (Bing, etc.)
  • Other Google Services (Google Analytics, Website Optimizer, Google Insights (merged into Google Trends, Transparency Report))
  • Custom Error Pages
  • Special Consideration for Mobile Devices
  • Other analysis tools like Sawmill
  • Broken Links; Google says have a custom 404 error page, sure that's good, but what about fixing the broken links? (Plugin or Extension anyone?)
  • Google states the following
    • We do not use PRIORITY settings in SiteMap XML Files
    • Date Modified is used (but not the way you think, IE, it seems as if they also compare content too and may punish people who abuse the modified or edited date)

Basics

There are some basics that apply to every website and web page. First CONTENT!!! Notice the emphasis on that. If the purpose of your website is to garner traffic for the sake of traffic or selling advertising space, then fork you (Watch The Good Place on NBC and you'll understand that term if you don't already). For all the other people on the planet who are trying to create or write something useful or artistic, you're probably already doing the right thing and there doesn't need to be any further explanation. Although I would point out, spelling, grammar, etc. are important too.

Beyond that, why not start with the king of search, Google, and find out what they have to say on the subject. After all, they are the search engine. So for right now, forget about all of those books that have some interesting information on the last five percent of important things to do for SEO and focus on the first 95%.

  • Quality Content. It's self explanatory.
  • Title Element for each page: Make it accurate and unique for each page, but not long winded.
  • Meta Element, Description Attribute: A simple, concise sentence that summarizes a web page. Avoid the same description on every page.
  • Per many sources (noted in the Wikipedia article on the Meta Element), the Keyword Attribute is almost completely ignored since the late noughties.
  • Additional "Tips of the Current Era": Look at extensions for MediaWiki, like WikiSEO (discussed later), and the different settings it has. These are some fairly clear indications as to what is important to SEO.
  • Create a robots.txt file in the root of the website that includes a reference to a sitemap
  • Sitemap: https://Wiki.TerraBase.info/sitemap/sitemap-index-Wiki.TerraBase.info.xml

Robots

For a simple site, there aren't many restrictions that need to be put in place from a technical perspective. To see what the "Big Boys" do with their robots.txt file, just go to their site and view the file (https://www.Microsoft.com/robots.txt).

Sample robots.txt file for a Wiki

User-agent: *
Disallow: / or /* (/ and /* are equivalent)
Allow: /wiki

User-agent: googlebot
Disallow: /
Allow: /wiki

Host: www.WhatEverSiteName.com OR WhatEverSiteName.com (supposedly not supported by Google)
Crawl-delay: 10 (Number of seconds to delay in some action, depends on bot)

Sitemap https://WhatEverSite/sitemap/sitemap.xml

For technical details on the robots.txt file: https://developers.google.com/search/reference/robots_txt

To test the robots.txt file, there are several ways including Google, but here's another: https://technicalseo.com/tools/robots-txt/

Indexing can also be controlled via HTML within the Header Tags;

<meta name="robots" content="noindex" />
<meta name="googlebot" content="noarchive,nosnippet" />


MediaWiki

HTML Meta and Title Extension

To change the <Title> Element, add this extension: Add HTML Meta and Title

Using the "Magic Word" PageName along with the $wgSiteName Variable (in LocalSettings.php), one can keep the automatically generated Title Name and add to it;

<seo title="{{PAGENAME}} - WhatEverAddtionalText" metakeywords="WhatEverKeyWords,AnotherKeyWord,Etc." metadescription="WhatEverDescription,AnotherDescription,Etc." />

Tried additional Meta Attributes (see W3schools for more) and none of them worked. Even tried it without "meta" prepending the actual attribute name. google-site-verification is another item that does work. WikiSEO is a more advanced extension that does a bunch more. This one can add a description, which is useful as MediaWiki doesn't include a description by default, and Google (and others) tend to use the description over contents, so it gives one the option to customize a Description META Tag.

Not so much for SEO, but nice as it is displayed on every page is the "Tag Line" (Example on the Wikipedia Site:From Wikipedia, the free encyclopedia) is the "Special Page", MediaWiki:Tagline (by default contains: "From {{SiteName}}", SiteName being a "Magic Word" for the name of the website) can be edited to anything (although "Create Source" must be selected first before it can be edited).

WikiSEO Extensions

WikiSEO Extension;

### Search Engine Optimization
###
wfLoadExtension( 'WikiSEO' );

### SiteMap Extension
###
### .../maintenance/generateSitemap.php is a built in utility for generating a SiteMap
### Basic Command Syntax: php generateSitemap.php (but first chmod 744 generateSitemap.php (notice the CAPITAL S)
### Example: php generateSitemap.php --server=https://WhatEverWebSite --fspath=/var/www/html/WhatEverWebSite/sitemap --urlpath=https://WhatEverWebSite/sitemap --identifier=WhatEverWebSite --compress=no --skip-redirects

This script is run to generate a new XML SiteMap (Google differentiates between a site map for human reading VS machine reading);

php /var/www/html/Wiki.TerraBase.info/maintenance/generateSitemap.php --server=https://wiki.terrabase.info --fspath=/var/www/html/Wiki.TerraBase.info/sitemap --urlpath=sitemap --identifier=Wiki.TerraBase.info --compress=no --skip-redirects

There's no way to name the file (sitemap.xml) as it auto generates the name (sitemap-WhatEverWebSiteName-NS_0-0.xml, followed by others, but "index" file is the important one)

It can be run manually start or run by a Systemd Timer or a Cron job (see Other Thoughts for creating a System Timer on CentOS 7 or 8)

In the Google Search Console a SiteMap can be deleted if selected, then using the ... at the upper right to delete. Also, several individuals have noted that the interface shows a "Couldn't fetch" error message, but usually means "pending" (just click around and come back and it should show success).

AutoSitemap Extension;

For a SystemD Timer: systemctl list-timers --all to find the timer

### This extension causes an error message with VisualEditor, but the page saves and the SiteMap File is created
### A strategy might be to enable manually when a new SiteMap File is needed.
######wfLoadExtension( 'AutoSitemap' );
######$wgAutoSitemap["server"] = "https://WhatEverWebSite";
######$wgAutoSitemap["filename"] = "sitemap.xml";

AddHTMLMetaAndTitle Extension (WikiSEO accomplishes the same thing and a whole lot more);

### To add to the original information MediaWiki Generates, use the PAGENAME Magic Word: <seo title="{{PAGENAME}} - MaryPoppins" metakeywords="word3,word4" metadescription="word5,word6" />
wfLoadExtension( 'AddHTMLMetaAndTitle' );

MediaWiki

SiteMap Script and Cron or a System Timer

Friendly URL

First, MediaWiki has some advice on short URLs here: https://www.mediawiki.org/wiki/Manual:Short_URL

They also have a warning about putting MediaWiki in the root of a website or virtual host as opposed to a sub directory (IE https://MyWiki.com VS https://MyWiki.com/WikiFolder, they recommend the latter). The explanation they give is here: https://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory, and I must say they make it a bit more threatening than it needs to be, at least as far as not being able to create articles with certain names (which isn't obvious in the way they word it). Although, a workaround I had to use to get Apache mod_rewrites to work sort of points to something deeper that relies on what they're recommending. Plus Wikipedia uses that format too, so why spit into the wind?

Unlike WordPress, there are no extensions that change the typical "index.php" URL that MediaWiki displays. Well, there are (ShortURL, URLShortener, Surl, etc.), but none of them do it without the need to add code to an Apache configuration file or .htaccess file (look them up). So why not just do it without the extensions (as they don't seem to make it easy like the WordPress extensions)? Turns out it is fairly easy. A great tool to make it easier is this one:MediaWiki ShortURL Builder But a word of caution if a MediaWiki site is "private". That tool will not be able to access the section of the wiki necessary to create the proper script, so make sure this setting is set correctly: $wgGroupPermissions['*']['read'] = true; (change it back to false to make it a private wiki.

In an Apache configuration file (making sure that Apache has that module available) (inside a VirtualHost Directive);

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-f
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^(.*)$ %{DOCUMENT_ROOT}/index.php [L]

RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-f
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^/?images/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ %{DOCUMENT_ROOT}/thumb.php?f=$1&width=$2 [L,QSA,B]

RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-f
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^/?images/thumb/archive/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ %{DOCUMENT_ROOT}/thumb.php?f=$1&width=$2&archived=1 [L,QSA,B]

And in the localsettings.php file;

### For Friendly URLs, the following settings are used in conjunction with settings in the VirtualHost Directives in HTTPD.conf (obtained from https://shorturls.redwerks.org/)
### Additional notes from: https://www.mediawiki.org/wiki/Manual:Short_URL/Apache
### AND: An Alias Directive was also added (this was a key point in making it work, per recommendations from MediaWiki:
### Web Root Directory Note: https://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory
###$wgScriptPath is defined earlier in the file at its default location
###$wgScriptPath = "";
$wgScriptExtension = ".php";
$wgArticlePath = "/$1";
$wgUsePathInfo = true;

###$wgEnableUploads is defined earlier in the file at its default location
###$wgEnableUploads  = true;
$wgGenerateThumbnailOnParse = false;

And finally a workaround I used to address the above mentioned issue about placing the wiki in a root directory of a website to put in an Apache configuration file (inside a VirtualHost Directive);

Alias			/wiki /var/www/html/Wiki.TerraBase.info

Just so you know, there are several ways MediaWiki displays content: .../index.php?WhatEverQuery or ...index.php/WhatEverPageTitle, explained here: https://www.mediawiki.org/wiki/Manual:Short_URL

A lot of the above for Apache is explained here: https://www.mediawiki.org/wiki/Manual:Short_URL/Apache The one thing they didn't include was addressing the question: What if the wiki is in the root directory of a website or virtual host? IE, the instructions they wrote are if one followed "best practices" by having the wiki in a sub-directory of a web site. My workaround using the Alias Directive is noted above.

WordPress

WordPress

Yoast and others

Google XML Sitemaps Plugin

Udinra All Image Sitemap Plugin

WPMUDEV Plugins (various)

Google Sitemap XML does get along with Yoast. Sort of. Instead of using the sitemap_index.xml, it uses the sitemap.xml file name

WP Sitemap Page generates a human readable sitemap if one puts shortcode on each page to be "sitemaped". Note, this is not for search engines.

Other SiteMap plugins focus on very specific search engines, like google news. (XML Sitemap and Google News)

Yoast doesn't like to "share" the sitemap_index.xml URL file name, but apache rewrite could change it, then the individual yoast generated sitemaps could be used with a static sitemap_index.xml OR, better yet, just notify Google and define in the Robots.txt file a different index name. But if the latter choice is taken, then the automatic notification (AKA ping) of google by Yoast would need to be modified. So why not keep Yoast "stock" and modify at the Apache level? And with a quick bit of experimenting, just discovered that a physical file in the root index will override anything generated by WordPress. Copy the MediaWikiFormat or follow these instructions: https://www.google.com/sitemaps/protocol.html

Taxonomy: In wordpress taxonomy is a blanket term that includes Categories and Tags (a "sub category", equivalent to Keywords). These items are used in internal wordpress searches.


Other Google Services

Transparency Report (IE, is your website compromised (has it been hacked) and contains malware)

It gives a date on the last time the information was updated, so sort of a round about way of knowing the last time a Google Bot drove by.

Reputation (Content-Agnostic Malware Protection (CAMP))

What appears to be an accidental comment in the Google Search Engine Optimization Guide: "...could split the reputation of that content between URLs)." What is "reputation"? In 2020 a search for Google and Reputation steers one to their malware detection / reporting site. Also ran across some sites that manipulate search results and "get rid of negative stuff" by pushing "bad Google Search results" down with "Other results" An interesting way to manipulate search results with SEO. IE, polluting search results with anything else other than the bad stuff. A web site idea for a search directory might be ScrollDown.com, where people go digging around for truthful items that might otherwise be lost.

Other Thoughts

$wgWhitelistReadRegexp

...ran into an issue with the $wgWhitelistReadRegexp variable. The regular expression syntax written did not work as it should have. Below is a note posted on the MediaWiki site that is self explanatory;

Hyphens and Periods with MediaWiki 1.33 and PHP 7.3

There might be an issue with $wgWhitelistReadRegexp when attempting to allow hyphens or periods (and possibly other special characters).


For the below noted items, on a private MediaWiki site the following setting was configured: $wgGroupPermissions['*']['read'] = false;

As expected, anonymous users were successfully able to access articles in the Main NameSpace if $wgWhitelistReadRegexp was set to this: '/ :*/'

However, anonymous users were blocked from articles that contained no spaces and contained at least one hyphen in the title. Logged in users could view the article normally. Articles with spaces in the titles and a hyphen were visible for anonymous users. Example article title: DD-WRT

Adding this to the $wgWhitelistReadRegexp array did not correct the issue: '/ :.*\-.*/' (neither did '/ :*-/', '/ :*\-/', etc.)

That was puzzling, because the regular expression .*\-.* used with TitleBlacklist extension works as expected, so I anticipated the same regular expression would work with the $wgWhitelistReadRegexp array, but it did not, so I tried others, but nothing allowed anonymous users to view the article.

Periods caused a similar issue with article tiles that had spaces or no spaces. No other accepted title characters like ), (, ;, etc caused any issues.


Is there a different regular expression syntax used with $wgWhitelistReadRegexp or is this a bug?

Blacklist

To solve the issue with the $wgWhitelistReadRegexp variable, use the TitleBlacklist Extension (well, not really a solution, but a prevention);

###wfLoadExtension is defined earlier in the file at its default location
###wfLoadExtension( 'TitleBlacklist' );
### This extension is necessary to block creation of articles with a hyphen / dash ( - ) or a period in the title as the wgWhitelistReadRegexp setting can't handle periods (it can handle dashes)
$wgTitleBlacklistSources = array(
    array(
         'type' => 'localpage',
         'src'  => 'MediaWiki:Titleblacklist',
    ),
### The below items can be used to define other sources of a blacklist, but for my purposes a NameSpace is fine.
###    array(
###         'type' => 'url',
###         'src'  => 'https://meta.wikimedia.org/w/index.php?title=Title_blacklist&action=raw',
###    ),
###    array(
###         'type' => 'file',
###         'src'  => '/home/wikipedia/blacklists/titles',
###    ),
);
### The above noted MediaWiki:Titleblacklist has the following items added
### This is for the hyphen / dash character: .*\-.*
### This is for the period: .*\..*
### ALL Titleblacklist entries seem to start with .
### In the above examples, which start out with .*WhatEverWord.*, then for special characters, prepend it with a backslach ( \ ) and then the character
###
### This following permission allows the above settings to apply to the root / administrator too, otherwise that account is free to create any titled article.
$wgGroupPermissions['sysop']['tboverride'] = false;

Systemd Timer for CentOS 7 & 8 and others

Based on information from: https://www.certdepot.net/rhel7-use-systemd-timers/

Create a file that ends in .service in the /usr/lib/systemd/system/ directory similar to this (change paths as appropriate);

[Unit]
Description=Create XML SiteMap for WhatEverURL

[Service]
Type=simple
ExecStart=/usr/bin/php /var/www/html/WhatEverURL/maintenance/generateSitemap.php --server=https://WhatEverURL --fspath=/var/www/html/WhatEverURL /sitemap --urlpath=https://WhatEverURL /sitemap --identifier=WhatEverURL  --compress=no --skip-redirects

[Install]
WantedBy=multi-user.target

Create a file that ends in .timer in the same directory;

[Unit]
Description=Timer Service that runs the sitemap.WhatEverURL.service

[Timer]
OnCalendar=*-*-* 00/12:00:00
Unit=sitemap.WhatEverURL.service

[Install]
WantedBy=multi-user.target

Then enable the time with this command: systemctl enable WhatEverName.timer

And start it with this command: systemctl start WhatEverName.timer

Check its status with: systemctl is-enabled OR is-active WhatEverName.timer

The .service can be checked with this command: systemctl start WhatEverName.service

Handy Command to Find Information about Services

whereis WhatEverProgramOrService

It doesn't tell everything about a service. Case in point, BIND or NAMED, because it doesn't mention the /var/spool/ directory for zone files.