Hosting your website’s sitemap on S3 is relatively common and up until some time ago, things would just “work”. Recently there have been some additional hoops to jump through; this article will walk through the hoops.
If your website is hosted somewhere without persistent storage you have two choices;
- Generate a sitemap on every deploy
- Hosting your sitemap somewhere with persistent storage
I prefer to host my sitemap somewhere with persistent storage.
In this article, we’re talking about S3, but the same issue is going to happen anytime you host a sitemap on a different domain to your website.
S3 is one of Amazon’s many web services. This one provides a simple storage mechanism for files.
In our setup, when the sitemap regeneration happens, the sitemap XML file is uploaded to an S3 bucket.
Why Google Search Console
I like to do everything in my power to make sure search engines know about my pages, including giving search engines my sitemaps.
Search Console lets you say to Google “I own this domain” and give it your sitemap to help with crawling.
A sitemap helps to prioritise the crawler; you’re saying “don’t guess which pages I want you to look at, here is a list”.
Google Search Console needs to be able to know that you have access to the domains you want to track. There are a few ways you can do this, but the two we care about are:
- at the DNS level
- with an uploaded file
To verify you own an entire domain, you can upload a DNS record that Google Search Console provides. You should do this for your main website.
I will talk about the file upload shortly.
The “normal” process
If your store your sitemap on the same domain as your website, then you don’t have any extra steps. Once you’ve verified domain ownership, you can add your sitemap in the “sitemaps” section of Google Search Console.
You should see the pages it has found almost immediately from their “sitemap” section.
The process for externally hosted sitemaps
A while ago you could do a redirect from a location within your domain to another source, and Google Search Console would honour it. Now it won’t.
Now you need to verify ownership of the external domain, in our S3 case that will look like
https://s3bucketname.aws.url.com. You will be able to get the exact domain from your S3 account.
Verifying ownership feels wrong because as S3 users, we don’t really “own” the subdomain that AWS assigns for our bucket, but in Google’s eyes, you do.
Because we’re dealing with a subdomain, we can’t email Werner Vogels and ask him to add a DNS record for us.
Luckily Google Search Console allows us to upload an HTML file as a way of verifying ownership. We need to go down this path and upload the file Google Search Console gives us to our S3 bucket.
Once verified, we can submit the sitemap within this AWS property.
It doesn’t matter that this sitemap lists URLs on our primary domain, Google respects this.
This process feels longer than it needs to be.
If you have a sitemap but don’t want to set things up manually, you can add a line into your
robots.txt file which search engines will take into account.
Bonus - Google Search Console sitemap setup for a Rails app
Initially, this article was going to be all about my exact issue, but I realised it was probably too niche. However, if you’re interested, the website I needed indexed was my Disney pin trading site. It is a Ruby on Rails project hosted on Heroku.
Heroku doesn’t have persistent storage, so for files to exist between deploys, you need to use something else. I use S3.
For generating the sitemap, I use sitemap_generator, it is a fantastic gem that I’ve used for years to accomplish this task.
I’ve never had an issue understanding their docs and setup usually doesn’t take that long.
The steps we took were:
- follow the sitemap_generator README to install the gem, including the additional steps for S3 buckets
- set up the relevant S3 bucket on Amazon Web Services
- set up
config/sitemap_generator.rbto tell it the pages you want to appear on the sitemap
- verify you own the main domain in Google Search Console
- verify you own the bucket URL in Google Search Console
- in Google Search Console under the
https://my_bucket.aws.url.comproperty, add your sitemap
- verify that Google has at least seen the entries in the sitemap