0 comments on “Sitecore Azure Search: Top 10 Tips”

Sitecore Azure Search: Top 10 Tips

Its been a while since I first wrote about Azure Search and we have a few more tips and tricks on how to optimise Azure Search implementations.

Before proceeding if you missed our previous posts check out some tools we created for Azure Search Helix setup and Geo-Spatial Searching.

Also, check out the slides from our presentation at last years Melbourne Sitecore User Group.

Ok let us jump into the top 10 tips:

Tip 1) Create custom indexes for targeted searching

The default out of the box indexes will attempt to cover just about everything in your Sitecore databases. They do so to support Sitecore CMS UI searches out of the box.  It’s not a problem if you want to use the default indexes (web, master) to search with, however for optimal searches and faster re-indexing time a custom index will help performance.

By stepping back and looking at the different search requirements across the site you can map out your custom indexes and the data that each will require.

Consider also that if the custom indexes need to be used across multiple Feature Helix modules the configuration files and search repositories may need to live in an appropriate Foundation module. More about feature vs foundation can be found here.

Tip 2) Keep your indexes lean

This tip follows on from the first Tip.

Essentially the default Azure Search configuration out of the box will have:

<indexAllFields>true</indexAllFields>

This can include a lot of fields and your probably not going to need every single Sitecore field in order to present the user with meaningful data on the front end interfaces.

The other option is to specify only the fields that you need in your indexes:

<include hint="list:IncludeField"> 
<Text>{A60ACD61-A6DB-4182-8329-C957982CEC74}</Text> 
</include>

The end result will limit the amount of JSON payload that needs to be sent across the network and also the amount of payload that the Sitecore Azure Search Provider needs to process.

Particularly if you are returning thousands of search results you can see what happens when “IndexAllFields” is on via Fiddler.

This screenshot is via a local development machine and Azure Search instance at the Microsoft hosting centre.

Fiddler Index

JSONFIelds

  • So for a single query “IndexAllFields” can result in:
    • 2 MB plus JSON payload size.
    • Document results with all Sitecore metadata included. That could be around 100 fields.

If your query results in Document counts in the thousands obviously the payload will grow rapidly. By reducing the fields in your indexes (removing un-necessary data)  you can speed up query, transfer and processing times and get the data displayed quicker.

Tip 3) Make use of direct azure connections

Sitecore has done a lot of the heavy lifting for you in the Sitecore Azure Search Provider. It’s a bit like a wrapper that does all the hard work for you. In some cases however you may find that writing your own queries that connect via the Azure Search DLL gives you better performance.

Tip 4) Monitor performance via Azure Search Portal

It’s really important to monitor your Azure Search Instance via Azure Portal. This will give you critical clues as to whether your scaling settings are appropriate.

In particular look out for high latency times as this will indicate that your search queries are getting throttled. As a result, you may need to scale up your Azure Search Instance.

In order to monitor your latency times go to:

  1. Login to Azure Portal
  2. Navigate to your Azure Search Instance.
  3. Click on metrics in the left-hand navigation
    • metrics
  4. Select the “Search Latency” checkbox and scan over the last week.
    • graph
  5. You will see some peaks these usually indicate heavy periods of re-indexing. During re-indexing, the Azure Search instance is under heavy load. As long as your peaks under 0.5-second mark your ok.  If you see Search Latency up into the 2-second timeframe you probably need to either adjust how your indexes are used (caching and re-indexing) or scale up to avoid the flow on effects of slow search.

Tip 5) Cache Wrappers

In the code that uses Azure Search, it would be advisable to use cache wrappers around the searches when possible. For your most common searches, this should prevent Azure Search getting hit repeatedly with the same query.

For a full example of cache wrapper checkout the section titled Sitecore.Caching.CustomCache in my previous blog post.

Tip 6) Disable Indexing on CD

This is a hot tip that we got from Sitecore Support when we started to encounter high search latency during re-indexing.

Most likely in your production setup, you will have a single Azure Search instance shared between CM and CD environments.

You need to factor in that CM should be the server that controls the re-indexing (writing) and CD will most likely be the server doing the queries (reading).

Re-indexing is triggered via the event queue and every server subscribes and reacts to these events. Each server with the out of the box search configuration will cause the Azure Search indexes to be updated.  In a shared Azure Search (or SOLR instance) this only needs to be updated by a single server. Each additional re-index is overkill and just doubling up on re-indexing workload.

You can, therefore, adjust the configuration on the CD servers so that it does not cause re-indexing to happen.

The trick is in your index configuration files to use Configuration Roles to specify the indexing strategy on each server.

 <strategies hint="list:AddStrategy">
 <!--
 NOTE: order of these is controls the execution order 
 -->
 <strategy role:require="Standalone OR ContentManagement" ref="contentSearch/indexConfigurations/indexUpdateStrategies/onPublishEndAsync"/>
 <strategy role:require="ContentDelivery" ref="contentSearch/indexConfigurations/indexUpdateStrategies/manual"/>
 </strategies>

Setting the index update strategy to manual on your CD servers will take a big load off your remote indexes.

Particularly if you have multiple CD servers using the same indexes. Each additional CD server would cause additional updates to the index without the above setting.

Tip 7) Rigid Indexes – Have a deployment plan

If your deployment includes additions and changes to the indexes and you need 100% availability of search data, a deployment plan for re-indexing will be required.

Grant chatted about the problem in his post here. To get around this you could consider using the blue / green paradigm during deployments.

  • This would mean having a set blue indexes and a set of green indexes.
  • Using slot swaps for your deployments.
    • One slot points to green in configuration.
    • One slot (production) points to blue in configuration.
  • To save on costs you could decommission the staging slot between deployments.

Tip 8) HttpClient should be a singleton or static

The basic idea here is that you should keep the number of HttpClient instances in your code to an absolute minimum if you want optimal performance.

The Sitecore Azure Search provider actually spins up 2 x HttpClient connections for every single index. This in itself is not ideal and unfortunately, there is not a lot you can do about this code in the core product itself.

In your own connections to other APIs, however, HttpClient SendAsync is perfectly thread safe.

By using HttpClient singletons you stand to gain big in the performance stakes. One great blog article worth reading runs you through the performance benefits. 

It’s also worth noting that in the Azure Search documentation Microsoft themselves say you should treat HttpClient as a singleton.

Tip 9) Monitor your resources

In Azure web apps you have finite resources with your app server plans. Opening multiple connections with HttpClient and not disposing of them properly can have severe consequences.

For instance, we found a bug in the core Sitecore product that was caused by the connection retryer. It held open ports forever whenever we hit out Azure Search plan usage limits.  The result was that we hit outbound open connection limits for sockets and this caused our Sitecore instance to ground to a slow halt.

Sitecore has since resolved the issue mentioned above after a lengthy investigation working alongside the Aceik team. This was tracked under reference number 203909.

To monitor the number of sockets in Azure we found a nice page on the MSDN site.

Tip 10) Make use of OData Expressions

This tip relates strongly to tip 3.  Azure search has some really powerful OData Expressions that you can make use of by a direct connection.  Once you have had a play with direct connections it is surprisingly easy to spin up really fast queries.

Operators include:

  • OrderBy, Filter (by field), Search
  • Logical operators (and, or, not).
  • Comparison expressions (eq, ne, gt, lt, ge, le).
  • any with no parameters. This tests whether a field of type Collection(Edm.String) contains any elements.
  • any and all with limited lambda expression support.
  • Geospatial functions geo.distance and geo.intersects. The geo.distance function returns the distance in kilometres between two points.

See the complete list here.


 

Q&A

Q) Anything on multiple region setups? Or latency considerations?

A) Multi-region setups:   Although I can’t comment from experience the configuration documentation does state that you can specify multiple Azure Search instances using a pipe separator in the connection string.

<add name="cloud.search" connectionString="serviceUrl=https://searchservice1.search.windows.net;apiVersion=2015-02-28;apiKey=AdminKey1|serviceUrl=https://searchservice2.search.windows.net;apiVersion=2015-02-28;apiKey=AdminKey2" /> 

Unfortunately, the documentation does not go into much detail. It simply states that “Sitecore supports a Search service with geo-replicated scenarios” which one would hope means under the hood it has all the smarts to take care of this.

I’m curious about this as well and opened a stack overflow ticket. Let’s see if anyone else in the community can answer this for us.

Search Latency: 

Search latency can be directly improved by adding more replicas via the scaling setting in Azure Portal

replicas

Two replicas should be your starting point for an Azure Search instance to support Sitecore. Once you launch your site you will need to follow the instruction in tip 4 above monitor search latency.  If the latency graph is showing consistent spikes and high latency times above 0.5 seconds it’s probably time to add some more replicas.

1 comment on “Sitecore Geo-Spatial Results with Azure Search”

Sitecore Geo-Spatial Results with Azure Search

This is the second blog post in this series. In the first blog post, I spoke about setting up Azure Search.

In this post, I am going to talk about using a Geo-Spatial lookup query that Azure Search has built in and the associated Index data type called Edm.GeographyPoint

The Azure Search overview can be found here and the configuration document here.

If you’re simply looking for a working example jump over to our GitHub repository that contains the full working helix site.

https://github.com/Aceik/Sitecore-Azure-Search-Spatial

You can also find the latest version of the Sitecore package here:

https://github.com/Aceik/Sitecore-Azure-Search-Spatial/tree/master/lib/Package


Notice on the configuration document that it has a list of EDM data types and it also references the document from Microsoft which has the same list.

A notable exception is the that EDM.GeographyPoint is mentioned in the Microsoft document but not the Sitecore EDM list. So if you need to implement a geo-spatial lookup and get back results based on a latitude and longitude that isn’t possible with the Sitecore Azure Search provider out of the box.

If you dig a little deeper you can see that EDM.GeographyPoint is not in the search code. Have a look at the source code in Sitecore.ContentSearch.Azure.Schema.dll CloudSearchTypeMapper.GetNativeType()

We asked Sitecore support when and if EDM.GeographyPoint would be supported by the Sitecore provider and the answer was no plans in the immediate future.

The solution we came up with to solve this problem essentially involved two steps:

  • Getting data of type EDM.GeographyPoint into the Azure Search Index.
  • Performing a Spatial Query on the Azure Search Index (using geo.distance) in a timely manner.

To support both of the above operations we had to do a little digging into the core Sitecore Azure search code. Read on to find out about the solution we came up with or jump on over to the GitHub repository to look over the code.

Getting data into the index

As noted above the EDM.GeographyPoint data type simply isn’t coded into the Sitecore DLLs.  You can see this by looking at CloudSearchTypeMapper.GetNativeType() inside the DLL Sitecore.ContentSearch.Azure.Schema.dll.

If you’re still not convinced have a look in the Sitecore.ContentSearch.Azure.Http.dll

EdmTypes

The next thing we tried was adding Edm.GeographyPoint as a field converter.

Once again we hit a brick wall with this solution as that doesn’t change the fact that Edm.GeographyPoint is not actually a known cloud type in the core DLLs.

If you look at the error message you get when attempting to add Edm.GeographyPoint as a Converter it tells us that CloudIndexFieldStorageValueFormatter.cs is attempting to look up the native EDM type.

Exception: System.NotSupportedException
Message: Not supported type: ‘Edm.GeographyPoint’
Source: Sitecore.ContentSearch.Azure at Sitecore.ContentSearch.Azure.Schema.CloudSearchTypeMapper.GetNativeType(String edmTypeName) at Sitecore.ContentSearch.Azure.Converters.CloudIndexFieldStorageValueFormatter.FormatValueForIndexStorage(Object value, String fieldName) at Sitecore.ContentSearch.Azure.CloudSearchDocumentBuilder.AddField(String fieldName, Object fieldValue, Boolean append)
….

Working around this limitation took a bit of trial and error but eventually, we nailed it down to the following steps:

  1. Created a new search configuration called spatialAzureIndexConfiguration.
  2. Create a new index configuration based on the new configuration in step 1.
  3. In the search configuration from step 1.
    • Add the following computed field.
    • <fields hint="raw:AddComputedIndexField">
       <field fieldName="geo_location" type="Aceik.Foundation.CloudSpatialSearch.IndexWrite.Computed.GeoLocationField, Aceik.Foundation.CloudSpatialSearch" />
       </fields>
    • Add a new cloud type mapper to map to EDM.GeographyPoint
    • <cloudTypeMapper ref="contentSearch/indexConfigurations/defaultCloudIndexConfiguration/cloudTypeMapper">
       <maps hint="raw:AddMap">
       <map type="Aceik.Foundation.CloudSpatialSearch.Models.GeoJson, Aceik.Foundation.CloudSpatialSearch" cloudType="Edm.GeographyPoint"/>
       </maps>
       </cloudTypeMapper>
    • Replace the standard index field storage value formatter with the following the following class.
    • <indexFieldStorageValueFormatter type="Aceik.Foundation.CloudSpatialSearch.IndexWrite.Formatter.GeoCloudIndexFieldStorageValueFormatter, Aceik.Foundation.CloudSpatialSearch">
       <converters hint="raw:AddConverter">
         ...
       </converters>
       </indexFieldStorageValueFormatter>
  4. Our new formatter GeoCloudIndexFieldStorageValueFormatter.cs  inherits from the formatter in the core dlls and overrides the method FormatValueForIndexStorage. It detects if the field name contains “_geo” and basically prevents CloudSearchTypeMapper.GetNativeType from ever running and falling over.

 

The computed field GeoLocationField returns a GeoJson POCO which is serialised to the GeoJson format (http://geojson.org/) which is required for storage in Azure Search Index.

Getting data out of the index

The GitHub solution demonstrates two solutions.

Both solutions above at the end of the day allow you to perform an OData Expression query against your Azure Search Indexes. Using the “geo.distance” filter allows us to perform spatial queries using the computing power of the cloud.

The only downside is that search results returned don’t actually include the distance as a value. They are correctly filtered and ordered by the distance yet the actual distance value itself is not returned. We suggest voting for this change on this ticket :). For now, we suggest using the System.Device.Location.GeoCoordinate to figure out the distance between two coordinates.


References:

In researching for possible solutions already out there we had a look at the following implementations for Lucene and SOLR:

  1. Sitecore Solr Spatial: https://github.com/ehabelgindy/sitecore-solr-spatial
  2. Sitecore Lucene Spatial: https://github.com/aokour/Sitecore.ContentSearch.Spatial

Special Thanks:

  • Ed Khoze –  For the research, he did and help with the initial prototype. Plus google address lookup advice.
  • Jose Dominguez – For getting the OData filtering query working.