Solr, the Other In-Memory NoSQL Database?

Using Solr for general purpose analysis

If you ask well informed technical people what is Apache Solr used for? Your most likely response would be that Apache Solr + Lucene is an open source text search engine. Documents are indexed into Solr and after indexing, these same documents can be easily searched using free form queries in much the same way as you would query Google. Still others might add that Solr has some very capable geo-location indexing capabilities that support radius, bounded-box, and defined area searches. And both of the above answers would be well informed and correct.

What may be less well known is that Apache Solr (+Lucene) can be effectively used for certain indexed data queries and provide lightning fast response times too! By leveraging Solr in this way you can benefit by extending either your current use of Solr, or adding Solr to your existing cluster in order to better leverage your existing data assets.

This article will share how Solr can be leveraged to provide exceptional query response times to a wide variety of business style queries. Guidance will be provided on how to index documents into a Solr cluster and issue complex queries against the indexed documents. After the nuts and bolts are shared, guidance will be provided regarding important considerations for using Solr in such and way. The article will finish with a review of Solr's capabilities as compared to other in-memory NoSQL engines such as MongoDB.

In short, this article provides a great overview of how to leverage Solr as a NoSQL in-memory database.

Let's Get Some Data

In searching for some data to index into Solr I had a few criteria. I wanted the number of fields to be small, so that the data set could be easily understood. I also wanted a data set that really isn't a typical text based data set, but rather is more of a business type data set. Lastly I wanted a data set with some numerical values so that Solr's capabilities of comparison filtering and range filtering can be easily demonstrated and understood.

After a little searching on-line I found the following data set which I believe meets all of my criteria:

The above data set is a simple listing of electricity rates listed by zip code for 2011. The data set contains the following fields and types:

Field Name

Field Type

Field Description



Rate zip code



Energy provider id



Company name









Investor based company



$ / KWH



$ / KWH



Residential rate $ / KWH


The csv file from the above URL has been downloaded and for brevity the filename has been changed to rates.csv. What follows are the first few lines from that csv file:


35218,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35219,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35214,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35215,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35216,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35210,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35211,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35212,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35213,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

Let's Create and Load a Schema


Solr can infer a schema from indexed data, but when this is done you leave it up to Solr to determine the fields and types. To be assured that we get the appropriate indexing and type semantics, defining a schema is recommended. In our example we will query certain fields and we will apply comparison and range filters to certain fields. As a result, we must make sure that these certain fields are indexed and defined with the proper field type before we index data into Solr. We also take care not to index fields that will not be searched or faceted. This reduces or minimizes the memory needed to fulfil your business needs.

First we instruct Solr to create a default configuration set on the local file system. To do this we issue the following command, where /tmp/electric_rates is the local directory where Solr will place our default configuration set:

solrctl --zk localhost:2181/solr instancedir --generate /tmp/electric_rates

In the /tmp/electric_rates directory there will now be a file named schema.xml. This is a rather large xml file and contains some definitions that are leveraged in other areas. The main area of concern is the field definitions. All of the example field definitions can be removed. Listed below are the field definitions that we will use for our example electric rate data set:

<field name="zip" type="int" indexed="true" stored="true" required="true"/>

<field name="eiaid" type="int" indexed="false" stored="true"/>

<field name="utility_name" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="state" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="service_type" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="ownership" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="comm_rate" type="double" indexed="true" stored="true"/>

<field name="ind_rate" type="double" indexed="true" stored="true"/>

<field name="res_rate" type="double" indexed="true" stored="true"/>

You will note that there are a few "int" fields, a few "string" fields, and a few "double" fields. Also note that only some fields are designated as "indexed='true'" as these are fields that we will query on or apply grouping functions to. The "omitNorms" setting informs Solr that we will NOT be using these fields in any form of boosting searches. Using boosting in searches is an advanced way to instruct Solr that a specific field is more or less important in certain "boosted" queries.

After the schema.xml file has been edited zookeeper must be used to create a Solr instance directory using the following command:

solrctl --zk localhost:2181/solr instancedir --create electric_collection /tmp/electric_rates

Next we instruct Solr to create a new collection with the following command:

solrctl --zk localhost:2181/solr collection --create electric_collection

Finally to index the data into Solr we will use the already configured csv request handler to easily index this csv file into Solr. It should be noted that this is an excellent utility for small data sets, but is not really recommended for larger data sets. For larger data sets you might want to consider using the MapReduceIndexerTool, but I leave that to the reader to investigate its use. The following command will get our data indexed.

curl "http://localhost:8983/solr/electric_collection_shard1_replica1/update/csv?header=true& \ rowid=id&stream.file=/tmp/rates.csv&stream.contentType=text/csv;charset=utf-8"

Upon completion you will note that 37791 documents were indexed into Solr. Obviously this is not a large data set, but the intention is to demonstrate query capabilities first and response times only as secondary information.

Now Let's Get Some Business Answers and Make it Fast!


To demonstrate Solr's query capabilities on our newly indexed data set let's answer some business style questions targeting our newly indexed data. For each business question, I will provide the query along with a detail of each query element. In order to keep the article shorter I will not list the full Solr response, but only provide the answers in very short form.

Find out how many utility companies serve the state of Maryland (MD)?

To fulfill the above question we need to apply a filter to the state field specifying only results from ‘MD'. In order to determine how many utility companies exist in MD we will ask Solr to group the results based on the utility_name field but we will limit grouping results to just 1 as we only care to find out how many total groups there are. The following query fulfills the business needs requested above:



Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


State="MD" filter


Results in json format


Indent the results


Group results


Group by utility_name


Limit # of groups to 10


Only 1 result per group


The number of groups returned is 4 and the result was returned in 23 milliseconds!

Which Maryland utility has the cheapest residential rates?

To fulfill the above question we only need to add one additional element to the prior query which will instruct Solr to sort the groups in ascending order which will place the cheapest residential rate at the top and we can also limit the number of groups to just 1.


Listed below are the new or modified query elements decomposed for better understanding:

Query Element



Limit # of groups to 1


Sort groups by res_rate


The cheapest utility in MD is "The Potomac Edison Company" @ 0.03079 / KWH and the result was returned in 4 milliseconds!

What are the minimum and maximum residential power rates excluding missing data elements?

To fulfil this query we need to filter out data rows where res_rate = 0.0 as these are missing data elements. We accomplish this using an "frange" query excluding the lower bound of 0.0. To get the minimum and maximum res_rate we instruct Solr to generate statistics for the res_rate indexed field. The query to answer the above business question is listed below:


Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


Consider all documents


Restated below without URL encoded characters:

fq={!frange l=0.0 incl=false}res_rate

Range query excluding lower bound of 0.0 on res_rate field


Results in json format


Indent the results


No document results to be returned


Generate statistics


Return stats on the res_rate field


The res_rate minimum is 0.0260022258659 and the res_rate maximum is 0.849872773537. Results were returned in 5 milliseconds.

What is the state and zip code with the highest res_rate?

To fulfil the above business request we take the maximum res_rate returned from the prior query and use it as a filter for the next query as listed below:


Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


Select the target res_rate documents


Results in json format


Indent the results


Return only 1 result if found


The highest residential electric rates are found in Alaska in zip code 99634. The results were returned in 1 millisecond!


Guidelines for Using Solr to Meet Your Analysis Needs


It is worth pointing out that Solr should not be thought of as a general purpose in-memory NoSQL engine. With that in mind here are some guidelines to help understand when it might be appropriate to leverage Solr's query capabilities:

1.      Your use case requires very fast query response times

2.      The data you need to analyze is already stored in Hadoop

3.      You can easily define a schema for the data to be indexed

4.      You need to query (filter) on many fields

5.      The amount of data to be indexed into Solr will not exceed your Solr cluster capabilities

If many or all of the above criteria apply then using Solr for your data analysis might just be a great fit.

Comparing Solr to MongoDB


MongoDB is one of several NoSQL database engines in existence today. MongoDB often get consideration when someone is investigating fast, scalable, general purpose databases. For comparison purposes I provide a table of features that details support for the listed features.




Supports In-Memory analysis



Requires schema definition

Yes (highly recommended)


Supports dynamic addition of new indexes on existing fields

No (requires re-indexing documents)


Scales to support more data



Supports SQL syntax



General Purpose In-Memory database



Supports the HDFS file system






As you can see Solr provides lightning fast query response times to a wide variety of business style queries. The query language is not nearly as well-known as SQL, but Solr has some excellent capabilities that can be leveraged with some thought and practice.

To get the answers needed above we leveraged grouping, group sorting, field selection (filtering), statistics generation and range selection. While Solr should not be considered to be a general purpose NoSQL in-memory database system, it can still be leveraged to yield some very capable analysis results with awesome response times. As such it should be viewed as another tool in the toolbox that when used correctly can simplify the life of the Hadoop Eco-System Architect!

System Specifications


All of the above queries were issued against a single Solr instance running in a virtual machine.


Cent OS 6.6


CDH 5.0.0



Solr Memory Available

5.84 GB






Related Links


Solr quick start:

Solr reference guide:

Solr in Action (book)


More Stories By Pete Whitney

Pete Whitney is a Solutions Architect for Cloudera. His primary role at Cloudera is guiding and assisting Cloudera's clients through successful adoption of Cloudera's Enterprise Data Hub and surrounding technologies.

Previously Pete served as VP of Cloud Development for FireScope Inc. In the advertising industry Pete designed and delivered DG Fastchannel’s internet-based advertising distribution architecture. Pete also excelled in other areas including design enhancements in robotic machine vision systems for FSI International Inc. These enhancements included mathematical changes for improved accuracy, improved speed, and automated calibration. He also designed a narrow spectrum light source, and a narrow spectrum band pass camera filter for controlled machine vision imaging.

Pete graduated Cum Laude from the University of Texas at Dallas, and holds a BS in Computer Science. Pete can be contacted via Email at