architecture-reviews

A repository of reviews and analysis of the architecture of open source systems

View project on GitHub

ElasticSearch - Evolution

1. History and Evolution

1.1 Main Functional Differences

Even from the begining, Elasticsearch was a search engine with different features:

  • Distributed and Highly Available Search Engine.
    • Each index is fully sharded with a configurable number of shards.
    • Each shard can have one or more backups.
    • Read / Search operations performed on either primary or backup shards.
  • Multi Tenant with Multi Types.
    • Support for more than one index.
    • Support for more than one type per index.
    • Index level configuration (number of shards, index storage, …).
  • Various set of APIs
    • HTTP RESTful API
    • Native Java API.
    • All APIs perform automatic node operation rerouting.
  • Document oriented
    • No need for upfront schema definition.
    • Schema can be defined per type for customization of the indexing process.
  • Reliable, Asynchronous Write Behind for long term persistency.
  • (Near) Real Time Search.
  • Built on top of Lucene
    • Each shard is a fully functional Lucene index
    • All the power of Lucene easily exposed through simple configuration / plugins.
  • Per operation consistency
    • Single document level operations are atomic, consistent, isolated and durable.
  • Open Source under Apache 2 License.

And now, it hasn’t really change it’s main functionality, but has added quality ones. The list of actual features is the next one:

  • Scalability and resiliency
    • Clustering and high availability
    • Automatic node recovery
    • Automatic data rebalancing
    • Horizontal scalability
    • Rack awareness
    • Cross-cluster replication
    • Cross-datacenter replication
  • Management
    • Index lifecycle management
    • Hot-warm architecture
    • Frozen indices
    • Snapshot lifecycle management
    • Snapshot and restore
    • Source-only snapshots
    • Data rollups
    • Data streams
    • Transforms
    • Upgrade Assistant API
    • API key management
  • Security
    • Elasticsearch secure settings
    • Encrypted communications
    • Encryption at rest support
    • Role-based access control (RBAC)
    • Attribute-based access control (ABAC)
    • Field- and document-level security
    • Audit logging
    • IP filtering
    • Security realms
    • Single sign-on (SSO)
    • Third-party security integration
  • Alerting
    • Highly available, scalable alerting
    • Notifications via email, Slack, PagerDuty, ServiceNow, or webhooks
  • Clients
    • Language clients
    • Elasticsearch DSL
    • Elasticsearch SQL
    • Event Query Language (EQL)
    • JDBC client
    • ODBC client
    • Tableau Connector for Elasticsearch
    • CLI tools
  • REST APIs
    • Document APIs
    • Search APIs
    • Aggregations APIs
    • Ingest APIs
    • Management APIs
  • Integrartions
    • Elasticsearch-Hadoop
    • Apache Hive
    • Apache Pig
    • Apache Spark
    • Apache Storm
    • Business intelligence (BI)
    • Plugins and integrations
  • Deployment
    • Download and install
    • Elastic Cloud
    • Elastic Cloud Enterprise
    • Elastic Cloud on Kubernetes
    • Helm Charts
    • Docker containerization

1.2 Evolution in Numbers

For the following Charts, the following versions of the ElasticSearch repository were used (with their corresponding attributes):

  • v0.4 (2010)
    • Files: 1.350
    • Lines of Code: 181.946
    • Commit: b3337c312765e51cec7bde5883bbc0a08f56fb65
  • v0.15 (2011)
    • Files: 2.329
    • Lines of Code: 280.801
    • Commit: dac2a888fbd4a1fbff50efe3d6600ce2ed0bb95b
  • v0.90 GA (2013)
    • Files: 2.957
    • Lines of Code: 561.527
    • Commit: cb75ce0caa65af294167ae0d1514deec7815431b
  • v1.0.0 (2014)
    • Files: 4.501
    • Lines of Code: 758.745
    • Commit: a46900e9c72c0a623d71b54016357d5f94c8ea32
  • v2.0.0-beta1 (2015)
    • Files: 6.027
    • Lines of Code: 1.123.593
    • Commit: bfa3e47383d0adc690329a2fa1094ceb64cae651
  • v2.3.5 (2016)
    • Files: 6.492
    • Lines of Code: 1.229.010
    • Commit: 90f439ff60a3c0f497f91663701e64ccd01edbb4
  • v5.5.0 (2017)
    • Files: 8.796
    • Lines of Code: 1.657.088
    • Commit: 260387d2e4c0422c00adb04d160690ed1da05209
  • v6.4.0 (2018)
    • Files: 13.292
    • Lines of Code: 2.303.684
    • Commit: 595516e45d054483dcdaf12f0468a59cb34bb568
  • v6.8.5 (2019)
    • Files: 15.868
    • Lines of Code: 2.729.228
    • Commit: 78990e93431aebcc3dcd7fa20fab4b5b29ab7988
  • v7.9.3 Last GitHub Build (2020)
    • Files: 20.022
    • Lines of Code: 3.406.533
    • Commit: c4138e51121ef06a6404866cddc601906fe5c868

N° of Files in Time

N° of Lines of Code in Time

2. Architecture Decision Records

ADR1: Apache Lucene as base search software

Context: (1st commit)

  • An esqueleton is needed to develop the software
  • Apache Lucene is a search software, that includes Lucene Core (Java library with indexing and search features) and Solr, a search server.

Decision: Use Apache Lucene as the base software for the Elasticsearch project

Status: Implemented

Consequences:

  • 10 years after, it is still the base software used by ElasticSearch.
  • Easy to use and construct over it.
  • Apache Lucene has continuous updates that make the project to never get deprecated in its core

ADR2: Implement Hamcrest matcher library

Context: (1st commit)

  • Because ElasticSearch is a small software, it can use external libraries for complex algorithms without losing performance.
  • Hamcrest provides a matcher for different types of data structures (from simple text to collections and objects).

Decision: Use the Hamcrest library to find matching results to the query.

Status: Implemented

Consequences:

  • In the begining, it was a good tool for a much smaller software like what ElasticSearch was 10 years ago.
  • Today Hamcrest is mainly used in testing, where the performance is not important.

ADR3: Use Google Collect

Context: (1st commit)

  • Google Collect provides different tools and functions to make it easy to program in Java.

Decision: Use the Google Collect library for different purposes in code.

Status: Implemented

Consequences: Google Collect changed it’s name and now is Guava, but the tool is still used in different parts of the software.

ADR4: Use Docker to Install ElasticSearch

Context: (~v5.0.0)

  • Docker provides an easy way to deploy software in different types of Operative Systems
  • ElasticSearch should provide installation methods that adapt to the most quantity of users as possible

Decision: We will use Docker to provide the option to install images of ElasticSearch

Status: Implemented

Consequences:

  • Installing ElasticSearch as a Docker image allows to create single or multi node clusters in an easy way
  • Users can have multiple instances or versions of ElasticSearch
  • Docker gives an easy way to use ElasticSearch in production environments

ADR5: Create ElasticSearch images using centos as the base image

Context: (~v5.0.0)

  • ElasticSearch images need a base OS image to have an easier way to work wth Docker
  • ElasticSearch images should last in time with good maintenance and must have good security

Decision:

  • We will use centos:7 to provide a base for the ElasticSearch images

Status: Implemented

Consequences:

  • centos:7 is one of the latest releases of its kind, so it receives good maintenance and support
  • Implementation of the image is easy as centos has a lot of documentation
  • centos management panel support makes ElasticSearch images to host multiple sites on the server in an easier way