MongoDB Interview Questions

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

  • 4.5 Rating
  • 41 Question(s)
  • 45 Mins of Read
  • 7241 Reader(s)

Beginner

Yes, it will allow you to connect.

How would you print out, in the shell, the name of all the products without extraneous characters or braces, sorted alphabetically, ascending?

var c = db.products.find({}).sort({name:1}); c.forEach( function(doc){ print(doc.name) } );

A covered query is a query that can be satisfied entirely using an index and does not have to examine any documents. An index covers a query when both of the following apply:

  1. all the fields in the query are part of an index, and
  2. all the fields returned in the results are in the same index. 

For example, a collection inventory has the following index on the type and item fields:

db.inventory.createIndex( { type: 1, item: 1 } )

This index will cover the following operation which queries on the type and item fields and returns only the item field:

db.inventory.find

(
   { type: "food", item:/^c/ },
   { item: 1, _id: 0 }
)

MongoDB includes a database profiler which shows performance characteristics of each operation against the database. With this profiler you can find queries (and write operations) which are slower than they should be and use this information for determining when an index is needed.

There are no major disadvantages with MongoDB, however below are few minor issues you’ll have with MongoDB. And they’ve solution for them as well in their upcoming releases.

  • oMongoDB is ideal for implementing things like analytics/caching where impact of small data loss is negligible. However, with their latest MongoDB’s ACID/transaction support this has been mitigated and MongoDB is ideally best fit for all types of Applications. 
  •  A 32-bit edition has 2GB data limit. After that it will corrupt the entire DB, including the existing data. A 64-bit edition won’t suffer from this bug/feature. And this can be ignored as 32-bit is not recommended for any product after all.
  •  Default installation of MongoDB has asynchronous and batch commits turned on. Meaning, it lies when asked to store something in DB and commits all changes in a batch at a later time in future.  If there is a server crash or power failure, all those commits buffered in memory will be lost. This functionality can be disabled, but then it will perform as good as or worse than any traditional RDBMS. 

Journaling will save the data loss till a certain point in this case.

The combination of database and collection is called Namespace in MongoDB.
MongoDB stores BSON objects in collections as the concatenation of the database name and the collection name (with a period in between) is called a ‘namespace’.

Each journal (group) write is consistent and won’t be replayed during recovery unless it is complete.

No. Writes to disk are lazy by default. A write may only hit the disk a couple of seconds later. For example, if the database receives thousand increments to an object within one second, it will only be flushed to disk once

Replica sets use elections to determine which set member will become primary. Replica sets can trigger an election in response to a variety of events, such as:

In the following diagram, the primary node was unavailable for longer than the configured timeout and triggers the automatic failover process. One of the remaining secondaries calls for an election to select a new primary and automatically resume normal operations.

The replica set cannot process write operations until the election completes successfully. The replica set can continue to serve read queries if such queries are configured to run on secondaries.

The median time before a cluster elects a new primary should not typically exceed 12 seconds, assuming default replica configuration settings.

 This includes time required to mark the primary as unavailable and call and complete an election. You can tune this time period by modifying the settings.electionTimeoutMillis replication configuration option. Factors such as network latency may extend the time required for replica set elections to complete, which in turn affects the amount of time your cluster may operate without a primary. These factors are dependent on your particular cluster architecture.

Your application connection logic should include tolerance for automatic failovers and the subsequent elections.

It may take 10-30 seconds for the primary to be declared down by the other members and a new primary to be elected. During this window of time, the cluster is down for primary operations i.e writes and strong consistent reads. However, eventually consistent queries may be executed to secondaries at any time (in slaveOk mode), including during this window.

For the WiredTiger storage engine, you can specify the maximum size of the cache that WiredTiger will use for all data. This can be done using storage.wiredTiger.engineConfig.cacheSizeGB option.

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB. Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document.

The explain() command can be used for this information. The possible modes are: 'queryPlanner', 'executionStats', and 'allPlansExecution'.

Listen for changes across a cluster

Listen for changes in documents across all collections in a database

Change Event document provides information on txnNumber

The WiredTiger storage engine maintains lists of empty records in data files as it deletes documents. This space can be reused by WiredTiger, but will not be returned to the operating system unless under very specific circumstances.

The amount of empty space available for reuse by WiredTiger is reflected in the output of db.collection.stats() under the heading wiredTiger.block-manager.file bytes available for reuse.

To allow the WiredTiger storage engine to release this empty space to the operating system, we can de-fragment our data file. This can be achieved using the compact command.

With WiredTiger, MongoDB utilizes both the WiredTiger internal cache and the filesystem cache. Starting in 3.4, the WiredTiger internal cache, by default, will use the larger of either:

  • 50% of (RAM - 1 GB), or
  • 256 MB.

For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4GB - 1 GB) = 1.5 GB). 

Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB - 1GB) = 128 MB < 256 MB).

By default, WiredTiger uses Snappy block compression for all collections and prefix compression for all indexes. Compression defaults are configurable at a global level and can also be set on a per-collection and per-index basis during collection and index creation.

MongoDB is implemented in C++. Drivers and client libraries are typically written in their respective languages, although some drivers use C extensions for better performance.

MongoDB uses the dot notation to access the elements of an array and to access the fields of an embedded document.

Using the dot notation:

db.people.update({ }, { $set: { “address.street”: “Main Street” } })

Nested fields should be used with dot and enclosed by double quotes as shown above.

Execute below command on the MongoDB Client shell to modify the default behaviour of shell, ideally it print 20 documents in chunks with it - iteration.
But you can change that with following command.
DBQuery.shellBatchSize = <Number of Documents you want to print on the shell>
example: DBQuery.shellBatchSize = 1000

MongoDB supports following CRUD operations:

Create
Read
Update
Delete

B & C

Unique indexes have the certain properties and restrictions as below.

For example, they ensure that no documents have the same data at the same key that carries a unique index, and you may not specify a unique constraint on a field that is specified as a hashed index.

Disk I/O will probably be lower with zlib than without compression.

The zlib algorithm provides higher compression rates, at the cost of more CPU.

For that reason, it is likely that the compressed data will have a smaller foot print, resulting in less disk I/O.

1 & 2  

All fields used in the selection filter of the query must be in the index, so the system can find the documents that satisfy the selection filter without having to retrieve the document from the collection.

All fields returned in the results must be in the index, so again there is no need to retrieve the full document. A common mistake is not to provide a projection that filters out the field _id, which is returned by default. If the _id field is not a field in the index definition, it is not available, and the query system will need to fetch the full document to retrieve the value.

On the other hand, it is OK to ask for more fields than the ones provided in the selection filter, as long as those are in the index values, the system has all the information needed to avoid fetching the full document from the collection.

MongoDB keeps what it can of the indexes in RAM. They’ll be swaped out on an LRU basis. You’ll often see documentation that suggests you should keep your “working set” in memory: if the portions of index you’re actually accessing fit in memory, you’ll be fine.

It is the working set size plus MongoDB’s indexes which should ideally reside in RAM at all times i.e. the amount of available RAM should ideally be at least the working set size plus the size of indexes plus what the rest of the OS (Operating System) and other software running on the same machine needs. 

If the available RAM is less than that, LRUing is what happens and we might therefore get significant slowdown. One thing to keep in mind is that in an index btree buckets are cached, not individual index keys i.e. if we had a uniform distribution of keys in an index including for historical data, we might need more of the index in RAM compared to when we have a compound index on time plus something else. With the latter, keys in the same btree bucket are usually from the same time era, so this caveat does not happen. Also, we should keep in mind that our field names in BSON are stored in the records (but not the index) so if we are under memory pressure they should be kept short.

MongoDB doesn't follow file system fragmentation and pre allocates data files to reserve space while setting up the server. That's why MongoDB data files are large in size.

The voting is done by a majority of voting members.

Imagine a Replica Set with three (voting) members. Let’s say that Node A is primary, and nodes B+C are secondaries. Node A goes down, so nodes B+C go to election. They still do form a majority (two out of three). The election is first decided by priority. If both Nodes B & C have the same priority, then the one who is most up to date in respect to the failed primary (oplog) wins. Let’s say it’s Node B.

Once node A comes back alive, there is no new election. Node B remains the master, and C+A are now secondaries.

On the other hand, if two nodes go down you don’t have a majority, so the replica set can’t accept updates (apply writes) any more until at least one of the two failing servers becomes alive (and connected by the single surviving node) again.

Imagine now a Replica Set with four (voting) members. Let’s say that Node A is primary, and nodes B+C+D are secondaries. Node A goes down, so nodes B+C+D go to election. They of course form majority (three out of four)

However, if two nodes go down you don’t have a majority (two out of four), so the replica set is again at read only mode.

So that’s why an odd number is recommended; If you lose a single member in a 3 members replica set, it’s the same as losing a single member in a 4 members replica set: you still gain quorum majority and a new primary can be elected (the RS can still elect a new master by majority). On the other hand, if you lose two members in a 3 members replica set or a 4 members replica set (or n/2 members of n-members replica set) – again – the impact is the same: No new leader can be voted by election.

So, to make a long story short, there is no redundancy gain by having an even number of members in a replica set.

The reason for issuing rs.slaveOk() or db.getMongo().setSlaveOk() command while querying on secondaries. 

We’ve to set “slave okay” mode to let the mongo shell know that we’re allowing reads from a secondary. This is to protect our applications from performing eventually consistent reads by accident.
We can do this in the shell with:

rs.slaveOk()

After that we can query normally from secondaries.

A note about “eventual consistency”: under normal circumstances, replica set secondaries have all the same data as primaries within a second or less.

What is eventual consistency :

A property of a distributed system that allows changes to the system to propagate gradually. In a database system, this means that readable members are not required to reflect the latest writes at all times. In MongoDB, reads to a primary have strict consistency; reads to secondaries have eventual consistency.

Under very high load, data that we’ve written to the primary may take a while to replicate to the secondaries. This is known as “replica lag”, and reading from a lagging secondary is known as an “eventually consistent” read, because, while the newly written data will show up at some point (barring network failures, etc), it may not be immediately available.

Please note that we need to only set slaveok when querying from secondaries, and only once per session.

Imagine you have a three-member replica set and if your secondaries are falling behind, what are the plausible causes and why ? 

  • Network issues
  • Slower hardware on the secondaries

Application is writing to the secondaries but not the Primary is clearly wrong because an application can only write to the Primary.

Network issues may lead to the replication subsystem not being able to quickly get the changes happening on the Primary resulting in replication lag.

Having faster hardware for the Primary can also lead to replication lag. Imagine the Primary operating at full capacity. While this is happening, the secondaries with slower hardware may not be able to apply all the writes happening on the Primary at the same speed.

No. If you don’t call getLastError (aka “Safe Mode”) the server does exactly the same behavior as if you had. The getLastError call simply lets one get confirmation that the write operation was successfully committed. Of course, often you will want that confirmation, but the safety of the write and its durability is independent.

For the WiredTiger storage engine, you can specify the maximum size of the cache that WiredTiger will use for all data. This can be done using storage.wiredTiger.engineConfig.cacheSizeGB option.

Is it possible to configure the cache size for MMAPv1 in MongoDB?

No. it is not possible to configure the cache size for MMAPv1 because MMAPv1 does not allow configuring the cache size.

Advanced

Sharding is a method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

In most situations a Sharded Cluster will create/split and distribute chunks automatically without user intervention. However, in a limited number of cases, MongoDB cannot create enough chunks or distribute data fast enough to support the required throughput.

For example, if you want to ingest a large volume of data into a cluster that is unbalanced, or where the ingestion of data will lead to data imbalance, such as with monotonically increasing or decreasing shard keys. Pre-splitting the chunks of an empty sharded collection can help with the throughput in these cases.

The update will go through immediately on the old Shard and then the change will be replicated to the new Shard before ownership transfers.

No, chunk moves are consistent and deterministic. The move will retry and when completed, the data will be only on the new Shard.

A single MongoDB instance cannot keep up with your application's write load and you have exhausted other options.

When the data set is too big to fit in a single MongoDB instance.

When we want to improve read performance for the application.

The data set is taking too much time to backup and restore.

Taking too much time to backup and restore is a function of your operational requirements. Breaking the dataset over shards will mean that each server will have more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster

An insufficiently granular (“low cardinality”) shard key can result in Large chunks that cannot be split.

And the reason is the documents with the same values for their shard key will be colocated in the same chunk. If a lot of documents have the same values, this may result in a very big chunk. The system is unable to split this chunk as there is no value between the bounds of the chunk. For example if a shard key is the name of a country, all documents with INDIA are placed in the same chunk, and this chunk can't be split, as there is no other value between INDIA and INDIA.

High IO wait times in the CPU stats

IO wait is the key piece of information. That means the disk is unable to promptly take all the requests sent to it.

SSD are usually faster than spinning disks, however you can have a system performing very well with spinning disks if they are not used at full capacity.

High number of page faults and Resident memory approaches physical memory are usually symptoms that the system does not have enough physical memory.

We should disable/stop the Balancer service before backing up a running sharded cluster.

One of the requirements for doing a backup of a sharded cluster is to ensure that no group of documents (chunks) are getting migrated by one shard to another shard while you are copying the data for the given shard.

For this reason, you need to ensure the balancer is disabled while you take the file system snapshots.

If you are using Ops Manager or Cloud Manager for your backups, then those tools will stop the balancer for you, and they will also insert a synchronization token in all shards and in the config server, so you can have a consistent backup.

In the context of scaling MongoDB:

replication creates additional copies of the data and allows for automatic failover to another node. 

  1.  Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn’t the latest.
  2.  sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It’s important to choose a good shard key. For example, a poor choice of shard key could lead to “hot spots” of data only being written on a single shard.

A sharded environment does add more complexity because MongoDB has to manage distributing data and requests between shards — additional configuration and routing processes are added to manage those aspects.

Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.

From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:

  1.  Read preferences
  2.  Write concerns

Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It’s primarily meant for redundancy, although you can scale reads by adding replica set members. That’s a little complicated, but works very well for some apps.

Sharding sits on top of replication, usually. “Shards” in MongoDB are just replica sets with something called a “router” in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It’s significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).

We always suggest starting unsharded for simplicity and quick startup unless your initial data set will not fit on single servers. Upgrading to sharing from unshared is easy and seamless, so there is not a lot of advantage to setting up sharing before your data set is large.

Levels