Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
BSON includes metadata to describe a document/object and also BSON extends the JSON model to provide additional data types, ordered fields, and to be efficient for encoding and decoding within different languages, to make it simple. And thus MongoDB uses BSON rather than JSON.
In MongoDB, a write operation is atomic on the level of a single document, even if the operation modifies multiple embedded documents within a single document.
MongoDB 4.0 introduced updates to multiple documents in a replica set through transactions.
In MongoDB 3.6, updates to single documents in replica sets or sharded clusters were atomic.
To elaborate starting from MongoDB 4.0 below update operations are atomic.
An update to a single document in a replica set
An update to multiple documents in a replica set using transactions
An update to a single document in a sharded cluster
If one of the main queries of the system is pulling related information from different collections, it usually gives better performance to group those information in a single document.
Keeping the information together will remove the need to do the corresponding joins in the application. The single document may also be a better match to the representation of that object in the application.
If you use MongoDB with a direct mapping of each collection to a table in the Relational model, you are not making use of some benefits brought by the document model. So, yes using a denormalized model is encouraged.
As for duplicating information in different collections (think of an address in an order for example), this is perfectly acceptable if you need to do it to get good performance.
Two documents satisfy the following criteria:
{ "city" : "Seattle", "state" : "WA" }
Out of these two documents, the document with _id:ObjectId("57fd48257268886f789b3402") already contains forest in the likes array and therefore, $addToSet is not going to add anything to this document.
The document for _id:ObjectId("57fd48257268886f789b33ff") is the only one that will be updated since it meets both criteria, and forest is not in the likes array.
Both db.collection.insertOne() & db.collection.updateMany() are write operation and write concerns apply to write operations. insertOne and updateMany are write operations.
find is a read operation. Read operations can be influenced by read concerns or read preferences.
The WiredTiger storage engine supports document-level concurrency, allowing multiple documents from the same collection to be written to, simultaneously.
Replication is a feature handled at a higher level in the mongod process. A storage engine has the mission to store and retrieve documents from cache and memory. Replication, Sharding, processing of MongoDB Query Language queries and more, are all done in higher layers in the mongod process.
mongodump will export the documents in BSON. It is also the preferred way to transfer documents from one instance of MongoDB to another instance.
However, if you need to export to a CSV file, you would use mongoexport.
The correct answer is the one that includes --type=csv, which tells which format we want mongoexport to use for the output. The default type is JSON.
When a chunk is in flight, reads and writes from the application can still access the documents in that chunk. Modifications on documents are propagated to the shard where it is migrated.
Until the chunk is fully migrated, the shard (donor) that is sending it to another shard (receiver) is the only location where the all documents are present in their latest form. For that reason, the donor shard is processing the reads.
Auto-generated ObjectId's are monotonically increasing values. Using those as shard keys results in one shard getting all the insert operations. Hashing those ObjectId values will create a uniform distribution of values, therefore sending insert operations on all shards in a distributed fashion.
Unfortunately, a range of documents who may have been colocated in the same chunks will now be distributed randomly in all the chunks. The consequence is that any range query on a hashed shard key needs to be sent to all shards, making those queries less efficient, and also impacting the scalability of the system.
Documents with the same values for their shard key will be colocated in the same chunk. If a lot of documents have the same values, this may result in a very big chunk. The system is unable to split this chunk as there is no value between the bounds of the chunk. For example, if a shard key is the name of a country, all documents with USA are placed in the same chunk, and this chunk can't be split, as there is no other value between USA and USA.
Chunks that cannot be split are called jumbo chunks.
For a given database in a cluster, not all collections may be sharded. As a matter of fact, you are likely to shard only the very large collections. For the ease of management and to provide features like $lookup across collections, it makes sense to group all non-sharded collections together, and this location is referred to as the Primary Shard for this given database. Other databases in the cluster are likely to have a different Primary Shard to level the space and load between the shards.
As a note, the term Primary Shard is used here, so be careful not to confuse this notion with the Primary replica in a replica set.
By specifying the "beginning of line" regex operation, you constrain the query and MongoDB can efficiently navigate the index to the correct location.
You should use the $match stage as early as possible in your pipeline. The reason is that by filtering documents that are not part of the answer early will mean the rest of the pipeline will process fewer documents and will be faster.
Most stages, except few exceptions, can be used many times in a pipeline.
The $sample stage in the Aggregation Framework returns a subset of documents in a random fashion. In the above pipeline, we ask the stage to return 3 documents, so the documents returned could be any document in any order.
This $sample stage is useful when you want to test something against a large dataset, as processing all the documents would take too much time.
$unwind takes a field (array) as its argument. It then replaces the original document by N documents, one for each value in the array. So taking the "apple" document, you would get 2 resulting documents:
{ "_id" : "apples", "traits" : "sweet" } { "_id" : "apples", "traits" : "crispy" }
Similarly, the "oranges" document would unwind into 3 documents, for a total of 5 documents.
Horizontal scaling is defined as adding more servers, while vertical scaling is defined as increasing the resources of a server. Horizontal scaling is achieved by sharding, not replication
A good practice when using replication is to have replicas in different geographical regions. If one region becomes unavailable due to a major failure in a data center or the network connection to it, the applications will continue to operate without downtime.
Note that replication helps in case of physical disasters, but does not prevent against logical disasters like the deletion of a database. For that reason, replication does not replace backups.
If you have a delayed member in your replica set, for example, a delay of one hour, it will take one hour before changes on the Primary are replicated to this member.
If a user were to drop a collection or database on the Primary, you would have one hour to go to this delayed member to retrieve the destroyed data.
You can also query older versions of your documents, however, you can't choose a historical version to retrieve as you only get the one that existed one hour ago.
The Oplog collection only contains an entry for a given write query if the operation has modified a document.
Because the deleteMany operation is not deleting any document, there will be no Oplog entry to record.
You can not set the version of MongoDB a given replica is using. This would be a little difficult to control, as nodes may not have the desired version installed yet.
The other options make more sense to be controlled in a global configuration, as you want to be able to change them from one location (the Primary), and have the changes being effective without having to restart the mongod processes.
The _id must be unique in a replica set, as those values are used in the Oplog to reference documents to update. This characteristic of uniqueness is enforced by the system.
In a sharded cluster, _id must also be unique across the sharded collection because documents may migrate to another shard, and identical values would prevent the document to be inserted in the receiver shard, failing the migration of the chunk. It is the responsibility of the application to ensure uniqueness on _id for a given collection in a sharded cluster if it is not the shard key.
Also note that if _id is used as the shard key, the system will automatically enforce the uniqueness of the values, as chunk ranges are assigned to a single shard, and the shard can ensure uniqueness on the values in that range.
As for the shard key index, if it's not _id, it is perfectly acceptable to have identical values for different documents. However, beware of having too many documents with the same values, as this will lead to jumbo chunks.
The common thread in the above answer choices is a bottleneck on resources, or for "taking too much time to backup and restore" a function of your operational requirements. Breaking the dataset over shards will mean that each server will have more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server.
If a member of the replica set becomes unavailable due to maintenance or a hardware crash, the other members will still be able to provide the applications access to the documents.
db.team.find( { } , { scores : { $slice : [ 0 , 5 ] } } )
Query will return the desired results.
The $slice projection operator is used to control how many elements of an array will be returned after fetching it.
The correct answer is that this operation cannot be done in a single query.
To understand why, recall that the _id field of a document is immutable.
In fact, trying this operation with:
updateOne({_id: 3}, {$set: { _id: 7, c: 4 }, $unset: { a: "", b: "" }})
produces the following error:
"Performing an update on the path '_id' would modify the immutable field '_id'"
The $or operator means that any document that satisfies one of the conditions will be retrieved.
{ "_id" : 2, "a" : 2, "c" : 0, "b" : 1 }
{ "_id" : 5, "a" : 3, "c" : 0, "b" : 12 }
{ "_id" : 8, "a" : 11, "c" : 1, "b" : 0 }
{ "_id" : 9, "a" : 17, "c" : 1, "b" : 1 }
{ "_id" : 10, "a" : 3, "c" : 1, "b" : 1 }
In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
In this question, a document with an _id field is inserted.
The query
db.things.find( { b: 1} ).sort( {c: 1, a: 1} )
will require every document to be loaded into RAM in order to fulfill the query. This is because the initial match on the b key does not use any existing index or index prefix.
Unique indexes have certain properties and restrictions such as they ensure that no documents have the same data at the same key that carries a unique index, and you may not specify a unique constraint on a field that is specified as a hashed index.
There are 2 indexes:
{ _id: 1 } { "name" : 1, "date" : 1, "phone" : 1 }
The order of the fields in the index is important, however, the order of the fields in the query are not significant as the query planner will "reorder" the query terms to match a prefix of, or the full compound index.
The query on _id will use the first index. Because _id is guaranteed to be unique, it's possible for the planner to make this optimization. To be sure, there will still be a FETCH stage to get the document and ensure the date predicate is fulfilled.
The query on date, name will also use an index, the name_1_date_1_phone_1 index, because a prefix is specified (name, date).
The query on the title is using a field for which there is no index.
As for the query on phone and info, phone is indexed, however as the third member of the compound index so won't be used.
Covered queries are the best queries!
The underlying index supports the entire query, so no document information is required to be fetched from disk. With a covered query, you are servicing the operation entirely from the index, which is usually faster than examining each document.
Because the field user.login is indexed and the regex beginning of line operator is being used (^), the index myIndex will be used for this query.
All fields used in the selection filter of the query must be in the index, so the system can find the documents that satisfy the selection filter without having to retrieve the document from the collection.
All fields returned in the results must be in the index, so again there is no need to retrieve the full document. A common mistake is not to provide a projection that filters out the field _id, which is returned by default. If the _id field is not a field in the index definition, it is not available, and the query system will need to fetch the full document to retrieve the value.
On the other hand, it is OK to ask for more fields than the ones provided in the selection filter, as long as those are in the index values, the system has all the information needed to avoid fetching the full document from the collection.
All answer choices are correct!
Consider just the first choice. Any document that could match "manufacturer is Matteo AND name is Barbara AND date is 2018-07-02 would have to match date is 2018-07-02 AND, the name is Barbara, and the manufacturer is Matteo.
Because of this fact, the optimizer is able to rearrange the search terms, using the existing index for each query.
The operators updateOne and insertOne are correct because adding indexes does impact write performance.
Remember, write operations that modify an indexed field may require MongoDB to update the indexes associated with the document.
That said, not having the appropriate index for a given query will produce a collection scan on the collection, and those are undesirable.
The first stage, $group, groups by the region and uses the $sum accumulator expression to count the number of documents in each group.
Next, these documents flow into the $match stage, where documents with a count that is less than 3 (3 out of the 5 groups) are filtered out, returning two documents.
{ "_id" : "SE2", "count" : 3 } { "_id" : "NW1", "count" : 3 }
The desire for the Oplog to be idempotent is to ensure that if the server needs to resume applying for Oplog entries it will always get to the same end state, regardless if it reapplies some that entry already applied. For example, if the server crashes applying oplog5 and it is difficult to identify if oplog5 is applied, then idempotency let you restart at oplog4without issues.
Another goal is to have the new state of the document be independent of a previous state. This means all operators like $em, which relies on the previous value to determine the new value, needs to be transformed to the actual values seen. For example, if an increment operation results in modifying a field from the value '4' to the value '5', the operation should be transformed to simply set '5' on that field. Replaying this operation many times always lead to the same result.
Any chunk that covers the values in the range of 20,000 to 40,000 can be accessed by the find() query. Looking at the provided output for rs.status(), we identify the chunks for this range as:
{ "productId" : 18684 } -->> { "productId" : 27851 } on : shard0003 { "t" : 4, "i" : 0 } { "productId" : 27851 } -->> { "productId" : 36852 } on : shard0004 { "t" : 5, "i" : 0 } { "productId" : 36852 } -->> { "productId" : 46047 } on : shard0005 { "t" : 6, "i" : 0 }
Those three chunks cover all values for 20,000 to 40,000. Because they are on three different shards, the query will have to be routed to those three shards, each shard returning the corresponding documents it has for the range.
Phone number is the best selection here.
With weight, eye_color, and started_driving_at, we run the risk of low cardinality and wouldn't get a good distribution.
_id would not make a good shard key because it isn't something meaningful we could query the database within normal circumstances, like searching for a customer record when they call into a call center.
Because the first query account for 90% of your read workload, it should be the one driving the selection of the shard key.
Combination of fields from that query would make the best shard key. Out of the two solutions that use a subset of those fields, the one using company and lastName is preferred to currentEmployee and company, as the currentEmployee field is likely a boolean leading to few values, and potentially a lot of documents with the same values, resulting in jumbo chunks (chunks too big to be splitted).
BSON includes metadata to describe a document/object and also BSON extends the JSON model to provide additional data types, ordered fields, and to be efficient for encoding and decoding within different languages, to make it simple. And thus MongoDB uses BSON rather than JSON.
In MongoDB, a write operation is atomic on the level of a single document, even if the operation modifies multiple embedded documents within a single document.
MongoDB 4.0 introduced updates to multiple documents in a replica set through transactions.
In MongoDB 3.6, updates to single documents in replica sets or sharded clusters were atomic.
To elaborate starting from MongoDB 4.0 below update operations are atomic.
An update to a single document in a replica set
An update to multiple documents in a replica set using transactions
An update to a single document in a sharded cluster
If one of the main queries of the system is pulling related information from different collections, it usually gives better performance to group those information in a single document.
Keeping the information together will remove the need to do the corresponding joins in the application. The single document may also be a better match to the representation of that object in the application.
If you use MongoDB with a direct mapping of each collection to a table in the Relational model, you are not making use of some benefits brought by the document model. So, yes using a denormalized model is encouraged.
As for duplicating information in different collections (think of an address in an order for example), this is perfectly acceptable if you need to do it to get good performance.
Two documents satisfy the following criteria:
{ "city" : "Seattle", "state" : "WA" }
Out of these two documents, the document with _id:ObjectId("57fd48257268886f789b3402") already contains forest in the likes array and therefore, $addToSet is not going to add anything to this document.
The document for _id:ObjectId("57fd48257268886f789b33ff") is the only one that will be updated since it meets both criteria, and forest is not in the likes array.
Both db.collection.insertOne() & db.collection.updateMany() are write operation and write concerns apply to write operations. insertOne and updateMany are write operations.
find is a read operation. Read operations can be influenced by read concerns or read preferences.
The WiredTiger storage engine supports document-level concurrency, allowing multiple documents from the same collection to be written to, simultaneously.
Replication is a feature handled at a higher level in the mongod process. A storage engine has the mission to store and retrieve documents from cache and memory. Replication, Sharding, processing of MongoDB Query Language queries and more, are all done in higher layers in the mongod process.
mongodump will export the documents in BSON. It is also the preferred way to transfer documents from one instance of MongoDB to another instance.
However, if you need to export to a CSV file, you would use mongoexport.
The correct answer is the one that includes --type=csv, which tells which format we want mongoexport to use for the output. The default type is JSON.
When a chunk is in flight, reads and writes from the application can still access the documents in that chunk. Modifications on documents are propagated to the shard where it is migrated.
Until the chunk is fully migrated, the shard (donor) that is sending it to another shard (receiver) is the only location where the all documents are present in their latest form. For that reason, the donor shard is processing the reads.
Auto-generated ObjectId's are monotonically increasing values. Using those as shard keys results in one shard getting all the insert operations. Hashing those ObjectId values will create a uniform distribution of values, therefore sending insert operations on all shards in a distributed fashion.
Unfortunately, a range of documents who may have been colocated in the same chunks will now be distributed randomly in all the chunks. The consequence is that any range query on a hashed shard key needs to be sent to all shards, making those queries less efficient, and also impacting the scalability of the system.
Documents with the same values for their shard key will be colocated in the same chunk. If a lot of documents have the same values, this may result in a very big chunk. The system is unable to split this chunk as there is no value between the bounds of the chunk. For example, if a shard key is the name of a country, all documents with USA are placed in the same chunk, and this chunk can't be split, as there is no other value between USA and USA.
Chunks that cannot be split are called jumbo chunks.
For a given database in a cluster, not all collections may be sharded. As a matter of fact, you are likely to shard only the very large collections. For the ease of management and to provide features like $lookup across collections, it makes sense to group all non-sharded collections together, and this location is referred to as the Primary Shard for this given database. Other databases in the cluster are likely to have a different Primary Shard to level the space and load between the shards.
As a note, the term Primary Shard is used here, so be careful not to confuse this notion with the Primary replica in a replica set.
By specifying the "beginning of line" regex operation, you constrain the query and MongoDB can efficiently navigate the index to the correct location.
You should use the $match stage as early as possible in your pipeline. The reason is that by filtering documents that are not part of the answer early will mean the rest of the pipeline will process fewer documents and will be faster.
Most stages, except few exceptions, can be used many times in a pipeline.
The $sample stage in the Aggregation Framework returns a subset of documents in a random fashion. In the above pipeline, we ask the stage to return 3 documents, so the documents returned could be any document in any order.
This $sample stage is useful when you want to test something against a large dataset, as processing all the documents would take too much time.
$unwind takes a field (array) as its argument. It then replaces the original document by N documents, one for each value in the array. So taking the "apple" document, you would get 2 resulting documents:
{ "_id" : "apples", "traits" : "sweet" } { "_id" : "apples", "traits" : "crispy" }
Similarly, the "oranges" document would unwind into 3 documents, for a total of 5 documents.
Horizontal scaling is defined as adding more servers, while vertical scaling is defined as increasing the resources of a server. Horizontal scaling is achieved by sharding, not replication
A good practice when using replication is to have replicas in different geographical regions. If one region becomes unavailable due to a major failure in a data center or the network connection to it, the applications will continue to operate without downtime.
Note that replication helps in case of physical disasters, but does not prevent against logical disasters like the deletion of a database. For that reason, replication does not replace backups.
If you have a delayed member in your replica set, for example, a delay of one hour, it will take one hour before changes on the Primary are replicated to this member.
If a user were to drop a collection or database on the Primary, you would have one hour to go to this delayed member to retrieve the destroyed data.
You can also query older versions of your documents, however, you can't choose a historical version to retrieve as you only get the one that existed one hour ago.
The Oplog collection only contains an entry for a given write query if the operation has modified a document.
Because the deleteMany operation is not deleting any document, there will be no Oplog entry to record.
You can not set the version of MongoDB a given replica is using. This would be a little difficult to control, as nodes may not have the desired version installed yet.
The other options make more sense to be controlled in a global configuration, as you want to be able to change them from one location (the Primary), and have the changes being effective without having to restart the mongod processes.
The _id must be unique in a replica set, as those values are used in the Oplog to reference documents to update. This characteristic of uniqueness is enforced by the system.
In a sharded cluster, _id must also be unique across the sharded collection because documents may migrate to another shard, and identical values would prevent the document to be inserted in the receiver shard, failing the migration of the chunk. It is the responsibility of the application to ensure uniqueness on _id for a given collection in a sharded cluster if it is not the shard key.
Also note that if _id is used as the shard key, the system will automatically enforce the uniqueness of the values, as chunk ranges are assigned to a single shard, and the shard can ensure uniqueness on the values in that range.
As for the shard key index, if it's not _id, it is perfectly acceptable to have identical values for different documents. However, beware of having too many documents with the same values, as this will lead to jumbo chunks.
The common thread in the above answer choices is a bottleneck on resources, or for "taking too much time to backup and restore" a function of your operational requirements. Breaking the dataset over shards will mean that each server will have more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server.
If a member of the replica set becomes unavailable due to maintenance or a hardware crash, the other members will still be able to provide the applications access to the documents.
db.team.find( { } , { scores : { $slice : [ 0 , 5 ] } } )
Query will return the desired results.
The $slice projection operator is used to control how many elements of an array will be returned after fetching it.
The correct answer is that this operation cannot be done in a single query.
To understand why, recall that the _id field of a document is immutable.
In fact, trying this operation with:
updateOne({_id: 3}, {$set: { _id: 7, c: 4 }, $unset: { a: "", b: "" }})
produces the following error:
"Performing an update on the path '_id' would modify the immutable field '_id'"
The $or operator means that any document that satisfies one of the conditions will be retrieved.
{ "_id" : 2, "a" : 2, "c" : 0, "b" : 1 }
{ "_id" : 5, "a" : 3, "c" : 0, "b" : 12 }
{ "_id" : 8, "a" : 11, "c" : 1, "b" : 0 }
{ "_id" : 9, "a" : 17, "c" : 1, "b" : 1 }
{ "_id" : 10, "a" : 3, "c" : 1, "b" : 1 }
In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
In this question, a document with an _id field is inserted.
The query
db.things.find( { b: 1} ).sort( {c: 1, a: 1} )
will require every document to be loaded into RAM in order to fulfill the query. This is because the initial match on the b key does not use any existing index or index prefix.
Unique indexes have certain properties and restrictions such as they ensure that no documents have the same data at the same key that carries a unique index, and you may not specify a unique constraint on a field that is specified as a hashed index.
There are 2 indexes:
{ _id: 1 } { "name" : 1, "date" : 1, "phone" : 1 }
The order of the fields in the index is important, however, the order of the fields in the query are not significant as the query planner will "reorder" the query terms to match a prefix of, or the full compound index.
The query on _id will use the first index. Because _id is guaranteed to be unique, it's possible for the planner to make this optimization. To be sure, there will still be a FETCH stage to get the document and ensure the date predicate is fulfilled.
The query on date, name will also use an index, the name_1_date_1_phone_1 index, because a prefix is specified (name, date).
The query on the title is using a field for which there is no index.
As for the query on phone and info, phone is indexed, however as the third member of the compound index so won't be used.
Covered queries are the best queries!
The underlying index supports the entire query, so no document information is required to be fetched from disk. With a covered query, you are servicing the operation entirely from the index, which is usually faster than examining each document.
Because the field user.login is indexed and the regex beginning of line operator is being used (^), the index myIndex will be used for this query.
All fields used in the selection filter of the query must be in the index, so the system can find the documents that satisfy the selection filter without having to retrieve the document from the collection.
All fields returned in the results must be in the index, so again there is no need to retrieve the full document. A common mistake is not to provide a projection that filters out the field _id, which is returned by default. If the _id field is not a field in the index definition, it is not available, and the query system will need to fetch the full document to retrieve the value.
On the other hand, it is OK to ask for more fields than the ones provided in the selection filter, as long as those are in the index values, the system has all the information needed to avoid fetching the full document from the collection.
All answer choices are correct!
Consider just the first choice. Any document that could match "manufacturer is Matteo AND name is Barbara AND date is 2018-07-02 would have to match date is 2018-07-02 AND, the name is Barbara, and the manufacturer is Matteo.
Because of this fact, the optimizer is able to rearrange the search terms, using the existing index for each query.
The operators updateOne and insertOne are correct because adding indexes does impact write performance.
Remember, write operations that modify an indexed field may require MongoDB to update the indexes associated with the document.
That said, not having the appropriate index for a given query will produce a collection scan on the collection, and those are undesirable.
The first stage, $group, groups by the region and uses the $sum accumulator expression to count the number of documents in each group.
Next, these documents flow into the $match stage, where documents with a count that is less than 3 (3 out of the 5 groups) are filtered out, returning two documents.
{ "_id" : "SE2", "count" : 3 } { "_id" : "NW1", "count" : 3 }
The desire for the Oplog to be idempotent is to ensure that if the server needs to resume applying for Oplog entries it will always get to the same end state, regardless if it reapplies some that entry already applied. For example, if the server crashes applying oplog5 and it is difficult to identify if oplog5 is applied, then idempotency let you restart at oplog4without issues.
Another goal is to have the new state of the document be independent of a previous state. This means all operators like $em, which relies on the previous value to determine the new value, needs to be transformed to the actual values seen. For example, if an increment operation results in modifying a field from the value '4' to the value '5', the operation should be transformed to simply set '5' on that field. Replaying this operation many times always lead to the same result.
Any chunk that covers the values in the range of 20,000 to 40,000 can be accessed by the find() query. Looking at the provided output for rs.status(), we identify the chunks for this range as:
{ "productId" : 18684 } -->> { "productId" : 27851 } on : shard0003 { "t" : 4, "i" : 0 } { "productId" : 27851 } -->> { "productId" : 36852 } on : shard0004 { "t" : 5, "i" : 0 } { "productId" : 36852 } -->> { "productId" : 46047 } on : shard0005 { "t" : 6, "i" : 0 }
Those three chunks cover all values for 20,000 to 40,000. Because they are on three different shards, the query will have to be routed to those three shards, each shard returning the corresponding documents it has for the range.
Phone number is the best selection here.
With weight, eye_color, and started_driving_at, we run the risk of low cardinality and wouldn't get a good distribution.
_id would not make a good shard key because it isn't something meaningful we could query the database within normal circumstances, like searching for a customer record when they call into a call center.
Because the first query account for 90% of your read workload, it should be the one driving the selection of the shard key.
Combination of fields from that query would make the best shard key. Out of the two solutions that use a subset of those fields, the one using company and lastName is preferred to currentEmployee and company, as the currentEmployee field is likely a boolean leading to few values, and potentially a lot of documents with the same values, resulting in jumbo chunks (chunks too big to be splitted).
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.