Mongodb document fundamentals
These are my notes from: https://learn.mongodb.com/courses/m320-mongodb-data-modeling
- documents displayed in json but stored as bson
- a document is like a joined row
- Documents are like a map or an associative array
- by default, documents in a collection dont have to have the same schema
- by default, fields in a document dont have to have the same data type
Mongodb anti patterns
- massive arrays
- massive number of collections
- bloated documents
- unnecessary indexes
- queries without indexes
- data that is accessed together but stored in different collections
Mongodb patterns
- keep frequently documents in RAM
- keep indexes in RAM
- solid state drives or hard disk
- infrequently used data can use hard disks
- You either model for simplicity or performance
Mongodb datatypes
json datatypes
- string
- object
- array
- boolean
- null
bson datatypes
- dates
- numbers
- object id(primary key)
every doc requires _id field which is the primary key mongodb will auto add it, if its not included in the insert
_id is a required field in every MongoDB document, but it’s not a data type. If an inserted document does not include an_id field, MongoDB will automatically create the_id field and populate it with a value of type ObjectId.
Mongodb to RDBMS translation
Relational vs mongodb data modeling
example relational data model:
Modeling in Mongodb:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
"_id": "5ad88534e3632e1a35a58d00",
"customerID": 12345,
"name": {
"first": "John",
"last": "Doe"
},
"address": [
{
"location": "work",
"address": {
"street": "16 Hatfields",
"city": "London",
"postal_code": "SE1 8DJ"
},
"country": "United Kingdom",
"geo": {
"type": "Point",
"coord": [
51.5065752,
-0.109081
]
}
}
],
"email": "john.doe@acme.com",`
"phone": [
{
"location": "work",
"number": "+44-1234567890"
}
],
"dob": "1977-04-01T05:00:00Z",
"interests": [
"devops",
"data science"
],
"annualSpend": 1292815.75
}
Mongo db data relationships
data that is accessed together should be stored together
- one-to-one
- one-to-many
- Many-to-many
Ways to model relationships
- Embedding
- Referencing
Embedding and Referencing example: actor document (cast) is embedded within movie document
NOT candidates for Embedding:
A document is frequently read, but contains data that is rarely accessed. Embedding this data only increases the in-memory requirements
One part of a document is frequently updated and constantly growing in size, while the remainder of the document is relatively static
The combined document size would exceed MongoDB’s 16MB document limit
Mongodb indexing
By default, MongoDB creates an index on the document’s _id primary key field All user-defined indexes are secondary indexes. Indexes can be created on any part of the JSON document – including inside sub-documents and array elements
Index types:
- unique
- compound
- array
- TTL Indexes
In some cases data should expire automatically. Time to Live (TTL) indexes allow the user to specify a period of time after which the database will automatically delete the data. A common use of TTL indexes is applications that maintain a rolling window of history (e.g., most recent 100 days) for user actions such as clickstreams
Wildcard Indexes For workloads with many ad-hoc query patterns or that handle highly polymorphic document structures, wildcard indexes give you a lot of extra flexibility. You can define a filter that automatically indexes all matching fields, subdocuments, and arrays in a collection. As with any index, they also need to be stored and maintained, so will add overhead to the database. If your application’s query patterns are known in advance, then you should use more selective indexes on the specific fields accessed by the queries
Partial Indexes Partial Indexes can be viewed as a more flexible evolution of Sparse Indexes, where the DBA can specify an expression that will be checked to determine whether a document should be included in a particular index. e.g. for an “orders” collection, an index on state and delivery company might only be needed for active orders and so the index could be made conditional on {orderState: “active”} – thereby reducing the impact to memory, storage, and write performance while still optimizing searches over the active orders.
Hash Indexes Hash indexes compute a hash of the value of a field and index the hashed value. The primary use of this index is to enable hash-based sharding, a simple and uniform distribution of documents across
Text Search Indexes MongoDB provides a specialized index for text search that uses advanced, language-specific linguistic rules for stemming, tokenization and stop words shards
Optimizing Performance
- MongoDB’s explain() method
- MongoDB Compass GUI
- Mongodb query profiler
- mongodb ops manager MongoDB Atlas and Ops Manager eliminates this effort with the Performance Advisor which monitors queries that took more than 100ms to execute, and automatically suggests new indexes to improve performance.
MongoDB Aggregation Pipeline
MongoDB provides the Aggregation Pipeline natively within the database, which delivers similar functionality to the GROUP BY, JOIN, Materialized View, and related SQL features.
MongoDB Charts
MongoDB Charts is the quickest and easiest way create and share visualizations of your MongoDB data in real time,
Mongodb transactions
Multi-Record ACID Transactional Model Because documents can bring together related data that would otherwise be modelled across separate parent-child tables in a tabular schema, MongoDB’s atomic single-document operations provide transaction semantics that meet the data integrity needs of the majority of applications.
MongoDB 4.0 added support for multi-document ACID transactions in the 4.0 release and extended them in 4.2 with Distributed Transactions that operate across scaled-out, sharded clusters.
Mongodb Consistency
MongoDB handles the complexity of maintaining multiple copies of data via replication. Read and write operations are directed to the primary replica by default for strong consistency, but users can choose to read from secondary replicas for reduced network latency, especially when users are geographically dispersed, or for isolating operational and analytical workloads
When reading data from any cluster member, users cantune MongoDB’s consistency model to match application requirements
Foreign Keys
add validation to embedded sub documents which would act like foreign keys in a relational model
mongodb schema validation
is very tunable, can be applied to sub documents only
Example json schema validation:
- Each document must contain a field named lineItems
- The document may optionally contain other fields
- lineItems must be an array where each element:
- Must contain a title (string), price (number no smallerthan 0)
- May optionally contain a boolean named purchased
- Must contain no further fields
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
db.createCollection("orders",
{
validator: {
$jsonSchema: {
properties: {
lineItems: {
type: "array",
items: {
properties: {
title: {
type: "string"
},
price: {
type: "number",
minimum: 0.0
},
purchased: { type: "boolean" }
},
required: [
"_id",
"title",
"price"
],
additionalProperties: false
}
}
},
required: ["lineItems"]
}
}
}
)
On-Demand Materialized Views Using the $merge stage, outputs from aggregation pipeline queries can now be merged with existing stored result sets whenever you run the pipeline, enabling you to create materialized views that are refreshed on-demand. Rather than a full stop replacement of the existing collection’s content, you can increment and enrich views of your result sets as new data is processed by the aggregation pipeline. With MongoDB’s Materialized Views you have the flexibility to output results to sharded collections – enabling you to scale-out your views as data volumes grow. You can also write the output to collections in different databases further isolating operational and analytical workloads from one another. As the materialized views are stored in a regular MongoDB collection, you can apply indexes to each view, enabling you to optimize query access patterns, and run deeper analysis against them using MongoDB Charts, or the BI and Apache Spark connectors.
Data modeling for Mongodb
- identify workloads how your use your data
- document relationships
- Embedding vs Reference
How to identify workloads
- apply design patterns
Modeling for simplicity vs performance
simplicity
- fewer collections
- bigger embedded documents
- objects map well to documents
- fewer disc IO
Performance
- Sharding
- supports very fast reads or writes
Performance criteria latency operations per second more collections
Identify workloads
qualify and quantify operations
Examples of types of data and their durability
Relationships and Cardinality
a mother and her children embed as children should be small amount
a twitter user and his/er followers reference as a user may have millions of followers
Better entity diagram:
- movie has between 1 and 1000 actors
- has 2 sets of financials
- has between 0 and 100 reviews
- with 30 reviews being the norm
One to Many
embed on the side of the most queried collection
One to many using references
- array of references
- cascade deletes must be handled by application
- use referencing when the associated documents are not always needed with the most queried documents
one to many exmaple
here when i delete or update a store, no need to handle the zip codes since they’re kept in the zips collection
Join operations like SQL joins with $ operator
One to Zillions
Document this relationship in the code as it requires special handling
Mongodb Patterns
embed address with orders as the shipping address does not changed for shipped and delivered orders so duplication ok
ok to embed actors with movies as its unlikely to change so duplication ok
how to maintain this sum? maybe data can be stale and some application process updates the gross sum
foreign keys and cascading deletes not supported
Wildcard indexes in mongodb 4.2 may replace the need for indexes on the third section.
Note that wildcard indexes may remove the need to use the pattern below
What if i wanted to know the releases between two dates?
Extended reference pattern
Here we query orders than we do orders per customer
Subset pattern
Lab data attribute pattern
The purpose of this lab is complete the Pattern Attribute file in the Pattern-Attribute editor tab and then to run thevalidate_m320 in the tab mongodb to ensure this is the correct solution for the lab. The lab covers theAttribute pattern and converting our existing schema to apply this pattern.
User Story
The museum we work at has grown from a local attraction to one that is seen as saving very popular items. For this reason, other museums in the World have started exchanging pieces of art with our museum. Our database was tracking if our pieces are on display and where they are in the museum. To track the pieces we started exchanging with other museum, we added an array called events, in which we created an entry for each date a piece was loaned and the museum it was loaned to.
This schema has a problem where each time we start a exchange with a new museum, we must then add a new index. For example, when we started working with The Prado in Madrid, we added this index: { “events.prado” : 1 } Please review the Problem Schema and the Problem Document editor tabs to help understand the old schema and how we need to change the schema to apply the pattern.
Tasks
To address this issue, you will need to change the schema to:
use a single index on all event dates. transform the field that tracks the date when a piece was acquired, date_acquisition, so that it is also indexed with the values above.
Problem schema
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"_id": "<objectId>",
"title": "<string>",
"artist": "<string>",
"date_acquisition": "<date>",
"location": "<string>",
"on_display": "<bool>",
"in_house": "<bool>",
"events": [{
"moma": "<date>",
"louvres": "<date>"
}]
}
Problem document
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"_id": ObjectId("5c5348f5be09bedd4f196f18"),
"title": "Cookies in the sky",
"artist": "Michelle Vinci",
"date_acquisition": ISODate("2017-12-25T00:00:00.000Z"),
"location": "Blue Room, 20A",
"on_display": false,
"in_house": false,
"events": [{
"moma": ISODate("2019-01-31T00:00:00.000Z"),
"louvres": ISODate("2020-01-01T00:00:00.000Z")
}]
}
Answer schema
1
2
3
4
5
6
7
8
9
10
11
12
{
"_id": "<objectId>",
"title": "<string>",
"artist": "<string>",
"location": "<string>",
"on_display": "<bool>",
"in_house": "<bool>",
"events": [{
"k": "<string>",
"v": "<date>"
}]
}
Subset pattern lab
User Story
You are the lead developer for an online organic recycled clothing store.
Due to the growing number of environmentally-conscious consumers, our store’s inventory has increased exponentially. We now also have an increasingly large pool of makers and suppliers.
We recently found that our shopping app is getting slower due to the fact that the frequently-used documents can no longer all fit in RAM. This is happening largely due to having all product reviews, questions, and specs stored in the same document, which grows in size as reviews and ratings keep coming in.
To resolve this issue, we want to reduce the amount of data immediately available to the user in the app and only load additional data when the user asks for it.
Problem schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
"_id": "<objectId>",
"item_code": "<string>",
"name": "<string>",
"maker_brand": "<string>",
"price": "<decimal>",
"description": "<string>",
"materials": ["<string>"],
"country": "<string>",
"image": "<string>",
"available_sizes": {
"mens": ["<string>"],
"womens": ["<string>"]
},
"package_weight_kg": "<decimal>",
"average_rating": "<decimal>",
"reviews": [{
"author": "<string>",
"text": "<string>",
"rating": "<int>"
}],
"questions": [{
"author": "<string>",
"text": "<string>",
"likes": "<int>"
}],
"stock_amount": "<int>",
"maker_address": {
"building_number": "<string>",
"street_name": "<string>",
"city": "<string>",
"country": "<string>",
"postal_code": "<string>"
},
"makers": ["<string>"]
}
Problem document
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
{
"_id": ObjectId("5c9be463f752ec6b191c3c7e"),
"item_code": "AS45OPD",
"name": "Recycled Kicks",
"maker_brand": "Shoes From The Gutter",
"price": 100.00,
"description": "These amazing Kicks are made from recycled plastics and
fabrics.They come in a variety of sizes and are completely unisex in design.
If your feet don't like them within the first 30 days, we'll return your
money no questions asked.
",
"materials": [
"recycled cotton",
"recycled plastic",
"recycled food waste",
],
"country": "Russia",
"image": "https:///www.shoesfromthegutter.com/kicks/AS45OPD.img",
"available_sizes": {
"mens": ["5", "6", "8", "8W", "10", "10W", "11", "11W", "12", "12W"],
"womens": ["5", "6", "7", "8", "9", "10", "11", "12"]
},
"package_weight_kg": 2.00,
"average_rating": 4.8,
"reviews": [{
"author": "i_love_kicks",
"text": "best shoes ever! comfortable, awesome colors and design!",
"rating": 5
},
{
"author": "i_know_everything",
"text": "These shoes are no good because I ordered the wrong size.",
"rating": 1
},
"..."
],
"questions": [{
"author": "i_love_kicks",
"text": "Do you guys make baby shoes?",
"likes": 1223
},
{
"author": "i_know_everything",
"text": "Why do you make shoes out of garbage?",
"likes": 0
},
"..."
],
"stock_amount": 10000,
"maker_address": {
"building_number": 7,
"street_name": "Turku",
"city": "Saint-Petersburg",
"country": "RU",
"postal_code": 172091
},
"makers": ["Ilya Muromets", "Alyosha Popovich", "Ivan Grozniy", "Chelovek Molekula"],
}
Answer document
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
"_id": ObjectId("5c9be463f752ec6b191c3c7e"),
"item_code": "AS45OPD",
"name": "Recycled Kicks",
"maker_brand": "Shoes From The Gutter",
"price": 100.00,
"description": "These amazing Kicks are made from recycled plastics and
fabrics.They come in a variety of sizes and are completely unisex in design.
If your feet don't like them within the first 30 days, we'll return your
money no questions asked.
",
"materials": [
"recycled cotton",
"recycled plastic",
"recycled food waste",
],
"country": "Russia",
"image": "https:///www.shoesfromthegutter.com/kicks/AS45OPD.img",
"available_sizes": {
"mens": ["5", "6", "8", "8W", "10", "10W", "11", "11W", "12", "12W"],
"womens": ["5", "6", "7", "8", "9", "10", "11", "12"]
},
"package_weight_kg": 2.00,
"average_rating": 4.8,
"top_five_reviews": [{
"author": "i_love_kicks",
"text": "best shoes ever! comfortable, awesome colors and design!",
"rating": 5
},
{
"author": "i_know_everything",
"text": "These shoes are no good because I ordered the wrong size.",
"rating": 1
},
"..."
],
"top_five_questions": [{
"author": "i_love_kicks",
"text": "Do you guys make baby shoes?",
"likes": 1223
},
{
"author": "i_want_to_know_everything",
"text": "How are these shoes made?",
"likes": 1120
},
"..."
],
"stock_amount": 10000,
}
Answer schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
"_id": "<objectId>",
"item_code": "<string>",
"name": "<string>",
"maker_brand": "<string>",
"price": "<decimal>",
"description": "<string>",
"materials": ["<string>"],
"country": "<string>",
"image": "<string>",
"available_sizes": {
"mens": ["<string>"],
"womens": ["<string>"]
},
"package_weight_kg": "<decimal>",
"average_rating": "<decimal>",
"top_five_reviews": [{
"author": "<string>",
"text": "<string>",
"rating": "<int>"
}],
"top_five_questions": [{
"author": "<string>",
"text": "<string>",
"likes":"<int>"
}],
"stock_amount": "<int>"
}