We also loaded the data into an in-memory Pandas DataFrame, in seconds!.
The performance and ease of working became very obvious with Pandas killing the traditional SQL approach ultimately.
Checkmate!UI for MongoDBDatabase UI toolsAs with all things, your approach to the task at hand can make things easier or harder for yourself.
You can use the command line in MongoDB right from the shell, or you can work with a User Interface.
Our toolbar, left, shows us using Compass, Robo3T, and Studio3T as UI tools.
When we have a script, we prefer the command line.
When we are designing our approach or working with the data, we use the UI and mostly that is the Compass UI.
Loading a 1.
1GB CSV file to MongoDBBoth Compass and Studio3T offer an import routine as does MongoDB shell.
A few attempts using Compass failed to load the big fileStudio3T ingested the data in 1:48 minutes.
Importing data from a disk file requires the common source to target mapping.
Stage 1 — Studio3T setting up the import as the source stage 2 — working the target.
Stage 2- mapping source to destination.
Similar to the DDL requirement for Postgres and SQL databases, you do need to define the correct data types in the Import options during Stage 2.
Working with Studio3TWorking with Studio3T double click on the Type column, not really intuitive, and select the correct field type from the drop-down menu.
The dataset, loans.
csv, contains a matrix of 2.
26m rows X 145 columns.
Working through 145 columns to decide the data type is a little tedious, but you can do this activity up front or as cleaning scripts later.
Python Pandas does a better job and usually correctly infers the data type from the input.
If your organization already has a database, such as Cassandra or MongoDB already in place, then you won’t even need to worry about importing data.
Even for one-off exercises, I tend to use MongoDB as a temporary storage mechanism as opposed to say the ‘Pickle serialization approach’.
Organizing a DDL and importing into Postgres was a 30-minute exercise.
The actual data loading took over 8 minutes.
Loading the file into MongoDB with Studio3T is more like 10 minutes with the time lost in deciding the correct field types one by one.
Exploring the data setWith Python Pandas, using an in-memory approach, we could issue the df.
describe() or df.
info() method calls.
We did find that 1.
1GB grew to 2.
4GB with Pandas and hence that would probably cause some ‘memory over-flow’ issues with weaker off the shelf workstations.
Exploit Compass to analyse the Schema for you!Compass has a feature called Analyze which is sort of the equivalent of issuing the Analyze command in SQL with Postgres if you will allow me that analogy.
Analyse loans;With the Analyse Schema operation finished — navigate to the Schema tabIsn’t that wonderful and without a considerable effort.
We can see a field ‘acc_open_past_24mnths’ has a range from min:0 to the max: 23 with the shape of the distribution of the values.
Each of the 145 fields is thoroughly analysed and ready for us to examine them.
Summaries, structuring, and advanced grouping or groupby()The fun doesn’t stop at just a bit of exploration.
Both Compass and Studio3T offer Map-Reduce, Aggregate and a lot of other beneficial per defined operations.
A screenshot of the Studio3T menu bar.
Here is an aggregate operation, we defined earlier, which was fun to put together.
Example output from our aggregate script aboveCleaningOk, you are saying what about the cleaning?.A Data Scientist, allegedly, spends 80% of their time getting, shaping, cleaning and exploring the Data.
What about the cleaning you ask?.Enter validation!MongoDB Compass — validation tabHere is one we prepared earlier!An example of running the validation rules developed.
Adding Document Validation Rules Using MongoDB Compass 1.
5 from Andrew Morgan on Databases extends our work and provides a mechanism to select the documents which fail validation and how to fix them using $NorAgain using Compass we can explore our data set, we can validate the Schema, organize to fix or impute missing values and run complex map-reduce operations from the UI without significant pain or needing a powerful mainframe under our desks.
MongoDB and Mongo CompassOffer an excellent approach to the initial exploration of data setsPhoto by Chris Barbalis on UnsplashIndeed we have shown how powerful and extremely useful MongoDB is.
If there is one small, tiny, cloud in an otherwise ‘Sky Blue’ arrangement, it is the learning curve.
You need to go to MongoDB university and practice hard.
Instead of moving your data, why not bring your analytical skills to where the information is?.Take advantage of the tools illustrated here and, I am quite sure that with a bit of study, you can figure out most stuff.
Go do it!.Forget in-memory approaches go do it the MongoDB, CouchDB, Cloudant way.