Data cleaning in Python: some examples from cleaning Airbnb data

Left: the Shard (source: https://www.

visitbritainshop.

com/world/the-view-from-the-shard/).

Right: Dulwich high street (source: https://de.

wikipedia.

org/wiki/Dulwich_(London)).

I did also experiment with using latitude and longitude instead of borough in order to get more fine-grained results — but as a future blog post will show, it was not entirely successful.

Amenities (so very many amenities)In the dataset from Insiderairbnb.

com, amenities were stored as one big block of text— here’s one example:In order to figure out what the various options were and which listings had them, I first made a giant string of all the amenities values, tidied it up a bit, split out the individual amenities separated by commas, and created a set of the resultant list (fortunately the dataset was small enough to allow this, but I would have needed a more efficient way to do this with a much larger dataset):And here’s a list of all the amenities it is possible to have: '24-hour check-in', 'Accessible-height bed', 'Accessible-height toilet', 'Air conditioning', 'Air purifier', 'Alfresco bathtub', 'Amazon Echo', 'Apple TV', 'BBQ grill', 'Baby bath', 'Baby monitor', 'Babysitter recommendations', 'Balcony', 'Bath towel', 'Bathroom essentials', 'Bathtub', 'Bathtub with bath chair', 'Beach essentials', 'Beach view', 'Beachfront', 'Bed linens', 'Bedroom comforts', 'Bidet', 'Body soap', 'Breakfast', 'Breakfast bar', 'Breakfast table', 'Building staff', 'Buzzer/wireless intercom', 'Cable TV', 'Carbon monoxide detector', 'Cat(s)', 'Ceiling fan', 'Ceiling hoist', 'Central air conditioning', 'Changing table', "Chef's kitchen", 'Children’s books and toys', 'Children’s dinnerware', 'Cleaning before checkout', 'Coffee maker', 'Convection oven', 'Cooking basics', 'Crib', 'DVD player', 'Day bed', 'Dining area', 'Disabled parking spot', 'Dishes and silverware', 'Dishwasher', 'Dog(s)', 'Doorman', 'Double oven', 'Dryer', 'EV charger', 'Electric profiling bed', 'Elevator', 'En suite bathroom', 'Espresso machine', 'Essentials', 'Ethernet connection', 'Exercise equipment', 'Extra pillows and blankets', 'Family/kid friendly', 'Fax machine', 'Fire extinguisher', 'Fire pit', 'Fireplace guards', 'Firm mattress', 'First aid kit', 'Fixed grab bars for shower', 'Fixed grab bars for toilet', 'Flat path to front door', 'Formal dining area', 'Free parking on premises', 'Free street parking', 'Full kitchen', 'Game console', 'Garden or backyard', 'Gas oven', 'Ground floor access', 'Gym', 'HBO GO', 'Hair dryer', 'Hammock', 'Handheld shower head', 'Hangers', 'Heat lamps', 'Heated floors', 'Heated towel rack', 'Heating', 'High chair', 'High-resolution computer monitor', 'Host greets you', 'Hot tub', 'Hot water', 'Hot water kettle', 'Indoor fireplace', 'Internet', 'Iron', 'Ironing Board', 'Jetted tub', 'Keypad', 'Kitchen', 'Kitchenette', 'Lake access', 'Laptop friendly workspace', 'Lock on bedroom door', 'Lockbox', 'Long term stays allowed', 'Luggage dropoff allowed', 'Memory foam mattress', 'Microwave', 'Mini fridge', 'Mobile hoist', 'Mountain view', 'Mudroom', 'Murphy bed', 'Netflix', 'Office', 'Other', 'Other pet(s)', 'Outdoor kitchen', 'Outdoor parking', 'Outdoor seating', 'Outlet covers', 'Oven', 'Pack ’n Play/travel crib', 'Paid parking off premises', 'Paid parking on premises', 'Patio or balcony', 'Pets allowed', 'Pets live on this property', 'Pillow-top mattress', 'Pocket wifi', 'Pool', 'Pool cover', 'Pool with pool hoist', 'Printer', 'Private bathroom', 'Private entrance', 'Private gym', 'Private hot tub', 'Private living room', 'Private pool', 'Projector and screen', 'Propane barbeque', 'Rain shower', 'Refrigerator', 'Roll-in shower', 'Room-darkening shades', 'Safe', 'Safety card', 'Sauna', 'Security system', 'Self check-in', 'Shampoo', 'Shared gym', 'Shared hot tub', 'Shared pool', 'Shower chair', 'Single level home', 'Ski-in/Ski-out', 'Smart TV', 'Smart lock', 'Smoke detector', 'Smoking allowed', 'Soaking tub', 'Sound system', 'Stair gates', 'Stand alone steam shower', 'Standing valet', 'Steam oven', 'Step-free access', 'Stove', 'Suitable for events', 'Sun loungers', 'TV', 'Table corner guards', 'Tennis court', 'Terrace', 'Toilet paper', 'Touchless faucets', 'Walk-in shower', 'Warming drawer', 'Washer', 'Washer / Dryer', 'Waterfront', 'Well-lit path to entrance', 'Wheelchair accessible', 'Wide clearance to bed', 'Wide clearance to shower', 'Wide doorway', 'Wide entryway', 'Wide hallway clearance', 'Wifi', 'Window guards', 'Wine cooler', 'toilet',In the list above, some amenities are more important than others (e.

g.

a balcony is more likely to increase price than a fax machine), and some are likely to be fairly uncommon (e.

g.

‘Electric profiling bed’).

Based on previous experience in the industry, and furtherresearch into which amenities are considered by guests to be more important, a selection of the more important amenities were extracted.

These were then selected from for inclusion in the final model depending on how sparse the data was.

For example, if it turns out that almost all properties have/do not have a particular amenity, that feature will not be very useful in differentiating between listings or helping explain differences in prices.

The whole convoluted code for this can be found on GitHub, but this is the final section where I removed columns where over 90% of the listings either had or did not have a particular amenity:These are the amenities that I ended up keeping:BalconyBed linenBreakfastTVCoffee machineBasic cooking equipmentWhite goods (specifically a washer, dryer and/or dishwasher)Child-friendlyParkingOutdoor spaceGreeted by hostInternetLong term stays allowedPets allowedPrivate entranceSafe or security systemSelf check-inSummaryAfter these (and many other) cleaning and pre-processing steps, the Airbnb was in suitable form to begin exploration and modelling, and in future I’ll be writing more about this.

If you found this post interesting or helpful, please let me know via the medium of claps and/or comments, and you can follow me in order to be notified about future posts.

Thanks for reading!.. More details

Leave a Reply