# Data Analysis on Restaurants in Downtown Brooklyn

During my time in downtown Brooklyn, one of the things that drove me mad was looking at Restaurant reviews. Friends and I would decide to go to restaurants, and we’d look at the ratings. When we did, we’d get an aggregate rating of 3.5 or 4. But what does it all mean? It really doesn’t mean anything unless we understand the distribution of the data.

It drove me nuts. I had to find out the average, and what the review landscape look like.

So I did. I cobbled together a program in Python using Scrapy, Pandas, and Matplotlib. I would have left out Sir Scrapy, but the review website had this random feature where if you queried a restaurant for all its reviews, it would give you 3. Grr…

Process:

I used the a certain restaurant review website’s own filters to hone in on downtown Brooklyn within a mile radius. The website’s api request gave me two two longitude latitude pairs. They are shown below:

Interesting. So it’s either endpoints of a circle, or a rectangle. I think it’s a rectangle, so we’re going to refer to this area as the rectangle.

This search query gave me a starting point of all restaurants, but displayed only 10 or 20. I scraped them, and then used a program to move to the next page. I stored the set of restaurants in a directory. Then for every restaurant, I automated a http request, got to the main webpage which was an anchor point of a restaurants set of reviews. So I went through all those reviews and got them too.

Results:

Number of restaurants in the bounding box: ~500.

Number of restaurant reviews: ~70,000.

The average of all review ratings: 3.70

Standard deviation of all ratings: 1.32

Graphs:

So people in Brooklyn tend to rate 4s and 5s much more often than 1s and 2s. I wonder if you do this for every city, then could you gauge a friendliness metric for every city, and see if it correlates with the happiness ranking of every country? That would be AMAZING.

This is if you plot average rating of restaurants, and plot them by rounding them to the nearest 0.5.

This is simply the number of reviews and their restaurant names. Seems like it follows some power law distribution, but I’m not quite sure. It may be just one outlier.

Top 20 restaurants by review count:

I guess this can be interpreted as popularity or to some degree how much people care. One can perhaps use this information to extrapolate the length of time the restaurants have been around.

 Restaurant Number of Reviews Grimaldi’s Pizzeria 4440 Juliana’s Pizza 1955 Junior’s Restaurant 1573 Joya 1259 Rocco’s Tacos & Tequila Bar 1196 The River Café 1038 Habana Outpost 962 Shake Shack 788 Mile End Delicatessen Brooklyn 744 Clover Club 728 Yaso Tangbao 703 Ki Sushi 671 Hanco’s 658 Vinegar Hill House 634 Alamo Drafthouse Cinema Downtown Brooklyn 621 Forno Rosso 604 Two 8 Two Bar & Burger 603 Dekalb Market Hall 589 Bedouin Tent 586 Sottocasa Pizzeria 586

Top 50 Highest Restaurants by review average (ignore just 5s – probably insufficient number of reviews. Perhaps I will add review count next to the rating later…)

 Restaurant Average Rating Moshman Dental 5 Pipitone’s Pizza 5 Bird’s Eye Vietnamese 5 VALENTINE’S CAFE 5 First Wok 5 New Fresco Tortilla Plus 5 Smith Gourmet Deli 4.952380952 Thai on Wheels 4.888888889 Lillo Cucina Italiana 4.870588235 GMC Temaxcal Deli & Grocery 4.833333333 Simple NYC-Downtown Brooklyn 4.80952381 Yumpling Food Truck 4.790697674 Cafe Gitane 4.75 Sunny Delicatessen 4.75 Pret A Manger 4.666666667 Ashland 4.625 Govinda’s Vegetarian 4.621118012 Dariush Persian Cuisine 4.580645161 Grand Canyon Restaurant 4.577777778 dot & line 4.52173913 Rice & Miso 4.519230769 dumboLUV 4.5 Kazi Halal 4.5 Saint Julivert 4.5 Yemen Cafe & Restaurant 4.452173913 E-bite 4.444444444 Sanpanino 4.444444444 ACE Thai Kitchen 4.414634146 Sushi Gallery 4.4140625 Bread & Spread 4.412698413 Lavatera Grill 4.409090909 Forcella Fried Pizza 4.407407407 Chicks Isan 4.404761905 Doner Kebab NYC 4.401360544 Mr. Fulton 4.4 W XYZ Bar 4.4 Yossi’s Cart 4.4 Juliana’s Pizza 4.396930946 Shawarma & Grill 4.375 Makina Cafe 4.375 Koji Izakaya 4.358974359 Daigo Handroll Bar 4.333333333 Metro Buffet 4.333333333 Warung Roadside 4.333333333 Taiki 4.327586207 Sultan Restaurant & Cafe Lounge 4.3125 Espresso Me 4.306122449 Piz-zetta 4.305220884 Downtown Natural Market 4.304347826 Sottocasa Pizzeria 4.298634812

To-do list:

• I need to ask someone who is really knowledgable in statistics, or do research on if there’s a law that correlates the number of reviews to its true rating. What should it converge on? For example, a review with 1000 reviews with a 3.5 should be weighted differently with a review with 10 reviews with a 3.5.
• It would be interesting to plot the average words per review or aggregate words per restaurant.
• What would be the most commonly used and the most least commonly used words? I want to run it through a basic NLP program to stem the word, remove stop words, etc.
• What if we have a list of words, give them a raw numerical value of positive and negative numbers, and average that out. Would the rankings be different?
• What if you plot out the datestamps of the reviews, and use some metric for happiness/economic activity, and see if there’s a correlation between that and the stock market or the consumer sentiment index? Would there be a correlation?
• See what another review service from a search engine’s data looks like.

Examining all this data is a lot of fun! But for now, my experiments are on hiatus. There’s already enough things to read and build! But if anyone wants some Brooklyn data, let me know!

Tips:

1. You should probably use this dataset to get interesting insights: https://www.yelp.com/dataset/download
2. DO NOT SIMPLY SCRAPE ON YOUR LOCAL NETWORK. If you mess up, it’ll cause your ip to be banned. Your ISP will refresh your ip within a month, but it’ll annoy the people who share the internet connection with you. Either use a vpn, or spin up a cloud computing instance and use that to download your data.
3. Rate limit your scraper. Don’t try to download all this data within a span of a minute. Pace your scraper to download a reasonable amount in a minute so you don’t overload their apis. Or get caught. There is no need to rush.
4. You should learn regular expressions. I was talking to someone who was doing data collecting with DOM traversal. That maybe more painful than building a reasonable regular expression. My two cents.

This post first appeared on Byshiny's, please read the originial post: here

# Share the post

Data Analysis on Restaurants in Downtown Brooklyn

×