Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate Cleaning Script from MGG-LC-Collab #48

Open
regan008 opened this issue May 17, 2023 · 0 comments
Open

Recreate Cleaning Script from MGG-LC-Collab #48

regan008 opened this issue May 17, 2023 · 0 comments
Labels
Data-errors missing data, inconsistent data, or other error in transcription

Comments

@regan008
Copy link
Collaborator

For the MGG-LC collab we cleaned the cities to ensure that neighborhoods (like Studio City) appear as the main city (Los Angeles). Also fixed a ton of typos but script only covers CA, NY, IL, and FL. Probably need to do this for all the states.

Code used:

mgg.data <- readRDS(file = "mgg-data.rds")
mgg.data <- mgg.data %>% select(-ID, -streetaddress, -lat, -lon)

mgg.data <- mgg.data %>% mutate_if(is.character, trimws)
mgg.data.notfiltered <- mgg.data %>% filter(state != "CA" & state != "NY" & state != "FL" & state != "IL")
mgg.ca <- mgg.data %>% filter(state == "CA")
mgg.ca <- mgg.ca %>% mutate(city = fct_collapse(mgg.ca$city, 
                                                Alleghany = "Alleghanny",
                                                Clearlake =	"Clear Lake",
                                                "Clearlake Oaks" = "Clear Lake Oaks",
                                                "Culver City"	= "Culber City",
                                                "El Monte"	= "South El Monte",
                                                Hawthorne	= "Hawthrone",
                                                Laguna	= "South Laguna",
                                                "Lake Tahoe"	= "South Lake Tahoe",
                                                Lompoc	= "Lampoc",
                                                "Los Angeles"	= c("Highland Park", "Hollywood", "Hollywood Station", "Hollywood West", "Midtown", "Midtown- Los Angeles", "N. Hollywood", "North Hollywood", "Silver Lake", "Silverlake", "Studio City", "Valley", "Van Nuys", "West Hollywood", "West Los Angeles"),
                                                Pasadena = "South Pasadena",
                                                Philipsville = "Philipsville, Humbolt County", 
                                                "Redwood City" = "Redwood CIty",
                                                "Russian River"	= "Russian RIver",
                                                Sacramento	= c("South Sacramento", "West Sacramento"),
                                                "San Francsico"	= c("Castro Village", "Castro Village- S.F.", "Folsom", "Folsom- S.F.", "Haight, Filmore & West- S.F.", "Mission District","Polk Street Area", "Polk Street Area- S.F.", "Tenderloin", "Tenderloin to North Beach- S.F."),
                                                "San Gabriel"	= "South San Gabriel",
                                                Tahoe	= "Tahoe Vista"))
  
  
mgg.fl <- mgg.data %>% filter(state == "FL")
mgg.fl <- mgg.fl %>% mutate(city = fct_collapse(mgg.fl$city,
                                              "Daytona Beach"	= "S. Daytona Beach",
                                              Dunedin =	"Dunepin",
                                              "Ft. Lauderdale"	= c("Ft. Lauderdale Beach",	"Ft. Launderdale"),
                                              Gainesville	= "Gainsville",
                                              Hialeah	= "Hilaleah",
                                              Jacksonville	= "Jacksonville Beach",
                                              "Madeira Beach" =	"Medeira Beach",
                                              Miami	= "North Miami",
                                              "Miami Beach"	= "North Miami Beach",
                                              "Pinellas Park" =	c("Pineallis Park", "Pinellis Park",
                                                                  "Ponce de Leon"	= "Ponce De Leon")))
        

mgg.il <- mgg.data %>% filter(state == "IL")
mgg.il <- mgg.il %>% mutate(city = fct_collapse(mgg.il$city,
                                                Carbondale = "Carbondle",
                                                Chicago	= c("Chicago Heights", "Chicago- Near North Side", "Chicago- New Town", "Chicago-North Side", "Chicago-South Side"),
                                                "Des Plaines" =	"Des Plains",
                                                "Elk Grove Village" =	"Elk Grove",
                                                Hinsdale = "Hindale"))


mgg.ny <- mgg.data %>% filter(state == "NY")
mgg.ny <- mgg.ny %>% mutate(city = fct_collapse(mgg.ny$city,
                                              Babylon	= c("Babylon, L.I", "Babylon, L.I."),
                                              Baldwin	= "Baldwin, L.I.",
                                              "Bay Shore" = "Bayshore",
                                              Bayville = "Bayville, L.I.",
                                              Bellmore = c("Bellmore, L.I.", "N. Bellmore"),
                                              Binghamton = "Binghampton",
                                              Bronx	= "The Bronx",
                                              Brooklyn = "Brooklyn Heights",
                                              Copiague = c("Copaigue", "Copiaque"),
                                              "Croton Falls" = "Crofton Falls",
                                              "Deer Park"	= c("Deep Park", "Deer Park, L.I."),
                                              "East Hampton" = "Easthampton",
                                              "East Meadow"	= "E. Meadow",
                                              "East Northport" =	"E. Northport",
                                              "Fire Island" =	"Fire Island, L.I.",
                                              "Floral Park"	= "Floral Park, L.I.",
                                              "Hauppauge"	= "Hauppauge, L.I.",
                                              "Huntington" = "Huntington, L.I.",
                                              "Lake Ronkonkoma"	= c("Lake Ronkonkoma, L.I.", "Lake Ronkonnoma"),
                                              "New York City" =	c("Chelsea", "Greenwich Village", "Greenwich Village & Chelsea", "Manhattan", "Midtown", "New York", "Uptown", "Uptown- N.Y.C."),
                                              "Niagra Falls" = "Niagara Falls",
                                              "North Bellmore" = "North Bellmore, L.I.",
                                              "Northport"	= "Northpoint",
                                              Queens = c("Forest Hill", "Forest Hills", "Forrest Hill", "Jackson Heights"),
                                              Sayville = "Sayville, L.I.",
                                              Smithtown = "Smithtown, L.I.",
                                              Southampton = c("South Hampton", "Southhampton"),
                                              Utica = "N. Utica",
                                              Wantagh	= "Wantagh, L.I.",
                                              "West Hempstead" = "West Hempstead, L.I.",
                                              Westbury = "Westbury, L.I."
                                              ))
mgg.data <- rbind(mgg.data.notfiltered, mgg.il, mgg.ny, mgg.fl, mgg.ca)
@regan008 regan008 added the Data-errors missing data, inconsistent data, or other error in transcription label May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data-errors missing data, inconsistent data, or other error in transcription
Projects
None yet
Development

No branches or pull requests

1 participant