The Python Time Zone Rabbit Hole

Thanks to the recent asshat controversy on Slashdot, and a fellow Slashdotter’s request for the link to the comment that prompted the controversy in the first place, I wrote a Python script to scrap my ~8,000 comments from Slashdot to dump into a spreadsheet for future reference. I’m planning to write essays about my various misadventures in Silicon Valley and my comment history is rich treasure trove of stories I’ve written over the years. While working on the script, I came across a programming rabbit hole for converting the timestamp string into a different timestamp string that kept me up for three nights.

The original timestamp that I extracted from each comment was a text string like this, “on Friday April 04, 2017 @06:03PM” (as it appeared on the website), and got written into CSV (Comma Separated Values) file just like that. After the initial script was working, I opened the 5MB CSV file in Excel to scroll through the data and see what I needed to change in the script. The timestamp string wasn’t in a sortable format. I had to change the timestamp into this format: “2017-04-07 18:03:00”. There’s two ways of doing this in Python: using a datetime object or slice-and-dice the string.

from datetime import datetime

def get_timestamp(string):
    return datetime.strptime(string, "on %A %B %d, %Y @%I:%M%p")

print(get_timestamp("on Friday April 04, 2017 @09:03PM"))

My first attempt (see code fragment above) was quite simple, using the strptime function for the datetime object to parse the timestamp string according to the matching format string (“on %A %B %d, %Y @%I:%M%p”). When I opened up the CSV file and compared the timestamp against the corresponding timestamp on Slashdot, the timestamp was correct except that the hour in 24-hour time was off by three hours. Every timestamps in the CSV file was off by three hours. I quickly learned that Python’s datetime objects are generally time zone unaware (or naive), and, in general, not very easy to use with different time zones.

def convert_timestamp(string):

    months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
              'August', 'September', 'October', 'November', 'December']
    month_number = {x: str(y).zfill(2) for y, x in enumerate(months, 1)}

    # remove "on" and split string into list
    string = string[3:].split(' ')

    # slice and dice into date/time components
    month = month_number[string[1]]                       # '04'
    day = string[2][:-1]                                  # '07'
    year = string[3]                                      # '2017'
    hour, minute = string[4][1:][:-2].split(':')          # '06' / '03'
    period = string[4][-2:]                               # 'PM'
    second = '00'                                         # add missing value

    # convert 12-hour time to 24-hour time
    if period == 'PM':
        if hour < '12':
            hour = str(int(hour) + 12).zfill(2)

    date_str = '-'.join([year, month, day])               # 2017-04-07
    time_str = ':'.join([hour, minute, second])           # 18:03:00
    return ' '.join([date_str, time_str])                 # 2017-04-07 18:03:00

print(convert_timestamp("on Friday April 04, 2017 @09:03PM"))

My second attempt (see code fragment above) was to slice-and-dice the timestamp string into the corresponding string values for month, day, year, hour and minute. The second value got added for completeness. If the period was “PM” instead of “AM”, the hour went from 12-hour time to 24-hour time. Date, “2017-04-07”, and time, “18:03:00”, are join together into one string, “2017-04-07 18:03:00” . When I ran the script and looked at the CSV file, the resulting timestamps was identical to the timestamps created by the datetime object.

Every timestamp was still off by three hours.

When I work on a website scraping script, I always save the scraped data into text files while refining the parsing and output sections to avoid re-scraping the website. That reduces the risk of my IP address being flagged by the website or firewall as a spammer and/or scrapper. The completed script will scrape, parse and write each page directly to the CSV file.

The slice-and-dice function converted the timestamp string as found in those text files. If I viewed the timestamps on the website, the timestamps are correct for the Pacific time zone. If I look at the timestamps in the text files, the timestamps were all off by three hours (“09:03PM” instead of “06:03PM”). So both the datetime object and slice-and-dice functions were working properly. The logical conclusion is the Slashdot server is located in the Eastern time zone and what I thought about the data was wrong. There lies the problem—and the solution.

from datetime import datetime
from pytz import timezone

def set_timezone(ts_str, tz_alt='US/Eastern'):
    ts_format, tz_def = "on %A %B %d, %Y @%I:%M%p", 'US/Eastern'
    tz_obj = timezone(tz_def).localize(datetime.strptime(ts_str, ts_format))
    return tz_obj if tz_alt == tz_def else tz_obj.astimezone(timezone(tz_alt))

timestamp = set_timezone("on Friday April 07, 2017 @09:03PM", 'US/Pacific')

print(timestamp.strftime("%Y-%m-%d %H:%M:%S %Z%z"))

The third and final attempt (see code fragment above) uses the pytz package to add time zone definitions to the datetime object and provide functionality to translate between different time zones. Using the datetime object from the first code fragment, the timestamps get encoded “US/Eastern” and then translated into “US/Pacific” to match the timestamp on the website. The resulting timestamp with time zone info in the CSV file has this format: “2017-04-07 18:03:00 PDT-0700”. A nice thing about the pytz package is that it also handles Daylight Saving Time seamlessly. If you don’t need the time zone info for the timestamp, remove “%Z%z” from the format string.

Comments

  1. I’d be interested in the script. Have you opened the source so others can use it?

    1. Not yet. I’m still working on the script. I’m planning to release it under an open source license on GitHub during the Fourth of July weekend next month.

Comments are closed.