Extracting WAV file header information using a Python script

I am currently working on WAV file playback in an embedded device. To store WAV (PCM) sounds on an embedded device, you need to strip the headers and extract the uncompressed PCM data from the file. My previous post shows how to convert the binary data from the WAV file into a C array that can be included in a project as a source file. In this post, I show you how to figure out where the PCM data lies within the WAV file.

WAV File Format

I won’t go into the details, but suffice it to say that the WAV format consists of ‘chunks’, and only one chunk – the ‘data’ chunk actually contains the data. You can get into the details here (simplified version) and here (more detailed). For example, the start.wav file you can find in the %SYSTEMROOT%\Media directory in Windows XP has the following chunks in that order:
fmt , fact, data, LIST

To embed a PCM sound, the ‘data’ chunk must contain uncompressed PCM waveform data. The start.wav file has compressed Microsoft ADPCM data (we will see later how I know that). You can open the file in Audacity and re-encode it to remove the compression and save it at a bitrate/sampling frequency/bits per sample setting that is easiest to handle for your embedded device. Once you have the new file with uncompressed PCM information, you can extract that information by figuring out the location of the data chunk as below.

Locating the Data Chunk

Posted below is the source code for the Python script that extracts information from the WAV header. The script can be executed from the command line like this:
python WavHeader.py start.wav
Of course, start.wav should be in the same directory as WavHeader.py. If it is not, supply the full path to start.wav instead. The output looks like this:

Subchunks Found:
fmt ,  fact,  data,  LIST,
Data Chunk located at offset [82] of data length [1024] bytes
BitsPerSample:  4
NumChannels:  2
ChunkSize:  1184
Format:  WAVE
Filename:  start.wav
ByteRate:  22311
Subchunk1Size:  50
AudioFormat:  2
BlockAlign:  1024
SampleRate:  22050

Of particular interest is the AudioFormat field. If it doesn’t read 1, it is not uncompressed PCM data. If you look up the Common Wave Compression Codes table at this page, you will find that 2 corresponds to Microsoft ADPCM compression. This is of no use to me, so I converted the file using Audacity and ran the same script to get this:

 Subchunks Found: fmt ,  data,  
Data Chunk located at offset [36] of data length [2024] bytes 
BitsPerSample:  16 
NumChannels:  1 
ChunkSize:  2060 
Format:  WAVE 
Filename:  start_pcm_8khz_16bit.wav 
ByteRate:  16000 
Subchunk1Size:  16 
AudioFormat:  1 
BlockAlign:  2 
SampleRate:  8000

As you can see, the AudioFormat field has a value of 1 (uncompressed PCM). I also know that the Data chunk is at offset 36 and of length 2024 bytes. But the first 8 bytes of that are subchunk header information. Also, we notice that the Data chunk is the last chunk in the file, as opposed to earlier, when there was a LIST chunk after the Data chunk. So, if I read all the 2024 bytes starting at offset 36 + 8 = 44, I will be able to extract the raw PCM data into whatever format I want. One more thing to note is the byte order of data words. WAV uses little-endian byte ordering, so when using the extracted data in an embedded device, you need to process the data accordingly.

The Code

In all its glory:

# WavHeader.py
#   Extract basic header information from a WAV file
def PrintWavHeader(strWAVFile):
    """ Extracts data in the first 44 bytes in a WAV file and writes it
            out in a human-readable format
    import os
    import struct
    import logging
    def DumpHeaderOutput(structHeaderFields):
        for key in structHeaderFields.keys():
            print "%s: " % (key), structHeaderFields[key]
        # end for
    # Open file
        fileIn = open(strWAVFile, 'rb')
    except IOError, err:
        logging.debug("Could not open input file %s" % (strWAVFile))
    # end try
    # Read in all data
    bufHeader = fileIn.read(38)
    # Verify that the correct identifiers are present
    if (bufHeader[0:4] != "RIFF") or \
       (bufHeader[12:16] != "fmt "): 
         logging.debug("Input file not a standard WAV file")
    # endif
    stHeaderFields = {'ChunkSize' : 0, 'Format' : '',
        'Subchunk1Size' : 0, 'AudioFormat' : 0,
        'NumChannels' : 0, 'SampleRate' : 0,
        'ByteRate' : 0, 'BlockAlign' : 0,
        'BitsPerSample' : 0, 'Filename': ''}
    # Parse fields
    stHeaderFields['ChunkSize'] = struct.unpack('<L', bufHeader[4:8])[0]
    stHeaderFields['Format'] = bufHeader[8:12]
    stHeaderFields['Subchunk1Size'] = struct.unpack('<L', bufHeader[16:20])[0]
    stHeaderFields['AudioFormat'] = struct.unpack('<H', bufHeader[20:22])[0]
    stHeaderFields['NumChannels'] = struct.unpack('<H', bufHeader[22:24])[0]
    stHeaderFields['SampleRate'] = struct.unpack('<L', bufHeader[24:28])[0]
    stHeaderFields['ByteRate'] = struct.unpack('<L', bufHeader[28:32])[0]
    stHeaderFields['BlockAlign'] = struct.unpack('&lt;H', bufHeader[32:34])[0]
    stHeaderFields['BitsPerSample'] = struct.unpack('&lt;H', bufHeader[34:36])[0]
    # Locate & read data chunk
    chunksList = []
    dataChunkLocation = 0
    fileIn.seek(0, 2) # Seek to end of file
    inputFileSize = fileIn.tell()
    nextChunkLocation = 12 # skip the RIFF header
    while 1:
        # Read subchunk header
        bufHeader = fileIn.read(8)
        if bufHeader[0:4] == "data":
            dataChunkLocation = nextChunkLocation
        # endif
        nextChunkLocation += (8 + struct.unpack('<L', bufHeader[4:8])[0]) 
        if nextChunkLocation >= inputFileSize:
        # endif
    # end while
    # Dump subchunk list
    print "Subchunks Found: "
    for chunkName in chunksList:
        print "%s, " % (chunkName),
    # end for
    print "\n"
    # Dump data chunk information
    if dataChunkLocation != 0:
        bufHeader = fileIn.read(8)
        print "Data Chunk located at offset [%s] of data length [%s] bytes" % \
            (dataChunkLocation, struct.unpack('<L', bufHeader[4:8])[0])
    # endif
    # Print output
    stHeaderFields['Filename'] = os.path.basename(strWAVFile)
    # Close file
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        print "Invalid argument. Exactly one wave file location required as argument"

If you have any queries, leave me a comment below. If you found this script useful, leave me a comment below so I stay motivated to continue blogging about issues like this. Feel free to use the script as you like with the understanding that I am not guaranteeing anything and am not liable for any issues arising out of your use of this code.


Leave a Reply to James Cancel reply

Your email address will not be published. Required fields are marked *