Extracting WAV file header information using a Python script
I am currently working on WAV file playback in an embedded device. To store WAV (PCM) sounds on an embedded device, you need to strip the headers and extract the uncompressed PCM data from the file. My previous post shows how to convert the binary data from the WAV file into a C array that can be included in a project as a source file. In this post, I show you how to figure out where the PCM data lies within the WAV file.
WAV File Format
I won’t go into the details, but suffice it to say that the WAV format consists of ‘chunks’, and only one chunk – the ‘data’ chunk actually contains the data. You can get into the details here (simplified version) and here (more detailed). For example, the start.wav
file you can find in the %SYSTEMROOT%\Media
directory in Windows XP has the following chunks in that order:
fmt , fact, data, LIST
To embed a PCM sound, the ‘data’ chunk must contain uncompressed PCM waveform data. The start.wav
file has compressed Microsoft ADPCM data (we will see later how I know that). You can open the file in Audacity and re-encode it to remove the compression and save it at a bitrate/sampling frequency/bits per sample setting that is easiest to handle for your embedded device. Once you have the new file with uncompressed PCM information, you can extract that information by figuring out the location of the data chunk as below.
Locating the Data Chunk
Posted below is the source code for the Python script that extracts information from the WAV header. The script can be executed from the command line like this:
python WavHeader.py start.wav
Of course, start.wav should be in the same directory as WavHeader.py. If it is not, supply the full path to start.wav instead. The output looks like this:
Subchunks Found: fmt , fact, data, LIST, Data Chunk located at offset [82] of data length [1024] bytes BitsPerSample: 4 NumChannels: 2 ChunkSize: 1184 Format: WAVE Filename: start.wav ByteRate: 22311 Subchunk1Size: 50 AudioFormat: 2 BlockAlign: 1024 SampleRate: 22050 |
Of particular interest is the AudioFormat field. If it doesn’t read 1, it is not uncompressed PCM data. If you look up the Common Wave Compression Codes table at this page, you will find that 2 corresponds to Microsoft ADPCM compression. This is of no use to me, so I converted the file using Audacity and ran the same script to get this:
Subchunks Found: fmt , data, Data Chunk located at offset [36] of data length [2024] bytes BitsPerSample: 16 NumChannels: 1 ChunkSize: 2060 Format: WAVE Filename: start_pcm_8khz_16bit.wav ByteRate: 16000 Subchunk1Size: 16 AudioFormat: 1 BlockAlign: 2 SampleRate: 8000 |
As you can see, the AudioFormat field has a value of 1 (uncompressed PCM). I also know that the Data chunk is at offset 36 and of length 2024 bytes. But the first 8 bytes of that are subchunk header information. Also, we notice that the Data chunk is the last chunk in the file, as opposed to earlier, when there was a LIST chunk after the Data chunk. So, if I read all the 2024 bytes starting at offset 36 + 8 = 44, I will be able to extract the raw PCM data into whatever format I want. One more thing to note is the byte order of data words. WAV uses little-endian byte ordering, so when using the extracted data in an embedded device, you need to process the data accordingly.
The Code
In all its glory:
# WavHeader.py # Extract basic header information from a WAV file def PrintWavHeader(strWAVFile): """ Extracts data in the first 44 bytes in a WAV file and writes it out in a human-readable format """ import os import struct import logging logging.basicConfig(level=logging.DEBUG) def DumpHeaderOutput(structHeaderFields): for key in structHeaderFields.keys(): print "%s: " % (key), structHeaderFields[key] # end for # Open file try: fileIn = open(strWAVFile, 'rb') except IOError, err: logging.debug("Could not open input file %s" % (strWAVFile)) return # end try # Read in all data bufHeader = fileIn.read(38) # Verify that the correct identifiers are present if (bufHeader[0:4] != "RIFF") or \ (bufHeader[12:16] != "fmt "): logging.debug("Input file not a standard WAV file") return # endif stHeaderFields = {'ChunkSize' : 0, 'Format' : '', 'Subchunk1Size' : 0, 'AudioFormat' : 0, 'NumChannels' : 0, 'SampleRate' : 0, 'ByteRate' : 0, 'BlockAlign' : 0, 'BitsPerSample' : 0, 'Filename': ''} # Parse fields stHeaderFields['ChunkSize'] = struct.unpack('<L', bufHeader[4:8])[0] stHeaderFields['Format'] = bufHeader[8:12] stHeaderFields['Subchunk1Size'] = struct.unpack('<L', bufHeader[16:20])[0] stHeaderFields['AudioFormat'] = struct.unpack('<H', bufHeader[20:22])[0] stHeaderFields['NumChannels'] = struct.unpack('<H', bufHeader[22:24])[0] stHeaderFields['SampleRate'] = struct.unpack('<L', bufHeader[24:28])[0] stHeaderFields['ByteRate'] = struct.unpack('<L', bufHeader[28:32])[0] stHeaderFields['BlockAlign'] = struct.unpack('<H', bufHeader[32:34])[0] stHeaderFields['BitsPerSample'] = struct.unpack('<H', bufHeader[34:36])[0] # Locate & read data chunk chunksList = [] dataChunkLocation = 0 fileIn.seek(0, 2) # Seek to end of file inputFileSize = fileIn.tell() nextChunkLocation = 12 # skip the RIFF header while 1: # Read subchunk header fileIn.seek(nextChunkLocation) bufHeader = fileIn.read(8) if bufHeader[0:4] == "data": dataChunkLocation = nextChunkLocation # endif nextChunkLocation += (8 + struct.unpack('<L', bufHeader[4:8])[0]) chunksList.append(bufHeader[0:4]) if nextChunkLocation >= inputFileSize: break # endif # end while # Dump subchunk list print "Subchunks Found: " for chunkName in chunksList: print "%s, " % (chunkName), # end for print "\n" # Dump data chunk information if dataChunkLocation != 0: fileIn.seek(dataChunkLocation) bufHeader = fileIn.read(8) print "Data Chunk located at offset [%s] of data length [%s] bytes" % \ (dataChunkLocation, struct.unpack('<L', bufHeader[4:8])[0]) # endif # Print output stHeaderFields['Filename'] = os.path.basename(strWAVFile) DumpHeaderOutput(stHeaderFields) # Close file fileIn.close() if __name__ == "__main__": import sys if len(sys.argv) != 2: print "Invalid argument. Exactly one wave file location required as argument" else: PrintWavHeader(sys.argv[1]) |
If you have any queries, leave me a comment below. If you found this script useful, leave me a comment below so I stay motivated to continue blogging about issues like this. Feel free to use the script as you like with the understanding that I am not guaranteeing anything and am not liable for any issues arising out of your use of this code.
Good one, gave me confidence to try out my audio classification project.
Thanks. Great article, very useful for audio chip drivers testing 🙂
Hey, this article is really good!
The script is ideal for smooth and fast testing of the correctness of WAV files.
I used it for the development my tool that repairs wav headers, so I could just perform a check on the subchunks with your script! Really nice and helped my a lot to get into the material!
ty!
This code is legit. Works on both compressed and uncompressed WAV files. I adapted it to get the length of each wav file.
Thanks for sharing!