CHM Text Searches – Lesson Learned

The current project I am working on involves working on the CHM (Microsoft Help Files) for Windows Embedded Standard 7. This was a project underway before I joined the team and it had even changed managers during that time, which always creates at least a bit of chaos.

We just dropped our CHM files to the product team for integration last week and then discovered we had not completely purged the contents of the project’s internal name and prior release same in favor of the final release name. This happened despite having myself and four other people all search the Help for the string. It even happened despite a tool that would automatically check for the strings you tell it to at build time.

After dealing with the aftermath and rebuild/redrop (my boss was out of town), I looked into why this had happened and discovered I had let my tester mindset go to sleep and had not really thought in depth about how the searches were being done and the potential holes in these approaches.

The stage is set by:

  • The tool was checking for the term in the files at build time.
  • The individual searches done ad-hoc were using the search functionality provided in the Help by Windows when reading the CHM file.

The following discoveries were made as we looked into where this failed:

  • The tool does not check the index. Several instances of terms entered into the tool were found in the index.
  • Not all “bad terms” were entered into the tool’s database.
  • The Search functionality in Help does not search the index and no one thought to look there because the index was generated from files thought to be clean.
  • The Search functionality in Help is a regex-base search. All ad hoc searches were done with the same string, but no one thought to look for substrings with use of a wildcard before, after, or both in relation to the “bad string.” This meant no instances where the “bad string” was a substring could be found, though quite a few existed.

It seems obvious now, of course, but it really does help to look at these occurrences and see what the root cause was so that it can be caught in the future. In this case, we took away the following actions for the next release:

  • All terms must be entered in the tool’s database.
  • The CHM must be decompiled and the HTML contents searched for the “bad terms”. This will catch substrings and index occurrences.
  • Ad hoc searching must be done with wildcards and regex to also look for string and substring occurences.

Lesson learned and I paid the price. Both in terms of this instance’s particular issues and in terms of forgetting to look for holes in the process/testing – even if it’s not my job.

Tags: , , , ,

One Response to “CHM Text Searches – Lesson Learned”

  1. Dave Says:

    Instead of decompiling the CHM file — what about searching the source prior to compilation? A full text search across all source files (with wildcards, of course) should catch this problem. Unless you are using variables in your source content that are only specified at run time.

    Searching the source will ignore any boilerplate or template text added during publishing though.

Leave a Reply