Jump to content

Welcome to MSFN Forum
Register now to gain access to all of our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, post status updates, manage your profile and so much more. This message will be removed once you have signed in.
Login to Account Create an Account



Photo

How to check text file encoding from command-line?

- - - - -

  • Please log in to reply
14 replies to this topic

#1
tomasz86

tomasz86

    www.windows2000.tk

  • Member
  • PipPipPipPipPipPipPipPip
  • 2,525 posts
  • Joined 27-November 10
  • OS:none specified
  • Country: Country Flag
Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.

post-47483-1123010975.png



How to remove advertisement from MSFN

#2
allen2

allen2

    Not really Newbie

  • Member
  • PipPipPipPipPipPipPip
  • 1,814 posts
  • Joined 13-January 06
You could use file -i using the gnuwin32 build.

#3
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,674 posts
  • Joined 23-July 04
  • OS:none specified
  • Country: Country Flag

Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.

UCS-2 Little Endian sounds to me a lot like "Unicode". :whistle:

The simplest check you can make is looking for a hex 00 (if there is at least one, it's Unicode, conversely, if there are none, it's not Unicode - and it is very likely to be "plain ANSI text").
http://betterexplain...ticles/unicode/

jaclaz

#4
tain

tain

    Cyber Ops

  • Super Moderator
  • 3,683 posts
  • Joined 24-September 05
  • OS:none specified
  • Country: Country Flag

Donator

Nice one, jaclaz!

#5
tomasz86

tomasz86

    www.windows2000.tk

  • Member
  • PipPipPipPipPipPipPipPip
  • 2,525 posts
  • Joined 27-November 10
  • OS:none specified
  • Country: Country Flag
Thanks for you suggestions!

@allen2

It would be nice to be able to do it without any external application although I'll have a look at it if I can't manage to do it with just the Windows default tools.


@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

Edited by tomasz86, 07 October 2012 - 04:00 PM.

post-47483-1123010975.png


#6
jumper

jumper

    2014 All-American Masters HJ'er

  • Member
  • PipPipPip
  • 498 posts
  • Joined 21-January 11
  • OS:98SE
  • Country: Country Flag
Attached File  endian.zip   988bytes   25 downloads - endian.exe (console-32 app), endian.bat (sample usage)

endian.exe source snippet based on header info detailed at http://betterexplain...ticles/unicode/:
  // read first three (or more) bytes from file into byte array s[], then:
  return
    (s[0]==255) && (s[1]==254)? 255 :
    (s[0]==254) && (s[1]==255)? 254 :
    (s[0]==0xEF) && (s[1]==0xBB) && (s[2]==0xBF)? 239 : 0;
Sample usage: endian.bat
@echo off
%0\..\endian %1

IF ERRORLEVEL 255 GOTO UCS2LE
IF ERRORLEVEL 254 GOTO UCS2BE
IF ERRORLEVEL 239 GOTO UTF8

echo ANSI
GOTO End

:UCS2LE
echo UCS-2 Little Endian
GOTO End

:UCS2BE
echo UCS-2 Big Endian
GOTO End

:UTF8
echo UTF-8

:End
 
If debug.exe is available, I think it can be used to achieve the same results. Debug can be scripted to open the file, analyze the first three bytes, and create a temp com file that sets an appropriate ERRORLEVEL If Debug doesn't set the ERRORLEVEL upon exit itself, the temp file can be eliminated by running the temp program within Debug.
Design feedback requested:
IHAtool - IpHlpApi tester; call various functions and report results
--status-> framework is solid; 22 api's fully supported; preview release coming soon
ComDlg32 wrapper - ComDlgEx meets IpHlpApi wrapper
--status-> PrintDlgExW working in latest SumatraPDF 8^)
Future projects: ImportPatcher40 - dialog interface; Kexter - IP40+Ktree+Kexstubs

#7
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,674 posts
  • Joined 23-July 04
  • OS:none specified
  • Country: Country Flag

@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

Well, you have gsar already used in your "standard" set of tools, haven't you?

I doubt - no offence intended :) - that an ANSI file contains hex 00, but you can (better :thumbup ) use gsar to find the initial FFFE as jumper suggested.
In any case, FOR/ F won't "like" Unicode, thus:


@ECHO OFF
::ISANSI.CMD - small example batch to check if a file is ANSI or UNICODE
FOR /F %%A in  (%1) DO ECHO ANSI&GOTO :EOF
ECHO UNICODE
BUT this won't work the same:
@ECHO OFF
::NOANSI.CMD - small example batch that will always return ANSI
FOR /F %%A in  ('TYPE %1') DO ECHO ANSI&GOTO :EOF
ECHO UNICODE

jaclaz



jaclaz

#8
tomasz86

tomasz86

    www.windows2000.tk

  • Member
  • PipPipPipPipPipPipPipPip
  • 2,525 posts
  • Joined 27-November 10
  • OS:none specified
  • Country: Country Flag
Well, in this particular script the only "non-Windows" tool is gsar.exe...

But at the moment I actually managed to solve this specific problem with this very simple checking:

SETLOCAL ENABLEDELAYEDEXPANSION
FINDSTR/IL "[Version]" I386\hivedef.inf >NUL
IF !ERRORLEVEL! EQU 0 (
	TYPE temp.txt>>I386\hivedef.inf
) ELSE (
	CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
)
FINDSTR can't find "[Version]" if the file is encoded in UCS-2 Little Endian.

Edited by tomasz86, 08 October 2012 - 02:26 AM.

post-47483-1123010975.png


#9
allen2

allen2

    Not really Newbie

  • Member
  • PipPipPipPipPipPipPip
  • 1,814 posts
  • Joined 13-January 06
The BOM (if present) might help to detect to detect the encoding of the file.
And this vbs will report the encoding of the file passed as argument:
Function encoding(fpn) 
set file=CreateObject("ADODB.Stream") 
file.Type=1
file.Open 
file.LoadFromFile fpn 
data = file.Read 
file.Close 
a = hex(Ascb(Midb(data, 1, 1)))
b = hex(Ascb(Midb(data, 2, 1))) 
c = hex(Ascb(Midb(data, 3, 1))) 
d= hex(Ascb(Midb(data, 4, 1)))
encoding="unknow ascii"
If a = "EF" AND b = "BB" AND c = "BF" Then encoding = "UTF-8" 
If (a = "FE" AND b = "FF" AND not c = "00" ) then encoding = "UTF-16 (BE)"
If (a = "FF" AND b = "FE") Then encoding = "UTF-16 (LE)" 
If (a = "00" AND b = "00" AND c = "FE" AND d = "FF" ) then encoding = "UTF-32 (BE)"
If (a = "FF" AND b = "FE" AND c = "00" AND d = "00" ) then encoding = "UTF-32 (LE)"
If (a = "2B" AND b = "2F" AND c = "76" AND (d = "38" or d = "39" or d = "2B" or d = "2F" )) then encoding = "UTF-7"
If (a = "F7" AND b = "64" AND c = "4C") then encoding = "UTF-1"
If (a = "DD" AND b = "73" AND c = "66" AND d = "73") then encoding = "UTF-EBCDIC"
If (a = "0E" AND b = "FE" AND c = "FF") then encoding = "SCSU"
If (a = "FB" AND b = "EE" AND c = "28") then encoding = "BOCU-1"
If (a = "84" AND b = "31" AND c = "95" AND d = "33") then encoding = "GB-18030"
End Function 
wscript.echo encoding(WScript.Arguments.Item(0))

But many files doesn't contain the BOM.

#10
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,674 posts
  • Joined 23-July 04
  • OS:none specified
  • Country: Country Flag
It is clear how we all give a different meaning to the word "simple". :whistle:
FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip

jaclaz

#11
Yzöwl

Yzöwl

    Wise Owl

  • Super Moderator
  • 4,557 posts
  • Joined 13-October 04
  • OS:Windows 7 x64
  • Country: Country Flag

Donator

As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

#12
tomasz86

tomasz86

    www.windows2000.tk

  • Member
  • PipPipPipPipPipPipPipPip
  • 2,525 posts
  • Joined 27-November 10
  • OS:none specified
  • Country: Country Flag

It is clear how we all give a different meaning to the word "simple". :whistle:

FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip

Thank you :w00t: I meant simpler when compared to the other options where other tools must be used.


As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

It should be possible in case of these HIVE*.INF files because they utilise only English and Korean characters but how about unicode files where a lot of different languages are used like this one?

Attached File  Netrtle.7z   32.31KB   9 downloads

If you check the [Strings.*] sections you'll see that there are a lot of them for several different languages. If I try to convert such a file to ANSI then many characters are lost.

post-47483-1123010975.png


#13
allen2

allen2

    Not really Newbie

  • Member
  • PipPipPipPipPipPipPip
  • 1,814 posts
  • Joined 13-January 06
Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

#14
tomasz86

tomasz86

    www.windows2000.tk

  • Member
  • PipPipPipPipPipPipPipPip
  • 2,525 posts
  • Joined 27-November 10
  • OS:none specified
  • Country: Country Flag

Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

It's just a typo of mine :blushing: You can safely ignore it.

post-47483-1123010975.png


#15
jumper

jumper

    2014 All-American Masters HJ'er

  • Member
  • PipPipPip
  • 498 posts
  • Joined 21-January 11
  • OS:98SE
  • Country: Country Flag
For those with pre-ME systems that don't support FINDSTR, here's a batch file that uses FIND to process one file from the command line or a whole directory of files:

@echo off

if not %1*==* goto TEST



for %%s in (*.inf) do call %0 %%s

goto EXIT



:TEST

find>nul /i "[Version]" %1

IF ERRORLEVEL 1 goto NONANSI



:ANSI

echo %1 is ANSI

goto EXIT



:NONANSI

echo %1 is non-ANSI



:EXIT

Design feedback requested:
IHAtool - IpHlpApi tester; call various functions and report results
--status-> framework is solid; 22 api's fully supported; preview release coming soon
ComDlg32 wrapper - ComDlgEx meets IpHlpApi wrapper
--status-> PrintDlgExW working in latest SumatraPDF 8^)
Future projects: ImportPatcher40 - dialog interface; Kexter - IP40+Ktree+Kexstubs




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users